Parsing a text file line-by-line: skipping badly-formed lines?

  • Thread starter denis.papathanasiou
  • Start date
D

denis.papathanasiou

I have a script which reads a plain text (dos) file line-by-line and
splits it into several smaller files, based on a single attribute.

The code (below) works, except when a line is malformed (i.e., the
line contains binary or control characters), and the script just exits
with an error:

open(IN, "$IN_FILE") or die "\n\terror: Could not read $IN_FILE $!
\n"; ;
binmode(IN);
while( $ln=<IN> ) {
if( $ln =~ m/\r\n$/ ) {
$ln =~ s/\r\n$/\n/; # dos2unix: convert CR LF to LF
if( $. > 0 ) { # skip the header line
$sym = substr($ln, 10, 16);
$sym =~ s/ //g;
if( $prior_sym ne $sym ) {
if( $prior_sym ne '' ) { close(OUT); }
$sym_file = $OUT_PATH . "/" . $sym . "." . $OUT_SUFFIX ;
open(OUT, ">$sym_file") or die "\n\terror: Could not write to
$sym_file $!\n";
binmode(OUT);
}
print OUT $ln;
$prior_sym = $sym ;
}
}
}
close(IN);

What I'd like it to do, instead, is if it hits a bad line, write a
warning and keep going to the end of the file.

I've tried wrapping the block above in "eval { }; warn $@ if $@;" but
that doesn't trap the error; even with eval/warn, a bad line will
cause the script to exit.

Is there a better way of doing this?
 
G

Greg Bacon

: I've tried wrapping the block above in "eval { }; warn $@ if $@;" but
: that doesn't trap the error; even with eval/warn, a bad line will
: cause the script to exit.

You say your program exits with an error, but you didn't say what
the error is.

What's the error? What version of perl are you using? What's your
operating system?

Your chances of receiving a helpful reply are even better if you can
provide input that causes the problem. Yes, transmitting non-printable
characters on Usenet is a pain, so uuencode the input or write a Perl
program that can recreate it!

Greg
 
D

denis.papathanasiou

You say your program exits with an error, but you didn't say what
the error is.

My fault, I should have been more precise.

$? actually returns 0 but I know that is incorrect because the output
is not as expected.

The large text file contains data from "A" to "Z", so a successful run
would result in 26 smaller files.

But the output we get stops at "R", so either one of the "R" lines (or
possibly the start of the "S" data) is malformed.
What's the error? What version of perl are you using? What's your
operating system?

$ perl -v
This is perl, v5.8.4 built for i386-linux-thread-multi

$ uname -sro
Linux 2.4.27-2-386 GNU/Linux
Your chances of receiving a helpful reply are even better if you can
provide input that causes the problem. Yes, transmitting non-printable
characters on Usenet is a pain, so uuencode the input or write a Perl
program that can recreate it!

Getting to the exact line with the problem has been surprisingly
difficult: the input file is 14 gb in size, which is too big for the
hex editor we use (shed).

I've also tried split to break up the file into smaller chunks, so I
can load the "R" or "S" chunk into shed and look at the line, but
split suffers the same problem, i.e. it only gets so far through the
original file before it quits, leaving the "S" to "Z" range unsplit.

I'd also thought it might have to do with the $. command (perhaps at
14 gb, it exceeds perl's ability to count that high?), but removing
that logic in my script didn't change the result.
 
J

John W. Krahn

I have a script which reads a plain text (dos) file line-by-line and
splits it into several smaller files, based on a single attribute.

The code (below) works, except when a line is malformed (i.e., the
line contains binary or control characters), and the script just exits
with an error:

open(IN, "$IN_FILE") or die "\n\terror: Could not read $IN_FILE $!

perldoc -q quoting

Also, you should get into the habit of using the three argument form of open:

open IN, '<', $IN_FILE or die "\n\terror: Could not read $IN_FILE $!\n";

\n"; ;
binmode(IN);

You can also incorporate that into the open statement:

open IN, '<:raw', $IN_FILE or die "\n\terror: Could not read $IN_FILE $!\n";

while( $ln=<IN> ) {
if( $ln =~ m/\r\n$/ ) {
$ln =~ s/\r\n$/\n/; # dos2unix: convert CR LF to LF

You don't need to match the same pattern twice:

if ( $ln =~ s/\r\n$/\n/ ) {

Or more portable and correct:

if ( $ln =~ s/\015\012\z/\n/ ) {

if( $. > 0 ) { # skip the header line

$. starts out at 1 so it is *always* greater than 0 (unless you explicitly
change it.)

$sym = substr($ln, 10, 16);
$sym =~ s/ //g;

Use the three argument open() so you won't have to worry about whitespace in
the file name. However there are other characters that are not valid in a
file name that you should remove such as "\0" and '/'.

$sym =~ tr!\0/!!d

if( $prior_sym ne $sym ) {
if( $prior_sym ne '' ) { close(OUT); }
$sym_file = $OUT_PATH . "/" . $sym . "." . $OUT_SUFFIX ;
open(OUT, ">$sym_file") or die "\n\terror: Could not write to

open OUT, '>:raw', $sym_file or die "\n\terror: Could not write to
$sym_file $!\n";

$sym_file $!\n";
binmode(OUT);
}
print OUT $ln;
$prior_sym = $sym ;
}
}
}
close(IN);

What I'd like it to do, instead, is if it hits a bad line, write a
warning and keep going to the end of the file.

I've tried wrapping the block above in "eval { }; warn $@ if $@;" but
that doesn't trap the error; even with eval/warn, a bad line will
cause the script to exit.

Is there a better way of doing this?


John
 
D

denis.papathanasiou

perldoc -q quoting

Also, you should get into the habit of using the three argument form of open:

open IN, '<', $IN_FILE or die "\n\terror: Could not read $IN_FILE $!\n";


You can also incorporate that into the open statement:

open IN, '<:raw', $IN_FILE or die "\n\terror: Could not read $IN_FILE $!\n";

Thanks for the suggestion; I've been working with an old template, and
since it was functional, I never bothered to make it more idiomatic.
You don't need to match the same pattern twice:

if ( $ln =~ s/\r\n$/\n/ ) {

Or more portable and correct:

if ( $ln =~ s/\015\012\z/\n/ ) {

I'm guilty of some spaghetti there: the dos2unix line was added later,
and I just stuck it in there w/o thinking about the statement before
it.
$. starts out at 1 so it is *always* greater than 0 (unless you explicitly
change it.)

Really? If I leave that statement out, it winds up processing the
first line, but when it's there, it skips the first line.
Use the three argument open() so you won't have to worry about whitespace in
the file name. However there are other characters that are not valid in a
file name that you should remove such as "\0" and '/'.

$sym =~ tr!\0/!!d


open OUT, '>:raw', $sym_file or die "\n\terror: Could not write to
$sym_file $!\n";

These are all great comments, but they don't help with the original
problem: any thoughts on why the block terminates before processing
every line of the original input file?
 
G

Greg Bacon

: > You say your program exits with an error, but you didn't say what
: > the error is.
:
: My fault, I should have been more precise.

Yes, precision helps in diagnosing technical problems!

Is your program exiting silently, i.e., with no error message?

You wrote that you expected files named A-Z but R is the last
file created. Looking at your logic, your code skips input lines
that don't have CR NL. Is this your intent? Could the lines with
symbols in S-Z be "hidden" in the sense that they fail the test
in the following line?

if( $ln =~ m/\r\n$/ ) {

Debugging output will help you find the problem input. I'd add
at least two warnings:

while( $ln=<IN> ) {
if( $ln =~ s/\r\n\z/\n/ ) {
if( $. > 1 ) { # skip the header line
# the rest of your code...
}
else {
warn "$0: $IN_FILE:$.: skipping...\n";
}
}

warn "$0: $IN_FILE:$.: exiting...\n";

Hope this helps,
Greg
 
M

Martijn Lievaart

These are all great comments, but they don't help with the original
problem: any thoughts on why the block terminates before processing
every line of the original input file?

Maybe go back to the good old ways of debugging, add print statements
that tell what the program is doing. Tee this so you save it to a file as
well for later reference, or ptint to a logfile in the first place.

This will not tell you what is wrong, but may pinpoint the location in
the 14GB file where your program goes wrong.

HTH,
M4
 
D

denis.papathanasiou

Is your program exiting silently, i.e., with no error message?

Yes, $? is 0
You wrote that you expected files named A-Z but R is the last
file created. Looking at your logic, your code skips input lines
that don't have CR NL. Is this your intent? Could the lines with
symbols in S-Z be "hidden" in the sense that they fail the test
in the following line?

if( $ln =~ m/\r\n$/ ) {

Yes, that's the intent, because if a line doesn't end in CR, it is
malformed and cannot be parsed further.

While it's likely that there is at least one line that fits that
description (and hence fails the $ln =~ m/\r\n$/ test), the bulk of
the S-Z data *does* end in CR (I verified this by doing a tail on the
input file).

So those lines, i.e. the S-Z lines which do end in CR should not be
skipped.
Debugging output will help you find the problem input. I'd add
at least two warnings:

while( $ln=<IN> ) {
if( $ln =~ s/\r\n\z/\n/ ) {
if( $. > 1 ) { # skip the header line
# the rest of your code...
}
else {
warn "$0: $IN_FILE:$.: skipping...\n";
}
}

warn "$0: $IN_FILE:$.: exiting...\n";

Thanks, I'll try that.

In the meantime, I also tried doing a head of the first 120761073
lines (split exits after processing 120761072 lines in total, which is
not the full size of the file), and it gave me an interesting error:

$ head -120761073 qte20070430 > xy.1
head: error reading `qte20070430': Input/output error
$ echo $?
1
$ tail -2 xy.1
134950345PRIG 000008192000000028000008197000000003R
PP000000001715724200 C
134950355TRIG 000008192000000052000008197000000014$

So the last line there has the problem (well-formed lines are 90 bytes
long), but my hex editor doesn't show anything unusual after the "4"
character:

offs asc hex dec oct bin
0135: 0 30 048 060 00110000
0136: 0 30 048 060 00110000
0137: 0 30 048 060 00110000
0138: 0 30 048 060 00110000
0139: 0 30 048 060 00110000
0140: 8 38 056 070 00111000
0141: 1 31 049 061 00110001
0142: 9 39 057 071 00111001
0143: 7 37 055 067 00110111
0144: 0 30 048 060 00110000
0145: 0 30 048 060 00110000
0146: 0 30 048 060 00110000
0147: 0 30 048 060 00110000
0148: 0 30 048 060 00110000
0149: 0 30 048 060 00110000
0150: 0 30 048 060 00110000
0151: 1 31 049 061 00110001
0152: 4 34 052 064 00110100

(end)
152/153 (dec)
 
D

denis.papathanasiou

Using the extra warnings gave me this:

$ ./split-file.pl qte20070330
../split-file.pl: qte20070330:120761073: skipping...
134950355TRIG 000008192000000052000008197000000014
$ echo $?
0

Looking at the tail end of the problem line gave me this:

offs asc hex dec oct bin
0119: 0 30 048 060 00110000
0120: 0 30 048 060 00110000
0121: 1 31 049 061 00110001
0122: 4 34 052 064 00110100
0123: 0A 010 012 00001010

The difference between the malformed line is that it contains a single
linefeed character (hex 0a) at the 63rd byte, whereas a normal/well-
formed line is 90 bytes long, ending in carriage return (hex 0d) plus
linefeed (hex 0a).

So it seems that the single linefeed (0a character) fools perl into
thinking that it's come to EOF, terminating the "while( $ln=<IN> )
{ }" loop.

So if that's true, how can I guard against this condition?
 
G

Greg Bacon

: > You wrote that you expected files named A-Z but R is the last
: > file created. Looking at your logic, your code skips input lines
: > that don't have CR NL. Is this your intent? Could the lines with
: > symbols in S-Z be "hidden" in the sense that they fail the test
: > in the following line?
: >
: > if( $ln =~ m/\r\n$/ ) {
:
: Yes, that's the intent, because if a line doesn't end in CR, it is
: malformed and cannot be parsed further.

Assuming you haven't changed the value of $/ (documented in the
perlvar manpage), $ln contains newline-terminated records, so
control wouldn't reach the above conditional without a newline
at the end.

Note that your regular expression tests for a carriage return
followed by a newline at the end of $ln. Looking at the output
in a followup farther downthread, there's at least one record
that's being ignored because it doesn't have a carriage return.

You report that head(1) is failing with an I/O error. Can anyone
read the entire input? Does the following command succeed?

wc -l qte20070430

Greg
 
D

denis.papathanasiou

Assuming you haven't changed the value of $/ (documented in the
perlvar manpage), $ln contains newline-terminated records, so
control wouldn't reach the above conditional without a newline
at the end.

Note that your regular expression tests for a carriage return
followed by a newline at the end of $ln. Looking at the output
in a followup farther downthread, there's at least one record
that's being ignored because it doesn't have a carriage return.

Right, what should happen is: that line fails the regex text, so I
should see the warning.

But, and here's what I don't understand, the "while( $ln=<IN> )
{ }" loop should continue because the end of file has not been
reached.

So if the lone 0a character isn't triggering the end of that loop,
what is?

BTW, I haven't touched the value of $/ -- in fact the only code prior
to the block I pasted in the original post is just this:

#!/usr/bin/perl


#
#
# definition of necessary
# command-line arguments
#

die "\nUsage\n\tperl split-file.pl [Input file name ({file}YYYYMMDD)]
[Output file path] [Output suffix]\n" unless @ARGV ;

$IN_FILE = $ARGV[0];
$OUT_PATH = $ARGV[1];
$OUT_SUFFIX = $ARGV[2];

$prior_sym = '';

You report that head(1) is failing with an I/O error. Can anyone
read the entire input? Does the following command succeed?

wc -l qte20070430

Yes, I'd tried that earlier, before using split, and here's what
happened:

$ wc -l qte20070430
wc: qte20070430: Input/output error
120781227 qte20070430
 
D

denis.papathanasiou

Read 2k, analyze it, write 2k.
Try that. There is only 2 ways it can go. Either its not corrupt or it is.
There are no other options. Take Perl out of the conversation, it has nothing to
do with it apparently.

You're correct in that the file is probably corrupt, and that I'd be
better off using a simple read() or even perhaps an mmap() over chunks
of the file, and finding out what the bad byte sequence is.

However, one of the reasons to use perl for these types of tasks is
that the "while( $ln=<IN> ) { }" construct is so convenient: unlike
read() or mmap(), you don't have any additional overhead or work to
break up the data into lines.

So here's a case where the "while( $ln=<IN> ) { }" construct breaks
down.

What I'm curious to know (and that's why I posted it to a perl group)
is how to exception handle such that the "while( $ln=<IN> ) { }"
construct does not break down, and continues to EOF?

I'd thought that using "eval { }; warn $@ if $@;" would do that, but
since it didn't I'm asking here.
 
D

denis.papathanasiou

<snip>

Btw, you sound like a person with some experience with data.
Why haven't you thought of this? You really think Perl is
going to help you with this problem?

You couldn't solve this problem in a thousand years
in a thousand different languages.

Find another profession ..... trash collector

LOL... someone just got their email address added to a troll list.
 
J

Josef Moellers

I have a script which reads a plain text (dos) file line-by-line and
splits it into several smaller files, based on a single attribute.

The code (below) works, except when a line is malformed (i.e., the
line contains binary or control characters), and the script just exits
with an error:

open(IN, "$IN_FILE") or die "\n\terror: Could not read $IN_FILE $!
\n"; ;
binmode(IN);
while( $ln=<IN> ) {
if( $ln =~ m/\r\n$/ ) {
$ln =~ s/\r\n$/\n/; # dos2unix: convert CR LF to LF

I'd set $/ to "\r\n":

open(my $in, '<', $IN_FILE) or die "\n\terror: Could not open $IN_FILE: $!";
# I like that part after the \n\ ;-)
$/ = "\r\n";
my $thowawayfirstline = <$in>; # skip the header line
# Here you could check if the header line looks like what you'd expect
while (<$in>) {
# Process rest of data
}
close $in;

If I can't get the processing to succeed, I usually print out the first
line and stop:

while (<$in>) {
print "$_\n"; last
# Process rest of data
}

Once I've figured out what's going on, I comment or delete that line.

Josef
 
G

Greg Bacon

: I'd thought that using "eval { }; warn $@ if $@;" would do that, but
: since it didn't I'm asking here.

Which lines did you wrap in an eval?

Greg
 
D

denis.papathanasiou

: I'd thought that using "eval { }; warn $@ if $@;" would do that, but
: since it didn't I'm asking here.

Which lines did you wrap in an eval?

Initially, the entire file-handling block, i.e. from "open(IN,...) ...
close(IN);".

Since neither opening nor closing the file handle was the problem, I
tried putting the "while( $ln=<IN> ) { }" loop inside "eval{}; warn"
as well.

But the problem seems to be perl's <> construct: regardless of how I
define $/ (I used both the default and Josef's suggestion), if there's
an i/o error of any kind, the "$ln=<IN>" evaluates to false and the
loop ends.

So the way I solved the problem was to use a different file read
strategy: instead of using a line-by-line loader like <>, I load n
bytes at a time into a vector.

Since the file is fixed-width, I can treat the vector conceptually as
a 2d array and pull out any "line" I need.

Also, I can wrap the byte reads inside a condition-handler, so that
when I see the i/o error ("when", not "if" because the file *is*
corrupted), I can log the error lines, yet keep going all the way to
the end.

I wound up coding this in CL, not perl, though, because I couldn't
find any references to file reads in perl that did not involve the <>
construct, and also because the CL condition/exception handling logic
seems more robust than perl's.

If there's a way to do the same thing -- i.e., read byte blocks into a
vector, allowing for the possibility of an i/o error without stopping
-- in perl (and I'm sure there is), I'd be interested in learning how.
 
D

denis.papathanasiou

open(my $in, '<', $IN_FILE) or die "\n\terror: Could not open $IN_FILE: $!";
# I like that part after the \n\ ;-)

Ha! the unintended consequences of labeling and visible formatting
combinations!
$/ = "\r\n";
my $thowawayfirstline = <$in>; # skip the header line
# Here you could check if the header line looks like what you'd expect
while (<$in>) {
# Process rest of data}

close $in;

If I can't get the processing to succeed, I usually print out the first
line and stop:

while (<$in>) {
print "$_\n"; last
# Process rest of data

}

Once I've figured out what's going on, I comment or delete that line.

The issue seems to be that perl's <> construct is that it stops (i.e.,
"while (<$in>)" evaluates to false) in the event of an i/o error
regardless of how $/ is defined.

And that's exactly what I *don't* want it to do (my last reply to Greg
has more details).
 
B

Bart Lateur

But the problem seems to be perl's <> construct: regardless of how I
define $/ (I used both the default and Josef's suggestion), if there's
an i/o error of any kind, the "$ln=<IN>" evaluates to false and the
loop ends.

Check to see if that last line contains a chr(26). If that's the case
and you're on Windows, use binmode on the handle. Text mode treats a
chr(26) (AKA ctrl-Z, "\cZ") as an end of line marker, while binary mode
does not.

Of course, then, you'll have to convert the line ends to "\n" by hand...
but you're already doing that.
 
G

Greg Bacon

: Check to see if that last line contains a chr(26). If that's the case
: and you're on Windows, use binmode on the handle. Text mode treats a
: chr(26) (AKA ctrl-Z, "\cZ") as an end of line marker, while binary mode
: does not.

Denis said Linux is the operating system.

Greg
 
G

Greg Bacon

: So the way I solved the problem was to use a different file read
: strategy: instead of using a line-by-line loader like <>, I load n
: bytes at a time into a vector.
:
: Since the file is fixed-width, I can treat the vector conceptually as
: a 2d array and pull out any "line" I need.

But what about the records that aren't properly terminated? Won't
that throw off your count?

: Also, I can wrap the byte reads inside a condition-handler, so that
: when I see the i/o error ("when", not "if" because the file *is*
: corrupted), I can log the error lines, yet keep going all the way to
: the end.

Are you able to actually continue reading after the I/O error?

: I wound up coding this in CL, not perl, though, because I couldn't
: find any references to file reads in perl that did not involve the <>
: construct, and also because the CL condition/exception handling logic
: seems more robust than perl's.

There are alternatives:

perldoc -f read
perldoc -f sysread

: If there's a way to do the same thing -- i.e., read byte blocks into a
: vector, allowing for the possibility of an i/o error without stopping
: -- in perl (and I'm sure there is), I'd be interested in learning how.

You might try something along the following lines:

#! /usr/bin/perl

use strict;
use warnings;

use Fcntl qw/ SEEK_SET /;

my $RECORDSZ = 20;

my $IN_FILE = $0;

open IN, "<:raw", $IN_FILE or die "$0: open: $!";

my $nrec = 0;
while (sysseek IN, $nrec * $RECORDSZ, SEEK_SET) {
my $nread = sysread IN, my($buf), $RECORDSZ;

if (defined $nread) {
if ($nread == 0) {
exit 0; # eof
}
else {
$buf =~ s{([^[:graph:] ])} {
"<" . sprintf("%02X", ord $1) . ">"
}ge;

print "$nrec: $buf\n";
}
}
else {
warn "$0: $IN_FILE:$nrec: sysread: $!";
}

++$nrec;
}

die "$0: sysseek: $!";

When run (against itself, which you'll need to change), it gives

0: #! /usr/bin/perl<0A><0A>us
1: e strict;<0A>use warnin
2: gs;<0A><0A>use Fcntl qw/ S
3: EEK_SET /;<0A><0A>my $RECO
4: RDSZ = 20;<0A><0A>my $IN_F
5: ILE = $0;<0A><0A>open IN,
6: "<:raw", $IN_FILE or
7: die "$0: open: $!";
8: <0A><0A>my $nrec = 0;<0A>whil
9: e (sysseek IN, $nrec
10: * $RECORDSZ, SEEK_S
11: ET) {<0A> my $nread =
12: sysread IN, my($buf)
13: , $RECORDSZ;<0A><0A> if (
14: defined $nread) {<0A>
15: if ($nread == 0) {
16: <0A> exit 0; # eof
17: <0A> }<0A> else {<0A>
18: $buf =~ s{([^[:g
19: raph:] ])} {<0A>
20: "<" . sprintf("%02X
21: ", ord $1) . ">"<0A>
22: }ge;<0A><0A> print
23: "$nrec: $buf\n";<0A>
24: }<0A> }<0A> else {<0A>
25: warn "$0: $IN_FILE:
26: $nrec: sysread: $!";
27: <0A> }<0A><0A> ++$nrec;<0A>}<0A><0A>
28: die "$0: sysseek: $!
29: ";<0A>

I'm interested in knowing whether this approach allows processing
to continue after the I/O error.

Keep in mind that you're hitting a lower-level failure than malformed
data: the filesystem is failing to provide data.

Hope this helps,
Greg
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,013
Latest member
KatriceSwa

Latest Threads

Top