Parsing a text file line-by-line: skipping badly-formed lines?

  • Thread starter denis.papathanasiou
  • Start date
D

denis.papathanasiou

But what about the records that aren't properly terminated? Won't
that throw off your count?

If the read fails, the exception handling returns null for the entire
array of n bytes.

Likewise, if the read succeeds, the array is valid, i.e. all lines
within the data block are of the same size (and the routine that picks
out lines from the array does further checking).

So it brings up a trade-off in sizing the array for reads: too large
and miss parts of the file uncorrupted (but traverse the file quickly)
versus too small and take forever to traverse the file (but minimize
losing uncorrupted data).

It's not perfect, and I'll keep thinking up possible improvements over
time (fortunately, it's not a problem which happens very often).
You might try something along the following lines:

#! /usr/bin/perl

use strict;
use warnings;

use Fcntl qw/ SEEK_SET /;

my $RECORDSZ = 20;

my $IN_FILE = $0;

open IN, "<:raw", $IN_FILE or die "$0: open: $!";

my $nrec = 0;
while (sysseek IN, $nrec * $RECORDSZ, SEEK_SET) {
my $nread = sysread IN, my($buf), $RECORDSZ;

if (defined $nread) {
if ($nread == 0) {
exit 0; # eof
}
else {
$buf =~ s{([^[:graph:] ])} {
"<" . sprintf("%02X", ord $1) . ">"
}ge;

print "$nrec: $buf\n";
}
}
else {
warn "$0: $IN_FILE:$nrec: sysread: $!";
}

++$nrec;
}

die "$0: sysseek: $!";

Thanks for suggesting it; I'll definitely give it a try tomorrow.

My first (quick) impression is that the while loop should not be tied
to the file handler b/c of how perl (seems) to close or invalidate the
file handler at the sign of i/o trouble.

So it might be better to read the byte size of the file with stat(),
and use that value to iterate (read) n bytes at a time (that's what I
do in CL).

The other potential problem is catching exceptions when reading the
the corrupted section; I think "eval{ }; warn;" is supposed to do that
in perl, but I've not had success in getting it to work like an
exception handler in CL.

So regardless of how I wind up iterating through the file, if I can't
handle the bad read and maintain control, it won't work.

But that's just a guess based on a quick read; I'll experiment with it
and find out what really happens.
 
J

Joe Smith

Yes, I'd tried that earlier, before using split, and here's what
happened:

$ wc -l qte20070430
wc: qte20070430: Input/output error
120781227 qte20070430

When using "wc qte20070430", is the character count either
2147483647 or 4294967295 ?

If not, try:
cat qte20070430 >/dev/null || echo "disk file error"
sed 's/a/a/' qte20070430 >/dev/null || echo "disk file error"
dd if=qte20070430 of=/dev/null || echo "disk file error"
dmesg | tail; tail /var/log/messages
 
J

Joe Smith

The issue seems to be that perl's <> construct is that it stops (i.e.,
"while (<$in>)" evaluates to false) in the event of an i/o error
regardless of how $/ is defined.

And that's exactly what I *don't* want it to do.

What does eof($in) return after the while() stops?

If you attempt to read one more line from <$in>, does it
return undef, or a line from the file, or does it switch
to reading from STDIN?

-Joe
 
D

denis.papathanasiou

When using "wc qte20070430", is the character count either
2147483647 or 4294967295 ?

Neither:

$ wc qte20070430
wc: qte20070430: Input/output error
120760792 518542177 10989232128 qte20070430
If not, try:
cat qte20070430 >/dev/null || echo "disk file error"
sed 's/a/a/' qte20070430 >/dev/null || echo "disk file error"
dd if=qte20070430 of=/dev/null || echo "disk file error"
dmesg | tail; tail /var/log/messages

Here they are; except for the dmesg output, there's nothing new to
report:

$ cat qte20070430 >/dev/null || echo "disk file error"
cat: qte20070430: Input/output error
disk file error

$ sed 's/a/a/' qte20070430 >/dev/null || echo "disk file error"
sed: read error on qte20070430: Input/output error
disk file error

$ dd if=qte20070430 of=/dev/null || echo "disk file error"
dd: reading `qte20070430': Input/output error
21463392+0 records in
21463392+0 records out
10989256704 bytes transferred in 400.547484 seconds (27435590 bytes/
sec)
disk file error

$ dmesg | tail; tail /var/log/messages
end_request: I/O error, dev 03:01 (hda), sector 6690920
hda: dma_intr: status=0x59 { DriveReady SeekComplete DataRequest
Error }
hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6690987,
high=0, low=6690987, sector=6690832
end_request: I/O error, dev 03:01 (hda), sector 6690832
hda: dma_intr: status=0x59 { DriveReady SeekComplete DataRequest
Error }
hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6690987,
high=0, low=6690987, sector=6690840
end_request: I/O error, dev 03:01 (hda), sector 6690840
hda: dma_intr: status=0x59 { DriveReady SeekComplete DataRequest
Error }
hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6690987,
high=0, low=6690987, sector=6690920
end_request: I/O error, dev 03:01 (hda), sector 6690920
May 18 08:25:39 localhost kernel: end_request: I/O error, dev 03:01
(hda), sector 6690920
May 18 08:45:50 localhost kernel: hda: dma_intr: status=0x59
{ DriveReady SeekComplete DataRequest Error }
May 18 08:45:52 localhost kernel: hda: dma_intr: error=0x40
{ UncorrectableError }, LBAsect=6690987, high=0, low=6690987,
sector=6690832
May 18 08:45:52 localhost kernel: end_request: I/O error, dev 03:01
(hda), sector 6690832
May 18 08:45:52 localhost kernel: hda: dma_intr: status=0x59
{ DriveReady SeekComplete DataRequest Error }
May 18 08:45:52 localhost kernel: hda: dma_intr: error=0x40
{ UncorrectableError }, LBAsect=6690987, high=0, low=6690987,
sector=6690840
May 18 08:45:52 localhost kernel: end_request: I/O error, dev 03:01
(hda), sector 6690840
May 18 08:52:31 localhost kernel: hda: dma_intr: status=0x59
{ DriveReady SeekComplete DataRequest Error }
May 18 08:52:33 localhost kernel: hda: dma_intr: error=0x40
{ UncorrectableError }, LBAsect=6690987, high=0, low=6690987,
sector=6690920
May 18 08:52:33 localhost kernel: end_request: I/O error, dev 03:01
(hda), sector 6690920

I'm not sure if this means the error is related to the file or to the
(physical) disk itself.

But either way, it's a problem when trying to do a read at that point
in perl.
 
D

denis.papathanasiou

What does eof($in) return after the while() stops?

If you attempt to read one more line from <$in>, does it
return undef, or a line from the file, or does it switch
to reading from STDIN?

It's undef; basically, in perl, the moment it reaches the file
corruption point (regardless of whether I'm using read or <> for line
at a time), the file handle is gone.

So unlike CL (which lets me trap the exception while keeping the file
handle open so I can skip past it), there doesn't seem to be a way to
get beyond the point of the corruption.
 
G

Greg Bacon

: So unlike CL (which lets me trap the exception while keeping the file
: handle open so I can skip past it), there doesn't seem to be a way to
: get beyond the point of the corruption.

Did you run my code that used seek calls?

Greg
 
D

denis.papathanasiou

Did you run my code that used seek calls?

Yes; just like using <>, the moment the seek got to the corrupted
part, the file handle was lost.

And I couldn't seek past that point: because the file handle would
close, I got the idea of storing the offset, reopening the file
handle, seeking past the offset, but it wouldn't let me.

I suspect it's because I don't know the size of the corrupted area, so
just seeking to offset+n, unless n is large enough that it puts me in
a clear patch, would just get me an undefined file handle again.

It also occurred to me that I could increment offset by 1, and keep
reopening the file handle until I was clear of the corruption, but it
didn't seem like a good way of doing it.
 
P

Peter J. Holzer

$ wc qte20070430
wc: qte20070430: Input/output error [...]

$ cat qte20070430 >/dev/null || echo "disk file error"
cat: qte20070430: Input/output error

If you get I/O errors, you don't have "badly-formed lines", you have a
hardware problem. Buy a new hard disk and restore your last working
backup (You do have a backup, right?)

You may be able to use dd_rescue to salvage parts of the file, but a 14
GB file with some random parts replaced with zeros probably isn't that
useful.

hp
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,013
Latest member
KatriceSwa

Latest Threads

Top