Parsing a text file line-by-line: skipping badly-formed lines?

Discussion in 'Perl Misc' started by denis.papathanasiou@gmail.com, May 14, 2007.

  1. Guest

    I have a script which reads a plain text (dos) file line-by-line and
    splits it into several smaller files, based on a single attribute.

    The code (below) works, except when a line is malformed (i.e., the
    line contains binary or control characters), and the script just exits
    with an error:

    open(IN, "$IN_FILE") or die "\n\terror: Could not read $IN_FILE $!
    \n"; ;
    binmode(IN);
    while( $ln=<IN> ) {
    if( $ln =~ m/\r\n$/ ) {
    $ln =~ s/\r\n$/\n/; # dos2unix: convert CR LF to LF
    if( $. > 0 ) { # skip the header line
    $sym = substr($ln, 10, 16);
    $sym =~ s/ //g;
    if( $prior_sym ne $sym ) {
    if( $prior_sym ne '' ) { close(OUT); }
    $sym_file = $OUT_PATH . "/" . $sym . "." . $OUT_SUFFIX ;
    open(OUT, ">$sym_file") or die "\n\terror: Could not write to
    $sym_file $!\n";
    binmode(OUT);
    }
    print OUT $ln;
    $prior_sym = $sym ;
    }
    }
    }
    close(IN);

    What I'd like it to do, instead, is if it hits a bad line, write a
    warning and keep going to the end of the file.

    I've tried wrapping the block above in "eval { }; warn $@ if $@;" but
    that doesn't trap the error; even with eval/warn, a bad line will
    cause the script to exit.

    Is there a better way of doing this?
     
    , May 14, 2007
    #1
    1. Advertising

  2. Greg Bacon Guest

    In article <>,
    <> wrote:

    : I've tried wrapping the block above in "eval { }; warn $@ if $@;" but
    : that doesn't trap the error; even with eval/warn, a bad line will
    : cause the script to exit.

    You say your program exits with an error, but you didn't say what
    the error is.

    What's the error? What version of perl are you using? What's your
    operating system?

    Your chances of receiving a helpful reply are even better if you can
    provide input that causes the problem. Yes, transmitting non-printable
    characters on Usenet is a pain, so uuencode the input or write a Perl
    program that can recreate it!

    Greg
    --
    When buying and selling are controlled by legislation, the first
    things to be bought and sold are legislators.
    -- P. J. O'Rourke
     
    Greg Bacon, May 14, 2007
    #2
    1. Advertising

  3. Guest


    > You say your program exits with an error, but you didn't say what
    > the error is.


    My fault, I should have been more precise.

    $? actually returns 0 but I know that is incorrect because the output
    is not as expected.

    The large text file contains data from "A" to "Z", so a successful run
    would result in 26 smaller files.

    But the output we get stops at "R", so either one of the "R" lines (or
    possibly the start of the "S" data) is malformed.

    > What's the error? What version of perl are you using? What's your
    > operating system?


    $ perl -v
    This is perl, v5.8.4 built for i386-linux-thread-multi

    $ uname -sro
    Linux 2.4.27-2-386 GNU/Linux

    > Your chances of receiving a helpful reply are even better if you can
    > provide input that causes the problem. Yes, transmitting non-printable
    > characters on Usenet is a pain, so uuencode the input or write a Perl
    > program that can recreate it!


    Getting to the exact line with the problem has been surprisingly
    difficult: the input file is 14 gb in size, which is too big for the
    hex editor we use (shed).

    I've also tried split to break up the file into smaller chunks, so I
    can load the "R" or "S" chunk into shed and look at the line, but
    split suffers the same problem, i.e. it only gets so far through the
    original file before it quits, leaving the "S" to "Z" range unsplit.

    I'd also thought it might have to do with the $. command (perhaps at
    14 gb, it exceeds perl's ability to count that high?), but removing
    that logic in my script didn't change the result.
     
    , May 14, 2007
    #3
  4. wrote:
    > I have a script which reads a plain text (dos) file line-by-line and
    > splits it into several smaller files, based on a single attribute.
    >
    > The code (below) works, except when a line is malformed (i.e., the
    > line contains binary or control characters), and the script just exits
    > with an error:
    >
    > open(IN, "$IN_FILE") or die "\n\terror: Could not read $IN_FILE $!


    perldoc -q quoting

    Also, you should get into the habit of using the three argument form of open:

    open IN, '<', $IN_FILE or die "\n\terror: Could not read $IN_FILE $!\n";


    > \n"; ;
    > binmode(IN);


    You can also incorporate that into the open statement:

    open IN, '<:raw', $IN_FILE or die "\n\terror: Could not read $IN_FILE $!\n";


    > while( $ln=<IN> ) {
    > if( $ln =~ m/\r\n$/ ) {
    > $ln =~ s/\r\n$/\n/; # dos2unix: convert CR LF to LF


    You don't need to match the same pattern twice:

    if ( $ln =~ s/\r\n$/\n/ ) {

    Or more portable and correct:

    if ( $ln =~ s/\015\012\z/\n/ ) {


    > if( $. > 0 ) { # skip the header line


    $. starts out at 1 so it is *always* greater than 0 (unless you explicitly
    change it.)


    > $sym = substr($ln, 10, 16);
    > $sym =~ s/ //g;


    Use the three argument open() so you won't have to worry about whitespace in
    the file name. However there are other characters that are not valid in a
    file name that you should remove such as "\0" and '/'.

    $sym =~ tr!\0/!!d


    > if( $prior_sym ne $sym ) {
    > if( $prior_sym ne '' ) { close(OUT); }
    > $sym_file = $OUT_PATH . "/" . $sym . "." . $OUT_SUFFIX ;
    > open(OUT, ">$sym_file") or die "\n\terror: Could not write to


    open OUT, '>:raw', $sym_file or die "\n\terror: Could not write to
    $sym_file $!\n";


    > $sym_file $!\n";
    > binmode(OUT);
    > }
    > print OUT $ln;
    > $prior_sym = $sym ;
    > }
    > }
    > }
    > close(IN);
    >
    > What I'd like it to do, instead, is if it hits a bad line, write a
    > warning and keep going to the end of the file.
    >
    > I've tried wrapping the block above in "eval { }; warn $@ if $@;" but
    > that doesn't trap the error; even with eval/warn, a bad line will
    > cause the script to exit.
    >
    > Is there a better way of doing this?



    John
    --
    Perl isn't a toolbox, but a small machine shop where you can special-order
    certain sorts of tools at low cost and in short order. -- Larry Wall
     
    John W. Krahn, May 14, 2007
    #4
  5. Guest


    > perldoc -q quoting
    >
    > Also, you should get into the habit of using the three argument form of open:
    >
    > open IN, '<', $IN_FILE or die "\n\terror: Could not read $IN_FILE $!\n";
    >
    > > \n"; ;
    > > binmode(IN);

    >
    > You can also incorporate that into the open statement:
    >
    > open IN, '<:raw', $IN_FILE or die "\n\terror: Could not read $IN_FILE $!\n";


    Thanks for the suggestion; I've been working with an old template, and
    since it was functional, I never bothered to make it more idiomatic.

    > > while( $ln=<IN> ) {
    > > if( $ln =~ m/\r\n$/ ) {
    > > $ln =~ s/\r\n$/\n/; # dos2unix: convert CR LF to LF

    >
    > You don't need to match the same pattern twice:
    >
    > if ( $ln =~ s/\r\n$/\n/ ) {
    >
    > Or more portable and correct:
    >
    > if ( $ln =~ s/\015\012\z/\n/ ) {


    I'm guilty of some spaghetti there: the dos2unix line was added later,
    and I just stuck it in there w/o thinking about the statement before
    it.

    > > if( $. > 0 ) { # skip the header line

    >
    > $. starts out at 1 so it is *always* greater than 0 (unless you explicitly
    > change it.)


    Really? If I leave that statement out, it winds up processing the
    first line, but when it's there, it skips the first line.

    > > $sym = substr($ln, 10, 16);
    > > $sym =~ s/ //g;

    >
    > Use the three argument open() so you won't have to worry about whitespace in
    > the file name. However there are other characters that are not valid in a
    > file name that you should remove such as "\0" and '/'.
    >
    > $sym =~ tr!\0/!!d
    >
    > > if( $prior_sym ne $sym ) {
    > > if( $prior_sym ne '' ) { close(OUT); }
    > > $sym_file = $OUT_PATH . "/" . $sym . "." . $OUT_SUFFIX ;
    > > open(OUT, ">$sym_file") or die "\n\terror: Could not write to

    >
    > open OUT, '>:raw', $sym_file or die "\n\terror: Could not write to
    > $sym_file $!\n";


    These are all great comments, but they don't help with the original
    problem: any thoughts on why the block terminates before processing
    every line of the original input file?
     
    , May 14, 2007
    #5
  6. Greg Bacon Guest

    In article <>,
    <> wrote:

    : > You say your program exits with an error, but you didn't say what
    : > the error is.
    :
    : My fault, I should have been more precise.

    Yes, precision helps in diagnosing technical problems!

    Is your program exiting silently, i.e., with no error message?

    You wrote that you expected files named A-Z but R is the last
    file created. Looking at your logic, your code skips input lines
    that don't have CR NL. Is this your intent? Could the lines with
    symbols in S-Z be "hidden" in the sense that they fail the test
    in the following line?

    if( $ln =~ m/\r\n$/ ) {

    Debugging output will help you find the problem input. I'd add
    at least two warnings:

    while( $ln=<IN> ) {
    if( $ln =~ s/\r\n\z/\n/ ) {
    if( $. > 1 ) { # skip the header line
    # the rest of your code...
    }
    else {
    warn "$0: $IN_FILE:$.: skipping...\n";
    }
    }

    warn "$0: $IN_FILE:$.: exiting...\n";

    Hope this helps,
    Greg
    --
    (As far as I can see, it is always a man who makes the [Faustian] agreement.
    A woman is more likely to be the contract's benefit than its negotiator.
    The assumption is that Old Slewfoot fully controls her. Obviously, the
    story is literature.) -- Gary North
     
    Greg Bacon, May 14, 2007
    #6
  7. On Mon, 14 May 2007 12:42:00 -0700, denis.papathanasiou wrote:

    > These are all great comments, but they don't help with the original
    > problem: any thoughts on why the block terminates before processing
    > every line of the original input file?


    Maybe go back to the good old ways of debugging, add print statements
    that tell what the program is doing. Tee this so you save it to a file as
    well for later reference, or ptint to a logfile in the first place.

    This will not tell you what is wrong, but may pinpoint the location in
    the 14GB file where your program goes wrong.

    HTH,
    M4
     
    Martijn Lievaart, May 14, 2007
    #7
  8. Guest


    > Is your program exiting silently, i.e., with no error message?


    Yes, $? is 0

    > You wrote that you expected files named A-Z but R is the last
    > file created. Looking at your logic, your code skips input lines
    > that don't have CR NL. Is this your intent? Could the lines with
    > symbols in S-Z be "hidden" in the sense that they fail the test
    > in the following line?
    >
    > if( $ln =~ m/\r\n$/ ) {


    Yes, that's the intent, because if a line doesn't end in CR, it is
    malformed and cannot be parsed further.

    While it's likely that there is at least one line that fits that
    description (and hence fails the $ln =~ m/\r\n$/ test), the bulk of
    the S-Z data *does* end in CR (I verified this by doing a tail on the
    input file).

    So those lines, i.e. the S-Z lines which do end in CR should not be
    skipped.

    > Debugging output will help you find the problem input. I'd add
    > at least two warnings:
    >
    > while( $ln=<IN> ) {
    > if( $ln =~ s/\r\n\z/\n/ ) {
    > if( $. > 1 ) { # skip the header line
    > # the rest of your code...
    > }
    > else {
    > warn "$0: $IN_FILE:$.: skipping...\n";
    > }
    > }
    >
    > warn "$0: $IN_FILE:$.: exiting...\n";
    >


    Thanks, I'll try that.

    In the meantime, I also tried doing a head of the first 120761073
    lines (split exits after processing 120761072 lines in total, which is
    not the full size of the file), and it gave me an interesting error:

    $ head -120761073 qte20070430 > xy.1
    head: error reading `qte20070430': Input/output error
    $ echo $?
    1
    $ tail -2 xy.1
    134950345PRIG 000008192000000028000008197000000003R
    PP000000001715724200 C
    134950355TRIG 000008192000000052000008197000000014$

    So the last line there has the problem (well-formed lines are 90 bytes
    long), but my hex editor doesn't show anything unusual after the "4"
    character:

    offs asc hex dec oct bin
    0135: 0 30 048 060 00110000
    0136: 0 30 048 060 00110000
    0137: 0 30 048 060 00110000
    0138: 0 30 048 060 00110000
    0139: 0 30 048 060 00110000
    0140: 8 38 056 070 00111000
    0141: 1 31 049 061 00110001
    0142: 9 39 057 071 00111001
    0143: 7 37 055 067 00110111
    0144: 0 30 048 060 00110000
    0145: 0 30 048 060 00110000
    0146: 0 30 048 060 00110000
    0147: 0 30 048 060 00110000
    0148: 0 30 048 060 00110000
    0149: 0 30 048 060 00110000
    0150: 0 30 048 060 00110000
    0151: 1 31 049 061 00110001
    0152: 4 34 052 064 00110100

    (end)
    152/153 (dec)
     
    , May 14, 2007
    #8
  9. Guest

    Using the extra warnings gave me this:

    $ ./split-file.pl qte20070330
    ../split-file.pl: qte20070330:120761073: skipping...
    134950355TRIG 000008192000000052000008197000000014
    $ echo $?
    0

    Looking at the tail end of the problem line gave me this:

    offs asc hex dec oct bin
    0119: 0 30 048 060 00110000
    0120: 0 30 048 060 00110000
    0121: 1 31 049 061 00110001
    0122: 4 34 052 064 00110100
    0123: 0A 010 012 00001010

    The difference between the malformed line is that it contains a single
    linefeed character (hex 0a) at the 63rd byte, whereas a normal/well-
    formed line is 90 bytes long, ending in carriage return (hex 0d) plus
    linefeed (hex 0a).

    So it seems that the single linefeed (0a character) fools perl into
    thinking that it's come to EOF, terminating the "while( $ln=<IN> )
    { }" loop.

    So if that's true, how can I guard against this condition?
     
    , May 14, 2007
    #9
  10. Greg Bacon Guest

    In article <>,
    <> wrote:

    : > You wrote that you expected files named A-Z but R is the last
    : > file created. Looking at your logic, your code skips input lines
    : > that don't have CR NL. Is this your intent? Could the lines with
    : > symbols in S-Z be "hidden" in the sense that they fail the test
    : > in the following line?
    : >
    : > if( $ln =~ m/\r\n$/ ) {
    :
    : Yes, that's the intent, because if a line doesn't end in CR, it is
    : malformed and cannot be parsed further.

    Assuming you haven't changed the value of $/ (documented in the
    perlvar manpage), $ln contains newline-terminated records, so
    control wouldn't reach the above conditional without a newline
    at the end.

    Note that your regular expression tests for a carriage return
    followed by a newline at the end of $ln. Looking at the output
    in a followup farther downthread, there's at least one record
    that's being ignored because it doesn't have a carriage return.

    You report that head(1) is failing with an I/O error. Can anyone
    read the entire input? Does the following command succeed?

    wc -l qte20070430

    Greg
    --
    "Unsustainable," say economists.
    "Bubble," say the sourpusses.
    "Buy," say the lumpeninvestoriat.
    -- Bill Bonner
     
    Greg Bacon, May 15, 2007
    #10
  11. Guest


    > Assuming you haven't changed the value of $/ (documented in the
    > perlvar manpage), $ln contains newline-terminated records, so
    > control wouldn't reach the above conditional without a newline
    > at the end.
    >
    > Note that your regular expression tests for a carriage return
    > followed by a newline at the end of $ln. Looking at the output
    > in a followup farther downthread, there's at least one record
    > that's being ignored because it doesn't have a carriage return.


    Right, what should happen is: that line fails the regex text, so I
    should see the warning.

    But, and here's what I don't understand, the "while( $ln=<IN> )
    { }" loop should continue because the end of file has not been
    reached.

    So if the lone 0a character isn't triggering the end of that loop,
    what is?

    BTW, I haven't touched the value of $/ -- in fact the only code prior
    to the block I pasted in the original post is just this:

    #!/usr/bin/perl


    #
    #
    # definition of necessary
    # command-line arguments
    #

    die "\nUsage\n\tperl split-file.pl [Input file name ({file}YYYYMMDD)]
    [Output file path] [Output suffix]\n" unless @ARGV ;

    $IN_FILE = $ARGV[0];
    $OUT_PATH = $ARGV[1];
    $OUT_SUFFIX = $ARGV[2];

    $prior_sym = '';


    > You report that head(1) is failing with an I/O error. Can anyone
    > read the entire input? Does the following command succeed?
    >
    > wc -l qte20070430


    Yes, I'd tried that earlier, before using split, and here's what
    happened:

    $ wc -l qte20070430
    wc: qte20070430: Input/output error
    120781227 qte20070430
     
    , May 15, 2007
    #11
  12. Guest


    > Read 2k, analyze it, write 2k.
    > Try that. There is only 2 ways it can go. Either its not corrupt or it is.
    > There are no other options. Take Perl out of the conversation, it has nothing to
    > do with it apparently.


    You're correct in that the file is probably corrupt, and that I'd be
    better off using a simple read() or even perhaps an mmap() over chunks
    of the file, and finding out what the bad byte sequence is.

    However, one of the reasons to use perl for these types of tasks is
    that the "while( $ln=<IN> ) { }" construct is so convenient: unlike
    read() or mmap(), you don't have any additional overhead or work to
    break up the data into lines.

    So here's a case where the "while( $ln=<IN> ) { }" construct breaks
    down.

    What I'm curious to know (and that's why I posted it to a perl group)
    is how to exception handle such that the "while( $ln=<IN> ) { }"
    construct does not break down, and continues to EOF?

    I'd thought that using "eval { }; warn $@ if $@;" would do that, but
    since it didn't I'm asking here.
     
    , May 15, 2007
    #12
  13. Guest

    On May 15, 12:37 am, wrote:
    > <snip>
    >
    > Btw, you sound like a person with some experience with data.
    > Why haven't you thought of this? You really think Perl is
    > going to help you with this problem?
    >
    > You couldn't solve this problem in a thousand years
    > in a thousand different languages.
    >
    > Find another profession ..... trash collector


    LOL... someone just got their email address added to a troll list.
     
    , May 15, 2007
    #13
  14. wrote:
    > I have a script which reads a plain text (dos) file line-by-line and
    > splits it into several smaller files, based on a single attribute.
    >
    > The code (below) works, except when a line is malformed (i.e., the
    > line contains binary or control characters), and the script just exits
    > with an error:
    >
    > open(IN, "$IN_FILE") or die "\n\terror: Could not read $IN_FILE $!
    > \n"; ;
    > binmode(IN);
    > while( $ln=<IN> ) {
    > if( $ln =~ m/\r\n$/ ) {
    > $ln =~ s/\r\n$/\n/; # dos2unix: convert CR LF to LF


    I'd set $/ to "\r\n":

    open(my $in, '<', $IN_FILE) or die "\n\terror: Could not open $IN_FILE: $!";
    # I like that part after the \n\ ;-)
    $/ = "\r\n";
    my $thowawayfirstline = <$in>; # skip the header line
    # Here you could check if the header line looks like what you'd expect
    while (<$in>) {
    # Process rest of data
    }
    close $in;

    If I can't get the processing to succeed, I usually print out the first
    line and stop:

    while (<$in>) {
    print "$_\n"; last
    # Process rest of data
    }

    Once I've figured out what's going on, I comment or delete that line.

    Josef
    --
    These are my personal views and not those of Fujitsu Siemens Computers!
    Josef Möllers (Pinguinpfleger bei FSC)
    If failure had no penalty success would not be a prize (T. Pratchett)
    Company Details: http://www.fujitsu-siemens.com/imprint.html
     
    Josef Moellers, May 15, 2007
    #14
  15. Greg Bacon Guest

    In article <>,
    <> wrote:

    : I'd thought that using "eval { }; warn $@ if $@;" would do that, but
    : since it didn't I'm asking here.

    Which lines did you wrap in an eval?

    Greg
    --
    Until these powers are restored -- and the Fed, the income tax, and
    the Seventeenth Amendment abolished -- Americans have no hope of ever
    returning to a regime of constitutional liberty.
    -- Thomas DiLorenzo
     
    Greg Bacon, May 15, 2007
    #15
  16. Guest


    > : I'd thought that using "eval { }; warn $@ if $@;" would do that, but
    > : since it didn't I'm asking here.
    >
    > Which lines did you wrap in an eval?


    Initially, the entire file-handling block, i.e. from "open(IN,...) ...
    close(IN);".

    Since neither opening nor closing the file handle was the problem, I
    tried putting the "while( $ln=<IN> ) { }" loop inside "eval{}; warn"
    as well.

    But the problem seems to be perl's <> construct: regardless of how I
    define $/ (I used both the default and Josef's suggestion), if there's
    an i/o error of any kind, the "$ln=<IN>" evaluates to false and the
    loop ends.

    So the way I solved the problem was to use a different file read
    strategy: instead of using a line-by-line loader like <>, I load n
    bytes at a time into a vector.

    Since the file is fixed-width, I can treat the vector conceptually as
    a 2d array and pull out any "line" I need.

    Also, I can wrap the byte reads inside a condition-handler, so that
    when I see the i/o error ("when", not "if" because the file *is*
    corrupted), I can log the error lines, yet keep going all the way to
    the end.

    I wound up coding this in CL, not perl, though, because I couldn't
    find any references to file reads in perl that did not involve the <>
    construct, and also because the CL condition/exception handling logic
    seems more robust than perl's.

    If there's a way to do the same thing -- i.e., read byte blocks into a
    vector, allowing for the possibility of an i/o error without stopping
    -- in perl (and I'm sure there is), I'd be interested in learning how.
     
    , May 15, 2007
    #16
  17. Guest


    > open(my $in, '<', $IN_FILE) or die "\n\terror: Could not open $IN_FILE: $!";
    > # I like that part after the \n\ ;-)


    Ha! the unintended consequences of labeling and visible formatting
    combinations!

    > $/ = "\r\n";
    > my $thowawayfirstline = <$in>; # skip the header line
    > # Here you could check if the header line looks like what you'd expect
    > while (<$in>) {
    > # Process rest of data}
    >
    > close $in;
    >
    > If I can't get the processing to succeed, I usually print out the first
    > line and stop:
    >
    > while (<$in>) {
    > print "$_\n"; last
    > # Process rest of data
    >
    > }
    >
    > Once I've figured out what's going on, I comment or delete that line.


    The issue seems to be that perl's <> construct is that it stops (i.e.,
    "while (<$in>)" evaluates to false) in the event of an i/o error
    regardless of how $/ is defined.

    And that's exactly what I *don't* want it to do (my last reply to Greg
    has more details).
     
    , May 15, 2007
    #17
  18. Bart Lateur Guest

    wrote:

    >But the problem seems to be perl's <> construct: regardless of how I
    >define $/ (I used both the default and Josef's suggestion), if there's
    >an i/o error of any kind, the "$ln=<IN>" evaluates to false and the
    >loop ends.


    Check to see if that last line contains a chr(26). If that's the case
    and you're on Windows, use binmode on the handle. Text mode treats a
    chr(26) (AKA ctrl-Z, "\cZ") as an end of line marker, while binary mode
    does not.

    Of course, then, you'll have to convert the line ends to "\n" by hand...
    but you're already doing that.

    --
    Bart.
     
    Bart Lateur, May 16, 2007
    #18
  19. Greg Bacon Guest

    In article <>,
    Bart Lateur <> wrote:

    : Check to see if that last line contains a chr(26). If that's the case
    : and you're on Windows, use binmode on the handle. Text mode treats a
    : chr(26) (AKA ctrl-Z, "\cZ") as an end of line marker, while binary mode
    : does not.

    Denis said Linux is the operating system.

    Greg
    --
    I have always found it remarkable that so many men and women are prepared to
    distrust any and all businessmen -- whose appeals, in a free market, they are
    free to ignore -- while trusting even the most corrupt or cruel politician --
    whose demands they fail to meet at their peril. -- Butler Shaffer
     
    Greg Bacon, May 16, 2007
    #19
  20. Greg Bacon Guest

    In article <>,
    <> wrote:

    : So the way I solved the problem was to use a different file read
    : strategy: instead of using a line-by-line loader like <>, I load n
    : bytes at a time into a vector.
    :
    : Since the file is fixed-width, I can treat the vector conceptually as
    : a 2d array and pull out any "line" I need.

    But what about the records that aren't properly terminated? Won't
    that throw off your count?

    : Also, I can wrap the byte reads inside a condition-handler, so that
    : when I see the i/o error ("when", not "if" because the file *is*
    : corrupted), I can log the error lines, yet keep going all the way to
    : the end.

    Are you able to actually continue reading after the I/O error?

    : I wound up coding this in CL, not perl, though, because I couldn't
    : find any references to file reads in perl that did not involve the <>
    : construct, and also because the CL condition/exception handling logic
    : seems more robust than perl's.

    There are alternatives:

    perldoc -f read
    perldoc -f sysread

    : If there's a way to do the same thing -- i.e., read byte blocks into a
    : vector, allowing for the possibility of an i/o error without stopping
    : -- in perl (and I'm sure there is), I'd be interested in learning how.

    You might try something along the following lines:

    #! /usr/bin/perl

    use strict;
    use warnings;

    use Fcntl qw/ SEEK_SET /;

    my $RECORDSZ = 20;

    my $IN_FILE = $0;

    open IN, "<:raw", $IN_FILE or die "$0: open: $!";

    my $nrec = 0;
    while (sysseek IN, $nrec * $RECORDSZ, SEEK_SET) {
    my $nread = sysread IN, my($buf), $RECORDSZ;

    if (defined $nread) {
    if ($nread == 0) {
    exit 0; # eof
    }
    else {
    $buf =~ s{([^[:graph:] ])} {
    "<" . sprintf("%02X", ord $1) . ">"
    }ge;

    print "$nrec: $buf\n";
    }
    }
    else {
    warn "$0: $IN_FILE:$nrec: sysread: $!";
    }

    ++$nrec;
    }

    die "$0: sysseek: $!";

    When run (against itself, which you'll need to change), it gives

    0: #! /usr/bin/perl<0A><0A>us
    1: e strict;<0A>use warnin
    2: gs;<0A><0A>use Fcntl qw/ S
    3: EEK_SET /;<0A><0A>my $RECO
    4: RDSZ = 20;<0A><0A>my $IN_F
    5: ILE = $0;<0A><0A>open IN,
    6: "<:raw", $IN_FILE or
    7: die "$0: open: $!";
    8: <0A><0A>my $nrec = 0;<0A>whil
    9: e (sysseek IN, $nrec
    10: * $RECORDSZ, SEEK_S
    11: ET) {<0A> my $nread =
    12: sysread IN, my($buf)
    13: , $RECORDSZ;<0A><0A> if (
    14: defined $nread) {<0A>
    15: if ($nread == 0) {
    16: <0A> exit 0; # eof
    17: <0A> }<0A> else {<0A>
    18: $buf =~ s{([^[:g
    19: raph:] ])} {<0A>
    20: "<" . sprintf("%02X
    21: ", ord $1) . ">"<0A>
    22: }ge;<0A><0A> print
    23: "$nrec: $buf\n";<0A>
    24: }<0A> }<0A> else {<0A>
    25: warn "$0: $IN_FILE:
    26: $nrec: sysread: $!";
    27: <0A> }<0A><0A> ++$nrec;<0A>}<0A><0A>
    28: die "$0: sysseek: $!
    29: ";<0A>

    I'm interested in knowing whether this approach allows processing
    to continue after the I/O error.

    Keep in mind that you're hitting a lower-level failure than malformed
    data: the filesystem is failing to provide data.

    Hope this helps,
    Greg
    --
    When buying and selling are controlled by legislation, the first
    things to be bought and sold are legislators.
    -- P. J. O'Rourke
     
    Greg Bacon, May 16, 2007
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Eric
    Replies:
    1
    Views:
    5,064
    Karl Seguin
    Nov 17, 2005
  2. Gil
    Replies:
    1
    Views:
    2,794
    morice
    Dec 24, 2003
  3. Seeker
    Replies:
    8
    Views:
    1,915
    Vladimir S. Oka
    Mar 24, 2006
  4. Replies:
    0
    Views:
    596
  5. Hooby
    Replies:
    0
    Views:
    149
    Hooby
    Jul 27, 2005
Loading...

Share This Page