Reading Mac / Unix / DOS text files

Discussion in 'Perl Misc' started by January Weiner, Feb 21, 2006.

  1. Hi, I'm sure this is a common problem:

    I'd like my script to treat text files coming from various systems alike.
    More specifically, I'd like to recognize ends of line as one of: \r, \l,
    \r\l. Is there a more elegant way than doing the obvious?:

    while(<IF>) {
    s/\r?\l?$// ; # is this correct anyway? will an end of line be
    # recognized with a Mac file?
    #...
    }

    I would expect that there is some weird variable out there (like the $/)
    that changes the behaviour of chomp to be more promiscous.

    The problem, of course, is, that this cannot be set platform- or
    scriptwide. One file might contain DOS eols, another one would come from
    Mac.

    j.

    --
    January Weiner, Feb 21, 2006
    #1
    1. Advertising

  2. January Weiner <> wrote in news:dtfd12$m4k$1
    @sagnix.uni-muenster.de:

    > I'd like my script to treat text files coming from various systems

    alike.
    > More specifically, I'd like to recognize ends of line as one of: \r,

    \l,
    > \r\l. Is there a more elegant way than doing the obvious?:


    You should use the codes for those characters rather than the escapes.

    > while(<IF>) {


    I stared at this for a long time trying to figure out what

    while(<IF>) {

    meant. I guess IF is short for Input File?

    Here, an appropriate amount of whitespace, not using bareword
    filehandles, and using an appropriate variable name would have helped
    immensely with readability.

    while ( <$input> ) {

    > s/\r?\l?$// ; # is this correct anyway? will an end of line be
    > # recognized with a Mac file?



    This information is readily available by doing a cursory Google search.
    Are you that lazy?

    s{ \012 | (?: \015\012? ) }{\n}x

    should convert any line ending convention to the one supported by your
    platform.

    > I would expect that there is some weird variable out there
    > (like the $/)



    $/ is not a weird variable. It is documented in perldoc perlvar.

    Sinan
    A. Sinan Unur, Feb 21, 2006
    #2
    1. Advertising

  3. January Weiner

    thrill5 Guest

    I don't know why you had to stare at "while (<IF>)" for anything longer than
    about a tenth of second. Pretty obvious what the code does to me.
    Whitespace and using barewords for file handles is a matter of programming
    style. Just because that's not the way you do it does not mean that it is
    incorrect or wrong.

    Scott

    "A. Sinan Unur" <> wrote in message
    news:Xns97718FB4BB1ECasu1cornelledu@132.236.56.8...
    > January Weiner <> wrote in news:dtfd12$m4k$1
    > @sagnix.uni-muenster.de:
    >
    >> I'd like my script to treat text files coming from various systems

    > alike.
    >> More specifically, I'd like to recognize ends of line as one of: \r,

    > \l,
    >> \r\l. Is there a more elegant way than doing the obvious?:

    >
    > You should use the codes for those characters rather than the escapes.
    >
    >> while(<IF>) {

    >
    > I stared at this for a long time trying to figure out what
    >
    > while(<IF>) {
    >
    > meant. I guess IF is short for Input File?
    >
    > Here, an appropriate amount of whitespace, not using bareword
    > filehandles, and using an appropriate variable name would have helped
    > immensely with readability.
    >
    > while ( <$input> ) {
    >
    >> s/\r?\l?$// ; # is this correct anyway? will an end of line be
    >> # recognized with a Mac file?

    >
    >
    > This information is readily available by doing a cursory Google search.
    > Are you that lazy?
    >
    > s{ \012 | (?: \015\012? ) }{\n}x
    >
    > should convert any line ending convention to the one supported by your
    > platform.
    >
    >> I would expect that there is some weird variable out there
    >> (like the $/)

    >
    >
    > $/ is not a weird variable. It is documented in perldoc perlvar.
    >
    > Sinan
    thrill5, Feb 22, 2006
    #3
  4. January Weiner

    Rick Scott Guest

    (thrill5 <> uttered:)
    > I don't know why you had to stare at "while (<IF>)" for anything
    > longer than about a tenth of second. Pretty obvious what the code
    > does to me.


    Your filehandle `IF' collides with Perl's conditional `if' in the
    mental hash-bucket of the programmers who have to read your code.
    Given that it reduces the comprehensibility of your program and that
    you could have used any number of more legible identifiers, I'd call
    use of `IF' poor style, if not an outright error.


    > Whitespace and using barewords for file handles is a matter of
    > programming style. Just because that's not the way you do it does
    > not mean that it is incorrect or wrong.


    On the contrary -- I would posit that any coding practice that makes
    it easier to inadvertently introduce bugs into your program is a
    poor one. By using a bareword filehandle, you're essentially using
    a package variable (*IF). If some other piece of code in the same
    package namespace touches that bareword while you're using it to read
    from a file, your filehandle will get stomped and your code will break
    without you even having changed it. That's why it's a bad idea.

    Pick up a copy of Damian Conway's "Perl Best Practices" -- he explains
    why this and about a thousand other `harmless style preferences' aren't
    harmless, aren't stylish, and definitely aren't preferable.




    Rick
    --
    key CF8F8A75 / print C5C1 F87D 5056 D2C0 D5CE D58F 970F 04D1 CF8F 8A75
    A: Because the response should come after the question.
    Q: Why is top-posting so annoying?
    :Mike Andrews
    Rick Scott, Feb 22, 2006
    #4
  5. A. Sinan Unur <> wrote:

    > You should use the codes for those characters rather than the escapes.


    Hmmm, OK, and why?

    > Here, an appropriate amount of whitespace, not using bareword
    > filehandles, and using an appropriate variable name would have helped
    > immensely with readability.


    > while ( <$input> ) {


    Sorry. I learned the rudimentaries of Perl some ten years ago, when, as
    far as I can remember, bareword filehandles were something to be found
    frequently in the code I learned Perl from. Thanks for the suggestions.
    I should have written <STDIN> and everyone would be happy (unless, of
    course, the modern style recommends something instead of the bareword
    STDIN).

    > > s/\r?\l?$// ; # is this correct anyway? will an end of line be
    > > # recognized with a Mac file?



    > This information is readily available by doing a cursory Google search.
    > Are you that lazy?


    I think that I am rather that stupid, because I did go through both, FAQ
    and a dozen hits Google returned (not to mention Perl documentation on my
    system), but I found mostly references to modifications of the
    $INPUT_RECORD_SEPARATOR, which does not really do job for me. Do you think
    I am really so eager to expose myself to ruddy remarks of Perl gurus by
    asking a novice question? :)

    > s{ \012 | (?: \015\012? ) }{\n}x


    > should convert any line ending convention to the one supported by your
    > platform.


    Thank you for giving me the answer nonetheless :)

    > > I would expect that there is some weird variable out there
    > > (like the $/)


    > $/ is not a weird variable. It is documented in perldoc perlvar.


    Sorry. I know it is and I know the docs. Would you have been happier if I
    had written "shorthand" instead of "weird"?

    Thanks for your answer!

    j.

    --
    January Weiner, Feb 22, 2006
    #5
  6. Rick Scott <> wrote:
    > Your filehandle `IF' collides with Perl's conditional `if' in the
    > mental hash-bucket of the programmers who have to read your code.


    That's an argument. I will refrain from using <IF> in public. It causes
    too much stir and excites people.

    > Given that it reduces the comprehensibility of your program and that
    > you could have used any number of more legible identifiers, I'd call
    > use of `IF' poor style, if not an outright error.


    Why an error? (yeah, it can lead to errors, agree, but then, of course, a
    collegue of mine says the same about using Perl)

    > On the contrary -- I would posit that any coding practice that makes
    > it easier to inadvertently introduce bugs into your program is a
    > poor one. By using a bareword filehandle, you're essentially using
    > a package variable (*IF). If some other piece of code in the same
    > package namespace touches that bareword while you're using it to read
    > from a file, your filehandle will get stomped and your code will break
    > without you even having changed it. That's why it's a bad idea.


    I think this depends a little on what you are using Perl for, and just
    some basic common sense is sufficient to say when using shorthand is OK and
    when it is a bad idea. I do use the <IF> construct in few-liners that do
    not read more than one file. I do think it is important to define
    variables if you are writing a larger piece of code. I do not understand
    the whole stir about this issue.

    > Pick up a copy of Damian Conway's "Perl Best Practices" -- he explains
    > why this and about a thousand other `harmless style preferences' aren't
    > harmless, aren't stylish, and definitely aren't preferable.


    So what is the problem with whitespace? Why is while(<STDIN>) more harmful
    than while ( <STDIN> ) ?

    j.

    --
    January Weiner, Feb 22, 2006
    #6
  7. January Weiner <> wrote:


    > I'd like my script to treat text files coming from various systems alike.
    > More specifically, I'd like to recognize ends of line as one of: \r, \l,
    > \r\l.



    In Perl, \l lowercases the following character.

    I think you must have meant \n instead?

    Furthermore, sometimes \n means CR rather than LF.

    We better get our terminology precise if we are to avoid
    confusing ourselves.

    I will use "carriage return" (CR) and "linefeed" (LF) to
    avoid further confusion.


    > Is there a more elegant way than doing the obvious?:



    The obvious will not work, so I wouldn't characterize it as "obvious".

    You need a "correct way" before exploring for a "more elegant way".


    > while(<IF>) {



    Too late.

    At this point, you have *already done* an operation that depends
    on the definition of line-ending.

    If the file is Mac-style and the program is running on *nix,
    then the loop executes 1 time, and the entire file will be
    in $_ already...


    > s/\r?\l?$// ; # is this correct anyway? will an end of line be
    > # recognized with a Mac file?



    .... so this will delete the final CR but leave all the rest untouched.


    > I would expect that there is some weird variable out there (like the $/)
    > that changes the behaviour of chomp to be more promiscous.



    You would be disappointed then. :)


    > The problem, of course, is, that this cannot be set platform- or
    > scriptwide. One file might contain DOS eols, another one would come from
    > Mac.



    Then you should "normalize" the data before doing any line-oriented
    processing.

    In other words, you must treat these "text files" as if they
    were "binary" files. That is, use read() or sysread()
    rather than readline().


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Feb 22, 2006
    #7
  8. January Weiner <> wrote in
    news:dth5nn$3ip$-muenster.de:

    > A. Sinan Unur <> wrote:
    >
    >> You should use the codes for those characters rather than the
    >> escapes.

    >
    > Hmmm, OK, and why?


    Because ...

    >> > s/\r?\l?$// ; # is this correct anyway? will an end of line be


    it is easy to get confused (from perldoc perlre):

    \l lowercase next char (think vi)

    That is, \l is not linefeed.

    In any case, these escapes could potentially mean different things on
    different systems. Why not be very specific in what you really are
    looking for?


    >> This information is readily available by doing a cursory Google
    >> search. Are you that lazy?

    >
    > I think that I am rather that stupid, because I did go through both,
    > FAQ and a dozen hits Google returned


    http://www.google.com/search?q=perl eol

    http://www.google.com/search?q=newline

    In any case, I should probably have put a smiley there, because I had not
    intended it to come across that harshly.

    > Thanks for your answer!


    You are welcome.

    Sinan
    A. Sinan Unur, Feb 22, 2006
    #8
  9. A. Sinan Unur <> wrote:
    > >> > s/\r?\l?$// ; # is this correct anyway? will an end of line be


    > it is easy to get confused (from perldoc perlre):


    > \l lowercase next char (think vi)


    > That is, \l is not linefeed.


    :))) nice demonstration of the problem. But this

    s/(?:\r\n?|\n)/

    should work correctly? (except for the fact that one should use the codes)

    > In any case, these escapes could potentially mean different things on
    > different systems. Why not be very specific in what you really are
    > looking for?


    Hmmmm, I assumed that I should rather use what Perl thinks is a linefeed
    than the ASCII code I think it is. But this is really a minor issue.

    > > I think that I am rather that stupid, because I did go through both,
    > > FAQ and a dozen hits Google returned


    > http://www.google.com/search?q=perl eol
    > http://www.google.com/search?q=newline


    OK. However, I was not looking for a solution with string substitution, as
    you have seen (demonstrated on my faulty code snippet) I came up with that
    one myself. I was rather thinking along the following lines: isn't there a
    general way to tell Perl "Hey, treat all the text files alike, wherever
    they come from: DOS, Mac or Unix".

    The point is: (i) I have written a handfull of various scripts, some of them
    quite large. All of them work on text files. Recently I have discovered
    problems due to the fact that some of the files that I work on recently
    come from the DOS world. Now, I'd rather insert _one_ command or variable
    assignment somewhere at the beginning of the script that would change the
    behaviour of chomp than to go through all that code and substitute each
    chomp by a substitution. (ii) A substitution takes more time by orders of
    magnitude:

    :~ $ head -100000 /db/prodom/prodom.mul | (time perl -p -e 'chomp ;' > /dev/null ; )

    real 0m0.157s
    user 0m0.123s
    sys 0m0.034s
    :~ $ head -100000 /db/prodom/prodom.mul | (time perl -p -e 's{ \012 | (?: \015\012? ) }{\n}x ;' > /dev/null ; )

    real 0m2.012s
    user 0m1.990s
    sys 0m0.024s

    And, surprise, the files can be quite large:
    :~ $ wc -l /db/prodom/prodom.mul
    7900570 /db/prodom/prodom.mul

    I simply thought there might be a better solution than to use
    substitutions, like assigning $/ in a special way or using a module that
    adds a layer to the file open() or redefines chomp. What do I know. I
    thought that the problem was common enough to be addressed in a better way.

    I think that I will find some way to determine the file type (possibly by
    looking at the ending of the first line), redefine $/ and continue reading.
    Some untested code follows:


    #!/usr/bin/perl -w
    use strict;
    use warnings;

    my $DFNTNTF = myopen("<test.mul") ;

    die "Cannot open file: $!\n" unless($DFNTNTF) ;

    while( <$DFNTNTF> ) {
    chomp ;
    print "line $.:$_\n" ;
    }

    close $DFNTNTF ;

    exit 0 ;

    # open a file and set the input record separator
    sub myopen {

    my $file_mode = shift ;
    my $definitelynotif ;

    open ( $definitelynotif, $file_mode ) or return ;
    my $line = <$definitelynotif> ;

    if($line =~ m/(\015\012|\012|\015)/) {
    $/ = $1 ;
    }

    seek $definitelynotif, 0, 0 ;
    return $definitelynotif ;
    }

    > In any case, I should probably have put a smiley there, because I had not
    > intended it to come across that harshly.


    No offence taken.

    Cheers,
    January

    --
    January Weiner, Feb 22, 2006
    #9
  10. A. Sinan Unur <> wrote:
    > >> > s/\r?\l?$// ; # is this correct anyway? will an end of line be


    > it is easy to get confused (from perldoc perlre):


    > \l lowercase next char (think vi)


    > That is, \l is not linefeed.


    :))) nice demonstration of the problem. But this

    s/(?:\r\n?|\n)/

    should work correctly? (except for the fact that one should use the codes)

    > In any case, these escapes could potentially mean different things on
    > different systems. Why not be very specific in what you really are
    > looking for?


    Hmmmm, I assumed that I should rather use what Perl thinks is a linefeed
    than the ASCII code I think it is. But this is really a minor issue.

    > > I think that I am rather that stupid, because I did go through both,
    > > FAQ and a dozen hits Google returned


    > http://www.google.com/search?q=perl eol
    > http://www.google.com/search?q=newline


    OK. However, I was not looking for a solution with string substitution, as
    you have seen (demonstrated on my faulty code snippet) I came up with that
    one myself. I was rather thinking along the following lines: isn't there a
    general way to tell Perl "Hey, treat all the text files alike, wherever
    they come from: DOS, Mac or Unix".

    The point is: (i) I have written a handfull of various scripts, some of them
    quite large. All of them work on text files. Recently I have discovered
    problems due to the fact that some of the files that I work on recently
    come from the DOS world. Now, I'd rather insert _one_ command or variable
    assignment somewhere at the beginning of the script that would change the
    behaviour of chomp than to go through all that code and substitute each
    chomp by a substitution. (ii) A substitution takes more time by orders of
    magnitude:

    :~ $ head -100000 /db/prodom/prodom.mul | (time perl -p -e 'chomp ;' > /dev/null ; )

    real 0m0.157s
    user 0m0.123s
    sys 0m0.034s
    :~ $ head -100000 /db/prodom/prodom.mul | (time perl -p -e 's{ \012 | (?: \015\012? ) }{\n}x ;' > /dev/null ; )

    real 0m2.012s
    user 0m1.990s
    sys 0m0.024s

    And, surprise, the files can be quite large:
    :~ $ wc -l /db/prodom/prodom.mul
    7900570 /db/prodom/prodom.mul

    I simply thought there might be a better solution than to use
    substitutions, like assigning $/ in a special way or using a module that
    adds a layer to the file open() or redefines chomp. What do I know. I
    thought that the problem was common enough to be addressed in a better way.

    I think that I will find some way to determine the file type (possibly by
    looking at the ending of the first line), redefine $/ and continue reading.
    Some untested code follows:


    #!/usr/bin/perl -w
    use strict;
    use warnings;

    my $DFNTNTF = myopen("<test.mul") ;

    die "Cannot open file: $!\n" unless($DFNTNTF) ;

    while( <$DFNTNTF> ) {
    chomp ;
    print "line $.:$_\n" ;
    }

    close $DFNTNTF ;

    exit 0 ;

    # open a file and set the input record separator
    sub myopen {

    my $file_mode = shift ;
    my $definitelynotif ;

    open ( $definitelynotif, $file_mode ) or return ;
    my $line = <$definitelynotif> ;

    if($line =~ m/(\015\012|\012|\015)/) {
    $/ = $1 ;
    }

    close $definitelynotif ;
    open ( $definitelynotif, $file_mode ) or return ;

    return $definitelynotif ;
    }

    > In any case, I should probably have put a smiley there, because I had not
    > intended it to come across that harshly.


    No offence taken.

    Cheers,
    January

    --
    January Weiner, Feb 22, 2006
    #10
  11. January Weiner <> wrote in
    news:dti479$eti$-muenster.de:

    > A. Sinan Unur <> wrote:
    >> >> > s/\r?\l?$// ; # is this correct anyway? will an end of line
    >> >> > be

    >
    >> it is easy to get confused (from perldoc perlre):

    >
    >> \l lowercase next char (think vi)

    >
    >> That is, \l is not linefeed.

    >
    >:))) nice demonstration of the problem. But this
    >
    > s/(?:\r\n?|\n)/
    >
    > should work correctly?


    Have you tried that on a DOS file on Unix? Take a look at it in a hex
    editor.

    > OK. However, I was not looking for a solution with string
    > substitution, as you have seen (demonstrated on my faulty code
    > snippet) I came up with that one myself. I was rather thinking along
    > the following lines: isn't there a general way to tell Perl "Hey,
    > treat all the text files alike, wherever they come from: DOS, Mac or
    > Unix".


    Open in binmode, don't use \n to match eol.

    Sinan
    A. Sinan Unur, Feb 22, 2006
    #11
  12. January Weiner

    Rick Scott Guest

    (January Weiner <> uttered:)
    > Rick Scott <> wrote:
    > > Your filehandle `IF' collides with Perl's conditional `if' in
    > > the mental hash-bucket of the programmers who have to read
    > > your code.

    >
    > That's an argument. I will refrain from using <IF> in public. It
    > causes too much stir and excites people.
    >
    > > Given that it reduces the comprehensibility of your program and
    > > that you could have used any number of more legible identifiers,
    > > I'd call use of `IF' poor style, if not an outright error.

    >
    > Why an error? (yeah, it can lead to errors, agree, but then, of
    > course, a collegue of mine says the same about using Perl)


    Why cause a greater propensity to introduce errors than one has to?


    > > On the contrary -- I would posit that any coding practice that
    > > makes it easier to inadvertently introduce bugs into your
    > > program is a poor one. By using a bareword filehandle, you're
    > > essentially using a package variable (*IF). If some other piece
    > > of code in the same package namespace touches that bareword
    > > while you're using it to read from a file, your filehandle will
    > > get stomped and your code will break without you even having
    > > changed it. That's why it's a bad idea.

    >
    > I think this depends a little on what you are using Perl for, and
    > just some basic common sense is sufficient to say when using
    > shorthand is OK and when it is a bad idea. I do use the <IF>
    > construct in few-liners that do not read more than one file. I do
    > think it is important to define variables if you are writing a
    > larger piece of code. I do not understand the whole stir about
    > this issue.


    Since there's a way to do things that lets you do everything you can
    do with a bareword filehandle without incurring the disadvantages,
    why not make a habit of using it? As of Perl 5.6, you can do this:

    open my $FILE, '<', $filename or die "Can't open file: $!";

    Then the filehandle is stored in the lexical variable $FILE (where it
    can't be stomped on by someone else's code) instead of in the package
    variable *FILE (where it can).


    > > Pick up a copy of Damian Conway's "Perl Best Practices" -- he
    > > explains why this and about a thousand other `harmless style
    > > preferences' aren't harmless, aren't stylish, and definitely
    > > aren't preferable.

    >
    > So what is the problem with whitespace? Why is while(<STDIN>) more
    > harmful than while ( <STDIN> ) ?


    Actually, I'd go with the first of these two. About whitespace in
    general -- there's not necessarily too many hard and fast rules;
    you just want to make best use of it to increase the readability of
    your code. To take an extreme example,

    LINE:
    foreach my $line (@lines) {
    my ($registry, $cc, $type, $start, $length, $date, $status) =
    split qr{\|}, $line;

    next LINE unless $status;
    next LINE unless ($type eq 'ipv4');

    print $start;
    }

    is obviously much better than

    LINE:foreach my $line(@lines){my($registry,$cc,$type,$start,$length,$date,$status)=split qr{\|},$line;next LINE unless $status;next LINE unless($type eq 'ipv4');print$start;}

    or

    LINE:
    foreach
    my
    $line
    (
    @lines
    )
    {
    my
    ($registry,
    $cc,
    $type,
    $start,
    $length,
    $date,
    $status)
    =
    split
    qr{\|}
    ,
    $line
    ;
    ....

    or some other godawful thing.




    Rick
    --
    key CF8F8A75 / print C5C1 F87D 5056 D2C0 D5CE D58F 970F 04D1 CF8F 8A75
    Humankind cannot bear very much reality.
    :Karen Armstrong
    Rick Scott, Feb 23, 2006
    #12
  13. January Weiner

    Samwyse Guest

    error-prone coding styles

    Rick Scott wrote:
    > (January Weiner <> uttered:)
    >
    >>Rick Scott <> wrote:
    >>
    >>>Given that it reduces the comprehensibility of your program and
    >>>that you could have used any number of more legible identifiers,
    >>>I'd call use of `IF' poor style, if not an outright error.

    >>
    >>Why an error? (yeah, it can lead to errors, agree, but then, of
    >>course, a collegue of mine says the same about using Perl)

    >
    > Why cause a greater propensity to introduce errors than one has to?


    One of the very first programs (non-Perl) that I had to maintain (as
    opposed to create) was written by a joker who decided to use O as a
    variable name. Distinguishing between
    X = 0;
    and
    X = O;
    was a source of much merriment for the rest of the staff. Fortunately
    for his nose, he was quick at ducking whenever clenched fists were in
    his vicinity.
    Samwyse, Feb 23, 2006
    #13
  14. January Weiner <> wrote:
    > A. Sinan Unur <> wrote:
    >
    >> You should use the codes for those characters rather than the escapes.

    >
    > Hmmm, OK, and why?



    So you will *know* what character you will get.

    print "\n";

    outputs different characters when perl is run on different systems.

    If you use the codes, everybody on every system sees what you
    want them to see.


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Feb 23, 2006
    #14
  15. Rick Scott <> wrote:
    > > Why an error? (yeah, it can lead to errors, agree, but then, of
    > > course, a collegue of mine says the same about using Perl)


    > Why cause a greater propensity to introduce errors than one has to?


    Yeah, I guess I _could_ switch to Python, which is verrry orrrdentlich,
    syntax imposes coding style, etc., etc... sorry. I use Python when
    teaching. I use Perl when doing my bloody job. Seriously: Perl is
    notorious for shorthands, quick dirty code snippets etc. I would not have
    been programming in Perl if I was ana^W careful.

    > > just some basic common sense is sufficient to say when using
    > > shorthand is OK and when it is a bad idea. I do use the <IF>
    > > construct in few-liners that do not read more than one file. I do
    > > think it is important to define variables if you are writing a
    > > larger piece of code. I do not understand the whole stir about
    > > this issue.


    > Since there's a way to do things that lets you do everything you can
    > do with a bareword filehandle without incurring the disadvantages,
    > why not make a habit of using it? As of Perl 5.6, you can do this:


    > open my $FILE, '<', $filename or die "Can't open file: $!";


    > Then the filehandle is stored in the lexical variable $FILE (where it
    > can't be stomped on by someone else's code) instead of in the package
    > variable *FILE (where it can).


    Thanks for these explanation, but read what I have written above. Yeah, I
    know that, I use it in larger projects. I don't care about that in the
    five liners. Frankly, do you use "use strict ; use warnings ;" with
    one-liners run with "perl -e"? Will you try to convince me that omitting
    these is a serious danger and causes a greater propensity to introduce
    errors? ...with "perl -e"? Well -- I agree. Of course it does - so what?

    > > > Pick up a copy of Damian Conway's "Perl Best Practices" -- he
    > > > explains why this and about a thousand other `harmless style
    > > > preferences' aren't harmless, aren't stylish, and definitely
    > > > aren't preferable.

    > >
    > > So what is the problem with whitespace? Why is while(<STDIN>) more
    > > harmful than while ( <STDIN> ) ?


    > Actually, I'd go with the first of these two. About whitespace in
    > general -- there's not necessarily too many hard and fast rules;
    > you just want to make best use of it to increase the readability of
    > your code. To take an extreme example,


    (snip example)

    Yes, I do agree with general formatting, but I was critisized for this
    particular thing -- not putting space in while(<BLAH>). By the way,

    > LINE:
    > foreach my $line (@lines) {
    > my ($registry, $cc, $type, $start, $length, $date, $status) =
    > split qr{\|}, $line;


    > next LINE unless $status;
    > next LINE unless ($type eq 'ipv4');
    > print $start;
    > }


    ....I would most probably write as

    my @info ;
    for(@lines) {
    ($stat, $type, $start) = split /\|/ ; # info has now: registry, cc,
    # type, start, length, date, status
    next unless ($info[-1] && $info[2] eq 'ipv4') ;
    print $info[3] ;
    }

    which probably is for you

    > some other godawful thing.


    j.

    --
    January Weiner, Feb 23, 2006
    #15
  16. Re: error-prone coding styles

    Samwyse <> wrote:
    > One of the very first programs (non-Perl) that I had to maintain (as
    > opposed to create) was written by a joker who decided to use O as a
    > variable name. Distinguishing between
    > X = 0;
    > and
    > X = O;
    > was a source of much merriment for the rest of the staff. Fortunately
    > for his nose, he was quick at ducking whenever clenched fists were in
    > his vicinity.


    Yeah. Lack of reasonable editors or skills at using them (how do you do
    "s/\(\W\)O\(\W\)/\1BloodyStupidVariableName\2/g in Notepad?) is always
    a problem.

    j.

    --
    January Weiner, Feb 23, 2006
    #16
  17. January Weiner

    Lukas Mai Guest

    January Weiner <> schrob:
    > Rick Scott <> wrote:
    >> [lexical filehandles vs. package/bareword fhs]

    >
    > Thanks for these explanation, but read what I have written above. Yeah, I
    > know that, I use it in larger projects. I don't care about that in the
    > five liners. Frankly, do you use "use strict ; use warnings ;" with
    > one-liners run with "perl -e"? Will you try to convince me that omitting
    > these is a serious danger and causes a greater propensity to introduce
    > errors? ...with "perl -e"? Well -- I agree. Of course it does - so what?


    I 'use warnings; use script;' in every Perl script that's stored in a
    file. I use perl -w(l)e for one liners (except when
    golfing/obfuscating). Sometimes I add -Mstrict to quickly check how
    exactly a perl feature/bug works.

    Just my 2ยข, Lukas
    Lukas Mai, Feb 23, 2006
    #17
  18. January Weiner

    Donald King Guest

    January Weiner wrote:
    > Hi, I'm sure this is a common problem:
    >
    > I'd like my script to treat text files coming from various systems alike.
    > More specifically, I'd like to recognize ends of line as one of: \r, \l,
    > \r\l. Is there a more elegant way than doing the obvious?:
    >
    > while(<IF>) {
    > s/\r?\l?$// ; # is this correct anyway? will an end of line be
    > # recognized with a Mac file?
    > #...
    > }
    >
    > I would expect that there is some weird variable out there (like the $/)
    > that changes the behaviour of chomp to be more promiscous.
    >
    > The problem, of course, is, that this cannot be set platform- or
    > scriptwide. One file might contain DOS eols, another one would come from
    > Mac.
    >
    > j.
    >


    Short, short version:

    binmode(IF);
    my $whole_file = do { local $/; <IF> };
    my @lines = split /(?:\r\n|\r|\n)/, $whole_file;
    foreach(@lines) {
    ...
    }

    Since regexps check alternatives from left to right, that splits as
    correctly as possible. If something horrid has happened, like inserting
    the contents of a Mac file into a Unix file, you'll get some funny
    behavior near the seams, of course, but for a file that's all one line
    ending, it works great, and it even handles the common case of files
    that mix Unix LFs with Windows CRLFs.

    However, if you want to cut back on memory consumption (important for
    files bigger than a few hundred KB or so) and your files have consistent
    line endings, you might probe the end-of-line by sysreading the first
    2KB or so, sysseek back to the start of the file, then locally set $/ to
    the exact line ending that you probed.

    Something like this might work:

    use Fcntl ':seek';
    ....
    binmode(IF);
    local $/ = "\n";
    while(1) {
    last if sysread(IF, my $peek, 2048) == 0;
    $/ = $1, last if $peek =~ /(\r\n|\r|\n)/;
    }
    sysseek(IF, 0, SEEK_SET);
    while(<IF>) {
    ...
    }

    --
    Donald King, a.k.a. Chronos Tachyon
    http://chronos-tachyon.net/
    Donald King, Feb 23, 2006
    #18
  19. January Weiner

    Donald King Guest

    Donald King wrote:
    > January Weiner wrote:
    >
    >> Hi, I'm sure this is a common problem:
    >>
    >> I'd like my script to treat text files coming from various systems alike.
    >> More specifically, I'd like to recognize ends of line as one of: \r, \l,
    >> \r\l. Is there a more elegant way than doing the obvious?:
    >>
    >> while(<IF>) {
    >> s/\r?\l?$// ; # is this correct anyway? will an end of line be
    >> # recognized with a Mac file?
    >> #...
    >> }
    >>
    >> I would expect that there is some weird variable out there (like the $/)
    >> that changes the behaviour of chomp to be more promiscous.
    >> The problem, of course, is, that this cannot be set platform- or
    >> scriptwide. One file might contain DOS eols, another one would come from
    >> Mac.
    >>
    >> j.
    >>

    >
    > Short, short version:
    >
    > binmode(IF);
    > my $whole_file = do { local $/; <IF> };
    > my @lines = split /(?:\r\n|\r|\n)/, $whole_file;
    > foreach(@lines) {
    > ...
    > }
    >
    > Since regexps check alternatives from left to right, that splits as
    > correctly as possible. If something horrid has happened, like inserting
    > the contents of a Mac file into a Unix file, you'll get some funny
    > behavior near the seams, of course, but for a file that's all one line
    > ending, it works great, and it even handles the common case of files
    > that mix Unix LFs with Windows CRLFs.
    >
    > However, if you want to cut back on memory consumption (important for
    > files bigger than a few hundred KB or so) and your files have consistent
    > line endings, you might probe the end-of-line by sysreading the first
    > 2KB or so, sysseek back to the start of the file, then locally set $/ to
    > the exact line ending that you probed.
    >
    > Something like this might work:
    >
    > use Fcntl ':seek';
    > ...
    > binmode(IF);
    > local $/ = "\n";
    > while(1) {
    > last if sysread(IF, my $peek, 2048) == 0;
    > $/ = $1, last if $peek =~ /(\r\n|\r|\n)/;
    > }
    > sysseek(IF, 0, SEEK_SET);
    > while(<IF>) {
    > ...
    > }
    >


    Oh, and if your perl code is running or might run on any of a small
    handful of screwy systems (IIRC, EBCDIC systems and pre-OSX Macs are the
    main offenders), you might need to change \r => \x0D and \n => \x0A just
    to be specific. (If you're on an EBCDIC system and handling an ASCII
    file, though, you've got bigger problems.)

    --
    Donald King, a.k.a. Chronos Tachyon
    http://chronos-tachyon.net/
    Donald King, Feb 23, 2006
    #19
  20. Hi,

    Donald King <> wrote:

    (snip)

    > However, if you want to cut back on memory consumption (important for
    > files bigger than a few hundred KB or so) and your files have consistent
    > line endings, you might probe the end-of-line by sysreading the first


    Exactly. Sometimes I need to run my programs on files of gigabyte size.

    > 2KB or so, sysseek back to the start of the file, then locally set $/ to
    > the exact line ending that you probed.


    > Something like this might work:


    > use Fcntl ':seek';
    > ...
    > binmode(IF);
    > local $/ = "\n";
    > while(1) {
    > last if sysread(IF, my $peek, 2048) == 0;
    > $/ = $1, last if $peek =~ /(\r\n|\r|\n)/;
    > }
    > sysseek(IF, 0, SEEK_SET);
    > while(<IF>) {
    > ...
    > }


    Thanks! This brings me much further. Actually, it would be even nice to
    have a Perl Module implementing the file(1) functionality... The above
    subroutine plus a hacked magic file plus some clever string searching.

    j.

    --
    January Weiner, Feb 24, 2006
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Dave Moore

    Dos vs Unix style text files

    Dave Moore, Feb 10, 2005, in forum: C++
    Replies:
    8
    Views:
    6,645
    Ron Natalie
    Feb 12, 2005
  2. jennyw
    Replies:
    0
    Views:
    315
    jennyw
    Jul 11, 2003
  3. Skip Montanaro
    Replies:
    0
    Views:
    422
    Skip Montanaro
    Jul 11, 2003
  4. walterbyrd
    Replies:
    13
    Views:
    1,283
    walterbyrd
    May 13, 2009
  5. Robert Wallace

    my own perl "dos->unix"/"unix->dos"

    Robert Wallace, Jan 21, 2004, in forum: Perl Misc
    Replies:
    7
    Views:
    267
    Michele Dondi
    Jan 22, 2004
Loading...

Share This Page