end-of-line conventions

Discussion in 'Perl Misc' started by kj, Aug 13, 2009.

  1. kj

    kj Guest

    There are three major conventions for the end-of-line marker:
    "\n", "\r\n", and "\r".

    In a variety of situation, Perl must split strings into "lines",
    and must therefore follow a particular convention to identify line
    boundaries. There are three situations that interest me in
    particular: 1. the splitting into lines that happens when one
    iterates over a file using the <> operator; 2. the meaning of the
    operation performed by chomp; and 3. the meaning of the $ anchor
    in regular expressions.

    These three issues are tested by the following simple script:

    my $lines = my $matches = 0;
    while (<>) {
    $lines++;
    if (/z$/) {
    $matches++;
    chomp;
    print ">$_<";
    }
    }

    print "$/$matches matches out of $lines lines$/";
    __END__

    I have three files, unix.txt, dos.txt, and mac.txt, each containing
    four lines. Disregarding the end-of-line character(s) these lines
    are "foo", "bar", "baz", "frobozz".

    The file unix.txt uses "\n" to separate the lines. The output that
    I get when I pass it as the argument to the script is this:

    % demo.pl unix.txt
    >baz<>frobozz<

    2 matches out of 4 lines

    The file dos.txt uses "\r\n" to separate lines, and the file mac.txt
    uses "\r". Here's the output I get when I pass these files to the
    script:

    % demo.pl dos.txt

    0 matches out of 4 lines
    % demo.pl mac.txt

    0 matches out of 1 lines

    How can I change the script so that the output for unix.txt, dos.txt,
    and mac.txt will be the same as the one shown above for unix.txt?

    (Mucking with the value of $/ I was able to get <> to split the
    input stream at the right places, but it had no impact on the result
    of the regular expression match.)

    TIA!

    kynn
     
    kj, Aug 13, 2009
    #1
    1. Advertising

  2. kj

    kj Guest

    In <> Tad J McClellan <> writes:

    >kj <> wrote:
    >>
    >>
    >> Subject: end-of-line conventions



    >Have you read the "Newlines" section in


    > perldoc perlport


    >??



    >> There are three major conventions for the end-of-line marker:
    >> "\n", "\r\n", and "\r".
    >>
    >> In a variety of situation, Perl must split strings into "lines",
    >> and must therefore follow a particular convention to identify line
    >> boundaries.



    >perl detects its platform when it is *compiled*.


    >That is, perl decides what line ending to use when it is built.



    >> The file dos.txt uses "\r\n" to separate lines, and the file mac.txt
    >> uses "\r".


    >> How can I change the script so that the output for unix.txt, dos.txt,
    >> and mac.txt will be the same as the one shown above for unix.txt?



    >You can't.



    Mind-blowing, to say the least...

    Oh, well. Live and lurn. Thanks. And to Ben too.

    kynn
     
    kj, Aug 13, 2009
    #2
    1. Advertising

  3. kj wrote:

    > There are three major conventions for the end-of-line marker:
    > "\n", "\r\n", and "\r".


    These notations are not unambigious! See perlport documentation section
    newlines for details.

    > In a variety of situation, Perl must split strings into "lines",
    > and must therefore follow a particular convention to identify line
    > boundaries. There are three situations that interest me in
    > particular: 1. the splitting into lines that happens when one
    > iterates over a file using the <> operator; 2. the meaning of the
    > operation performed by chomp; and 3. the meaning of the $ anchor
    > in regular expressions.


    <> and chomp use the $/ variable for line endings. Since $/ does not
    support regular expressions, you cannot use this mechanism for all
    types of line endings.

    The $ anchor normally is just the end of the string (with or without an
    line ending).

    > How can I change the script so that the output for unix.txt, dos.txt,
    > and mac.txt will be the same as the one shown above for unix.txt?


    use strict;
    use warnings;

    my $lines = my $matches = 0;
    {
    local $/ = undef;
    for (<> =~ m{\G([^\012\015]*) \015?\012?}xmsg) {
    $lines++;
    if (/z$/) {
    $matches++;
    print ">$_<";
    }
    }
    }
    print "\n$matches matches out of $lines lines\n";
    __END__

    This uses <> with no line end definition, and iterates with a regular
    expression suitable for three types of line endings. The line ending is
    not included in $_, so chomp is omitted.

    If you need the line endings in $_ use the following lines.
    for (<> =~ m{\G([^\012\015]* \015?\012?)}xmsg) {
    $lines++;
    if (/z\s*$/) {
    $matches++;
    s{[\015\012][\015\012]?}{}xms; # chomp replacement

    Hope that helps, heiko
     
    Heiko Eißfeldt, Aug 13, 2009
    #3
  4. kj

    Steve C Guest

    kj wrote:
    > There are three major conventions for the end-of-line marker:
    > "\n", "\r\n", and "\r".
    >
    > In a variety of situation, Perl must split strings into "lines",
    > and must therefore follow a particular convention to identify line
    > boundaries. There are three situations that interest me in
    > particular: 1. the splitting into lines that happens when one
    > iterates over a file using the <> operator; 2. the meaning of the
    > operation performed by chomp; and 3. the meaning of the $ anchor
    > in regular expressions.
    >
    > These three issues are tested by the following simple script:
    >
    > my $lines = my $matches = 0;
    > while (<>) {
    > $lines++;
    > if (/z$/) {
    > $matches++;
    > chomp;
    > print ">$_<";
    > }
    > }
    >
    > print "$/$matches matches out of $lines lines$/";
    > __END__
    >
    > I have three files, unix.txt, dos.txt, and mac.txt, each containing
    > four lines. Disregarding the end-of-line character(s) these lines
    > are "foo", "bar", "baz", "frobozz".
    >
    > The file unix.txt uses "\n" to separate the lines. The output that
    > I get when I pass it as the argument to the script is this:
    >
    > % demo.pl unix.txt
    >> baz<>frobozz<

    > 2 matches out of 4 lines
    >
    > The file dos.txt uses "\r\n" to separate lines, and the file mac.txt
    > uses "\r". Here's the output I get when I pass these files to the
    > script:
    >
    > % demo.pl dos.txt
    >
    > 0 matches out of 4 lines
    > % demo.pl mac.txt
    >
    > 0 matches out of 1 lines
    >
    > How can I change the script so that the output for unix.txt, dos.txt,
    > and mac.txt will be the same as the one shown above for unix.txt?
    >


    Since "\n" eq "\012" on unix, you ought to be able to
    do something like this to be the same on all platforms:

    my $lines = my $matches = 0;

    $/ = "\012";
    binmode STDIN;
    binmode STDOUT;

    while (<>) {
    $lines++;
    if (/z\012/) {
    $matches++;
    s/\012//g;
    print ">$_<";
    }
    }

    print "$/$matches matches out of $lines lines$/";
    __END__
     
    Steve C, Aug 13, 2009
    #4
  5. kj

    Nathan Keel Guest

    kj wrote:

    >
    > Mind-blowing, to say the least...
    >
    > Oh, well. Live and lurn. Thanks. And to Ben too.
    >
    > kynn


    Don't worry, use a real OS (not Windows) and you'll not have to think
    about these things, though they are easily dealt with, and you'll have
    a lot more benefits as well.
     
    Nathan Keel, Aug 14, 2009
    #5
  6. kj

    chris Guest

    kj wrote:
    > There are three major conventions for the end-of-line marker:
    > "\n", "\r\n", and "\r".
    >
    > In a variety of situation, Perl must split strings into "lines",
    > and must therefore follow a particular convention to identify line
    > boundaries. There are three situations that interest me in
    > particular: 1. the splitting into lines that happens when one
    > iterates over a file using the <> operator; 2. the meaning of the
    > operation performed by chomp; and 3. the meaning of the $ anchor
    > in regular expressions.
    >
    > These three issues are tested by the following simple script:
    >
    > my $lines = my $matches = 0;
    > while (<>) {
    > $lines++;
    > if (/z$/) {
    > $matches++;
    > chomp;
    > print ">$_<";
    > }
    > }
    >
    > print "$/$matches matches out of $lines lines$/";
    > __END__
    >
    > I have three files, unix.txt, dos.txt, and mac.txt, each containing
    > four lines. Disregarding the end-of-line character(s) these lines
    > are "foo", "bar", "baz", "frobozz".


    If you're on linux (it seems you are) I would pass any files of dubious
    origin through 'mac2unix' and 'dos2unix' first to ensure that your perl
    will parse them correctly.
     
    chris, Aug 14, 2009
    #6
  7. kj

    Steve C Guest

    Ben Morrow wrote:
    > Quoth Steve C <>:
    >> Since "\n" eq "\012" on unix, you ought to be able to
    >> do something like this to be the same on all platforms:
    >>
    >> my $lines = my $matches = 0;
    >>
    >> $/ = "\012";
    >> binmode STDIN;
    >> binmode STDOUT;
    >>
    >> while (<>) {
    >> $lines++;
    >> if (/z\012/) {
    >> $matches++;
    >> s/\012//g;
    >> print ">$_<";
    >> }
    >> }
    >>
    >> print "$/$matches matches out of $lines lines$/";
    >> __END__

    >
    > Did you try it? This completely fails with "\r"-separated files, and
    > fails to match any lines with "\r\n"-separated files.
    >
    > Ben
    >


    I misread the question.
     
    Steve C, Aug 14, 2009
    #7
  8. kj <> wrote:
    >There are three major conventions for the end-of-line marker:


    Yes.

    >"\n", "\r\n", and "\r".


    No. The end-of-line markers are "\010", "\013\010", and "\013".

    "\n" is Perl's short-hand notation for whatever end-of-line marker
    combination is used on the current platform, thus it can be any of the
    three.

    >How can I change the script so that the output for unix.txt, dos.txt,
    >and mac.txt will be the same as the one shown above for unix.txt?


    If you have to deal with cross-platform files then your best bet is to
    explicitely check for each combination individually and not to use the
    short-hand "\n".

    jue
     
    Jürgen Exner, Aug 15, 2009
    #8
  9. kj

    Guest

    On Sat, 15 Aug 2009 23:39:45 +0100, Ben Morrow <> wrote:

    >
    >Quoth Jürgen Exner <>:
    >> kj <> wrote:
    >> >There are three major conventions for the end-of-line marker:

    >>
    >> Yes.
    >>
    >> >"\n", "\r\n", and "\r".

    >>
    >> No. The end-of-line markers are "\010", "\013\010", and "\013".

    >
    >ITYM \012 and \015 there. \0-escapes are in octal.
    >

    <snip>
    >Ben


    He meant 10/13 respectfully.
    Lets get this table going just for grins:

    lf crlf cr
    dec 10 13,10 13
    hex 0a 0d,0a 0d
    oct 012 015,012 015

    But how should binary intended be interpreted if opened for translation?
    Even if ascii and invalidness.

    The recovery of a applies to all regexp valid regex cannot create a mixed
    mode platform with append. Either all is converted OR invalid, or
    none is converted.

    No 0a0a0d0d0a0a. Naw, invalid. At best, recover what is possible,
    rewrite file, right the ship, destroy old. Don't tell anybody about it.
    Delete file, exit with success, or reformat hd, send it to deep magnetic
    disk recovery for partial recovery, tracks wiped clean.

    -sln
     
    , Aug 16, 2009
    #9
  10. Ben Morrow <> wrote:
    >Quoth Jürgen Exner <>:
    >> No. The end-of-line markers are "\010", "\013\010", and "\013".

    >
    >ITYM \012 and \015 there. \0-escapes are in octal.


    Yes, sorry.

    >> "\n" is Perl's short-hand notation for whatever end-of-line marker
    >> combination is used on the current platform, thus it can be any of the
    >> three.

    >
    >"\n" can *never* mean "\015\012": on Win32 it means "\012", just as on
    >Unix.


    But then how come that the file created by this little program

    open FOO, ">" , "foo";
    print FOO "k\n" x 20;
    close FOO;

    is 60 bytes long instead of 40 as would to be expected if the 'k' and
    the "\n" each were only one byte long?

    C:\tmp>dir foo
    15-Aug-09 21:13 60 foo

    jue
     
    Jürgen Exner, Aug 16, 2009
    #10
  11. kj

    Guest

    On Sat, 15 Aug 2009 21:16:32 -0700, Jürgen Exner <> wrote:

    >Ben Morrow <> wrote:
    >>Quoth Jürgen Exner <>:
    >>> No. The end-of-line markers are "\010", "\013\010", and "\013".

    >>
    >>ITYM \012 and \015 there. \0-escapes are in octal.

    >
    >Yes, sorry.
    >
    >>> "\n" is Perl's short-hand notation for whatever end-of-line marker
    >>> combination is used on the current platform, thus it can be any of the
    >>> three.

    >>
    >>"\n" can *never* mean "\015\012": on Win32 it means "\012", just as on
    >>Unix.

    >
    >But then how come that the file created by this little program
    >
    >open FOO, ">" , "foo";
    >print FOO "k\n" x 20;
    >close FOO;
    >
    >is 60 bytes long instead of 40 as would to be expected if the 'k' and
    >the "\n" each were only one byte long?
    >
    >C:\tmp>dir foo
    >15-Aug-09 21:13 60 foo
    >
    >jue


    Depends on what has edited it and how it is written out.
    Open in Word/Windows, a 0d only eol and it edits each line
    as a odoa. Modify and save it, I think it keeps only od.
    But Word jacks a lot of stuff, especially encoding.
    -sln
     
    , Aug 16, 2009
    #11
  12. kj

    Willem Guest

    Jürgen Exner wrote:
    ) But then how come that the file created by this little program
    )
    ) open FOO, ">" , "foo";
    ) print FOO "k\n" x 20;
    ) close FOO;
    )
    ) is 60 bytes long instead of 40 as would to be expected if the 'k' and
    ) the "\n" each were only one byte long?

    Because the I/O routine translates the newlines. Just like in C.
    Perl probably even uses the C I/O library to write to the file.


    SaSW, Willem
    --
    Disclaimer: I am in no way responsible for any of the statements
    made in the above text. For all I know I might be
    drugged or something..
    No I'm not paranoid. You all think I'm paranoid, don't you !
    #EOT
     
    Willem, Aug 16, 2009
    #12
  13. kj

    Guest

    On Sun, 16 Aug 2009 07:59:33 +0000 (UTC), Willem <> wrote:

    >Jürgen Exner wrote:
    >) But then how come that the file created by this little program
    >)
    >) open FOO, ">" , "foo";
    >) print FOO "k\n" x 20;
    >) close FOO;
    >)
    >) is 60 bytes long instead of 40 as would to be expected if the 'k' and
    >) the "\n" each were only one byte long?
    >
    >Because the I/O routine translates the newlines. Just like in C.
    >Perl probably even uses the C I/O library to write to the file.
    >
    >
    >SaSW, Willem


    There are heuristics in Windows programs. Just look at Word, a Microsoft
    offering.

    -sln
     
    , Aug 16, 2009
    #13
  14. Willem <> wrote:
    >Jürgen Exner wrote:
    >) But then how come that the file created by this little program
    >)
    >) open FOO, ">" , "foo";
    >) print FOO "k\n" x 20;
    >) close FOO;
    >)
    >) is 60 bytes long instead of 40 as would to be expected if the 'k' and
    >) the "\n" each were only one byte long?
    >
    >Because the I/O routine translates the newlines.


    So, I guess you are saying that there is a context where "\n" does mean
    two characters, contrary to Ben's statement:
    "\n" can *never* mean "\015\012"

    jue
     
    Jürgen Exner, Aug 16, 2009
    #14
  15. On 2009-08-16 16:24, Jürgen Exner <> wrote:
    > Willem <> wrote:
    >>Jürgen Exner wrote:
    >>) But then how come that the file created by this little program
    >>)
    >>) open FOO, ">" , "foo";
    >>) print FOO "k\n" x 20;
    >>) close FOO;
    >>)
    >>) is 60 bytes long instead of 40 as would to be expected if the 'k' and
    >>) the "\n" each were only one byte long?
    >>
    >>Because the I/O routine translates the newlines.

    >
    > So, I guess you are saying that there is a context where "\n" does mean
    > two characters, contrary to Ben's statement:
    > "\n" can *never* mean "\015\012"


    "\n" is *always* a string containing one character (\x{000A} on most
    platforms including Windows). However, when this character is written to
    a file handle, an I/O layer may convert this in any way it pleases. It
    may just pass it through unchanged, it may convert it into a sequence of
    two bytes (e.g. "\x0D\x0A"), or it might even pad all lines to a fixed
    length with spaces and not write any new line characters at all.

    On input the reverse transformation should be performed.

    hp
     
    Peter J. Holzer, Aug 16, 2009
    #15
  16. kj

    Guest

    On 13 Aug 2009 21:13:17 GMT, Heiko Eißfeldt <> wrote:

    >kj wrote:
    >
    >> There are three major conventions for the end-of-line marker:
    >> "\n", "\r\n", and "\r".

    >
    >These notations are not unambigious! See perlport documentation section
    >newlines for details.
    >
    >> In a variety of situation, Perl must split strings into "lines",
    >> and must therefore follow a particular convention to identify line
    >> boundaries. There are three situations that interest me in
    >> particular: 1. the splitting into lines that happens when one
    >> iterates over a file using the <> operator; 2. the meaning of the
    >> operation performed by chomp; and 3. the meaning of the $ anchor
    >> in regular expressions.

    >
    ><> and chomp use the $/ variable for line endings. Since $/ does not
    >support regular expressions, you cannot use this mechanism for all
    >types of line endings.
    >
    >The $ anchor normally is just the end of the string (with or without an
    >line ending).
    >
    >> How can I change the script so that the output for unix.txt, dos.txt,
    >> and mac.txt will be the same as the one shown above for unix.txt?

    >
    >use strict;
    >use warnings;
    >
    >my $lines = my $matches = 0;
    >{
    > local $/ = undef;
    > for (<> =~ m{\G([^\012\015]*) \015?\012?}xmsg) {

    ^^^^^^^^
    This won't work, depending on the translation mode opened or
    appended to before, opened now, etc.., 0d 0d 0a could be one, two
    or 3 eol's.
    In fact you don't even have, or couldn't create a reference anchor
    to tell the difference.

    -sln
     
    , Aug 17, 2009
    #16
  17. kj

    Guest

    On Sun, 16 Aug 2009 18:24:22 +0100, Ben Morrow <> wrote:

    >
    >Quoth Shmuel (Seymour J.) Metz <>:
    >> In <>, on 08/15/2009
    >> at 11:39 PM, Ben Morrow <> said:
    >>
    >> >"\n" can *never* mean "\015\012":

    >>
    >> Are you a betting man.

    >
    >Not usually :).
    >
    >> >"\n" can *never* mean "\015\012": on Win32 it means "\012", just as on
    >> >Unix. If "\n" was more than one byte/character (take your pick) long,
    >> >practically everything would break.

    >>
    >> Wrong; \n is two bytes on DOS and OS/2, and AFAIK nothing breaks except
    >> cgi.pm.

    >
    >No it's not, at least not as far as Perl is concerned. Files have CRLF
    >line endings, but they are (by default) translated into LF line endings
    >when the file is read. If you have a file containing
    >
    > fooCRLF
    >
    >and you read a line with
    >
    > open my $FOO, "<", "foo";
    > my $foo = <$FOO>;
    >
    >then $foo will be four bytes long, not five.
    >
    >> >AFAIK the only platforms where "\n" ne "\012" are Mac OS Classic and
    >> >the EBCDIC platforms, both of which are obsolete as far as Perl is
    >> >concerned.

    >>
    >> When there's ongoing maintenance then it's a rather lively corpse.

    >
    >Perl is no longer maintained for Mac OS Classic or the EBCDIC platforms.
    >I did not mean to imply they were obsolete for other purposes.
    >
    >Ben


    Yes this is fairly standard ANSI translations.

    This fopen api documentation, phrase sums it up:
    "Carriage return–line feed (CR-LF) combinations are translated
    into a single line feed character on input.
    Line feed characters are translated into CR-LF combinations on output. "


    Opening in text mode, translated is the default, and these things happen:

    - On reads: CRLF are converted to LF's
    - On writes: LF is converted to CRLF's
    - EOL character is the LF
    - binmode(STDOUT,':raw') is not good for viewing because the console does real \r and \n

    Finally, there is no clear cut solution to the OP I don't believe.

    If one platform can append CR's and another LF's as eol's, then it can't be
    determined that these are seperate eol's. Of course, another comes along and
    adds the CRLF pair as eol.

    Either way, opening a file in ':raw' mode and doing your own eol
    translations, would make this, by definition: if /\015?\012?/ ++$linecnt,
    invalid.

    I guess there is the C fmode and setmode to read, turn on/off translations.
    Unless there is a convergence of platform meanings that don't step on each
    other when files are appeneded in translated mode (if it is supported), opening
    a file un-translated and doing your own eol translation, would no seem to be
    %100 reliable.

    -sln

    Raw Data:
    (18) = 54 d a 57 a a d d 58 65 64 66 d d 59 d a 5a
    (18) = T..W....Xedf..Y..Z

    Writing translated text file
    --------------------

    Reading translated text file
    --------------------
    tran (18) = 54 d a 57 a a d d 58 65 64 66 d d 59 d a 5a
    (18) = T..W....Xedf..Y..Z
    ( 3) = T ( d a )
    ( 2) = W ( a )
    ( 1) = ( a )
    (11) = ( d d ) Xedf ( d d ) Y ( d a )
    ( 1) = Z

    Reading un-translated text file
    --------------------
    raw (22) = 54 d d a 57 d a d a d d 58 65 64 66 d d 59 d d a 5a
    (22) = T ( d d a ) W ( d a d a d d ) Xedf ( d d ) Y ( d d a ) Z
    (22) = T...W......Xedf..Y...Z
    ( 4) = T ( d d a )
    ( 3) = W ( d a )
    ( 2) = ( d a )
    (12) = ( d d ) Xedf ( d d ) Y ( d d a )
    ( 1) = Z

    =============================================

    Writing RAW text file
    --------------------

    Reading translated text file
    --------------------
    tran (16) = 54 a 57 a a d d 58 65 64 66 d d 59 a 5a
    (16) = T.W....Xedf..Y.Z
    ( 2) = T ( a )
    ( 2) = W ( a )
    ( 1) = ( a )
    (10) = ( d d ) Xedf ( d d ) Y ( a )
    ( 1) = Z

    Reading un-translated text file
    --------------------
    raw (18) = 54 d a 57 a a d d 58 65 64 66 d d 59 d a 5a
    (18) = T ( d a ) W ( a a d d ) Xedf ( d d ) Y ( d a ) Z
    (18) = T..W....Xedf..Y..Z
    ( 3) = T ( d a )
    ( 2) = W ( a )
    ( 1) = ( a )
    (11) = ( d d ) Xedf ( d d ) Y ( d a )
    ( 1) = Z
     
    , Aug 17, 2009
    #17
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.

Share This Page