end-of-line conventions

K

kj

There are three major conventions for the end-of-line marker:
"\n", "\r\n", and "\r".

In a variety of situation, Perl must split strings into "lines",
and must therefore follow a particular convention to identify line
boundaries. There are three situations that interest me in
particular: 1. the splitting into lines that happens when one
iterates over a file using the <> operator; 2. the meaning of the
operation performed by chomp; and 3. the meaning of the $ anchor
in regular expressions.

These three issues are tested by the following simple script:

my $lines = my $matches = 0;
while (<>) {
$lines++;
if (/z$/) {
$matches++;
chomp;
print ">$_<";
}
}

print "$/$matches matches out of $lines lines$/";
__END__

I have three files, unix.txt, dos.txt, and mac.txt, each containing
four lines. Disregarding the end-of-line character(s) these lines
are "foo", "bar", "baz", "frobozz".

The file unix.txt uses "\n" to separate the lines. The output that
I get when I pass it as the argument to the script is this:

% demo.pl unix.txt
baz<>frobozz<
2 matches out of 4 lines

The file dos.txt uses "\r\n" to separate lines, and the file mac.txt
uses "\r". Here's the output I get when I pass these files to the
script:

% demo.pl dos.txt

0 matches out of 4 lines
% demo.pl mac.txt

0 matches out of 1 lines

How can I change the script so that the output for unix.txt, dos.txt,
and mac.txt will be the same as the one shown above for unix.txt?

(Mucking with the value of $/ I was able to get <> to split the
input stream at the right places, but it had no impact on the result
of the regular expression match.)

TIA!

kynn
 
H

Heiko Eißfeldt

kj said:
There are three major conventions for the end-of-line marker:
"\n", "\r\n", and "\r".

These notations are not unambigious! See perlport documentation section
newlines for details.
In a variety of situation, Perl must split strings into "lines",
and must therefore follow a particular convention to identify line
boundaries. There are three situations that interest me in
particular: 1. the splitting into lines that happens when one
iterates over a file using the <> operator; 2. the meaning of the
operation performed by chomp; and 3. the meaning of the $ anchor
in regular expressions.

<> and chomp use the $/ variable for line endings. Since $/ does not
support regular expressions, you cannot use this mechanism for all
types of line endings.

The $ anchor normally is just the end of the string (with or without an
line ending).
How can I change the script so that the output for unix.txt, dos.txt,
and mac.txt will be the same as the one shown above for unix.txt?

use strict;
use warnings;

my $lines = my $matches = 0;
{
local $/ = undef;
for (<> =~ m{\G([^\012\015]*) \015?\012?}xmsg) {
$lines++;
if (/z$/) {
$matches++;
print ">$_<";
}
}
}
print "\n$matches matches out of $lines lines\n";
__END__

This uses <> with no line end definition, and iterates with a regular
expression suitable for three types of line endings. The line ending is
not included in $_, so chomp is omitted.

If you need the line endings in $_ use the following lines.
for (<> =~ m{\G([^\012\015]* \015?\012?)}xmsg) {
$lines++;
if (/z\s*$/) {
$matches++;
s{[\015\012][\015\012]?}{}xms; # chomp replacement

Hope that helps, heiko
 
S

Steve C

kj said:
There are three major conventions for the end-of-line marker:
"\n", "\r\n", and "\r".

In a variety of situation, Perl must split strings into "lines",
and must therefore follow a particular convention to identify line
boundaries. There are three situations that interest me in
particular: 1. the splitting into lines that happens when one
iterates over a file using the <> operator; 2. the meaning of the
operation performed by chomp; and 3. the meaning of the $ anchor
in regular expressions.

These three issues are tested by the following simple script:

my $lines = my $matches = 0;
while (<>) {
$lines++;
if (/z$/) {
$matches++;
chomp;
print ">$_<";
}
}

print "$/$matches matches out of $lines lines$/";
__END__

I have three files, unix.txt, dos.txt, and mac.txt, each containing
four lines. Disregarding the end-of-line character(s) these lines
are "foo", "bar", "baz", "frobozz".

The file unix.txt uses "\n" to separate the lines. The output that
I get when I pass it as the argument to the script is this:

% demo.pl unix.txt
2 matches out of 4 lines

The file dos.txt uses "\r\n" to separate lines, and the file mac.txt
uses "\r". Here's the output I get when I pass these files to the
script:

% demo.pl dos.txt

0 matches out of 4 lines
% demo.pl mac.txt

0 matches out of 1 lines

How can I change the script so that the output for unix.txt, dos.txt,
and mac.txt will be the same as the one shown above for unix.txt?

Since "\n" eq "\012" on unix, you ought to be able to
do something like this to be the same on all platforms:

my $lines = my $matches = 0;

$/ = "\012";
binmode STDIN;
binmode STDOUT;

while (<>) {
$lines++;
if (/z\012/) {
$matches++;
s/\012//g;
print ">$_<";
}
}

print "$/$matches matches out of $lines lines$/";
__END__
 
N

Nathan Keel

kj said:
Mind-blowing, to say the least...

Oh, well. Live and lurn. Thanks. And to Ben too.

kynn

Don't worry, use a real OS (not Windows) and you'll not have to think
about these things, though they are easily dealt with, and you'll have
a lot more benefits as well.
 
C

chris

kj said:
There are three major conventions for the end-of-line marker:
"\n", "\r\n", and "\r".

In a variety of situation, Perl must split strings into "lines",
and must therefore follow a particular convention to identify line
boundaries. There are three situations that interest me in
particular: 1. the splitting into lines that happens when one
iterates over a file using the <> operator; 2. the meaning of the
operation performed by chomp; and 3. the meaning of the $ anchor
in regular expressions.

These three issues are tested by the following simple script:

my $lines = my $matches = 0;
while (<>) {
$lines++;
if (/z$/) {
$matches++;
chomp;
print ">$_<";
}
}

print "$/$matches matches out of $lines lines$/";
__END__

I have three files, unix.txt, dos.txt, and mac.txt, each containing
four lines. Disregarding the end-of-line character(s) these lines
are "foo", "bar", "baz", "frobozz".

If you're on linux (it seems you are) I would pass any files of dubious
origin through 'mac2unix' and 'dos2unix' first to ensure that your perl
will parse them correctly.
 
S

Steve C

Ben said:
Did you try it? This completely fails with "\r"-separated files, and
fails to match any lines with "\r\n"-separated files.

Ben

I misread the question.
 
J

Jürgen Exner

kj said:
There are three major conventions for the end-of-line marker:
Yes.

"\n", "\r\n", and "\r".

No. The end-of-line markers are "\010", "\013\010", and "\013".

"\n" is Perl's short-hand notation for whatever end-of-line marker
combination is used on the current platform, thus it can be any of the
three.
How can I change the script so that the output for unix.txt, dos.txt,
and mac.txt will be the same as the one shown above for unix.txt?

If you have to deal with cross-platform files then your best bet is to
explicitely check for each combination individually and not to use the
short-hand "\n".

jue
 
S

sln

ITYM \012 and \015 there. \0-escapes are in octal.

He meant 10/13 respectfully.
Lets get this table going just for grins:

lf crlf cr
dec 10 13,10 13
hex 0a 0d,0a 0d
oct 012 015,012 015

But how should binary intended be interpreted if opened for translation?
Even if ascii and invalidness.

The recovery of a applies to all regexp valid regex cannot create a mixed
mode platform with append. Either all is converted OR invalid, or
none is converted.

No 0a0a0d0d0a0a. Naw, invalid. At best, recover what is possible,
rewrite file, right the ship, destroy old. Don't tell anybody about it.
Delete file, exit with success, or reformat hd, send it to deep magnetic
disk recovery for partial recovery, tracks wiped clean.

-sln
 
J

Jürgen Exner

Ben Morrow said:
ITYM \012 and \015 there. \0-escapes are in octal.

Yes, sorry.
"\n" can *never* mean "\015\012": on Win32 it means "\012", just as on
Unix.

But then how come that the file created by this little program

open FOO, ">" , "foo";
print FOO "k\n" x 20;
close FOO;

is 60 bytes long instead of 40 as would to be expected if the 'k' and
the "\n" each were only one byte long?

C:\tmp>dir foo
15-Aug-09 21:13 60 foo

jue
 
S

sln

Yes, sorry.


But then how come that the file created by this little program

open FOO, ">" , "foo";
print FOO "k\n" x 20;
close FOO;

is 60 bytes long instead of 40 as would to be expected if the 'k' and
the "\n" each were only one byte long?

C:\tmp>dir foo
15-Aug-09 21:13 60 foo

jue

Depends on what has edited it and how it is written out.
Open in Word/Windows, a 0d only eol and it edits each line
as a odoa. Modify and save it, I think it keeps only od.
But Word jacks a lot of stuff, especially encoding.
-sln
 
W

Willem

Jürgen Exner wrote:
) But then how come that the file created by this little program
)
) open FOO, ">" , "foo";
) print FOO "k\n" x 20;
) close FOO;
)
) is 60 bytes long instead of 40 as would to be expected if the 'k' and
) the "\n" each were only one byte long?

Because the I/O routine translates the newlines. Just like in C.
Perl probably even uses the C I/O library to write to the file.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
S

sln

Jürgen Exner wrote:
) But then how come that the file created by this little program
)
) open FOO, ">" , "foo";
) print FOO "k\n" x 20;
) close FOO;
)
) is 60 bytes long instead of 40 as would to be expected if the 'k' and
) the "\n" each were only one byte long?

Because the I/O routine translates the newlines. Just like in C.
Perl probably even uses the C I/O library to write to the file.


SaSW, Willem

There are heuristics in Windows programs. Just look at Word, a Microsoft
offering.

-sln
 
J

Jürgen Exner

Willem said:
Jürgen Exner wrote:
) But then how come that the file created by this little program
)
) open FOO, ">" , "foo";
) print FOO "k\n" x 20;
) close FOO;
)
) is 60 bytes long instead of 40 as would to be expected if the 'k' and
) the "\n" each were only one byte long?

Because the I/O routine translates the newlines.

So, I guess you are saying that there is a context where "\n" does mean
two characters, contrary to Ben's statement:
"\n" can *never* mean "\015\012"

jue
 
P

Peter J. Holzer

So, I guess you are saying that there is a context where "\n" does mean
two characters, contrary to Ben's statement:
"\n" can *never* mean "\015\012"

"\n" is *always* a string containing one character (\x{000A} on most
platforms including Windows). However, when this character is written to
a file handle, an I/O layer may convert this in any way it pleases. It
may just pass it through unchanged, it may convert it into a sequence of
two bytes (e.g. "\x0D\x0A"), or it might even pad all lines to a fixed
length with spaces and not write any new line characters at all.

On input the reverse transformation should be performed.

hp
 
S

sln

kj said:
There are three major conventions for the end-of-line marker:
"\n", "\r\n", and "\r".

These notations are not unambigious! See perlport documentation section
newlines for details.
In a variety of situation, Perl must split strings into "lines",
and must therefore follow a particular convention to identify line
boundaries. There are three situations that interest me in
particular: 1. the splitting into lines that happens when one
iterates over a file using the <> operator; 2. the meaning of the
operation performed by chomp; and 3. the meaning of the $ anchor
in regular expressions.

<> and chomp use the $/ variable for line endings. Since $/ does not
support regular expressions, you cannot use this mechanism for all
types of line endings.

The $ anchor normally is just the end of the string (with or without an
line ending).
How can I change the script so that the output for unix.txt, dos.txt,
and mac.txt will be the same as the one shown above for unix.txt?

use strict;
use warnings;

my $lines = my $matches = 0;
{
local $/ = undef;
for (<> =~ m{\G([^\012\015]*) \015?\012?}xmsg) {
^^^^^^^^
This won't work, depending on the translation mode opened or
appended to before, opened now, etc.., 0d 0d 0a could be one, two
or 3 eol's.
In fact you don't even have, or couldn't create a reference anchor
to tell the difference.

-sln
 
S

sln

Not usually :).


No it's not, at least not as far as Perl is concerned. Files have CRLF
line endings, but they are (by default) translated into LF line endings
when the file is read. If you have a file containing

fooCRLF

and you read a line with

open my $FOO, "<", "foo";
my $foo = <$FOO>;

then $foo will be four bytes long, not five.


Perl is no longer maintained for Mac OS Classic or the EBCDIC platforms.
I did not mean to imply they were obsolete for other purposes.

Ben

Yes this is fairly standard ANSI translations.

This fopen api documentation, phrase sums it up:
"Carriage return–line feed (CR-LF) combinations are translated
into a single line feed character on input.
Line feed characters are translated into CR-LF combinations on output. "


Opening in text mode, translated is the default, and these things happen:

- On reads: CRLF are converted to LF's
- On writes: LF is converted to CRLF's
- EOL character is the LF
- binmode(STDOUT,':raw') is not good for viewing because the console does real \r and \n

Finally, there is no clear cut solution to the OP I don't believe.

If one platform can append CR's and another LF's as eol's, then it can't be
determined that these are seperate eol's. Of course, another comes along and
adds the CRLF pair as eol.

Either way, opening a file in ':raw' mode and doing your own eol
translations, would make this, by definition: if /\015?\012?/ ++$linecnt,
invalid.

I guess there is the C fmode and setmode to read, turn on/off translations.
Unless there is a convergence of platform meanings that don't step on each
other when files are appeneded in translated mode (if it is supported), opening
a file un-translated and doing your own eol translation, would no seem to be
%100 reliable.

-sln

Raw Data:
(18) = 54 d a 57 a a d d 58 65 64 66 d d 59 d a 5a
(18) = T..W....Xedf..Y..Z

Writing translated text file
--------------------

Reading translated text file
--------------------
tran (18) = 54 d a 57 a a d d 58 65 64 66 d d 59 d a 5a
(18) = T..W....Xedf..Y..Z
( 3) = T ( d a )
( 2) = W ( a )
( 1) = ( a )
(11) = ( d d ) Xedf ( d d ) Y ( d a )
( 1) = Z

Reading un-translated text file
--------------------
raw (22) = 54 d d a 57 d a d a d d 58 65 64 66 d d 59 d d a 5a
(22) = T ( d d a ) W ( d a d a d d ) Xedf ( d d ) Y ( d d a ) Z
(22) = T...W......Xedf..Y...Z
( 4) = T ( d d a )
( 3) = W ( d a )
( 2) = ( d a )
(12) = ( d d ) Xedf ( d d ) Y ( d d a )
( 1) = Z

=============================================

Writing RAW text file
--------------------

Reading translated text file
--------------------
tran (16) = 54 a 57 a a d d 58 65 64 66 d d 59 a 5a
(16) = T.W....Xedf..Y.Z
( 2) = T ( a )
( 2) = W ( a )
( 1) = ( a )
(10) = ( d d ) Xedf ( d d ) Y ( a )
( 1) = Z

Reading un-translated text file
--------------------
raw (18) = 54 d a 57 a a d d 58 65 64 66 d d 59 d a 5a
(18) = T ( d a ) W ( a a d d ) Xedf ( d d ) Y ( d a ) Z
(18) = T..W....Xedf..Y..Z
( 3) = T ( d a )
( 2) = W ( a )
( 1) = ( a )
(11) = ( d d ) Xedf ( d d ) Y ( d a )
( 1) = Z
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,020
Latest member
GenesisGai

Latest Threads

Top