end-of-line conventions

kj · Aug 13, 2009

There are three major conventions for the end-of-line marker:
"\n", "\r\n", and "\r".

In a variety of situation, Perl must split strings into "lines",
and must therefore follow a particular convention to identify line
boundaries. There are three situations that interest me in
particular: 1. the splitting into lines that happens when one
iterates over a file using the <> operator; 2. the meaning of the
operation performed by chomp; and 3. the meaning of the $ anchor
in regular expressions.

These three issues are tested by the following simple script:

my $lines = my $matches = 0;
while (<>) {
$lines++;
if (/z$/) {
$matches++;
chomp;
print ">$_<";
}
}

print "$/$matches matches out of $lines lines$/";
__END__

I have three files, unix.txt, dos.txt, and mac.txt, each containing
four lines. Disregarding the end-of-line character(s) these lines
are "foo", "bar", "baz", "frobozz".

The file unix.txt uses "\n" to separate the lines. The output that
I get when I pass it as the argument to the script is this:

% demo.pl unix.txt

baz<>frobozz<

2 matches out of 4 lines

The file dos.txt uses "\r\n" to separate lines, and the file mac.txt
uses "\r". Here's the output I get when I pass these files to the
script:

% demo.pl dos.txt

0 matches out of 4 lines
% demo.pl mac.txt

0 matches out of 1 lines

How can I change the script so that the output for unix.txt, dos.txt,
and mac.txt will be the same as the one shown above for unix.txt?

(Mucking with the value of $/ I was able to get <> to split the
input stream at the right places, but it had no impact on the result
of the regular expression match.)

TIA!

kynn

kj · Aug 13, 2009

In said:
Have you read the "Newlines" section in

perldoc perlport

perl detects its platform when it is *compiled*.

That is, perl decides what line ending to use when it is built.

You can't.

Mind-blowing, to say the least...

Oh, well. Live and lurn. Thanks. And to Ben too.

kynn

Heiko Eißfeldt · Aug 13, 2009

kj said:
There are three major conventions for the end-of-line marker:
"\n", "\r\n", and "\r".

These notations are not unambigious! See perlport documentation section
newlines for details.

In a variety of situation, Perl must split strings into "lines",
and must therefore follow a particular convention to identify line
boundaries. There are three situations that interest me in
particular: 1. the splitting into lines that happens when one
iterates over a file using the <> operator; 2. the meaning of the
operation performed by chomp; and 3. the meaning of the $ anchor
in regular expressions.

<> and chomp use the $/ variable for line endings. Since $/ does not
support regular expressions, you cannot use this mechanism for all
types of line endings.

The $ anchor normally is just the end of the string (with or without an
line ending).

How can I change the script so that the output for unix.txt, dos.txt,
and mac.txt will be the same as the one shown above for unix.txt?

use strict;
use warnings;

my $lines = my $matches = 0;
{
local $/ = undef;
for (<> =~ m{\G([^\012\015]*) \015?\012?}xmsg) {
$lines++;
if (/z$/) {
$matches++;
print ">$_<";
}
}
}
print "\n$matches matches out of $lines lines\n";
__END__

This uses <> with no line end definition, and iterates with a regular
expression suitable for three types of line endings. The line ending is
not included in $_, so chomp is omitted.

If you need the line endings in $_ use the following lines.
for (<> =~ m{\G([^\012\015]* \015?\012?)}xmsg) {
$lines++;
if (/z\s*$/) {
$matches++;
s{[\015\012][\015\012]?}{}xms; # chomp replacement

Hope that helps, heiko

Steve C · Aug 13, 2009

kj said:
There are three major conventions for the end-of-line marker:
"\n", "\r\n", and "\r".

In a variety of situation, Perl must split strings into "lines",
and must therefore follow a particular convention to identify line
boundaries. There are three situations that interest me in
particular: 1. the splitting into lines that happens when one
iterates over a file using the <> operator; 2. the meaning of the
operation performed by chomp; and 3. the meaning of the $ anchor
in regular expressions.

These three issues are tested by the following simple script:

my $lines = my $matches = 0;
while (<>) {
$lines++;
if (/z$/) {
$matches++;
chomp;
print ">$_<";
}
}

print "$/$matches matches out of $lines lines$/";
__END__

I have three files, unix.txt, dos.txt, and mac.txt, each containing
four lines. Disregarding the end-of-line character(s) these lines
are "foo", "bar", "baz", "frobozz".

The file unix.txt uses "\n" to separate the lines. The output that
I get when I pass it as the argument to the script is this:

% demo.pl unix.txt
2 matches out of 4 lines

The file dos.txt uses "\r\n" to separate lines, and the file mac.txt
uses "\r". Here's the output I get when I pass these files to the
script:

% demo.pl dos.txt

0 matches out of 4 lines
% demo.pl mac.txt

0 matches out of 1 lines

How can I change the script so that the output for unix.txt, dos.txt,
and mac.txt will be the same as the one shown above for unix.txt?

Since "\n" eq "\012" on unix, you ought to be able to
do something like this to be the same on all platforms:

my $lines = my $matches = 0;

$/ = "\012";
binmode STDIN;
binmode STDOUT;

while (<>) {
$lines++;
if (/z\012/) {
$matches++;
s/\012//g;
print ">$_<";
}
}

print "$/$matches matches out of $lines lines$/";
__END__

Nathan Keel · Aug 14, 2009

kj said:
Mind-blowing, to say the least...

Oh, well. Live and lurn. Thanks. And to Ben too.

kynn

Don't worry, use a real OS (not Windows) and you'll not have to think
about these things, though they are easily dealt with, and you'll have
a lot more benefits as well.

chris · Aug 14, 2009

kj said:
There are three major conventions for the end-of-line marker:
"\n", "\r\n", and "\r".

In a variety of situation, Perl must split strings into "lines",
and must therefore follow a particular convention to identify line
boundaries. There are three situations that interest me in
particular: 1. the splitting into lines that happens when one
iterates over a file using the <> operator; 2. the meaning of the
operation performed by chomp; and 3. the meaning of the $ anchor
in regular expressions.

These three issues are tested by the following simple script:

my $lines = my $matches = 0;
while (<>) {
$lines++;
if (/z$/) {
$matches++;
chomp;
print ">$_<";
}
}

print "$/$matches matches out of $lines lines$/";
__END__

I have three files, unix.txt, dos.txt, and mac.txt, each containing
four lines. Disregarding the end-of-line character(s) these lines
are "foo", "bar", "baz", "frobozz".

If you're on linux (it seems you are) I would pass any files of dubious
origin through 'mac2unix' and 'dos2unix' first to ensure that your perl
will parse them correctly.

Steve C · Aug 14, 2009

Ben said:
Did you try it? This completely fails with "\r"-separated files, and
fails to match any lines with "\r\n"-separated files.

Ben

I misread the question.

Jürgen Exner · Aug 15, 2009

kj said:
There are three major conventions for the end-of-line marker:
Yes.

"\n", "\r\n", and "\r".

No. The end-of-line markers are "\010", "\013\010", and "\013".

"\n" is Perl's short-hand notation for whatever end-of-line marker
combination is used on the current platform, thus it can be any of the
three.

How can I change the script so that the output for unix.txt, dos.txt,
and mac.txt will be the same as the one shown above for unix.txt?

If you have to deal with cross-platform files then your best bet is to
explicitely check for each combination individually and not to use the
short-hand "\n".

jue

sln · Aug 16, 2009

ITYM \012 and \015 there. \0-escapes are in octal.

Ben

He meant 10/13 respectfully.
Lets get this table going just for grins:

lf crlf cr
dec 10 13,10 13
hex 0a 0d,0a 0d
oct 012 015,012 015

But how should binary intended be interpreted if opened for translation?
Even if ascii and invalidness.

The recovery of a applies to all regexp valid regex cannot create a mixed
mode platform with append. Either all is converted OR invalid, or
none is converted.

No 0a0a0d0d0a0a. Naw, invalid. At best, recover what is possible,
rewrite file, right the ship, destroy old. Don't tell anybody about it.
Delete file, exit with success, or reformat hd, send it to deep magnetic
disk recovery for partial recovery, tracks wiped clean.

-sln

Jürgen Exner · Aug 16, 2009

Ben Morrow said:
ITYM \012 and \015 there. \0-escapes are in octal.

Yes, sorry.

"\n" can *never* mean "\015\012": on Win32 it means "\012", just as on
Unix.

But then how come that the file created by this little program

open FOO, ">" , "foo";
print FOO "k\n" x 20;
close FOO;

is 60 bytes long instead of 40 as would to be expected if the 'k' and
the "\n" each were only one byte long?

C:\tmp>dir foo
15-Aug-09 21:13 60 foo

jue

sln · Aug 16, 2009

Yes, sorry.

But then how come that the file created by this little program

open FOO, ">" , "foo";
print FOO "k\n" x 20;
close FOO;

is 60 bytes long instead of 40 as would to be expected if the 'k' and
the "\n" each were only one byte long?

C:\tmp>dir foo
15-Aug-09 21:13 60 foo

jue

Depends on what has edited it and how it is written out.
Open in Word/Windows, a 0d only eol and it edits each line
as a odoa. Modify and save it, I think it keeps only od.
But Word jacks a lot of stuff, especially encoding.
-sln

Willem · Aug 16, 2009

Jürgen Exner wrote:
) But then how come that the file created by this little program
)
) open FOO, ">" , "foo";
) print FOO "k\n" x 20;
) close FOO;
)
) is 60 bytes long instead of 40 as would to be expected if the 'k' and
) the "\n" each were only one byte long?

Because the I/O routine translates the newlines. Just like in C.
Perl probably even uses the C I/O library to write to the file.

SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

sln · Aug 16, 2009

Jürgen Exner wrote:
) But then how come that the file created by this little program
)
) open FOO, ">" , "foo";
) print FOO "k\n" x 20;
) close FOO;
)
) is 60 bytes long instead of 40 as would to be expected if the 'k' and
) the "\n" each were only one byte long?

Because the I/O routine translates the newlines. Just like in C.
Perl probably even uses the C I/O library to write to the file.

SaSW, Willem

There are heuristics in Windows programs. Just look at Word, a Microsoft
offering.

-sln

Jürgen Exner · Aug 16, 2009

Willem said:
Jürgen Exner wrote:
) But then how come that the file created by this little program
)
) open FOO, ">" , "foo";
) print FOO "k\n" x 20;
) close FOO;
)
) is 60 bytes long instead of 40 as would to be expected if the 'k' and
) the "\n" each were only one byte long?

Because the I/O routine translates the newlines.

So, I guess you are saying that there is a context where "\n" does mean
two characters, contrary to Ben's statement:
"\n" can *never* mean "\015\012"

jue

Peter J. Holzer · Aug 16, 2009

So, I guess you are saying that there is a context where "\n" does mean
two characters, contrary to Ben's statement:
"\n" can *never* mean "\015\012"

"\n" is *always* a string containing one character (\x{000A} on most
platforms including Windows). However, when this character is written to
a file handle, an I/O layer may convert this in any way it pleases. It
may just pass it through unchanged, it may convert it into a sequence of
two bytes (e.g. "\x0D\x0A"), or it might even pad all lines to a fixed
length with spaces and not write any new line characters at all.

On input the reverse transformation should be performed.

hp

sln · Aug 17, 2009

kj said:
kj said:

There are three major conventions for the end-of-line marker:
"\n", "\r\n", and "\r".

Click to expand...

These notations are not unambigious! See perlport documentation section
newlines for details.

In a variety of situation, Perl must split strings into "lines",
and must therefore follow a particular convention to identify line
boundaries. There are three situations that interest me in
particular: 1. the splitting into lines that happens when one
iterates over a file using the <> operator; 2. the meaning of the
operation performed by chomp; and 3. the meaning of the $ anchor
in regular expressions.

Click to expand...

<> and chomp use the $/ variable for line endings. Since $/ does not
support regular expressions, you cannot use this mechanism for all
types of line endings.

The $ anchor normally is just the end of the string (with or without an
line ending).

How can I change the script so that the output for unix.txt, dos.txt,
and mac.txt will be the same as the one shown above for unix.txt?

Click to expand...

use strict;
use warnings;

my $lines = my $matches = 0;
{
local $/ = undef;
for (<> =~ m{\G([^\012\015]*) \015?\012?}xmsg) {

^^^^^^^^
This won't work, depending on the translation mode opened or
appended to before, opened now, etc.., 0d 0d 0a could be one, two
or 3 eol's.
In fact you don't even have, or couldn't create a reference anchor
to tell the difference.

-sln

sln · Aug 17, 2009

Not usually .

No it's not, at least not as far as Perl is concerned. Files have CRLF
line endings, but they are (by default) translated into LF line endings
when the file is read. If you have a file containing

fooCRLF

and you read a line with

open my $FOO, "<", "foo";
my $foo = <$FOO>;

then $foo will be four bytes long, not five.

Perl is no longer maintained for Mac OS Classic or the EBCDIC platforms.
I did not mean to imply they were obsolete for other purposes.

Ben

Yes this is fairly standard ANSI translations.

This fopen api documentation, phrase sums it up:
"Carriage return–line feed (CR-LF) combinations are translated
into a single line feed character on input.
Line feed characters are translated into CR-LF combinations on output. "

Opening in text mode, translated is the default, and these things happen:

- On reads: CRLF are converted to LF's
- On writes: LF is converted to CRLF's
- EOL character is the LF
- binmode(STDOUT,':raw') is not good for viewing because the console does real \r and \n

Finally, there is no clear cut solution to the OP I don't believe.

If one platform can append CR's and another LF's as eol's, then it can't be
determined that these are seperate eol's. Of course, another comes along and
adds the CRLF pair as eol.

Either way, opening a file in ':raw' mode and doing your own eol
translations, would make this, by definition: if /\015?\012?/ ++$linecnt,
invalid.

I guess there is the C fmode and setmode to read, turn on/off translations.
Unless there is a convergence of platform meanings that don't step on each
other when files are appeneded in translated mode (if it is supported), opening
a file un-translated and doing your own eol translation, would no seem to be
%100 reliable.

-sln

Raw Data:
(18) = 54 d a 57 a a d d 58 65 64 66 d d 59 d a 5a
(18) = T..W....Xedf..Y..Z

Writing translated text file
--------------------

Reading translated text file
--------------------
tran (18) = 54 d a 57 a a d d 58 65 64 66 d d 59 d a 5a
(18) = T..W....Xedf..Y..Z
( 3) = T ( d a )
( 2) = W ( a )
( 1) = ( a )
(11) = ( d d ) Xedf ( d d ) Y ( d a )
( 1) = Z

Reading un-translated text file
--------------------
raw (22) = 54 d d a 57 d a d a d d 58 65 64 66 d d 59 d d a 5a
(22) = T ( d d a ) W ( d a d a d d ) Xedf ( d d ) Y ( d d a ) Z
(22) = T...W......Xedf..Y...Z
( 4) = T ( d d a )
( 3) = W ( d a )
( 2) = ( d a )
(12) = ( d d ) Xedf ( d d ) Y ( d d a )
( 1) = Z

=============================================

Writing RAW text file
--------------------

Reading translated text file
--------------------
tran (16) = 54 a 57 a a d d 58 65 64 66 d d 59 a 5a
(16) = T.W....Xedf..Y.Z
( 2) = T ( a )
( 2) = W ( a )
( 1) = ( a )
(10) = ( d d ) Xedf ( d d ) Y ( a )
( 1) = Z

Reading un-translated text file
--------------------
raw (18) = 54 d a 57 a a d d 58 65 64 66 d d 59 d a 5a
(18) = T ( d a ) W ( a a d d ) Xedf ( d d ) Y ( d a ) Z
(18) = T..W....Xedf..Y..Z
( 3) = T ( d a )
( 2) = W ( a )
( 1) = ( a )
(11) = ( d d ) Xedf ( d d ) Y ( d a )
( 1) = Z

Command Line Arguments	0	Mar 7, 2023
FAQ 4.32 How do I strip blank space from the beginning/end of a string?	0	Feb 25, 2011
The cost of the cheapest routes between cities	3	Jan 7, 2023
How to find the end of a word in perl..	6	Jun 29, 2010
Why failing correcting new line at end of text file	1	Jul 29, 2009
read and parse a single line file	21	Apr 1, 2014
I would like to use awk to calculate the total number of records processed	1	Aug 25, 2022
FAQ 5.2 How do I change, delete, or insert a line in a file, or append to the beginning of a file?	0	Feb 24, 2011

end-of-line conventions

kj

kj

Heiko Eißfeldt

Steve C

Nathan Keel

chris

Steve C

Jürgen Exner

sln

Jürgen Exner

sln

Willem

sln

Jürgen Exner

Peter J. Holzer

sln

sln

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads