regexp: \x0a => \x0d\x0a

  • Thread starter Sébastien Cottalorda
  • Start date
S

Sébastien Cottalorda

Hi,

In a file, I have \x0a characters and I'd like to replace them by the couple
\x0d\x0a

How can I do ?

Note: If I have already \x0d\x0a, I don't want to replace \x0a of course.

Thanks in advance.

Sébastien
 
B

Brian McCauley

Sébastien Cottalorda said:
In a file, I have \x0a characters and I'd like to replace them by the couple
\x0d\x0a

How can I do ?

What happend when you tried the obvious s/// ?

s/\x0a/\x0d\x0a/g;

(If you've not heard of s/// then you need to go back and do some
basic Perl tutorials).
Note: If I have already \x0d\x0a, I don't want to replace \x0a of course.

You could use a negative look-behind.

s/(?<!\x0d)\x0a/\x0d\x0a/g;

--
\\ ( )
. _\\__[oo
.__/ \\ /\@
. l___\\
# ll l\\
###LL LL\\
 
B

Ben Morrow

=?ISO-8859-15?Q?S=E9bastien?= Cottalorda said:
In a file, I have \x0a characters and I'd like to replace them by the couple
\x0d\x0a

How can I do ?

Note: If I have already \x0d\x0a, I don't want to replace \x0a of course.

If you have 5.8, you can use

perl -Mopen=IN,:raw,OUT,:crlf -pi -e1 <file>

You may need 5.8.1 to avoid double "\r\r\n"s... IIRC this was one of
the bugfixes.

Ben
 
A

Alan J. Flavell

You may need 5.8.1 to avoid double "\r\r\n"s... IIRC this was one of
the bugfixes.

Yes, it's mentioned in the perldelta

Apropos of which, I suppose I ought at some point to repeat with
5.8.1 the tests that I had reported for 5.8.0 in
http://www.google.com/[email protected]
(message (e-mail address removed) )
and related thread, about apparently broken newlines handling with
utf-16LE

Or could you perhaps throw any light, if you're interested, on what I
was seeing there and the subsequent followup?

I don't see anything clearly mentioned in the perldelta for 5.8.1
about *this* particular issue.

cheers
 
M

Malcolm Dew-Jones

=?ISO-8859-15?Q?S=E9bastien?= Cottalorda ([email protected]) wrote:
: Hi,

: In a file, I have \x0a characters and I'd like to replace them by the couple
: \x0d\x0a

: How can I do ?

: Note: If I have already \x0d\x0a, I don't want to replace \x0a of course.

What would you do with \x0d\x0a\x0a?

In addition to other techniques, you could

s/\x0a\x0a/\x0a/g; # reduce pairs to singles
s/\x0a/\x0a\x0a/g; # expand singles to pairs
 
B

Ben Morrow

Alan J. Flavell said:
Apropos of which, I suppose I ought at some point to repeat with
5.8.1 the tests that I had reported for 5.8.0 in
http://www.google.com/groups?selm=Pine.LNX.4.53.0308170139110.
6451%40lxplus005.cern.ch
(message (e-mail address removed) )
and related thread, about apparently broken newlines handling with
utf-16LE

Or could you perhaps throw any light, if you're interested, on what I
was seeing there and the subsequent followup?

Right... I've some some testing on this, and I would say it's
definitely a bug... Also that it has nothing to do with utf16le,
specifically; rather that it is a problem with the :crlf layer.

Please excuse the rather long post.

All the tests below have exactly the same results with 5.8.0 and
5.8.2. All tests have been run on i686-linux-thread-multi, but as of
5.8 they ought to give the same results on all platforms, given that
all filehandles are explicitly binmode()d. (I could be wrong: if Win32
systems have :crlf pushed by default then it's *definitely* worth
pushing :raw before you do anything else if you're dealing with utf16)

First, input. This is a modified version of your script/test file from
the above post. The output has been line-wrapped for posting.

% od -x utf16
0000000 feff 004e 004f 0054 0045 0053 0020 0046
^^^^ BOM (le)
0000020 004f 0052 4120 0041 0044 0044 0049 0054
^^^^ a char >FF
0000040 0049 004f 004e 0041 004c 0020 0041 00a0
a char >7F <FF ^^^^
0000060 0055 004e 0044 002e 000d 000a 000d 000a
DOSish newlines ^^^^-^^^^
0000100

% cat read
#!/usr/bin/perl

use strict;
use warnings;
use Encode qw/:fallbacks is_utf8 _utf8_on/;
use PerlIO::encoding;

my $bom = "\x{feff}";

# just so we know what's what
$PerlIO::encoding::fallback = FB_PERLQQ;
binmode STDOUT, ":encoding(ascii)";

# the first argument is the list of layers to use
open my $IN, "<$ARGV[0]", "utf16" or die $!;

$\ = "\n"; $, = " ";
$_ = <$IN>;

print "utf8 flag is", is_utf8($_) ? "on" : "off";

# force utf8 flag on if we were given two arguments
$ARGV[1] and _utf8_on($_), print "forcing utf8";

s/^$bom// and print "snipped BOM";
chomp;

# this is a slightly clearer display format
print map {sprintf "%04x", $_} unpack '(U)*', $_;
print;

__END__

% ./read ":encoding(utf16le)"
utf8 flag is on
snipped BOM
004e 004f 0054 0045 0053 0020 0046 004f 0052 4120 0041 0044 0044 0049
0054 0049 004f 004e 0041 004c 0020 0041 00a0 0055 004e 0044 002e 000d
DOSish newline not stripped ^^^^
NOTES FOR\x{4120}ADDITIONAL A\x{00a0}UND.

% ./read ":encoding(utf16le):crlf"
utf8 flag is off
00ef 00bb 00bf 004e 004f 0054 0045 0053 0020 0046 004f 0052
^^^^-^^^^-^^^^ this is \x{feff} in utf8
00e4 0084 00a0 0041 0044 0044 0049 0054 0049 004f 004e 0041
^^^^-^^^^-^^^^ ditto \x{4120}
004c 0020 0041 00c2 00a0 0055 004e 0044 002e
DOSish newline is stripped, however ^^
\x{00ef}\x{00bb}\x{00bf}NOTES FOR\x{00e4}\x{0084}\x{00a0}ADDITIONAL
A\x{00c2}\x{00a0}UND.

% ./read ":encoding(utf16le):crlf" 1
utf8 flag is off
forcing utf8
snipped BOM
004e 004f 0054 0045 0053 0020 0046 004f 0052 4120 0041 0044 0044 0049
0054 0049 004f 004e 0041 004c 0020 0041 00a0 0055 004e 0044 002e
NOTES FOR\x{4120}ADDITIONAL A\x{00a0}UND.

So the problem here is that :crlf fails to set the utf8 flag on the
data when it should. Now, output.

% perl -e'binmode STDOUT, ":encoding(utf16le)";
print "\xa0hello\n\n"' > out
% od -x out
0000000 00a0 0068 0065 006c 006c 006f 000a 000a
0000020

% perl -e'binmode STDOUT, ":crlf:encoding(utf16le)";
print "\xa0hello\n\n"' > out
% od -x out
0000000 00a0 0068 0065 006c 006c 006f 0a0d 0d00
0000020 000a
0000022

This is not actually quite such nonsense as it seems: because 'od -x'
byteswaps everything, the file actually ends '6f 00 0d 0a 00 0d 0a 00',
which is the perfectly reasonable result of treating the binary
UTF16 data as text. So we do the :crlf before the UTF16:

% perl -e'binmode STDOUT, ":encoding(utf16le):crlf";
print "\xa0hello\n\n"' > out
Malformed UTF-8 character (unexpected continuation byte 0xa0, with no
preceding start byte) in null operation.
% od -x out
0000000 0000 0068 0065 006c 006c 006f 000d 000a
0000020 000d 000a
0000024

This last would give the desired result, but seems to have the
converse problem from above: that it is trying to treat as utf8 data
that should be treated as bytes.

Having a look at perlio.c suggests to me (though I can't entirely
follow it) that a :crlf layer always has PERLIO_F_UTF8 off, when in
fact it should check the state of the layer below and set itself
accordingly. Having a think about the issued involved suggests to me
that Microsoft should *really* have taken to opportunity of changing
to utf16 to ditch using \r\n... but there we go.

I would seriously consider not using :crlf at all, but instead writing
a :nl layer that maps any of \n, \r, \r\n to \n on input and any of
\n, \r, \r\n to \r\n on output... seems to me that'd be more use, in
general. I guess it would probably be slower.

Ben
 
J

John W. Krahn

Sébastien Cottalorda said:
In a file, I have \x0a characters and I'd like to replace them by the couple
\x0d\x0a

How can I do ?

Note: If I have already \x0d\x0a, I don't want to replace \x0a of course.

perl -i~ -lpe'BEGIN{$/=$\="\x0d\x0a"}s/(?=\x0a)/\x0d/g' yourfile



John
 
A

Alan J. Flavell

Please excuse the rather long post.

Speaking for myself (and who else is going to do that if I don't? ;-)
I'm extremely grateful to have your input on this, as I had been
beginning to think I was doing something seriously wrong with the
layers. Anyway, some technical detail makes a pleasant change from
the interminable arguments from crabby newbies who want to impose
their TOFU-posting and FAQ-ignorant demands around here.

Incidentally I've found that for utf-8 data the "od -t x1" format is
handy, rather than "od -x".

This is only a partial response. I'll be looking at this some more
yet. (Just for interest's sake, actually. I don't actually play
with the Microsoft train sim myself[1], which is what lay behind the
originally posted problem.)
So the problem here is that :crlf fails to set the utf8 flag on the
data when it should.

Aha, looks like a key observation...
This is not actually quite such nonsense as it seems: because 'od -x'
byteswaps everything,

(that's why I recommend od -t x1 instead...)
the file actually ends '6f 00 0d 0a 00 0d 0a 00',
which is the perfectly reasonable result of treating the binary
UTF16 data as text.

Good point.
Having a look at perlio.c suggests to me (though I can't entirely
follow it) that a :crlf layer always has PERLIO_F_UTF8 off, when in
fact it should check the state of the layer below and set itself
accordingly.

Sounds right to me. Is one of us expected to call this in as a bug,
or do we have developers lurking who would be willing to take this on?
Having a think about the issued involved suggests to me
that Microsoft should *really* have taken to opportunity of changing
to utf16 to ditch using \r\n... but there we go.

I like that idea, but as you say, it's a bit late for them to do that
now.
I would seriously consider not using :crlf at all, but instead writing
a :nl layer that maps any of \n, \r, \r\n to \n on input and any of
\n, \r, \r\n to \r\n on output... seems to me that'd be more use, in
general. I guess it would probably be slower.

If it was part of the infrastructure, I doubt that the difference in
speed would be noticeable.

Whenever this topic comes up, there's usually someone who offers
anomalous data and asks what we'd do with it (mixed unix/mac/dos
newlines...), but that's just as much a problem for :crlf as it would
be for your hypothetical :nl, so I don't see it as a show-stopper.

thanks for the observations, anyway. In fact you're clearly ahead of
me. all the best


[1] I will admit to playing with BVE, http://mackoy.cool.ne.jp/
but that's entirely off-topic here!
 
B

Ben Morrow

Alan J. Flavell said:
Incidentally I've found that for utf-8 data the "od -t x1" format is
handy, rather than "od -x".

Yes, I found that too. -x is good for little-endian stuff, though.
http://morrow.me.uk/PerlIO-nline-0.01.tar.gz

If it was part of the infrastructure, I doubt that the difference in
speed would be noticeable.

#!/usr/bin/perl

use Benchmark qw/cmpthese/;
use Fcntl qw/:seek/;

my $teststr = "a\cJb\cMc\cM\cJ";
$/ = undef;

print "Writing mixed:\n";

{
open my $CRLF, ">:crlf", "one";
open my $NLINE, ">:nline", "two";

select((select($CRLF ),$|=1)[0]);
select((select($NLINE),$|=1)[0]);

cmpthese -5, { crlf => sub { print $CRLF $teststr },
nline => sub { print $NLINE $teststr }
};
}

print "Writing just \\n:\n";

{
open my $CRLF, ">:crlf", "one";
open my $NLINE, ">:nline", "two";

select((select($CRLF ),$|=1)[0]);
select((select($NLINE),$|=1)[0]);

cmpthese -5, { crlf => sub { print $CRLF "a\n" },
nline => sub { print $NLINE "a\n" }
};
}

{
open my $RAW, ">:raw", "three";
print $RAW $teststr;
}

print "Reading:\n";

{
open my $CRLF, "<:crlf", "three";
open my $NLINE, "<:nline", "three";

cmpthese -5, { crlf => sub { <$CRLF>; seek $CRLF, 0, SEEK_SET },
nline => sub { <$NLINE>; seek $NLINE, 0, SEEK_SET }
};
}

__END__

Writing mixed:
Rate nline crlf
nline 190612/s -- -23%
crlf 247892/s 30% --
Writing just \n:
Rate nline crlf
nline 229302/s -- -9%
crlf 252560/s 10% --
Reading:
Rate crlf nline
crlf 58405/s -- -0%
nline 58519/s 0% --


Hmmm... not that bad, I suppose, 'specially if you don't use the extra
flexibility.
Whenever this topic comes up, there's usually someone who offers
anomalous data and asks what we'd do with it (mixed unix/mac/dos
newlines...), but that's just as much a problem for :crlf as it would
be for your hypothetical :nl, so I don't see it as a show-stopper.

I'm /pretty/ sure this layer does the Right Thing in all situations.

Ben
 
A

Alan J. Flavell

Yes, I found that too. -x is good for little-endian stuff, though.
Agreed.

Thanks for the interesting posting! Just to make my meaning clear, I
meant "I doubt that the difference would be noticeable within the
scope of a realistic application". The benchmarking is interesting,
all the same.

Your approach is clearly more versatile. But the :crlf layer ought to
do what it says on the tin, shouldn't it? - and from the previous
discussion, it rather looks as if it isn't doing. Or else I was using
it wrong, but I tried several interpretations - and all the others
seemed to be even worse.

cheers
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Alan J. Flavell
Your approach is clearly more versatile. But the :crlf layer ought to
do what it says on the tin, shouldn't it? - and from the previous
discussion, it rather looks as if it isn't doing. Or else I was using
it wrong, but I tried several interpretations - and all the others
seemed to be even worse.

Given that the layers architecture is absolutely broken (especially
:crlf stuff), I do not see any reason why anything using layers should
do any particular thing...

Hope this helps,
Ilya
 
B

Ben Morrow

Ilya Zakharevich said:
Given that the layers architecture is absolutely broken (especially
:crlf stuff), I do not see any reason why anything using layers should
do any particular thing...

I was wondering about saying something along these lines but decided
it probably wasn't my place to... the idea is a good one, but I would
say it needs a fairly fundamental re-working. *Especially* :crlf.

Ben
 
A

Alan J. Flavell

I was wondering about saying something along these lines but decided
it probably wasn't my place to... the idea is a good one, but I would
say it needs a fairly fundamental re-working. *Especially* :crlf.

Well, those comments brought me to earth with a bit of a bump. Have I
been blundering around with my eyes shut? It seems so, but all of the
simpler things I've been using the utf8 stuff for have worked fine:
and I've been recommending it to others in good faith, and have had
quite a number of positive responses.

It was specifically this utf-16LE with crlf incident that had proven
to be a problem. Ho hum, back to the drawing board.
 
B

Ben Morrow

Alan J. Flavell said:
Well, those comments brought me to earth with a bit of a bump. Have I
been blundering around with my eyes shut? It seems so, but all of the
simpler things I've been using the utf8 stuff for have worked fine:
and I've been recommending it to others in good faith, and have had
quite a number of positive responses.

Simple things like pushing :utf8 or :encoding(*)[1] onto a filehandle
work fine. Anything more complicated than that gets tricky,
particulary with :crlf since it (like :utf8, but much more so) isn't
really a layer at all but instead searches down the stack 'till it
finds a layer that declares it can do CR:LF translation and tells it
to start... hence my preference for a straightforward layer.

For instance, one thing that I would want to be able to do is, without
losing the contents of any buffers, change a filehandle from being its
default of :unix:perlio to :stdio so I could pass a FILE* to some
library that wanted one, and then change it back afterwards. This
is... very dodgy at the moment. If you have 5.8.2 try some more
complicated pushings and poppings and see what PerlIO::get_layers says
ends up actually there. Or have a poke around in perlio.c :).

[1] As a separate issue, I would *always* be inclined to push
:encoding(utf8) rather than :utf8, despite the probable performance
hit, because :utf8 doesn't actually check the data is valid utf8: it
just marks it as such and passes it along. :encoding not only checks,
it also gives you some chance of decent fallback.

Ben
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,009
Latest member
GidgetGamb

Latest Threads

Top