regexp: \x0a => \x0d\x0a

Discussion in 'Perl Misc' started by Sébastien Cottalorda, Nov 27, 2003.

  1. Hi,

    In a file, I have \x0a characters and I'd like to replace them by the couple
    \x0d\x0a

    How can I do ?

    Note: If I have already \x0d\x0a, I don't want to replace \x0a of course.

    Thanks in advance.

    Sébastien
    --
    [ retirer NOSPAM pour répondre directement
    remove NOSPAM to reply directly ]
     
    Sébastien Cottalorda, Nov 27, 2003
    #1
    1. Advertising

  2. Sébastien Cottalorda <> writes:

    > In a file, I have \x0a characters and I'd like to replace them by the couple
    > \x0d\x0a
    >
    > How can I do ?


    What happend when you tried the obvious s/// ?

    s/\x0a/\x0d\x0a/g;

    (If you've not heard of s/// then you need to go back and do some
    basic Perl tutorials).

    > Note: If I have already \x0d\x0a, I don't want to replace \x0a of course.


    You could use a negative look-behind.

    s/(?<!\x0d)\x0a/\x0d\x0a/g;

    --
    \\ ( )
    . _\\__[oo
    .__/ \\ /\@
    . l___\\
    # ll l\\
    ###LL LL\\
     
    Brian McCauley, Nov 27, 2003
    #2
    1. Advertising

  3. Sébastien Cottalorda

    Ben Morrow Guest

    =?ISO-8859-15?Q?S=E9bastien?= Cottalorda <> wrote:
    > In a file, I have \x0a characters and I'd like to replace them by the couple
    > \x0d\x0a
    >
    > How can I do ?
    >
    > Note: If I have already \x0d\x0a, I don't want to replace \x0a of course.


    If you have 5.8, you can use

    perl -Mopen=IN,:raw,OUT,:crlf -pi -e1 <file>

    You may need 5.8.1 to avoid double "\r\r\n"s... IIRC this was one of
    the bugfixes.

    Ben

    --
    . | .
    \ / The clueometer is reading zero.
    . .
    __ <-----@ __
     
    Ben Morrow, Nov 27, 2003
    #3
  4. On Thu, 27 Nov 2003, Ben Morrow wrote:

    > You may need 5.8.1 to avoid double "\r\r\n"s... IIRC this was one of
    > the bugfixes.


    Yes, it's mentioned in the perldelta

    Apropos of which, I suppose I ought at some point to repeat with
    5.8.1 the tests that I had reported for 5.8.0 in
    http://www.google.com/groups?selm=
    (message )
    and related thread, about apparently broken newlines handling with
    utf-16LE

    Or could you perhaps throw any light, if you're interested, on what I
    was seeing there and the subsequent followup?

    I don't see anything clearly mentioned in the perldelta for 5.8.1
    about *this* particular issue.

    cheers
     
    Alan J. Flavell, Nov 27, 2003
    #4
  5. =?ISO-8859-15?Q?S=E9bastien?= Cottalorda () wrote:
    : Hi,

    : In a file, I have \x0a characters and I'd like to replace them by the couple
    : \x0d\x0a

    : How can I do ?

    : Note: If I have already \x0d\x0a, I don't want to replace \x0a of course.

    What would you do with \x0d\x0a\x0a?

    In addition to other techniques, you could

    s/\x0a\x0a/\x0a/g; # reduce pairs to singles
    s/\x0a/\x0a\x0a/g; # expand singles to pairs
     
    Malcolm Dew-Jones, Nov 27, 2003
    #5
  6. Sébastien Cottalorda

    Ben Morrow Guest

    "Alan J. Flavell" <> wrote:
    > Apropos of which, I suppose I ought at some point to repeat with
    > 5.8.1 the tests that I had reported for 5.8.0 in
    > http://www.google.com/groups?selm=Pine.LNX.4.53.0308170139110.
    > 6451%40lxplus005.cern.ch
    > (message )
    > and related thread, about apparently broken newlines handling with
    > utf-16LE
    >
    > Or could you perhaps throw any light, if you're interested, on what I
    > was seeing there and the subsequent followup?


    Right... I've some some testing on this, and I would say it's
    definitely a bug... Also that it has nothing to do with utf16le,
    specifically; rather that it is a problem with the :crlf layer.

    Please excuse the rather long post.

    All the tests below have exactly the same results with 5.8.0 and
    5.8.2. All tests have been run on i686-linux-thread-multi, but as of
    5.8 they ought to give the same results on all platforms, given that
    all filehandles are explicitly binmode()d. (I could be wrong: if Win32
    systems have :crlf pushed by default then it's *definitely* worth
    pushing :raw before you do anything else if you're dealing with utf16)

    First, input. This is a modified version of your script/test file from
    the above post. The output has been line-wrapped for posting.

    % od -x utf16
    0000000 feff 004e 004f 0054 0045 0053 0020 0046
    ^^^^ BOM (le)
    0000020 004f 0052 4120 0041 0044 0044 0049 0054
    ^^^^ a char >FF
    0000040 0049 004f 004e 0041 004c 0020 0041 00a0
    a char >7F <FF ^^^^
    0000060 0055 004e 0044 002e 000d 000a 000d 000a
    DOSish newlines ^^^^-^^^^
    0000100

    % cat read
    #!/usr/bin/perl

    use strict;
    use warnings;
    use Encode qw/:fallbacks is_utf8 _utf8_on/;
    use PerlIO::encoding;

    my $bom = "\x{feff}";

    # just so we know what's what
    $PerlIO::encoding::fallback = FB_PERLQQ;
    binmode STDOUT, ":encoding(ascii)";

    # the first argument is the list of layers to use
    open my $IN, "<$ARGV[0]", "utf16" or die $!;

    $\ = "\n"; $, = " ";
    $_ = <$IN>;

    print "utf8 flag is", is_utf8($_) ? "on" : "off";

    # force utf8 flag on if we were given two arguments
    $ARGV[1] and _utf8_on($_), print "forcing utf8";

    s/^$bom// and print "snipped BOM";
    chomp;

    # this is a slightly clearer display format
    print map {sprintf "%04x", $_} unpack '(U)*', $_;
    print;

    __END__

    % ./read ":encoding(utf16le)"
    utf8 flag is on
    snipped BOM
    004e 004f 0054 0045 0053 0020 0046 004f 0052 4120 0041 0044 0044 0049
    0054 0049 004f 004e 0041 004c 0020 0041 00a0 0055 004e 0044 002e 000d
    DOSish newline not stripped ^^^^
    NOTES FOR\x{4120}ADDITIONAL A\x{00a0}UND.

    % ./read ":encoding(utf16le):crlf"
    utf8 flag is off
    00ef 00bb 00bf 004e 004f 0054 0045 0053 0020 0046 004f 0052
    ^^^^-^^^^-^^^^ this is \x{feff} in utf8
    00e4 0084 00a0 0041 0044 0044 0049 0054 0049 004f 004e 0041
    ^^^^-^^^^-^^^^ ditto \x{4120}
    004c 0020 0041 00c2 00a0 0055 004e 0044 002e
    DOSish newline is stripped, however ^^
    \x{00ef}\x{00bb}\x{00bf}NOTES FOR\x{00e4}\x{0084}\x{00a0}ADDITIONAL
    A\x{00c2}\x{00a0}UND.

    % ./read ":encoding(utf16le):crlf" 1
    utf8 flag is off
    forcing utf8
    snipped BOM
    004e 004f 0054 0045 0053 0020 0046 004f 0052 4120 0041 0044 0044 0049
    0054 0049 004f 004e 0041 004c 0020 0041 00a0 0055 004e 0044 002e
    NOTES FOR\x{4120}ADDITIONAL A\x{00a0}UND.

    So the problem here is that :crlf fails to set the utf8 flag on the
    data when it should. Now, output.

    % perl -e'binmode STDOUT, ":encoding(utf16le)";
    print "\xa0hello\n\n"' > out
    % od -x out
    0000000 00a0 0068 0065 006c 006c 006f 000a 000a
    0000020

    % perl -e'binmode STDOUT, ":crlf:encoding(utf16le)";
    print "\xa0hello\n\n"' > out
    % od -x out
    0000000 00a0 0068 0065 006c 006c 006f 0a0d 0d00
    0000020 000a
    0000022

    This is not actually quite such nonsense as it seems: because 'od -x'
    byteswaps everything, the file actually ends '6f 00 0d 0a 00 0d 0a 00',
    which is the perfectly reasonable result of treating the binary
    UTF16 data as text. So we do the :crlf before the UTF16:

    % perl -e'binmode STDOUT, ":encoding(utf16le):crlf";
    print "\xa0hello\n\n"' > out
    Malformed UTF-8 character (unexpected continuation byte 0xa0, with no
    preceding start byte) in null operation.
    % od -x out
    0000000 0000 0068 0065 006c 006c 006f 000d 000a
    0000020 000d 000a
    0000024

    This last would give the desired result, but seems to have the
    converse problem from above: that it is trying to treat as utf8 data
    that should be treated as bytes.

    Having a look at perlio.c suggests to me (though I can't entirely
    follow it) that a :crlf layer always has PERLIO_F_UTF8 off, when in
    fact it should check the state of the layer below and set itself
    accordingly. Having a think about the issued involved suggests to me
    that Microsoft should *really* have taken to opportunity of changing
    to utf16 to ditch using \r\n... but there we go.

    I would seriously consider not using :crlf at all, but instead writing
    a :nl layer that maps any of \n, \r, \r\n to \n on input and any of
    \n, \r, \r\n to \r\n on output... seems to me that'd be more use, in
    general. I guess it would probably be slower.

    Ben

    --
    Musica Dei donum optimi, trahit homines, trahit deos. |
    Musica truces mollit animos, tristesque mentes erigit. |
    Musica vel ipsas arbores et horridas movet feras. |
     
    Ben Morrow, Nov 28, 2003
    #6
  7. Sébastien Cottalorda wrote:
    >
    > In a file, I have \x0a characters and I'd like to replace them by the couple
    > \x0d\x0a
    >
    > How can I do ?
    >
    > Note: If I have already \x0d\x0a, I don't want to replace \x0a of course.


    perl -i~ -lpe'BEGIN{$/=$\="\x0d\x0a"}s/(?=\x0a)/\x0d/g' yourfile



    John
    --
    use Perl;
    program
    fulfillment
     
    John W. Krahn, Nov 28, 2003
    #7
  8. Perl unicode and :crlf, was Re: regexp: \x0a => \x0d\x0a

    On Fri, 28 Nov 2003, Ben Morrow wrote:

    > Please excuse the rather long post.


    Speaking for myself (and who else is going to do that if I don't? ;-)
    I'm extremely grateful to have your input on this, as I had been
    beginning to think I was doing something seriously wrong with the
    layers. Anyway, some technical detail makes a pleasant change from
    the interminable arguments from crabby newbies who want to impose
    their TOFU-posting and FAQ-ignorant demands around here.

    Incidentally I've found that for utf-8 data the "od -t x1" format is
    handy, rather than "od -x".

    This is only a partial response. I'll be looking at this some more
    yet. (Just for interest's sake, actually. I don't actually play
    with the Microsoft train sim myself[1], which is what lay behind the
    originally posted problem.)

    > So the problem here is that :crlf fails to set the utf8 flag on the
    > data when it should.


    Aha, looks like a key observation...

    > This is not actually quite such nonsense as it seems: because 'od -x'
    > byteswaps everything,


    (that's why I recommend od -t x1 instead...)

    > the file actually ends '6f 00 0d 0a 00 0d 0a 00',
    > which is the perfectly reasonable result of treating the binary
    > UTF16 data as text.


    Good point.

    > Having a look at perlio.c suggests to me (though I can't entirely
    > follow it) that a :crlf layer always has PERLIO_F_UTF8 off, when in
    > fact it should check the state of the layer below and set itself
    > accordingly.


    Sounds right to me. Is one of us expected to call this in as a bug,
    or do we have developers lurking who would be willing to take this on?

    > Having a think about the issued involved suggests to me
    > that Microsoft should *really* have taken to opportunity of changing
    > to utf16 to ditch using \r\n... but there we go.


    I like that idea, but as you say, it's a bit late for them to do that
    now.

    > I would seriously consider not using :crlf at all, but instead writing
    > a :nl layer that maps any of \n, \r, \r\n to \n on input and any of
    > \n, \r, \r\n to \r\n on output... seems to me that'd be more use, in
    > general. I guess it would probably be slower.


    If it was part of the infrastructure, I doubt that the difference in
    speed would be noticeable.

    Whenever this topic comes up, there's usually someone who offers
    anomalous data and asks what we'd do with it (mixed unix/mac/dos
    newlines...), but that's just as much a problem for :crlf as it would
    be for your hypothetical :nl, so I don't see it as a show-stopper.

    thanks for the observations, anyway. In fact you're clearly ahead of
    me. all the best


    [1] I will admit to playing with BVE, http://mackoy.cool.ne.jp/
    but that's entirely off-topic here!
     
    Alan J. Flavell, Nov 28, 2003
    #8
  9. Sébastien Cottalorda

    Ben Morrow Guest

    Re: Perl unicode and :crlf, was Re: regexp: \x0a => \x0d\x0a

    "Alan J. Flavell" <> wrote:
    > Incidentally I've found that for utf-8 data the "od -t x1" format is
    > handy, rather than "od -x".


    Yes, I found that too. -x is good for little-endian stuff, though.

    > > I would seriously consider not using :crlf at all, but instead writing
    > > a :nl layer that maps any of \n, \r, \r\n to \n on input and any of
    > > \n, \r, \r\n to \r\n on output... seems to me that'd be more use, in
    > > general. I guess it would probably be slower.


    http://morrow.me.uk/PerlIO-nline-0.01.tar.gz

    > If it was part of the infrastructure, I doubt that the difference in
    > speed would be noticeable.


    #!/usr/bin/perl

    use Benchmark qw/cmpthese/;
    use Fcntl qw/:seek/;

    my $teststr = "a\cJb\cMc\cM\cJ";
    $/ = undef;

    print "Writing mixed:\n";

    {
    open my $CRLF, ">:crlf", "one";
    open my $NLINE, ">:nline", "two";

    select((select($CRLF ),$|=1)[0]);
    select((select($NLINE),$|=1)[0]);

    cmpthese -5, { crlf => sub { print $CRLF $teststr },
    nline => sub { print $NLINE $teststr }
    };
    }

    print "Writing just \\n:\n";

    {
    open my $CRLF, ">:crlf", "one";
    open my $NLINE, ">:nline", "two";

    select((select($CRLF ),$|=1)[0]);
    select((select($NLINE),$|=1)[0]);

    cmpthese -5, { crlf => sub { print $CRLF "a\n" },
    nline => sub { print $NLINE "a\n" }
    };
    }

    {
    open my $RAW, ">:raw", "three";
    print $RAW $teststr;
    }

    print "Reading:\n";

    {
    open my $CRLF, "<:crlf", "three";
    open my $NLINE, "<:nline", "three";

    cmpthese -5, { crlf => sub { <$CRLF>; seek $CRLF, 0, SEEK_SET },
    nline => sub { <$NLINE>; seek $NLINE, 0, SEEK_SET }
    };
    }

    __END__

    Writing mixed:
    Rate nline crlf
    nline 190612/s -- -23%
    crlf 247892/s 30% --
    Writing just \n:
    Rate nline crlf
    nline 229302/s -- -9%
    crlf 252560/s 10% --
    Reading:
    Rate crlf nline
    crlf 58405/s -- -0%
    nline 58519/s 0% --


    Hmmm... not that bad, I suppose, 'specially if you don't use the extra
    flexibility.

    > Whenever this topic comes up, there's usually someone who offers
    > anomalous data and asks what we'd do with it (mixed unix/mac/dos
    > newlines...), but that's just as much a problem for :crlf as it would
    > be for your hypothetical :nl, so I don't see it as a show-stopper.


    I'm /pretty/ sure this layer does the Right Thing in all situations.

    Ben

    --
    For the last month, a large number of PSNs in the Arpa[Inter-]net have been
    reporting symptoms of congestion ... These reports have been accompanied by an
    increasing number of user complaints ... As of June,... the Arpanet contained
    47 nodes and 63 links. [ftp://rtfm.mit.edu/pub/arpaprob.txt] *
     
    Ben Morrow, Nov 29, 2003
    #9
  10. Re: Perl unicode and :crlf, was Re: regexp: \x0a => \x0d\x0a

    On Sat, 29 Nov 2003, Ben Morrow wrote:

    > "Alan J. Flavell" <> wrote:
    > > Incidentally I've found that for utf-8 data the "od -t x1" format is
    > > handy, rather than "od -x".

    >
    > Yes, I found that too. -x is good for little-endian stuff, though.


    Agreed.

    > > > I would seriously consider not using :crlf at all, but instead writing
    > > > a :nl layer that maps any of \n, \r, \r\n to \n on input and any of
    > > > \n, \r, \r\n to \r\n on output... seems to me that'd be more use, in
    > > > general. I guess it would probably be slower.

    >
    > http://morrow.me.uk/PerlIO-nline-0.01.tar.gz
    >
    > > If it was part of the infrastructure, I doubt that the difference in
    > > speed would be noticeable.


    Thanks for the interesting posting! Just to make my meaning clear, I
    meant "I doubt that the difference would be noticeable within the
    scope of a realistic application". The benchmarking is interesting,
    all the same.

    Your approach is clearly more versatile. But the :crlf layer ought to
    do what it says on the tin, shouldn't it? - and from the previous
    discussion, it rather looks as if it isn't doing. Or else I was using
    it wrong, but I tried several interpretations - and all the others
    seemed to be even worse.

    cheers
     
    Alan J. Flavell, Nov 30, 2003
    #10
  11. Re: Perl unicode and :crlf, was Re: regexp: \x0a => \x0d\x0a

    [A complimentary Cc of this posting was sent to
    Alan J. Flavell
    <>], who wrote in article <>:

    > Your approach is clearly more versatile. But the :crlf layer ought to
    > do what it says on the tin, shouldn't it? - and from the previous
    > discussion, it rather looks as if it isn't doing. Or else I was using
    > it wrong, but I tried several interpretations - and all the others
    > seemed to be even worse.


    Given that the layers architecture is absolutely broken (especially
    :crlf stuff), I do not see any reason why anything using layers should
    do any particular thing...

    Hope this helps,
    Ilya
     
    Ilya Zakharevich, Dec 1, 2003
    #11
  12. Sébastien Cottalorda

    Ben Morrow Guest

    Re: Perl unicode and :crlf, was Re: regexp: \x0a => \x0d\x0a

    Ilya Zakharevich <> wrote:
    > Given that the layers architecture is absolutely broken (especially
    > :crlf stuff), I do not see any reason why anything using layers should
    > do any particular thing...


    I was wondering about saying something along these lines but decided
    it probably wasn't my place to... the idea is a good one, but I would
    say it needs a fairly fundamental re-working. *Especially* :crlf.

    Ben

    --
    Musica Dei donum optimi, trahit homines, trahit deos. |
    Musica truces mollit animos, tristesque mentes erigit. |
    Musica vel ipsas arbores et horridas movet feras. |
     
    Ben Morrow, Dec 1, 2003
    #12
  13. Re: Perl unicode and :crlf, was Re: regexp: \x0a => \x0d\x0a

    On Mon, 1 Dec 2003, Ben Morrow wrote:

    > Ilya Zakharevich <> wrote:
    > > Given that the layers architecture is absolutely broken (especially
    > > :crlf stuff), I do not see any reason why anything using layers should
    > > do any particular thing...

    >
    > I was wondering about saying something along these lines but decided
    > it probably wasn't my place to... the idea is a good one, but I would
    > say it needs a fairly fundamental re-working. *Especially* :crlf.


    Well, those comments brought me to earth with a bit of a bump. Have I
    been blundering around with my eyes shut? It seems so, but all of the
    simpler things I've been using the utf8 stuff for have worked fine:
    and I've been recommending it to others in good faith, and have had
    quite a number of positive responses.

    It was specifically this utf-16LE with crlf incident that had proven
    to be a problem. Ho hum, back to the drawing board.
     
    Alan J. Flavell, Dec 1, 2003
    #13
  14. Sébastien Cottalorda

    Ben Morrow Guest

    Re: Perl unicode and :crlf, was Re: regexp: \x0a => \x0d\x0a

    "Alan J. Flavell" <> wrote:
    > On Mon, 1 Dec 2003, Ben Morrow wrote:
    >
    > > Ilya Zakharevich <> wrote:
    > > > Given that the layers architecture is absolutely broken (especially
    > > > :crlf stuff), I do not see any reason why anything using layers should
    > > > do any particular thing...

    > >
    > > I was wondering about saying something along these lines but decided
    > > it probably wasn't my place to... the idea is a good one, but I would
    > > say it needs a fairly fundamental re-working. *Especially* :crlf.

    >
    > Well, those comments brought me to earth with a bit of a bump. Have I
    > been blundering around with my eyes shut? It seems so, but all of the
    > simpler things I've been using the utf8 stuff for have worked fine:
    > and I've been recommending it to others in good faith, and have had
    > quite a number of positive responses.


    Simple things like pushing :utf8 or :encoding(*)[1] onto a filehandle
    work fine. Anything more complicated than that gets tricky,
    particulary with :crlf since it (like :utf8, but much more so) isn't
    really a layer at all but instead searches down the stack 'till it
    finds a layer that declares it can do CR:LF translation and tells it
    to start... hence my preference for a straightforward layer.

    For instance, one thing that I would want to be able to do is, without
    losing the contents of any buffers, change a filehandle from being its
    default of :unix:perlio to :stdio so I could pass a FILE* to some
    library that wanted one, and then change it back afterwards. This
    is... very dodgy at the moment. If you have 5.8.2 try some more
    complicated pushings and poppings and see what PerlIO::get_layers says
    ends up actually there. Or have a poke around in perlio.c :).

    [1] As a separate issue, I would *always* be inclined to push
    :encoding(utf8) rather than :utf8, despite the probable performance
    hit, because :utf8 doesn't actually check the data is valid utf8: it
    just marks it as such and passes it along. :encoding not only checks,
    it also gives you some chance of decent fallback.

    Ben

    --
    If you put all the prophets, | You'd have so much more reason
    Mystics and saints | Than ever was born
    In one room together, | Out of all of the conflicts of time.
    |----------------+---------------| The Levellers, 'Believers'
     
    Ben Morrow, Dec 2, 2003
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. M-a-S
    Replies:
    7
    Views:
    991
    John Hazen
    Sep 10, 2003
  2. Greg Hurrell
    Replies:
    4
    Views:
    171
    James Edward Gray II
    Feb 14, 2007
  3. Mikel Lindsaar
    Replies:
    0
    Views:
    523
    Mikel Lindsaar
    Mar 31, 2008
  4. Joao Silva
    Replies:
    16
    Views:
    390
    7stud --
    Aug 21, 2009
  5. Uldis  Bojars
    Replies:
    2
    Views:
    204
    Janwillem Borleffs
    Dec 17, 2006
Loading...

Share This Page