tr/// broken?

Discussion in 'Perl Misc' started by Ilya Zakharevich, Apr 11, 2006.

  1. I'm trying to use tr/// operator (instead of RExen), and do not think
    it works... The simplified example is

    >perl5.8.7 -wle "$_ = q(abcdefg); tr/\x{e000}-\x{e0ff}/ /c; print"

    UTF-16 surrogate 0xdfff at -e line 1.
    Malformed UTF-8 character (UTF-16 surrogate 0xdfff) at -e line 1.
    abcdefg

    The original code contained something like

    perl5.8.7 -wle "$_ = qq(abcd\x{e155}efg);
    tr/\x{e100}-\x{e1ff}\x00-\x{1FFFFF}/\x00-\xFF_/; print"
    Unicode character 0x1fffff is illegal at -e line 1.
    ________

    That spurious warning can be worked about, but I think the behaviour
    is not up to documentation; is it?

    Thanks,
    Ilya
     
    Ilya Zakharevich, Apr 11, 2006
    #1
    1. Advertising

  2. Ilya Zakharevich

    Guest Guest

    Ilya Zakharevich a dit le Tue, 11 Apr 2006 02:53:58 +0000 (UTC):
    >I'm trying to use tr/// operator (instead of RExen), and do not think
    >it works... The simplified example is
    >
    > >perl5.8.7 -wle "$_ = q(abcdefg); tr/\x{e000}-\x{e0ff}/ /c; print"

    > UTF-16 surrogate 0xdfff at -e line 1.
    > Malformed UTF-8 character (UTF-16 surrogate 0xdfff) at -e line 1.
    > abcdefg
    >

    [...]
    >That spurious warning can be worked about, but I think the behaviour
    >is not up to documentation; is it?


    Its in the perldiag manpage :

    UTF-16 surrogate %s
    (W utf8) You tried to generate half of an UTF-16 surrogate by requesting a
    Unicode character between the code points 0xD800 and 0xDFFF (inclusive). That
    range is reserved exclusively for the use of UTF-16 encoding (by having two 16-
    bit UCS-2 characters); but Perl encodes its characters in UTF-8, so what you
    got is a very illegal character. If you really know what you are doing you can
    turn off this warning by "no warnings 'utf8';".
     
    Guest, Apr 11, 2006
    #2
    1. Advertising

  3. [A complimentary Cc of this posting was sent to

    <>], who wrote in article <443b8741$0$5170$>:
    > > >perl5.8.7 -wle "$_ = q(abcdefg); tr/\x{e000}-\x{e0ff}/ /c; print"

    > > UTF-16 surrogate 0xdfff at -e line 1.
    > > Malformed UTF-8 character (UTF-16 surrogate 0xdfff) at -e line 1.
    > > abcdefg


    > >That spurious warning can be worked about, but I think the behaviour
    > >is not up to documentation; is it?


    > Its in the perldiag manpage :
    >
    > UTF-16 surrogate %s
    > (W utf8) You tried to generate half ...


    First of all, I assume that "its" is this broken warning (actually,
    one of two [duplicate] warnings). Since it does not apply to the
    situation I discuss, I can hardly find your finding this message in
    the list of warnings relevant.

    Second, what I was discussing was not the warning, but the ACTION. Do
    you think the RESULT ('abcdefg') is "correct"?

    Thanks anyway,
    Ilya

    P.S. Actually, the text in perldiag is also wrong:

    > of an UTF-16 surrogate by requesting a Unicode character between the
    > code points 0xD800 and 0xDFFF (inclusive). That range is reserved
    > exclusively for the use of UTF-16 encoding (by having two 16- bit
    > UCS-2 characters); but Perl encodes its characters in UTF-8, so what
    > you got is a very illegal character. If you really know what you
    > are doing you can turn off this warning by "no warnings 'utf8';".


    Perl (the language) does not encode its characters in UTF-8.
    Characters are not encoded in any way, they just "are". And, if you
    consider implementation, the internal encoding is not UTF-8 either (it
    is called in perl world as "utf8", and is a proper superset). Sigh...
     
    Ilya Zakharevich, Apr 11, 2006
    #3
  4. Ilya Zakharevich

    Dr.Ruud Guest

    Ilya Zakharevich schreef:

    > I'm trying to use tr/// operator (instead of RExen), and do not think
    > it works... The simplified example is
    >
    > >perl5.8.7 -wle "$_ = q(abcdefg); tr/\x{e000}-\x{e0ff}/ /c; print"

    > UTF-16 surrogate 0xdfff at -e line 1.
    > Malformed UTF-8 character (UTF-16 surrogate 0xdfff) at -e line 1.
    > abcdefg
    >
    > The original code contained something like
    >
    > perl5.8.7 -wle "$_ = qq(abcd\x{e155}efg);
    > tr/\x{e100}-\x{e1ff}\x00-\x{1FFFFF}/\x00-\xFF_/; print"
    > Unicode character 0x1fffff is illegal at -e line 1.
    > ________
    >
    > That spurious warning can be worked about,


    Is it a "spurious warning"?

    perl -MO=Deparse -e '$_ = qq(\x{d7ff}\x{d800})'

    perl -MO=Deparse -e 'tr/\x{d7ff}\x{d800}//'


    > but I think the behaviour
    > is not up to documentation; is it?


    It isn't.

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Apr 11, 2006
    #4
  5. Ilya Zakharevich

    thundergnat Guest

    Ilya Zakharevich wrote:
    > I'm trying to use tr/// operator (instead of RExen), and do not think
    > it works... The simplified example is
    >
    > >perl5.8.7 -wle "$_ = q(abcdefg); tr/\x{e000}-\x{e0ff}/ /c; print"

    > UTF-16 surrogate 0xdfff at -e line 1.
    > Malformed UTF-8 character (UTF-16 surrogate 0xdfff) at -e line 1.
    > abcdefg
    >
    > The original code contained something like
    >
    > perl5.8.7 -wle "$_ = qq(abcd\x{e155}efg);
    > tr/\x{e100}-\x{e1ff}\x00-\x{1FFFFF}/\x00-\xFF_/; print"
    > Unicode character 0x1fffff is illegal at -e line 1.
    > ________
    >
    > That spurious warning can be worked about, but I think the behaviour
    > is not up to documentation; is it?
    >


    It /does/ appear to be a bug in tr. Not in that it has a problem with
    characters in the range D800–DFFF, that doesn't surprise me much. Those
    /aren't/ legal utf-8 character codes. The thing that DOES surprise me is
    that tr considers \x{e000} (and \x{d7ff}!) to be in the range
    \x{d800}-\x{dfff}. Seems like tr is confused about the surrogates range.


    no error:
    perl -wle "$_ = q(abcdefg); tr/\x{e001}-\x{e0ff}/ /c; print"


    error
    perl -wle "$_ = q(abcdefg); tr/\x{e000}/ /c; print"


    error
    perl -wle "$_ = q(abcdefg); tr/\x{d7ff}/ /c; print"


    no error
    perl -wle "$_ = q(abcdefg); tr/\x{d7fe}/ /c; print"
     
    thundergnat, Apr 11, 2006
    #5
  6. [A complimentary Cc of this posting was sent to
    Dr.Ruud
    <>], who wrote in article <>:
    > > The original code contained something like
    > >
    > > perl5.8.7 -wle "$_ = qq(abcd\x{e155}efg);
    > > tr/\x{e100}-\x{e1ff}\x00-\x{1FFFFF}/\x00-\xFF_/; print"
    > > Unicode character 0x1fffff is illegal at -e line 1.
    > > ________
    > >
    > > That spurious warning can be worked about,

    >
    > Is it a "spurious warning"?


    Looks so. What makes you doubt it? I'm working with Perl characters,
    not Unicode characters; and IIRC, even Unicode goes up to 0x1fffff...
    Or is it 0x10ffff?

    > perl -MO=Deparse -e 'tr/\x{d7ff}\x{d800}//'


    What is your point? I do not see which output makes you think this is
    relevant... Did you try

    perl -MO=Deparse -e 'tr/\x{7ff}\x{800}//'

    Thanks,
    Ilya
     
    Ilya Zakharevich, Apr 11, 2006
    #6
  7. [A complimentary Cc of this posting was sent to
    Dr.Ruud
    <>], who wrote in article <>:
    > Is it a "spurious warning"?


    > perl -MO=Deparse -e 'tr/\x{d7ff}\x{d800}//'


    Oups, ignore my preceeding message; I was using wrong quotes... So I
    see now where the Perl bug is:

    >perl -MO=Deparse -e "tr/\x{0000}-\x{ffff}//"

    Malformed UTF-8 character (character 0xffff) at -e line 1.
    Malformed UTF-8 character (character 0xffff) at -e line 1.
    use utf8 ();
    tr/\000//;
    -e syntax OK

    >perl -MO=Deparse -e "tr/\x{0000}-\x{fff0}//"

    use utf8 ();
    tr/\000-\x{fff0}//;
    -e syntax OK

    So some Perl developer thought that Perl characters == Unicode
    characters, and mangles the pattern without reporting errors...

    A lot of thanks,
    Ilya
     
    Ilya Zakharevich, Apr 11, 2006
    #7
  8. [A complimentary Cc of this posting was sent to
    thundergnat
    <>], who wrote in article <>:
    > It /does/ appear to be a bug in tr. Not in that it has a problem with
    > characters in the range D800–DFFF, that doesn't surprise me much. Those
    > /aren't/ legal utf-8 character codes.


    Let me disagree. First, I know of no such thing as utf-8. Second, if
    you mean utf8, legal codes are 0..MAX_UV (since the size of UV is
    specific to Perl build, this depends on the build of Perl executable).

    Some codes would not appear in Unicode strings; but one should be able
    to treat "binary" data freely (including 0..31 and 0x80..0x9F ranges,
    and other characters which have no Unicode-consortium-assigned
    cultural information).

    Thanks,
    Ilya
     
    Ilya Zakharevich, Apr 11, 2006
    #8
  9. Ilya Zakharevich

    Guest Guest

    Ilya Zakharevich a dit le Tue, 11 Apr 2006 16:17:49 +0000 (UTC):
    > Since it does not apply to the
    >situation I discuss, I can hardly find your finding this message in
    >the list of warnings relevant.
    >
    >Second, what I was discussing was not the warning, but the ACTION. Do
    >you think the RESULT ('abcdefg') is "correct"?


    The warning seems relevant, as avoiding the 0xD800-0xDFFF range seems to give a
    good result :


    $ perl -wle '$_ = q(abcdefg); tr/\x{d7ff}-\x{e0ff}/ /c; print'
     
    Guest, Apr 12, 2006
    #9
  10. Ben Bacarisse, Apr 13, 2006
    #10
  11. Ilya Zakharevich

    Dr.Ruud Guest

    Ben Bacarisse schreef:
    > Ilya Zakharevich:


    >> Let me disagree. First, I know of no such thing as utf-8. Second,
    >> if you mean utf8

    >
    > The proper form is UTF-8 (i.e. with caps) so your correction (further
    > from the accepted form) seems rather harsh!


    Please read

    perldoc Encode
    perldoc utf8


    In a Perl context, 'utf8' is commonly read as the proper subset of
    'UTF-8' currently used by Perl.
    See also Ilya's news:e1gkrd$2hr$

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Apr 13, 2006
    #11
  12. On Thu, 13 Apr 2006 15:07:48 +0200, Dr.Ruud wrote:

    > Ben Bacarisse schreef:
    >> Ilya Zakharevich:

    >
    >>> Let me disagree. First, I know of no such thing as utf-8. Second,
    >>> if you mean utf8

    >>
    >> The proper form is UTF-8 (i.e. with caps) so your correction (further
    >> from the accepted form) seems rather harsh!

    >
    > Please read
    >
    > perldoc Encode
    > perldoc utf8
    >
    >
    > In a Perl context, 'utf8' is commonly read as the proper subset of
    > 'UTF-8' currently used by Perl.


    I was rather glib, sorry. It was the (understandably) irritable "I know
    of no such thing as utf-8" when the author almost certainly knows about
    utf8, utf-8, UTF-8 and their meanings in and out of Perl that caused me to
    post too rapidly.

    --
    Ben.
     
    Ben Bacarisse, Apr 13, 2006
    #12
  13. [A complimentary Cc of this posting was sent to
    Dr.Ruud
    <>], who wrote in article <>:
    > In a Perl context, 'utf8' is commonly read as the proper subset of
    > 'UTF-8' currently used by Perl.


    utf8 is a proper SUPERSET of UTF-8. The former is not restricted to
    any particular range of non-negative integers; the current
    implementation goes 0..0xFFFFFFFFFFFFFFFF (i.e., maximal range of
    native unsigned integers currently used in Perl), and there are "free"
    bits to extend it to, e.g., 128bit - if Perl is used on architecture
    with sizeof(UV) = 128bits.

    UTF-8 is "legally" restricted to 0..0x1FFFFF, although technically, it
    can cover up to, IIRC, 0..0x1FFFFFFF.

    Hope this helps,
    Ilya
     
    Ilya Zakharevich, Apr 14, 2006
    #13
  14. Ilya Zakharevich

    Dr.Ruud Guest

    Ilya Zakharevich schreef:
    > [A complimentary Cc of this posting was sent to
    > Dr.Ruud


    Please don't do that. This is a newsgroup. Even with mailing lists I
    wouldn't do that, unless it is specifically requested somehow.

    > rvtol:


    >> In a Perl context, 'utf8' is commonly read as the proper subset of
    >> 'UTF-8' currently used by Perl.

    >
    > utf8 is a proper SUPERSET of UTF-8.


    Yes, sorry. When I wrote that I had a huge headache, that has just left
    together with one of my wisdom teeth.


    > The former is not restricted to
    > any particular range of non-negative integers; the current
    > implementation goes 0..0xFFFFFFFFFFFFFFFF (i.e., maximal range of
    > native unsigned integers currently used in Perl), and there are "free"
    > bits to extend it to, e.g., 128bit - if Perl is used on architecture
    > with sizeof(UV) = 128bits.
    >
    > UTF-8 is "legally" restricted to 0..0x1FFFFF, although technically, it
    > can cover up to, IIRC, 0..0x1FFFFFFF.


    OK, thanks.


    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Apr 14, 2006
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Cindy M  -WordMVP-
    Replies:
    0
    Views:
    556
    Cindy M -WordMVP-
    Dec 2, 2003
  2. sandra
    Replies:
    7
    Views:
    504
    Gerry O'Brien [MVP]
    Feb 25, 2004
  3. Steven D'Aprano

    Why are "broken iterators" broken?

    Steven D'Aprano, Sep 21, 2008, in forum: Python
    Replies:
    8
    Views:
    665
  4. Cameron Simpson

    Re: Why are "broken iterators" broken?

    Cameron Simpson, Sep 22, 2008, in forum: Python
    Replies:
    0
    Views:
    593
    Cameron Simpson
    Sep 22, 2008
  5. Fredrik Lundh

    Re: Why are "broken iterators" broken?

    Fredrik Lundh, Sep 22, 2008, in forum: Python
    Replies:
    0
    Views:
    610
    Fredrik Lundh
    Sep 22, 2008
Loading...

Share This Page