tr/// broken?

Ilya Zakharevich · Apr 11, 2006

I'm trying to use tr/// operator (instead of RExen), and do not think
it works... The simplified example is

perl5.8.7 -wle "$_ = q(abcdefg); tr/\x{e000}-\x{e0ff}/ /c; print"

UTF-16 surrogate 0xdfff at -e line 1.
Malformed UTF-8 character (UTF-16 surrogate 0xdfff) at -e line 1.
abcdefg

The original code contained something like

perl5.8.7 -wle "$_ = qq(abcd\x{e155}efg);
tr/\x{e100}-\x{e1ff}\x00-\x{1FFFFF}/\x00-\xFF_/; print"
Unicode character 0x1fffff is illegal at -e line 1.
________

That spurious warning can be worked about, but I think the behaviour
is not up to documentation; is it?

Thanks,
Ilya

Guest · Apr 11, 2006

Ilya Zakharevich a dit le Tue, 11 Apr 2006 02:53:58 +0000 (UTC):

I'm trying to use tr/// operator (instead of RExen), and do not think
it works... The simplified example is

perl5.8.7 -wle "$_ = q(abcdefg); tr/\x{e000}-\x{e0ff}/ /c; print"

Click to expand...

UTF-16 surrogate 0xdfff at -e line 1.
Malformed UTF-8 character (UTF-16 surrogate 0xdfff) at -e line 1.
abcdefg
[...]
That spurious warning can be worked about, but I think the behaviour
is not up to documentation; is it?

Its in the perldiag manpage :

UTF-16 surrogate %s
(W utf8) You tried to generate half of an UTF-16 surrogate by requesting a
Unicode character between the code points 0xD800 and 0xDFFF (inclusive). That
range is reserved exclusively for the use of UTF-16 encoding (by having two 16-
bit UCS-2 characters); but Perl encodes its characters in UTF-8, so what you
got is a very illegal character. If you really know what you are doing you can
turn off this warning by "no warnings 'utf8';".

Ilya Zakharevich · Apr 11, 2006

[A complimentary Cc of this posting was sent to

Its in the perldiag manpage :

UTF-16 surrogate %s
(W utf8) You tried to generate half ...

First of all, I assume that "its" is this broken warning (actually,
one of two [duplicate] warnings). Since it does not apply to the
situation I discuss, I can hardly find your finding this message in
the list of warnings relevant.

Second, what I was discussing was not the warning, but the ACTION. Do
you think the RESULT ('abcdefg') is "correct"?

Thanks anyway,
Ilya

P.S. Actually, the text in perldiag is also wrong:

of an UTF-16 surrogate by requesting a Unicode character between the
code points 0xD800 and 0xDFFF (inclusive). That range is reserved
exclusively for the use of UTF-16 encoding (by having two 16- bit
UCS-2 characters); but Perl encodes its characters in UTF-8, so what
you got is a very illegal character. If you really know what you
are doing you can turn off this warning by "no warnings 'utf8';".

Perl (the language) does not encode its characters in UTF-8.
Characters are not encoded in any way, they just "are". And, if you
consider implementation, the internal encoding is not UTF-8 either (it
is called in perl world as "utf8", and is a proper superset). Sigh...

Dr.Ruud · Apr 11, 2006

Ilya Zakharevich schreef:

I'm trying to use tr/// operator (instead of RExen), and do not think
it works... The simplified example is

UTF-16 surrogate 0xdfff at -e line 1.
Malformed UTF-8 character (UTF-16 surrogate 0xdfff) at -e line 1.
abcdefg

The original code contained something like

perl5.8.7 -wle "$_ = qq(abcd\x{e155}efg);
tr/\x{e100}-\x{e1ff}\x00-\x{1FFFFF}/\x00-\xFF_/; print"
Unicode character 0x1fffff is illegal at -e line 1.
________

That spurious warning can be worked about,

Is it a "spurious warning"?

perl -MO=Deparse -e '$_ = qq(\x{d7ff}\x{d800})'

perl -MO=Deparse -e 'tr/\x{d7ff}\x{d800}//'

but I think the behaviour
is not up to documentation; is it?

It isn't.

thundergnat · Apr 11, 2006

Ilya said:
I'm trying to use tr/// operator (instead of RExen), and do not think
it works... The simplified example is

UTF-16 surrogate 0xdfff at -e line 1.
Malformed UTF-8 character (UTF-16 surrogate 0xdfff) at -e line 1.
abcdefg

The original code contained something like

perl5.8.7 -wle "$_ = qq(abcd\x{e155}efg);
tr/\x{e100}-\x{e1ff}\x00-\x{1FFFFF}/\x00-\xFF_/; print"
Unicode character 0x1fffff is illegal at -e line 1.
________

That spurious warning can be worked about, but I think the behaviour
is not up to documentation; is it?

It /does/ appear to be a bug in tr. Not in that it has a problem with
characters in the range D800–DFFF, that doesn't surprise me much. Those
/aren't/ legal utf-8 character codes. The thing that DOES surprise me is
that tr considers \x{e000} (and \x{d7ff}!) to be in the range
\x{d800}-\x{dfff}. Seems like tr is confused about the surrogates range.

no error:
perl -wle "$_ = q(abcdefg); tr/\x{e001}-\x{e0ff}/ /c; print"

error
perl -wle "$_ = q(abcdefg); tr/\x{e000}/ /c; print"

error
perl -wle "$_ = q(abcdefg); tr/\x{d7ff}/ /c; print"

no error
perl -wle "$_ = q(abcdefg); tr/\x{d7fe}/ /c; print"

Ilya Zakharevich · Apr 11, 2006

[A complimentary Cc of this posting was sent to
Dr.Ruud

Is it a "spurious warning"?

Looks so. What makes you doubt it? I'm working with Perl characters,
not Unicode characters; and IIRC, even Unicode goes up to 0x1fffff...
Or is it 0x10ffff?

perl -MO=Deparse -e 'tr/\x{d7ff}\x{d800}//'

What is your point? I do not see which output makes you think this is
relevant... Did you try

perl -MO=Deparse -e 'tr/\x{7ff}\x{800}//'

Thanks,
Ilya

Ilya Zakharevich · Apr 11, 2006

[A complimentary Cc of this posting was sent to
Dr.Ruud

Is it a "spurious warning"?

perl -MO=Deparse -e 'tr/\x{d7ff}\x{d800}//'

Oups, ignore my preceeding message; I was using wrong quotes... So I
see now where the Perl bug is:

perl -MO=Deparse -e "tr/\x{0000}-\x{ffff}//"

Malformed UTF-8 character (character 0xffff) at -e line 1.
Malformed UTF-8 character (character 0xffff) at -e line 1.
use utf8 ();
tr/\000//;
-e syntax OK

perl -MO=Deparse -e "tr/\x{0000}-\x{fff0}//"

use utf8 ();
tr/\000-\x{fff0}//;
-e syntax OK

So some Perl developer thought that Perl characters == Unicode
characters, and mangles the pattern without reporting errors...

A lot of thanks,
Ilya

Ilya Zakharevich · Apr 11, 2006

[A complimentary Cc of this posting was sent to
thundergnat

It /does/ appear to be a bug in tr. Not in that it has a problem with
characters in the range D800–DFFF, that doesn't surprise me much. Those
/aren't/ legal utf-8 character codes.

Let me disagree. First, I know of no such thing as utf-8. Second, if
you mean utf8, legal codes are 0..MAX_UV (since the size of UV is
specific to Perl build, this depends on the build of Perl executable).

Some codes would not appear in Unicode strings; but one should be able
to treat "binary" data freely (including 0..31 and 0x80..0x9F ranges,
and other characters which have no Unicode-consortium-assigned
cultural information).

Thanks,
Ilya

Guest · Apr 12, 2006

Ilya Zakharevich a dit le Tue, 11 Apr 2006 16:17:49 +0000 (UTC):

Since it does not apply to the
situation I discuss, I can hardly find your finding this message in
the list of warnings relevant.

Second, what I was discussing was not the warning, but the ACTION. Do
you think the RESULT ('abcdefg') is "correct"?

The warning seems relevant, as avoiding the 0xD800-0xDFFF range seems to give a
good result :

$ perl -wle '$_ = q(abcdefg); tr/\x{d7ff}-\x{e0ff}/ /c; print'

Ben Bacarisse · Apr 13, 2006

Let me disagree. First, I know of no such thing as utf-8. Second, if
you mean utf8

The proper form is UTF-8 (i.e. with caps) so your correction (further from
the accepted form) seems rather harsh!

Refs:
http://www.unicode.org/versions/Unicode3.0.html
http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8

Dr.Ruud · Apr 13, 2006

Ben Bacarisse schreef:

Ilya Zakharevich:

The proper form is UTF-8 (i.e. with caps) so your correction (further
from the accepted form) seems rather harsh!

Please read

perldoc Encode
perldoc utf8

In a Perl context, 'utf8' is commonly read as the proper subset of
'UTF-8' currently used by Perl.
See also Ilya's

Ben Bacarisse · Apr 13, 2006

Ben Bacarisse schreef:

Please read

perldoc Encode
perldoc utf8

In a Perl context, 'utf8' is commonly read as the proper subset of
'UTF-8' currently used by Perl.

I was rather glib, sorry. It was the (understandably) irritable "I know
of no such thing as utf-8" when the author almost certainly knows about
utf8, utf-8, UTF-8 and their meanings in and out of Perl that caused me to
post too rapidly.

Ilya Zakharevich · Apr 14, 2006

[A complimentary Cc of this posting was sent to
Dr.Ruud

In a Perl context, 'utf8' is commonly read as the proper subset of
'UTF-8' currently used by Perl.

utf8 is a proper SUPERSET of UTF-8. The former is not restricted to
any particular range of non-negative integers; the current
implementation goes 0..0xFFFFFFFFFFFFFFFF (i.e., maximal range of
native unsigned integers currently used in Perl), and there are "free"
bits to extend it to, e.g., 128bit - if Perl is used on architecture
with sizeof(UV) = 128bits.

UTF-8 is "legally" restricted to 0..0x1FFFFF, although technically, it
can cover up to, IIRC, 0..0x1FFFFFFF.

Hope this helps,
Ilya

Dr.Ruud · Apr 14, 2006

Ilya Zakharevich schreef:

[A complimentary Cc of this posting was sent to
Dr.Ruud

Please don't do that. This is a newsgroup. Even with mailing lists I
wouldn't do that, unless it is specifically requested somehow.

rvtol:

utf8 is a proper SUPERSET of UTF-8.

Yes, sorry. When I wrote that I had a huge headache, that has just left
together with one of my wisdom teeth.

The former is not restricted to
any particular range of non-negative integers; the current
implementation goes 0..0xFFFFFFFFFFFFFFFF (i.e., maximal range of
native unsigned integers currently used in Perl), and there are "free"
bits to extend it to, e.g., 128bit - if Perl is used on architecture
with sizeof(UV) = 128bits.

UTF-8 is "legally" restricted to 0..0x1FFFFF, although technically, it
can cover up to, IIRC, 0..0x1FFFFFFF.

OK, thanks.

tr/ last char x$	4	Mar 15, 2007
Encoding of surrogate code points to UTF-8	14	Oct 8, 2013
Help with code	0	Jun 12, 2022
Help with my responsive home page	2	Dec 14, 2022
python tr equivalent (non-ascii)	3	Aug 13, 2008
Questions on various string literals in c++0x	1	Dec 7, 2010
Simple tr/// script	5	Aug 14, 2003
Regex testing and UTF8 awarenes or Regex and numeric pattern matching	2	Mar 10, 2009

tr/// broken?

Ilya Zakharevich

Guest

Ilya Zakharevich

Dr.Ruud

thundergnat

Ilya Zakharevich

Ilya Zakharevich

Ilya Zakharevich

Guest

Ben Bacarisse

Dr.Ruud

Ben Bacarisse

Ilya Zakharevich

Dr.Ruud

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads