tr/// broken?

I

Ilya Zakharevich

I'm trying to use tr/// operator (instead of RExen), and do not think
it works... The simplified example is
perl5.8.7 -wle "$_ = q(abcdefg); tr/\x{e000}-\x{e0ff}/ /c; print"
UTF-16 surrogate 0xdfff at -e line 1.
Malformed UTF-8 character (UTF-16 surrogate 0xdfff) at -e line 1.
abcdefg

The original code contained something like

perl5.8.7 -wle "$_ = qq(abcd\x{e155}efg);
tr/\x{e100}-\x{e1ff}\x00-\x{1FFFFF}/\x00-\xFF_/; print"
Unicode character 0x1fffff is illegal at -e line 1.
________

That spurious warning can be worked about, but I think the behaviour
is not up to documentation; is it?

Thanks,
Ilya
 
G

Guest

Ilya Zakharevich a dit le Tue, 11 Apr 2006 02:53:58 +0000 (UTC):
I'm trying to use tr/// operator (instead of RExen), and do not think
it works... The simplified example is
perl5.8.7 -wle "$_ = q(abcdefg); tr/\x{e000}-\x{e0ff}/ /c; print"
UTF-16 surrogate 0xdfff at -e line 1.
Malformed UTF-8 character (UTF-16 surrogate 0xdfff) at -e line 1.
abcdefg
[...]
That spurious warning can be worked about, but I think the behaviour
is not up to documentation; is it?

Its in the perldiag manpage :

UTF-16 surrogate %s
(W utf8) You tried to generate half of an UTF-16 surrogate by requesting a
Unicode character between the code points 0xD800 and 0xDFFF (inclusive). That
range is reserved exclusively for the use of UTF-16 encoding (by having two 16-
bit UCS-2 characters); but Perl encodes its characters in UTF-8, so what you
got is a very illegal character. If you really know what you are doing you can
turn off this warning by "no warnings 'utf8';".
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to

Its in the perldiag manpage :

UTF-16 surrogate %s
(W utf8) You tried to generate half ...

First of all, I assume that "its" is this broken warning (actually,
one of two [duplicate] warnings). Since it does not apply to the
situation I discuss, I can hardly find your finding this message in
the list of warnings relevant.

Second, what I was discussing was not the warning, but the ACTION. Do
you think the RESULT ('abcdefg') is "correct"?

Thanks anyway,
Ilya

P.S. Actually, the text in perldiag is also wrong:
of an UTF-16 surrogate by requesting a Unicode character between the
code points 0xD800 and 0xDFFF (inclusive). That range is reserved
exclusively for the use of UTF-16 encoding (by having two 16- bit
UCS-2 characters); but Perl encodes its characters in UTF-8, so what
you got is a very illegal character. If you really know what you
are doing you can turn off this warning by "no warnings 'utf8';".

Perl (the language) does not encode its characters in UTF-8.
Characters are not encoded in any way, they just "are". And, if you
consider implementation, the internal encoding is not UTF-8 either (it
is called in perl world as "utf8", and is a proper superset). Sigh...
 
D

Dr.Ruud

Ilya Zakharevich schreef:
I'm trying to use tr/// operator (instead of RExen), and do not think
it works... The simplified example is

UTF-16 surrogate 0xdfff at -e line 1.
Malformed UTF-8 character (UTF-16 surrogate 0xdfff) at -e line 1.
abcdefg

The original code contained something like

perl5.8.7 -wle "$_ = qq(abcd\x{e155}efg);
tr/\x{e100}-\x{e1ff}\x00-\x{1FFFFF}/\x00-\xFF_/; print"
Unicode character 0x1fffff is illegal at -e line 1.
________

That spurious warning can be worked about,

Is it a "spurious warning"?

perl -MO=Deparse -e '$_ = qq(\x{d7ff}\x{d800})'

perl -MO=Deparse -e 'tr/\x{d7ff}\x{d800}//'

but I think the behaviour
is not up to documentation; is it?

It isn't.
 
T

thundergnat

Ilya said:
I'm trying to use tr/// operator (instead of RExen), and do not think
it works... The simplified example is

UTF-16 surrogate 0xdfff at -e line 1.
Malformed UTF-8 character (UTF-16 surrogate 0xdfff) at -e line 1.
abcdefg

The original code contained something like

perl5.8.7 -wle "$_ = qq(abcd\x{e155}efg);
tr/\x{e100}-\x{e1ff}\x00-\x{1FFFFF}/\x00-\xFF_/; print"
Unicode character 0x1fffff is illegal at -e line 1.
________

That spurious warning can be worked about, but I think the behaviour
is not up to documentation; is it?

It /does/ appear to be a bug in tr. Not in that it has a problem with
characters in the range D800–DFFF, that doesn't surprise me much. Those
/aren't/ legal utf-8 character codes. The thing that DOES surprise me is
that tr considers \x{e000} (and \x{d7ff}!) to be in the range
\x{d800}-\x{dfff}. Seems like tr is confused about the surrogates range.


no error:
perl -wle "$_ = q(abcdefg); tr/\x{e001}-\x{e0ff}/ /c; print"


error
perl -wle "$_ = q(abcdefg); tr/\x{e000}/ /c; print"


error
perl -wle "$_ = q(abcdefg); tr/\x{d7ff}/ /c; print"


no error
perl -wle "$_ = q(abcdefg); tr/\x{d7fe}/ /c; print"
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Dr.Ruud
Is it a "spurious warning"?

Looks so. What makes you doubt it? I'm working with Perl characters,
not Unicode characters; and IIRC, even Unicode goes up to 0x1fffff...
Or is it 0x10ffff?
perl -MO=Deparse -e 'tr/\x{d7ff}\x{d800}//'

What is your point? I do not see which output makes you think this is
relevant... Did you try

perl -MO=Deparse -e 'tr/\x{7ff}\x{800}//'

Thanks,
Ilya
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Dr.Ruud
Is it a "spurious warning"?
perl -MO=Deparse -e 'tr/\x{d7ff}\x{d800}//'

Oups, ignore my preceeding message; I was using wrong quotes... So I
see now where the Perl bug is:
perl -MO=Deparse -e "tr/\x{0000}-\x{ffff}//"
Malformed UTF-8 character (character 0xffff) at -e line 1.
Malformed UTF-8 character (character 0xffff) at -e line 1.
use utf8 ();
tr/\000//;
-e syntax OK
perl -MO=Deparse -e "tr/\x{0000}-\x{fff0}//"
use utf8 ();
tr/\000-\x{fff0}//;
-e syntax OK

So some Perl developer thought that Perl characters == Unicode
characters, and mangles the pattern without reporting errors...

A lot of thanks,
Ilya
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to
thundergnat
It /does/ appear to be a bug in tr. Not in that it has a problem with
characters in the range D800–DFFF, that doesn't surprise me much. Those
/aren't/ legal utf-8 character codes.

Let me disagree. First, I know of no such thing as utf-8. Second, if
you mean utf8, legal codes are 0..MAX_UV (since the size of UV is
specific to Perl build, this depends on the build of Perl executable).

Some codes would not appear in Unicode strings; but one should be able
to treat "binary" data freely (including 0..31 and 0x80..0x9F ranges,
and other characters which have no Unicode-consortium-assigned
cultural information).

Thanks,
Ilya
 
G

Guest

Ilya Zakharevich a dit le Tue, 11 Apr 2006 16:17:49 +0000 (UTC):
Since it does not apply to the
situation I discuss, I can hardly find your finding this message in
the list of warnings relevant.

Second, what I was discussing was not the warning, but the ACTION. Do
you think the RESULT ('abcdefg') is "correct"?

The warning seems relevant, as avoiding the 0xD800-0xDFFF range seems to give a
good result :


$ perl -wle '$_ = q(abcdefg); tr/\x{d7ff}-\x{e0ff}/ /c; print'
 
D

Dr.Ruud

Ben Bacarisse schreef:
Ilya Zakharevich:

The proper form is UTF-8 (i.e. with caps) so your correction (further
from the accepted form) seems rather harsh!

Please read

perldoc Encode
perldoc utf8


In a Perl context, 'utf8' is commonly read as the proper subset of
'UTF-8' currently used by Perl.
See also Ilya's
 
B

Ben Bacarisse

Ben Bacarisse schreef:

Please read

perldoc Encode
perldoc utf8


In a Perl context, 'utf8' is commonly read as the proper subset of
'UTF-8' currently used by Perl.

I was rather glib, sorry. It was the (understandably) irritable "I know
of no such thing as utf-8" when the author almost certainly knows about
utf8, utf-8, UTF-8 and their meanings in and out of Perl that caused me to
post too rapidly.
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Dr.Ruud
In a Perl context, 'utf8' is commonly read as the proper subset of
'UTF-8' currently used by Perl.

utf8 is a proper SUPERSET of UTF-8. The former is not restricted to
any particular range of non-negative integers; the current
implementation goes 0..0xFFFFFFFFFFFFFFFF (i.e., maximal range of
native unsigned integers currently used in Perl), and there are "free"
bits to extend it to, e.g., 128bit - if Perl is used on architecture
with sizeof(UV) = 128bits.

UTF-8 is "legally" restricted to 0..0x1FFFFF, although technically, it
can cover up to, IIRC, 0..0x1FFFFFFF.

Hope this helps,
Ilya
 
D

Dr.Ruud

Ilya Zakharevich schreef:
[A complimentary Cc of this posting was sent to
Dr.Ruud

Please don't do that. This is a newsgroup. Even with mailing lists I
wouldn't do that, unless it is specifically requested somehow.
rvtol:

utf8 is a proper SUPERSET of UTF-8.

Yes, sorry. When I wrote that I had a huge headache, that has just left
together with one of my wisdom teeth.

The former is not restricted to
any particular range of non-negative integers; the current
implementation goes 0..0xFFFFFFFFFFFFFFFF (i.e., maximal range of
native unsigned integers currently used in Perl), and there are "free"
bits to extend it to, e.g., 128bit - if Perl is used on architecture
with sizeof(UV) = 128bits.

UTF-8 is "legally" restricted to 0..0x1FFFFF, although technically, it
can cover up to, IIRC, 0..0x1FFFFFFF.

OK, thanks.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,020
Latest member
GenesisGai

Latest Threads

Top