use binary operator on ascii text string

P

Peter J. Holzer

David said:
I don't think the docs are that unclear. In perlfunc#pack it says:

"C An unsigned char value. Only does bytes. See U for Unicode."

Yup. But that's for pack, and there is no ambiguity for pack.
pack('C*', 0xFC) always returns "\x{FC}". But the reverse operation is
ambiguous: unpack('C*', "\x{FC}") may return (0xFC) or (0xC3, 0xBC),
depending on whether the string happens to have the UTF-8 flag set or
not. I find this surprising and I find no mention that this is the
intended behaviour (rather than a side-effect of the implementation).
"Only does bytes" in the description of pack IMHO means "pack takes only
values from 0 to 255 and returns a byte string". It doesn't explicitely
say anything about the behaviour of unpack when fed a UTF-8 string, and
I'd like to have this explicitely spelled out (even if it is only "the
behaviour is undefined").

hp
 
P

Peter J. Holzer

Peter said:
Yup. But that's for pack, and there is no ambiguity for pack.
pack('C*', 0xFC) always returns "\x{FC}". But the reverse operation is
ambiguous: unpack('C*', "\x{FC}") may return (0xFC) or (0xC3, 0xBC),
depending on whether the string happens to have the UTF-8 flag set or
not. I find this surprising and I find no mention that this is the
intended behaviour (rather than a side-effect of the implementation).
"Only does bytes" in the description of pack IMHO means "pack takes only
values from 0 to 255 and returns a byte string". It doesn't explicitely
say anything about the behaviour of unpack when fed a UTF-8 string, and
I'd like to have this explicitely spelled out (even if it is only "the
behaviour is undefined").

Actually, it is specified, just not where I expected it.

perldoc perlunicode:

· The "chr()" and "ord()" functions work on characters, similar to
"pack("U")" and "unpack("U")", not "pack("C")" and "unpack("C")".
"pack("C")" and "unpack("C")" are methods for emulating byte-oriented
"chr()" and "ord()" on Unicode strings. While these methods reveal the
internal encoding of Unicode strings, that is not something one normally
needs to care about at all.

BTW, there is an open bug about a similar matter:
http://rt.perl.org/rt3/Ticket/Display.html?id=33734

Looks like the behaviour of unpack will change for Unicode strings in
5.8.9 and 5.10 (although probably not for "C").

hp
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top