why utf8::upgrade is needed?

P

Petr Pajas

Hi,
I'm using Perl 5.8.3 and want it to be 100% UTF-8. I'm however having
troubles with latin-1 characters in strings, since they seem to remain
byte encoded, unless I explicitly call utf8::upgrade, which is very
annoying.

In the example below, \x{e1} is latin1 small aacute,
\x{168} is non-latin1 Scaron. The code shows, that \x{e1}
remains non-UTF8 as long as it meets a non-latin1 character, or
utf8::upgrade is called. Can anyone explain why (and possibly
how to avoid that)?

$ perl -e '
use utf8;
use Devel::peek;
$a="\x{e1}";
$b="\x{e1}\x{168}";
Dump($a);
Dump($b);
utf8::upgrade($a);
Dump($a)'

SV = PV(0x8150000) at 0x816a488
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x8163af8 "\341"\0
CUR = 1
LEN = 2
SV = PV(0x8150090) at 0x816a4c4
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x8162530 "\303\241\305\250"\0 [UTF8 "\x{e1}\x{168}"]
CUR = 4
LEN = 5
SV = PV(0x8150000) at 0x816a488
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x81701a8 "\303\241"\0 [UTF8 "\x{e1}"]
CUR = 2
LEN = 3

Thanks,

-- Petr
 
T

Tassilo v. Parseval

Also sprach Petr Pajas:
Hi,
I'm using Perl 5.8.3 and want it to be 100% UTF-8. I'm however having
troubles with latin-1 characters in strings, since they seem to remain
byte encoded, unless I explicitly call utf8::upgrade, which is very
annoying.

In the example below, \x{e1} is latin1 small aacute,
\x{168} is non-latin1 Scaron. The code shows, that \x{e1}
remains non-UTF8 as long as it meets a non-latin1 character, or
utf8::upgrade is called.

As long as the numerical value of each character in the string fits into
one byte, actually. Latin1 is such a one-byte encoding and so perl will
not yet utf8ify the string.
Can anyone explain why (and possibly how to avoid that)?

Turn that around. Why do you want everything to be unicode? In all but
the most pathological cases you can trust perl to do the right thing
with your strings, upgrading when necessary etc.

Tassilo
 
P

Petr Pajas

Tassilo said:
Also sprach Petr Pajas:


As long as the numerical value of each character in the string fits into
one byte, actually. Latin1 is such a one-byte encoding and so perl will
not yet utf8ify the string.


Turn that around. Why do you want everything to be unicode? In all but
the most pathological cases you can trust perl to do the right thing
with your strings, upgrading when necessary etc.

Tassilo

Well, I'm passing the strings to some XS module for XML.
If this module finds UTF8 flag on the string, it knows what to do.
If not, it assumes I'm passing it a string in the encoding of the
XML document (not necessarily Latin1) and that causes problems,
since "\x{e1}" isn't UTF8 flagged and while Perl keeps it Latin1,
the XML module may interpret it quite differently. So I have to do
utf8::upgrade to make sure the string gets converted to utf8 and is
UTF8 flagged.

-- Petr
 
A

Alan J. Flavell

\x{168} is non-latin1 Scaron. The code shows, that \x{e1}
remains non-UTF8 as long as it meets a non-latin1 character, or
utf8::upgrade is called. Can anyone explain why (and possibly
how to avoid that)?

To try to answer the question "why", the documentation explains this
in terms of transparent compatibility with older 8-bit handling.

http://www.perldoc.com/perl5.8.4/pod/perlunicode.html#Byte-and-Character-Semantics

For how to deal with that in practice,

http://www.perldoc.com/perl5.8.4/po...nicode-in-Perl-(Or-Unforcing-Unicode-in-Perl)

(and the following heading) seem to be particularly relevant.

Maybe I misunderstood what you were saying, but you can't just mark an
iso-8859-1 string as utf8; it's necessary to cause Perl to genuinely
create the utf8 version from the 8-bit-coded version. As I understand
it, once the utf8 version has been created it won't be quietly
destroyed; so if a character > 255 is appended to a string (causing
upgrade to utf8) and then taken off again, the string will still be
held in utf8 form, unless one explicitly down-converts it. I'd
suggest

http://www.perldoc.com/perl5.8.4/pod/perlunicode.html#Interaction-with-Extensions

in relation to your specific interest.

hope this helps
 
T

Tassilo v. Parseval

Also sprach Petr Pajas:
Well, I'm passing the strings to some XS module for XML.
If this module finds UTF8 flag on the string, it knows what to do.
If not, it assumes I'm passing it a string in the encoding of the
XML document (not necessarily Latin1) and that causes problems,
since "\x{e1}" isn't UTF8 flagged and while Perl keeps it Latin1,
the XML module may interpret it quite differently. So I have to do
utf8::upgrade to make sure the string gets converted to utf8 and is
UTF8 flagged.

Ah, that's indeed a legitimate reason. This module you're talking about,
is that under your control? In this case, you could have the module do a
sv_utf8_upgrade() on its arguments which might already be enough to make
it all work.

Otherwise, maybe contacting the author would be in order.

Tassilo
 
P

Petr Pajas

Alan said:
To try to answer the question "why", the documentation explains this
in terms of transparent compatibility with older 8-bit handling.
http://www.perldoc.com/perl5.8.4/pod/perlunicode.html#Byte-and-Character-Semantics

I see, the answer seems to be here:
"For operations where this determination cannot be made without additional
information from the user, Perl decides in favor of compatibility and
chooses to use byte semantics.
....

"Such data may come from filehandles, from calls to external programs, from
information provided by the system (such as %ENV), or from literals and
constants in the source text."

"\x{e1}" is a literal, right? and Perl can't decide between
bytes/characters, therefore I have to upgrade it.
For how to deal with that in practice,

http://www.perldoc.com/perl5.8.4/po...nicode-in-Perl-(Or-Unforcing-Unicode-in-Perl)

(and the following heading) seem to be particularly relevant.

Maybe I misunderstood what you were saying, but you can't just mark an
iso-8859-1 string as utf8; it's necessary to cause Perl to genuinely
create the utf8 version from the 8-bit-coded version.

I know. The problem was, that I thought that there must be some way to
state, that all non-ascii should be treated using character semantics (with
something little more forceful than use utf8). I wanted literals like
"\x{e1}" to be automatically treated as Unicode (character semantics),
since it is non-ASCII (this works for \x{161}, but that's even >255, so
there's no doubt it's character semantics).

Without going into boring details, my situation is as follows: in my
program, the user provides arbitrary Perl expression which I parse using
Text::Balanced. The expression is expected to result in a ascii or UTF8
string (or maybe some other perl object). Due to a reported (and already
fixed) bugs in substr of Perl<=5.8.3, this module fails to handle utf8 code
correctly, so the users are forced to use ASCII code. To insert literal
utf8 data into ascii code, the user has to use \x{...}. After I evaluate
the expression, I'm passing it to a XS module, which is utf8 aware, but
treats non-utf8-flagged non-ascii strings in a specific way. On the other
hand, having a blood-signed treaty with the user on my desk:), I know that
when he says "\x{e1}", he means characters, not bytes. But, since "\x{e1}"
evaluates as to a non-ascii non-UTF8-flagged string, the modules behaves
incorrectly. So, in order to resolve it, I have to manually force upgrade
at all entry points to the library (hundreds). Other solution would be to
remove the "special treatment" of non-utf8 non-ascii data from the XS
module (being one of the developers I could try to establish that), but
unfortunately, lots of users rely on that behavior.
As I understand
it, once the utf8 version has been created it won't be quietly
destroyed; so if a character > 255 is appended to a string (causing
upgrade to utf8) and then taken off again, the string will still be
held in utf8 form, unless one explicitly down-converts it. I'd
suggest

http://www.perldoc.com/perl5.8.4/pod/perlunicode.html#Interaction-with-Extensions

in relation to your specific interest.

hope this helps

Yes it does, although the findings didn't make me any happier:-(
Thanks a lot, anyway.

Cheers,
-- Petr
 
A

Anno Siegel

[...]
Without going into boring details, my situation is as follows: in my
program, the user provides arbitrary Perl expression which I parse using
Text::Balanced. The expression is expected to result in a ascii or UTF8
string (or maybe some other perl object). Due to a reported (and already
fixed) bugs in substr of Perl<=5.8.3, this module fails to handle utf8 code
correctly, so the users are forced to use ASCII code. To insert literal
utf8 data into ascii code, the user has to use \x{...}. After I evaluate
the expression, I'm passing it to a XS module, which is utf8 aware, but
treats non-utf8-flagged non-ascii strings in a specific way. On the other
hand, having a blood-signed treaty with the user on my desk:), I know that
when he says "\x{e1}", he means characters, not bytes. But, since "\x{e1}"
evaluates as to a non-ascii non-UTF8-flagged string, the modules behaves
incorrectly. So, in order to resolve it, I have to manually force upgrade
at all entry points to the library (hundreds). Other solution would be to
remove the "special treatment" of non-utf8 non-ascii data from the XS
module (being one of the developers I could try to establish that), but
unfortunately, lots of users rely on that behavior.

Let me just throw in a reminder that the behavior of literals can be
overloaded. If the problem can be solved by changing the way string
literals are interpreted, this may help:

use overload;
overload::constant( q => \ &make_utf8);
sub make_utf8 {
my ( $orig, $perl, $mode) = @_;
utf8::encode( $perl) if grep ord() >= 128, split //, $perl;
$perl;
}

That would enforce utf8 interpretation of any string containing a character
in the 128 - 255 range. If the code is put in a library, the call to
overload::constant() should should go in the import() routine.

Then again, I may be entirely on the wrong track...

Anno
 
P

Petr Pajas

Anno said:
[...]
Without going into boring details, my situation is as follows: in my
program, the user provides arbitrary Perl expression which I parse using
Text::Balanced. The expression is expected to result in a ascii or UTF8
string (or maybe some other perl object). Due to a reported (and already
fixed) bugs in substr of Perl<=5.8.3, this module fails to handle utf8
code correctly, so the users are forced to use ASCII code. To insert
literal utf8 data into ascii code, the user has to use \x{...}. After I
evaluate the expression, I'm passing it to a XS module, which is utf8
aware, but treats non-utf8-flagged non-ascii strings in a specific way.
On the other hand, having a blood-signed treaty with the user on my
desk:), I know that when he says "\x{e1}", he means characters, not
bytes. But, since "\x{e1}" evaluates as to a non-ascii non-UTF8-flagged
string, the modules behaves incorrectly. So, in order to resolve it, I
have to manually force upgrade at all entry points to the library
(hundreds). Other solution would be to remove the "special treatment" of
non-utf8 non-ascii data from the XS module (being one of the developers I
could try to establish that), but unfortunately, lots of users rely on
that behavior.

Let me just throw in a reminder that the behavior of literals can be
overloaded. If the problem can be solved by changing the way string
literals are interpreted, this may help:

use overload;
overload::constant( q => \ &make_utf8);
sub make_utf8 {
my ( $orig, $perl, $mode) = @_;
utf8::encode( $perl) if grep ord() >= 128, split //, $perl;
$perl;
}

That would enforce utf8 interpretation of any string containing a
character
in the 128 - 255 range. If the code is put in a library, the call to
overload::constant() should should go in the import() routine.

Then again, I may be entirely on the wrong track...

This looks promissing. Thanks a lot,

-- Petr
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,904
Latest member
HealthyVisionsCBDPrice

Latest Threads

Top