why utf8::upgrade is needed?

Discussion in 'Perl Misc' started by Petr Pajas, Jul 10, 2004.

  1. Petr Pajas

    Petr Pajas Guest

    Hi,
    I'm using Perl 5.8.3 and want it to be 100% UTF-8. I'm however having
    troubles with latin-1 characters in strings, since they seem to remain
    byte encoded, unless I explicitly call utf8::upgrade, which is very
    annoying.

    In the example below, \x{e1} is latin1 small aacute,
    \x{168} is non-latin1 Scaron. The code shows, that \x{e1}
    remains non-UTF8 as long as it meets a non-latin1 character, or
    utf8::upgrade is called. Can anyone explain why (and possibly
    how to avoid that)?

    $ perl -e '
    use utf8;
    use Devel::peek;
    $a="\x{e1}";
    $b="\x{e1}\x{168}";
    Dump($a);
    Dump($b);
    utf8::upgrade($a);
    Dump($a)'

    SV = PV(0x8150000) at 0x816a488
    REFCNT = 1
    FLAGS = (POK,pPOK)
    PV = 0x8163af8 "\341"\0
    CUR = 1
    LEN = 2
    SV = PV(0x8150090) at 0x816a4c4
    REFCNT = 1
    FLAGS = (POK,pPOK,UTF8)
    PV = 0x8162530 "\303\241\305\250"\0 [UTF8 "\x{e1}\x{168}"]
    CUR = 4
    LEN = 5
    SV = PV(0x8150000) at 0x816a488
    REFCNT = 1
    FLAGS = (POK,pPOK,UTF8)
    PV = 0x81701a8 "\303\241"\0 [UTF8 "\x{e1}"]
    CUR = 2
    LEN = 3

    Thanks,

    -- Petr
    Petr Pajas, Jul 10, 2004
    #1
    1. Advertising

  2. Also sprach Petr Pajas:

    > Hi,
    > I'm using Perl 5.8.3 and want it to be 100% UTF-8. I'm however having
    > troubles with latin-1 characters in strings, since they seem to remain
    > byte encoded, unless I explicitly call utf8::upgrade, which is very
    > annoying.
    >
    > In the example below, \x{e1} is latin1 small aacute,
    > \x{168} is non-latin1 Scaron. The code shows, that \x{e1}
    > remains non-UTF8 as long as it meets a non-latin1 character, or
    > utf8::upgrade is called.


    As long as the numerical value of each character in the string fits into
    one byte, actually. Latin1 is such a one-byte encoding and so perl will
    not yet utf8ify the string.

    >Can anyone explain why (and possibly how to avoid that)?


    Turn that around. Why do you want everything to be unicode? In all but
    the most pathological cases you can trust perl to do the right thing
    with your strings, upgrading when necessary etc.

    Tassilo
    --
    $_=q#",}])!JAPH!qq(tsuJ[{@"tnirp}3..0}_$;//::niam/s~=)]3[))_$-3(rellac(=_$({
    pam{rekcahbus})(rekcah{lrePbus})(lreP{rehtonabus})!JAPH!qq(rehtona{tsuJbus#;
    $_=reverse,s+(?<=sub).+q#q!'"qq.\t$&."'!#+sexisexiixesixeseg;y~\n~~dddd;eval
    Tassilo v. Parseval, Jul 10, 2004
    #2
    1. Advertising

  3. Petr Pajas

    Petr Pajas Guest

    Tassilo v. Parseval wrote:

    > Also sprach Petr Pajas:
    >
    >> Hi,
    >> I'm using Perl 5.8.3 and want it to be 100% UTF-8. I'm however having
    >> troubles with latin-1 characters in strings, since they seem to remain
    >> byte encoded, unless I explicitly call utf8::upgrade, which is very
    >> annoying.
    >>
    >> In the example below, \x{e1} is latin1 small aacute,
    >> \x{168} is non-latin1 Scaron. The code shows, that \x{e1}
    >> remains non-UTF8 as long as it meets a non-latin1 character, or
    >> utf8::upgrade is called.

    >
    > As long as the numerical value of each character in the string fits into
    > one byte, actually. Latin1 is such a one-byte encoding and so perl will
    > not yet utf8ify the string.
    >
    >>Can anyone explain why (and possibly how to avoid that)?

    >
    > Turn that around. Why do you want everything to be unicode? In all but
    > the most pathological cases you can trust perl to do the right thing
    > with your strings, upgrading when necessary etc.
    >
    > Tassilo


    Well, I'm passing the strings to some XS module for XML.
    If this module finds UTF8 flag on the string, it knows what to do.
    If not, it assumes I'm passing it a string in the encoding of the
    XML document (not necessarily Latin1) and that causes problems,
    since "\x{e1}" isn't UTF8 flagged and while Perl keeps it Latin1,
    the XML module may interpret it quite differently. So I have to do
    utf8::upgrade to make sure the string gets converted to utf8 and is
    UTF8 flagged.

    -- Petr
    Petr Pajas, Jul 10, 2004
    #3
  4. On Sat, 10 Jul 2004, Petr Pajas wrote:

    > \x{168} is non-latin1 Scaron. The code shows, that \x{e1}
    > remains non-UTF8 as long as it meets a non-latin1 character, or
    > utf8::upgrade is called. Can anyone explain why (and possibly
    > how to avoid that)?


    To try to answer the question "why", the documentation explains this
    in terms of transparent compatibility with older 8-bit handling.

    http://www.perldoc.com/perl5.8.4/pod/perlunicode.html#Byte-and-Character-Semantics

    For how to deal with that in practice,

    http://www.perldoc.com/perl5.8.4/po...Unicode-in-Perl-(Or-Unforcing-Unicode-in-Perl)

    (and the following heading) seem to be particularly relevant.

    Maybe I misunderstood what you were saying, but you can't just mark an
    iso-8859-1 string as utf8; it's necessary to cause Perl to genuinely
    create the utf8 version from the 8-bit-coded version. As I understand
    it, once the utf8 version has been created it won't be quietly
    destroyed; so if a character > 255 is appended to a string (causing
    upgrade to utf8) and then taken off again, the string will still be
    held in utf8 form, unless one explicitly down-converts it. I'd
    suggest

    http://www.perldoc.com/perl5.8.4/pod/perlunicode.html#Interaction-with-Extensions

    in relation to your specific interest.

    hope this helps
    Alan J. Flavell, Jul 10, 2004
    #4
  5. Also sprach Petr Pajas:

    > Tassilo v. Parseval wrote:
    >
    >> Also sprach Petr Pajas:
    >>
    >>> Hi,
    >>> I'm using Perl 5.8.3 and want it to be 100% UTF-8. I'm however having
    >>> troubles with latin-1 characters in strings, since they seem to remain
    >>> byte encoded, unless I explicitly call utf8::upgrade, which is very
    >>> annoying.
    >>>
    >>> In the example below, \x{e1} is latin1 small aacute,
    >>> \x{168} is non-latin1 Scaron. The code shows, that \x{e1}
    >>> remains non-UTF8 as long as it meets a non-latin1 character, or
    >>> utf8::upgrade is called.

    >>
    >> As long as the numerical value of each character in the string fits into
    >> one byte, actually. Latin1 is such a one-byte encoding and so perl will
    >> not yet utf8ify the string.
    >>
    >>>Can anyone explain why (and possibly how to avoid that)?

    >>
    >> Turn that around. Why do you want everything to be unicode? In all but
    >> the most pathological cases you can trust perl to do the right thing
    >> with your strings, upgrading when necessary etc.


    > Well, I'm passing the strings to some XS module for XML.
    > If this module finds UTF8 flag on the string, it knows what to do.
    > If not, it assumes I'm passing it a string in the encoding of the
    > XML document (not necessarily Latin1) and that causes problems,
    > since "\x{e1}" isn't UTF8 flagged and while Perl keeps it Latin1,
    > the XML module may interpret it quite differently. So I have to do
    > utf8::upgrade to make sure the string gets converted to utf8 and is
    > UTF8 flagged.


    Ah, that's indeed a legitimate reason. This module you're talking about,
    is that under your control? In this case, you could have the module do a
    sv_utf8_upgrade() on its arguments which might already be enough to make
    it all work.

    Otherwise, maybe contacting the author would be in order.

    Tassilo
    --
    $_=q#",}])!JAPH!qq(tsuJ[{@"tnirp}3..0}_$;//::niam/s~=)]3[))_$-3(rellac(=_$({
    pam{rekcahbus})(rekcah{lrePbus})(lreP{rehtonabus})!JAPH!qq(rehtona{tsuJbus#;
    $_=reverse,s+(?<=sub).+q#q!'"qq.\t$&."'!#+sexisexiixesixeseg;y~\n~~dddd;eval
    Tassilo v. Parseval, Jul 10, 2004
    #5
  6. Petr Pajas

    Petr Pajas Guest

    Alan J. Flavell wrote:

    > On Sat, 10 Jul 2004, Petr Pajas wrote:
    >
    >> \x{168} is non-latin1 Scaron. The code shows, that \x{e1}
    >> remains non-UTF8 as long as it meets a non-latin1 character, or
    >> utf8::upgrade is called. Can anyone explain why (and possibly
    >> how to avoid that)?

    >
    > To try to answer the question "why", the documentation explains this
    > in terms of transparent compatibility with older 8-bit handling.
    >
    >

    http://www.perldoc.com/perl5.8.4/pod/perlunicode.html#Byte-and-Character-Semantics

    I see, the answer seems to be here:
    "For operations where this determination cannot be made without additional
    information from the user, Perl decides in favor of compatibility and
    chooses to use byte semantics.
    ....

    "Such data may come from filehandles, from calls to external programs, from
    information provided by the system (such as %ENV), or from literals and
    constants in the source text."

    "\x{e1}" is a literal, right? and Perl can't decide between
    bytes/characters, therefore I have to upgrade it.

    >
    > For how to deal with that in practice,
    >
    >

    http://www.perldoc.com/perl5.8.4/po...Unicode-in-Perl-(Or-Unforcing-Unicode-in-Perl)
    >
    > (and the following heading) seem to be particularly relevant.
    >
    > Maybe I misunderstood what you were saying, but you can't just mark an
    > iso-8859-1 string as utf8; it's necessary to cause Perl to genuinely
    > create the utf8 version from the 8-bit-coded version.


    I know. The problem was, that I thought that there must be some way to
    state, that all non-ascii should be treated using character semantics (with
    something little more forceful than use utf8). I wanted literals like
    "\x{e1}" to be automatically treated as Unicode (character semantics),
    since it is non-ASCII (this works for \x{161}, but that's even >255, so
    there's no doubt it's character semantics).

    Without going into boring details, my situation is as follows: in my
    program, the user provides arbitrary Perl expression which I parse using
    Text::Balanced. The expression is expected to result in a ascii or UTF8
    string (or maybe some other perl object). Due to a reported (and already
    fixed) bugs in substr of Perl<=5.8.3, this module fails to handle utf8 code
    correctly, so the users are forced to use ASCII code. To insert literal
    utf8 data into ascii code, the user has to use \x{...}. After I evaluate
    the expression, I'm passing it to a XS module, which is utf8 aware, but
    treats non-utf8-flagged non-ascii strings in a specific way. On the other
    hand, having a blood-signed treaty with the user on my desk:), I know that
    when he says "\x{e1}", he means characters, not bytes. But, since "\x{e1}"
    evaluates as to a non-ascii non-UTF8-flagged string, the modules behaves
    incorrectly. So, in order to resolve it, I have to manually force upgrade
    at all entry points to the library (hundreds). Other solution would be to
    remove the "special treatment" of non-utf8 non-ascii data from the XS
    module (being one of the developers I could try to establish that), but
    unfortunately, lots of users rely on that behavior.

    > As I understand
    > it, once the utf8 version has been created it won't be quietly
    > destroyed; so if a character > 255 is appended to a string (causing
    > upgrade to utf8) and then taken off again, the string will still be
    > held in utf8 form, unless one explicitly down-converts it. I'd
    > suggest
    >
    >

    http://www.perldoc.com/perl5.8.4/pod/perlunicode.html#Interaction-with-Extensions
    >
    > in relation to your specific interest.
    >
    > hope this helps


    Yes it does, although the findings didn't make me any happier:-(
    Thanks a lot, anyway.

    Cheers,
    -- Petr
    Petr Pajas, Jul 10, 2004
    #6
  7. Petr Pajas

    Anno Siegel Guest

    Petr Pajas <> wrote in comp.lang.perl.misc:

    [...]

    > Without going into boring details, my situation is as follows: in my
    > program, the user provides arbitrary Perl expression which I parse using
    > Text::Balanced. The expression is expected to result in a ascii or UTF8
    > string (or maybe some other perl object). Due to a reported (and already
    > fixed) bugs in substr of Perl<=5.8.3, this module fails to handle utf8 code
    > correctly, so the users are forced to use ASCII code. To insert literal
    > utf8 data into ascii code, the user has to use \x{...}. After I evaluate
    > the expression, I'm passing it to a XS module, which is utf8 aware, but
    > treats non-utf8-flagged non-ascii strings in a specific way. On the other
    > hand, having a blood-signed treaty with the user on my desk:), I know that
    > when he says "\x{e1}", he means characters, not bytes. But, since "\x{e1}"
    > evaluates as to a non-ascii non-UTF8-flagged string, the modules behaves
    > incorrectly. So, in order to resolve it, I have to manually force upgrade
    > at all entry points to the library (hundreds). Other solution would be to
    > remove the "special treatment" of non-utf8 non-ascii data from the XS
    > module (being one of the developers I could try to establish that), but
    > unfortunately, lots of users rely on that behavior.


    Let me just throw in a reminder that the behavior of literals can be
    overloaded. If the problem can be solved by changing the way string
    literals are interpreted, this may help:

    use overload;
    overload::constant( q => \ &make_utf8);
    sub make_utf8 {
    my ( $orig, $perl, $mode) = @_;
    utf8::encode( $perl) if grep ord() >= 128, split //, $perl;
    $perl;
    }

    That would enforce utf8 interpretation of any string containing a character
    in the 128 - 255 range. If the code is put in a library, the call to
    overload::constant() should should go in the import() routine.

    Then again, I may be entirely on the wrong track...

    Anno
    Anno Siegel, Jul 10, 2004
    #7
  8. Petr Pajas

    Petr Pajas Guest

    Anno Siegel wrote:

    > Petr Pajas <> wrote in comp.lang.perl.misc:
    >
    > [...]
    >
    >> Without going into boring details, my situation is as follows: in my
    >> program, the user provides arbitrary Perl expression which I parse using
    >> Text::Balanced. The expression is expected to result in a ascii or UTF8
    >> string (or maybe some other perl object). Due to a reported (and already
    >> fixed) bugs in substr of Perl<=5.8.3, this module fails to handle utf8
    >> code correctly, so the users are forced to use ASCII code. To insert
    >> literal utf8 data into ascii code, the user has to use \x{...}. After I
    >> evaluate the expression, I'm passing it to a XS module, which is utf8
    >> aware, but treats non-utf8-flagged non-ascii strings in a specific way.
    >> On the other hand, having a blood-signed treaty with the user on my
    >> desk:), I know that when he says "\x{e1}", he means characters, not
    >> bytes. But, since "\x{e1}" evaluates as to a non-ascii non-UTF8-flagged
    >> string, the modules behaves incorrectly. So, in order to resolve it, I
    >> have to manually force upgrade at all entry points to the library
    >> (hundreds). Other solution would be to remove the "special treatment" of
    >> non-utf8 non-ascii data from the XS module (being one of the developers I
    >> could try to establish that), but unfortunately, lots of users rely on
    >> that behavior.

    >
    > Let me just throw in a reminder that the behavior of literals can be
    > overloaded. If the problem can be solved by changing the way string
    > literals are interpreted, this may help:
    >
    > use overload;
    > overload::constant( q => \ &make_utf8);
    > sub make_utf8 {
    > my ( $orig, $perl, $mode) = @_;
    > utf8::encode( $perl) if grep ord() >= 128, split //, $perl;
    > $perl;
    > }
    >
    > That would enforce utf8 interpretation of any string containing a
    > character
    > in the 128 - 255 range. If the code is put in a library, the call to
    > overload::constant() should should go in the import() routine.
    >
    > Then again, I may be entirely on the wrong track...


    This looks promissing. Thanks a lot,

    -- Petr
    Petr Pajas, Jul 11, 2004
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. shade
    Replies:
    1
    Views:
    1,668
  2. Mr. SweatyFinger

    why why why why why

    Mr. SweatyFinger, Nov 28, 2006, in forum: ASP .Net
    Replies:
    4
    Views:
    880
    Mark Rae
    Dec 21, 2006
  3. Mr. SweatyFinger
    Replies:
    2
    Views:
    1,851
    Smokey Grindel
    Dec 2, 2006
  4. gry
    Replies:
    2
    Views:
    728
    Alf P. Steinbach
    Mar 13, 2012
  5. Replies:
    5
    Views:
    290
    Dr.Ruud
    Jul 5, 2006
Loading...

Share This Page