Does unpack() support higher-order Unicode strings for hex conversion?

Discussion in 'Perl Misc' started by fhscobey, Nov 3, 2005.

  1. fhscobey

    fhscobey Guest

    I've been playing around with the unpack() function to create literal
    wide-hex strings in the form of '\x{}', to represent UTF-8 strings in
    an application I support. I'm basically following the advice outlined
    in the following perldoc:

    But, one thing I noticed is that for higher order values (>0xFF I
    guess), unpack does not return the proper hex representation for the
    Unicode code point for the character provided. Here is an example.

    Example 1:
    I have a flat file called 'utf8.string', which contains the followng
    Polish string "Pozostale" (which means "Other" in English). If I run
    the following, I get the output you see below:

    $ cat utf8.string | perl -e 'binmode(STDIN,":utf8");
    binmode(STDOUT,":utf8"); while(<STDIN>){$line=$_; chomp($line);
    @raw_chars=split(//,$line); foreach $ch
    push(@unpacked_chars,$unpacked_char);} foreach $ch
    (@unpacked_chars){print("unpacked char = " . $ch,"\n");}}'
    unpacked char = 50
    unpacked char = 6f
    unpacked char = 7a
    unpacked char = 6f
    unpacked char = 73
    unpacked char = 74
    unpacked char = 61
    unpacked char = c582
    unpacked char = 65

    Notice that all hex values for all chars look OK, except for the second
    to last. The 'l' character is getting converted to 0xc582, which is
    incorrect. I know from referencing the Unicode documentation at:
    .... the correct code point is 0x142.

    Is this what is supposed to happen? I haven't seen anything in the
    documentation that says the 'H' template for unpack cannot be used for
    higher-order unicode characters. Did I miss something?

    I can work around this by using the "U" template unpack the chars, and
    then putting the decimal values through:
    sprintf("%X", $dec_value); get the correct code point hex value, but I was under the
    impression that unpack() was supposed to be able to do that by itself.

    Here is a sample of how I get the correct hex value:

    $ cat test_utf8_string.3.utf8 | perl -e 'binmode(STDIN,":utf8");
    binmode(STDOUT,":utf8"); while(<STDIN>){$line=$_; chomp($line);
    @unpacked_chars=unpack("U*",$line); foreach $ch
    (@unpacked_chars){print("unpacked char decimal = " . $ch, " / converted
    to hex = " . sprintf("%X",$ch),"\n");}}'
    unpacked char decimal = 80 / converted to hex = 50
    unpacked char decimal = 111 / converted to hex = 6F
    unpacked char decimal = 122 / converted to hex = 7A
    unpacked char decimal = 111 / converted to hex = 6F
    unpacked char decimal = 115 / converted to hex = 73
    unpacked char decimal = 116 / converted to hex = 74
    unpacked char decimal = 97 / converted to hex = 61
    unpacked char decimal = 322 / converted to hex = 142
    unpacked char decimal = 101 / converted to hex = 65

    Just wondering if I'm using unpack() incorrectly, or if my
    understanding that it should be able to handle higher-order unicode
    characters when converting to hex format, is incorrect.

    I'm on RedHat Linux 7.2, Perl 5.8.1.

    Thanks for any assistance you can offer.
    - Jeff
    fhscobey, Nov 3, 2005
    1. Advertisements

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    Aug 19, 2005
  2. Bengt Richter
    Juha Autero
    Aug 19, 2003
  3. Replies:
  4. could ildg
    could ildg
    Aug 4, 2005
  5. Robert Kern
    Robert Kern
    Aug 4, 2005

Share This Page