thank you both for the quick response. I followed your examples and
get the conversion working.
Qianqian
One last question. If you can't decode the URL with that module
(because of 4-byte embeded utf8), what are you using?
-sln
-----------------------------------
use strict;
use warnings;
use Encode;
binmode STDOUT, ':utf8';
my $urlchrs = qr/[^\w.!~*'()]/; #URL unreserved chars (just a guess)
my $url_encoded_sample = '
http://target/getdata.php?data=<script src="http://
www.badplace.com%2fnasty%F0%A1%BF%AA.js%22%3e%3c%2fscript%3e';
# 4byte, UTF-8 ^^^^^^^^^^^^
# test prints of URL encoded sample ...
print "\nSample:\n$url_encoded_sample\n",'-'x20,"\n";
print "\n", decodeURL( $url_encoded_sample ),"\n";
print "\n", encodeURL( decodeURL( $url_encoded_sample ) ),"\n";
print "\n", decodeURL( encodeURL( decodeURL( $url_encoded_sample ) ) ),"\n\n";
# decoded UTF-8. values > 255 are shown as dynamic generated hex string (useless) ...
for (split //, decode ('UTF-8', decodeURL( $url_encoded_sample ) )) {
my $uchar = ord;
if ($uchar >= 255) {
printf ("\\x{%X}", $uchar);
} else {
print;
}
}
print "\n";
exit 0;
# subs ...
sub decodeURL {
my $bytes = shift;
$bytes =~ s/%([0-9a-fA-F]{2})/ pack ("C", hex($1)) /ge; # octets 'C'
return $bytes;
}
sub encodeURL {
my $text = shift;
$text =~ s/($urlchrs)/ sprintf ("%%%02X", ord($1)) /ge;
return $text;
}
__END__
Output:
Sample:
http://target/getdata.php?data=<script src="http://
www.badplace.com%2fnasty%F0%A1%BF%AA.js%22%3e%3c%2fscript%3e
--------------------
http://target/getdata.php?data=<script src="http://
www.badplace.com/nasty+¦-í-+-¬.js"></script>
http%3A%2F%2Ftarget%2Fgetdata.php%3Fdata%3D%3Cscript%20src%3D%22http%3A%2F%2F%0A
www.badplace.com%2Fnasty%F0%A1%BF%AA.js%22%3E%3C%2Fscript%3E
http://target/getdata.php?data=<script src="http://
www.badplace.com/nasty+¦-í-+-¬.js"></script>
http://target/getdata.php?data=<script src="http://
www.badplace.com/nasty\x{21FEA}.js"></script>