CGI query string encoding issue...


H

howa

Hello, consider my simple cgi program below:

#=======
#!/usr/bin/perl
use strict;

use CGI;
my $q = new CGI;
my $s = $q->param("s");
print $q->header( -type => "text/html" );

print utf8::valid ($s);
#=======

Then I call, e.g.

http://www.example.com/cgi-bin/test.cgi?s=abc (print 1, ok)
http://www.example.com/cgi-bin/test.cgi?s=$BCfJ8(B (also print 1, but my
paramater s is BIG5 traditional Chinese encoding, not utf8!)

So now I am really confused with the encoding stuff... Can anyone
modify my program above ... so not to print 1 if my $s contains non-
UTF8 characters?

Thanks.
 
Ad

Advertisements

H

howa

Hi,

Not really, because there's no way to look at an arbitrary bit string and
know that it's BIG5, or utf8, or whatever. You said parameter s is a BIG5
string in the second case--but it's also a utf8 string: "^[$BCfJ8^[(B".


A simpler example, using encoded string, for the url:
http://www.example.com/cgi-bin/test.cgi?s=$BCf(B

http://www.example.com/cgi-bin/test.cgi?s=%A4%A4 (BIG-5, 0xa440 to
0xc67e, http://en.wikipedia.org/wiki/Big5)
http://www.example.com/cgi-bin/test.cgi?s=中 (UTF-8, see
variable $valid_utf8_regexp at http://cpansearch.perl.org/src/MARKF/Test-utf8-0.02/lib/Test/utf8.pm)


As you can see, BIG5 char %A4%A4 is definitely out of UTF8 range but
utf8::valid() return 1
 
G

Gunnar Hjalmarsson

howa said:
Hello, consider my simple cgi program below:

#=======
#!/usr/bin/perl
use strict;

use CGI;
my $q = new CGI;
my $s = $q->param("s");
print $q->header( -type => "text/html" );

print utf8::valid ($s);
#=======

Then I call, e.g.

http://www.example.com/cgi-bin/test.cgi?s=abc (print 1, ok)
http://www.example.com/cgi-bin/test.cgi?s=$BCfJ8(B (also print 1, but my
paramater s is BIG5 traditional Chinese encoding, not utf8!)

I'm not sure about the meaning of utf8::valid (), but the docs
recommends the use of utf8::is_utf8().

Does the below code make sense to you?

$ cat test.pl
use Encode;
$big5_uriencoded = '%A4%A4';
( $big5_bytes = $big5_uriencoded ) =~ s/%(..)/chr(hex $1)/eg;
print '$big5_bytes ', utf8::is_utf8($big5_bytes) ? 'is' : 'is not',
" in UTF-8 internally.\n";
$string = decode('Big5', $big5_bytes);
print '$string ', utf8::is_utf8($string) ? 'is' : 'is not',
" in UTF-8 internally.\n\n";

$ perl test.pl
$big5_bytes is not in UTF-8 internally.
$string is in UTF-8 internally.

I believe it tells us that it's not possible to encode $big5_bytes
directly to UTF-8, while that's possible with $string.
 
Ad

Advertisements

E

Eric Pozharski

On 2009-03-05 said:
I'm not sure about the meaning of utf8::valid (), but the docs
recommends the use of utf8::is_utf8().

Those just do different tests (or are supposed to do). But (and) see
below (there're some "smart defaults" on the road):

perl -wle '
#use encoding 'utf8';
@x = ( qq|\x{DF}\x{0100}|, q|a|, qq|\x{DF}|, qq|\x{0100}| );
foreach my $y (@x) {
printf qq|valid (%i) - is (%i) - |, utf8::valid($y), utf8::is_utf8($y);
print $y;
utf8::encode($y);
printf qq|valid (%i) + is (%i) + |, utf8::valid($y), utf8::is_utf8($y);
print $y;
utf8::decode($y);
printf qq|valid (%i) / is (%i) / |, utf8::valid($y), utf8::is_utf8($y);
print $y;
}
'
Wide character in print at -e line 6.
valid (1) - is (1) - ßĀ
valid (1) + is (0) + ßĀ
Wide character in print at -e line 12.
valid (1) / is (1) / ßĀ
valid (1) - is (0) - a
valid (1) + is (0) + a
valid (1) / is (0) / a
valid (1) - is (0) - �
valid (1) + is (0) + ß
valid (1) / is (1) / �
Wide character in print at -e line 6.
valid (1) - is (1) - Ā
valid (1) + is (0) + Ā
Wide character in print at -e line 12.
valid (1) / is (1) / Ā

While with C<use encoding> uncommented (output only):

valid (1) - is (1) - ßĀ
valid (1) + is (0) + ÃÄ
valid (1) / is (1) / ßĀ
valid (1) - is (1) - a
valid (1) + is (0) + a
valid (1) / is (0) / a
valid (1) - is (1) - �
valid (1) + is (0) + �
valid (1) / is (1) / �
valid (1) - is (1) - Ā
valid (1) + is (0) + Ä
valid (1) / is (1) / Ā

*CUT*

p.s. I'm not sure how all that would go out of slrn.

p.p.s. Would some kind perlist to look at B<utf8::valid> code, please?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top