CGI query string encoding issue...

howa · Mar 4, 2009

Hello, consider my simple cgi program below:

#=======
#!/usr/bin/perl
use strict;

use CGI;
my $q = new CGI;
my $s = $q->param("s");
print $q->header( -type => "text/html" );

print utf8::valid ($s);
#=======

Then I call, e.g.

http://www.example.com/cgi-bin/test.cgi?s=abc (print 1, ok)
http://www.example.com/cgi-bin/test.cgi?s=$BCfJ8(B (also print 1, but my
paramater s is BIG5 traditional Chinese encoding, not utf8!)

So now I am really confused with the encoding stuff... Can anyone
modify my program above ... so not to print 1 if my $s contains non-
UTF8 characters?

Thanks.

howa · Mar 5, 2009

Hi,

Not really, because there's no way to look at an arbitrary bit string and
know that it's BIG5, or utf8, or whatever. You said parameter s is a BIG5
string in the second case--but it's also a utf8 string: "^[$BCfJ8^[(B".

A simpler example, using encoded string, for the url:
http://www.example.com/cgi-bin/test.cgi?s=$BCf(B

http://www.example.com/cgi-bin/test.cgi?s=%A4%A4 (BIG-5, 0xa440 to
0xc67e, http://en.wikipedia.org/wiki/Big5)
http://www.example.com/cgi-bin/test.cgi?s=中 (UTF-8, see
variable $valid_utf8_regexp at http://cpansearch.perl.org/src/MARKF/Test-utf8-0.02/lib/Test/utf8.pm)

As you can see, BIG5 char %A4%A4 is definitely out of UTF8 range but
utf8::valid() return 1

Gunnar Hjalmarsson · Mar 5, 2009

howa said:
Hello, consider my simple cgi program below:

#=======
#!/usr/bin/perl
use strict;

use CGI;
my $q = new CGI;
my $s = $q->param("s");
print $q->header( -type => "text/html" );

print utf8::valid ($s);
#=======

Then I call, e.g.

http://www.example.com/cgi-bin/test.cgi?s=abc (print 1, ok)
http://www.example.com/cgi-bin/test.cgi?s=$BCfJ8(B (also print 1, but my
paramater s is BIG5 traditional Chinese encoding, not utf8!)

I'm not sure about the meaning of utf8::valid (), but the docs
recommends the use of utf8::is_utf8().

Does the below code make sense to you?

$ cat test.pl
use Encode;
$big5_uriencoded = '%A4%A4';
( $big5_bytes = $big5_uriencoded ) =~ s/%(..)/chr(hex $1)/eg;
print '$big5_bytes ', utf8::is_utf8($big5_bytes) ? 'is' : 'is not',
" in UTF-8 internally.\n";
$string = decode('Big5', $big5_bytes);
print '$string ', utf8::is_utf8($string) ? 'is' : 'is not',
" in UTF-8 internally.\n\n";

$ perl test.pl
$big5_bytes is not in UTF-8 internally.
$string is in UTF-8 internally.

I believe it tells us that it's not possible to encode $big5_bytes
directly to UTF-8, while that's possible with $string.

Eric Pozharski · Mar 6, 2009

On 2009-03-05 said:
I'm not sure about the meaning of utf8::valid (), but the docs
recommends the use of utf8::is_utf8().

Those just do different tests (or are supposed to do). But (and) see
below (there're some "smart defaults" on the road):

perl -wle '
#use encoding 'utf8';
@x = ( qq|\x{DF}\x{0100}|, q|a|, qq|\x{DF}|, qq|\x{0100}| );
foreach my $y (@x) {
printf qq|valid (%i) - is (%i) - |, utf8::valid($y), utf8::is_utf8($y);
print $y;
utf8::encode($y);
printf qq|valid (%i) + is (%i) + |, utf8::valid($y), utf8::is_utf8($y);
print $y;
utf8::decode($y);
printf qq|valid (%i) / is (%i) / |, utf8::valid($y), utf8::is_utf8($y);
print $y;
}
'
Wide character in print at -e line 6.
valid (1) - is (1) - ÃŸÄ€
valid (1) + is (0) + ÃŸÄ€
Wide character in print at -e line 12.
valid (1) / is (1) / ÃŸÄ€
valid (1) - is (0) - a
valid (1) + is (0) + a
valid (1) / is (0) / a
valid (1) - is (0) - ï¿½
valid (1) + is (0) + ÃŸ
valid (1) / is (1) / ï¿½
Wide character in print at -e line 6.
valid (1) - is (1) - Ä€
valid (1) + is (0) + Ä€
Wide character in print at -e line 12.
valid (1) / is (1) / Ä€

While with C<use encoding> uncommented (output only):

valid (1) - is (1) - ÃŸÄ€
valid (1) + is (0) + ÃƒÃ„
valid (1) / is (1) / ÃŸÄ€
valid (1) - is (1) - a
valid (1) + is (0) + a
valid (1) / is (0) / a
valid (1) - is (1) - ï¿½
valid (1) + is (0) + Ã¯Â¿Â½
valid (1) / is (1) / ï¿½
valid (1) - is (1) - Ä€
valid (1) + is (0) + Ã„
valid (1) / is (1) / Ä€

*CUT*

p.s. I'm not sure how all that would go out of slrn.

p.p.s. Would some kind perlist to look at B<utf8::valid> code, please?

CGI and PERL help - no image	6	Nov 26, 2008
mod_perl/cgi character encoding issues	1	Jul 29, 2005
cryptic error in cgi script	2	Oct 13, 2010
"if" as modifier causes incorrect tainted messages?	1	Mar 28, 2013
CGI redirect method opens new browser window	1	Mar 29, 2010
Encodign issue in Python 3.3.1 (once again)	42	May 26, 2013
Trying to get CGI selection to print	3	Oct 24, 2011
Once again: CGI help	3	Oct 6, 2009

CGI query string encoding issue...

howa

howa

Gunnar Hjalmarsson

Eric Pozharski

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads