CGI query string encoding issue...

Discussion in 'Perl Misc' started by howa, Mar 4, 2009.

  1. howa

    howa Guest

    Hello, consider my simple cgi program below:

    #=======
    #!/usr/bin/perl
    use strict;

    use CGI;
    my $q = new CGI;
    my $s = $q->param("s");
    print $q->header( -type => "text/html" );

    print utf8::valid ($s);
    #=======

    Then I call, e.g.

    http://www.example.com/cgi-bin/test.cgi?s=abc (print 1, ok)
    http://www.example.com/cgi-bin/test.cgi?s=中文 (also print 1, but my
    paramater s is BIG5 traditional Chinese encoding, not utf8!)

    So now I am really confused with the encoding stuff... Can anyone
    modify my program above ... so not to print 1 if my $s contains non-
    UTF8 characters?

    Thanks.
     
    howa, Mar 4, 2009
    #1
    1. Advertising

  2. howa

    howa Guest

    Hi,

    On Mar 4, 11:41 pm, Chris Mattern <> wrote:
    > Not really, because there's no way to look at an arbitrary bit string and
    > know that it's BIG5, or utf8, or whatever. You said parameter s is a BIG5
    > string in the second case--but it's also a utf8 string: "^[$BCfJ8^[(B".
    >



    A simpler example, using encoded string, for the url:
    http://www.example.com/cgi-bin/test.cgi?s=中

    http://www.example.com/cgi-bin/test.cgi?s=%A4%A4 (BIG-5, 0xa440 to
    0xc67e, http://en.wikipedia.org/wiki/Big5)
    http://www.example.com/cgi-bin/test.cgi?s= (UTF-8, see
    variable $valid_utf8_regexp at http://cpansearch.perl.org/src/MARKF/Test-utf8-0.02/lib/Test/utf8.pm)


    As you can see, BIG5 char %A4%A4 is definitely out of UTF8 range but
    utf8::valid() return 1
     
    howa, Mar 5, 2009
    #2
    1. Advertising

  3. howa wrote:
    > Hello, consider my simple cgi program below:
    >
    > #=======
    > #!/usr/bin/perl
    > use strict;
    >
    > use CGI;
    > my $q = new CGI;
    > my $s = $q->param("s");
    > print $q->header( -type => "text/html" );
    >
    > print utf8::valid ($s);
    > #=======
    >
    > Then I call, e.g.
    >
    > http://www.example.com/cgi-bin/test.cgi?s=abc (print 1, ok)
    > http://www.example.com/cgi-bin/test.cgi?s=中文 (also print 1, but my
    > paramater s is BIG5 traditional Chinese encoding, not utf8!)


    I'm not sure about the meaning of utf8::valid (), but the docs
    recommends the use of utf8::is_utf8().

    Does the below code make sense to you?

    $ cat test.pl
    use Encode;
    $big5_uriencoded = '%A4%A4';
    ( $big5_bytes = $big5_uriencoded ) =~ s/%(..)/chr(hex $1)/eg;
    print '$big5_bytes ', utf8::is_utf8($big5_bytes) ? 'is' : 'is not',
    " in UTF-8 internally.\n";
    $string = decode('Big5', $big5_bytes);
    print '$string ', utf8::is_utf8($string) ? 'is' : 'is not',
    " in UTF-8 internally.\n\n";

    $ perl test.pl
    $big5_bytes is not in UTF-8 internally.
    $string is in UTF-8 internally.

    I believe it tells us that it's not possible to encode $big5_bytes
    directly to UTF-8, while that's possible with $string.

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
     
    Gunnar Hjalmarsson, Mar 5, 2009
    #3
  4. On 2009-03-05, Gunnar Hjalmarsson <> wrote:
    *SKIP*
    > I'm not sure about the meaning of utf8::valid (), but the docs
    > recommends the use of utf8::is_utf8().


    Those just do different tests (or are supposed to do). But (and) see
    below (there're some "smart defaults" on the road):

    perl -wle '
    #use encoding 'utf8';
    @x = ( qq|\x{DF}\x{0100}|, q|a|, qq|\x{DF}|, qq|\x{0100}| );
    foreach my $y (@x) {
    printf qq|valid (%i) - is (%i) - |, utf8::valid($y), utf8::is_utf8($y);
    print $y;
    utf8::encode($y);
    printf qq|valid (%i) + is (%i) + |, utf8::valid($y), utf8::is_utf8($y);
    print $y;
    utf8::decode($y);
    printf qq|valid (%i) / is (%i) / |, utf8::valid($y), utf8::is_utf8($y);
    print $y;
    }
    '
    Wide character in print at -e line 6.
    valid (1) - is (1) -
    valid (1) + is (0) +
    Wide character in print at -e line 12.
    valid (1) / is (1) /
    valid (1) - is (0) - a
    valid (1) + is (0) + a
    valid (1) / is (0) / a
    valid (1) - is (0) -
    valid (1) + is (0) +
    valid (1) / is (1) /
    Wide character in print at -e line 6.
    valid (1) - is (1) -
    valid (1) + is (0) +
    Wide character in print at -e line 12.
    valid (1) / is (1) /

    While with C<use encoding> uncommented (output only):

    valid (1) - is (1) -
    valid (1) + is (0) +
    valid (1) / is (1) /
    valid (1) - is (1) - a
    valid (1) + is (0) + a
    valid (1) / is (0) / a
    valid (1) - is (1) -
    valid (1) + is (0) +
    valid (1) / is (1) /
    valid (1) - is (1) -
    valid (1) + is (0) +
    valid (1) / is (1) /

    *CUT*

    p.s. I'm not sure how all that would go out of slrn.

    p.p.s. Would some kind perlist to look at B<utf8::valid> code, please?

    --
    Torvalds' goal for Linux is very simple: World Domination
    Stallman's goal for GNU is even simpler: Freedom
     
    Eric Pozharski, Mar 6, 2009
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jon F.

    CAML Query: Multiple Query Fields Issue

    Jon F., May 12, 2004, in forum: ASP .Net Web Services
    Replies:
    0
    Views:
    774
    Jon F.
    May 12, 2004
  2. Stefan Fischer
    Replies:
    2
    Views:
    359
    Stefan Fischer
    Feb 23, 2010
  3. Matthew Salerno

    CGI.pm Escaping query strings - ampersand issue

    Matthew Salerno, Apr 30, 2004, in forum: Perl Misc
    Replies:
    5
    Views:
    561
    pkent
    May 1, 2004
  4. roadrunner
    Replies:
    1
    Views:
    263
    Gunnar Hjalmarsson
    Feb 8, 2006
  5. nick
    Replies:
    1
    Views:
    515
    David Mark
    Feb 13, 2011
Loading...

Share This Page