How get UTF-8 from urlencoded web form

Discussion in 'Perl Misc' started by Yohan N. Leder, Jul 15, 2006.

  1. Hello.

    All my tests are done using ActivePerl 5.8.8.817 under Win2K FR and
    Apache2.

    I'm trying to obtain (and display) user data which come from a web form
    with enctype as 'application/x-www-form-urlencoded' and don't succeed. I
    can do-it if the form is a 'multipart/form-data' but not a
    'application/x-www-form-urlencoded'.

    Here is a script to show the difference :

    ---- BEGIN ----
    #!/usr/bin/perl -w
    my $this = "utf8_and_webform.pl";

    require 5.8.0;
    use utf8;
    binmode(STDOUT, ':utf8');
    print "Content-type: text/html; charset=UTF-8\n\n";
    if (defined $ENV{'QUERY_STRING'} && length($ENV{'QUERY_STRING'}) > 0)
    {&see;}
    else {&ask;}
    exit 0;

    sub ask
    { # provide web forms for user to enter data
    print <<PAGE
    <html><head><title>Test about UTF-8 and web form</title></head><body>
    Use the form you want and see the resulting data.
    <p>
    FORM with enctype as 'application/x-www-form-urlencoded' :<br>
    <form action='$this?x' method='post' accept-charset='UTF-8'
    enctype='application/x-www-form-urlencoded'>
    <textarea name='msg' rows='4' cols='30' wrap='virtual'></textarea>
    <input type='submit' value='send'>
    </form></body></html></p>
    <p>
    FORM with enctype as 'multipart/form-data' :<br>
    <form action='$this?x' method='post' accept-charset='UTF-8'
    enctype='multipart/form-data'>
    <textarea name='msg' rows='4' cols='30' wrap='virtual'></textarea>
    <input type='submit' value='send'></p>
    </form></body></html>
    PAGE
    > [quoted text muted]

    }

    sub see
    { # display data which come from user form
    my $data='';

    binmode(STDIN, ':utf8'); # or ':encoding('UTF-8')'
    read(STDIN, $data, $ENV{'CONTENT_LENGTH'});

    # OR
    #use Encode qw(decode);
    #read(STDIN, $data, $ENV{'CONTENT_LENGTH'});
    #$data = decode('UTF-8', $data);

    print $data;
    > [quoted text muted]

    }
    ----- END ----

    For example, if I submit the 'urlencoded' form (the first one, at top of
    generated web page, if you run the script without any url parameter)
    with the letter 'é' (accentuated e) inside the textarea, I get 'msg=%C3%
    A9' displayed in the browser (knowing this has been proceeded through
    the see() sub).

    While, if I submit the same 'é' from the 'multipart/form-data' form (the
    second one, at bottom of generated web page), I get a well interpreted
    UTF-8 'é' as expected.

    How to get this same UTF-8 'é' when form uses 'application/x-www-form-
    urlencoded' enctype ? How to modify the see() sub for this urlencoded
    form case ?
     
    Yohan N. Leder, Jul 15, 2006
    #1
    1. Advertising

  2. Yohan N. Leder wrote:
    > if I submit the 'urlencoded' form (the first one, at top of
    > generated web page, if you run the script without any url parameter)
    > with the letter 'é' (accentuated e) inside the textarea, I get 'msg=%C3%
    > A9' displayed in the browser (knowing this has been proceeded through
    > the see() sub).
    >
    > While, if I submit the same 'é' from the 'multipart/form-data' form (the
    > second one, at bottom of generated web page), I get a well interpreted
    > UTF-8 'é' as expected.
    >
    > How to get this same UTF-8 'é' when form uses 'application/x-www-form-
    > urlencoded' enctype ?


    The problem is covered by this FAQ entry:
    http://faq.perl.org/perlfaq9.html#How_do_I_decode_a_CG

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
     
    Gunnar Hjalmarsson, Jul 15, 2006
    #2
    1. Advertising

  3. Yohan N. Leder, Jul 15, 2006
    #3
  4. Yohan N. Leder wrote:
    > In article <>, says...
    >>The problem is covered by this FAQ entry:
    >>http://faq.perl.org/perlfaq9.html#How_do_I_decode_a_CG

    >
    > It doesn't explain the problem, but remove the problem using CGI.pm, and
    > I would like to understand the problem.


    Excellent learning approach.

    The browser automatically URI escapes 'unsafe' characters when you make
    a GET or an x-www-form-urlencoded POST request. Hence those characters
    need to be unescaped by the web server. CGI.pm as well as other modules
    for parsing CGI data takes care of that.

    You can study the docs for the Perl module URI::Escape for a better
    explanation.

    I suppose you should also read up on the HTTP protocol.

    HTH

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
     
    Gunnar Hjalmarsson, Jul 15, 2006
    #4
  5. Yohan N. Leder wrote:

    > All my tests are done using ActivePerl 5.8.8.817 under Win2K FR and
    > Apache2.
    >
    > I'm trying to obtain (and display) user data which come from a web form
    > with enctype as 'application/x-www-form-urlencoded' and don't succeed. I
    > can do-it if the form is a 'multipart/form-data' but not a
    > 'application/x-www-form-urlencoded'.


    [snip code ]

    > For example, if I submit the 'urlencoded' form (the first one, at top of
    > generated web page, if you run the script without any url parameter)
    > with the letter 'é' (accentuated e) inside the textarea, I get 'msg=%C3%
    > A9' displayed in the browser (knowing this has been proceeded through
    > the see() sub).
    >
    > While, if I submit the same 'é' from the 'multipart/form-data' form (the
    > second one, at bottom of generated web page), I get a well interpreted
    > UTF-8 'é' as expected.
    >
    > How to get this same UTF-8 'é' when form uses 'application/x-www-form-
    > urlencoded' enctype ? How to modify the see() sub for this urlencoded
    > form case ?


    That shouldn't be particularly mysterious. You're specifying the page's
    charset as UTF-8 in its header (where you say "Content-type: text/html;
    charset=UTF-8"), causing the 'é'- character to be sent as Unicode's
    literal 'é' (dec 142/hex 8E/eacute/LATIN SMALL LETTER E WITH ACUTE.
    The code point for à is C3, and for © it's A9, thus the expected
    value becomes %C3%A9.

    Encoding é -> é -> %C3%A9 :

    #!/usr/bin/perl -w
    my $posteddata = <STDIN>;
    print <<PAGE
    Content-type: text/html; charset=UTF-8

    <html><body>
    Posted data: $posteddata<hr>
    <form action='f.pl' method='post'>
    <textarea name='msg'></textarea>
    <input type='submit'>
    </form></body></html>
    PAGE

    Whereas the "normal" form encoding would be é -> %E9:

    #!/usr/bin/perl -w
    my $posteddata = <STDIN>;
    print <<PAGE
    Content-type: text/html

    <html><body>
    Posted data: $posteddata<hr>
    <form action='f.pl' method='post'>
    <textarea name='msg'></textarea>
    <input type='submit'>
    </form></body></html>
    PAGE

    P.S. 'application/x-www-form-urlencoded' is the default form encoding
    type anyhow, so there is actually no need to set this as a form
    argument.

    Recommended literature:
    http://home.tiscali.nl/t876506/utf8tbl.html (search for string C3A9 on
    that page)
    Table CPs < 256: http://en.wikipedia.org/wiki/ISO_8859-1
    And of course Perl FAQ/docs, as Gunnar pointed out.

    --
    Bart
     
    Bart Van der Donck, Jul 16, 2006
    #5
  6. In article <>, says...
    > Excellent learning approach.


    Thanks. Better than taking everything as an eternal mysterious box in my
    mind.

    > The browser automatically URI escapes 'unsafe' characters when you make
    > a GET or an x-www-form-urlencoded POST request. Hence those characters
    > need to be unescaped by the web server. CGI.pm as well as other modules
    > for parsing CGI data takes care of that.
    >


    Hm, understood !

    > You can study the docs for the Perl module URI::Escape for a better
    > explanation.


    I'll do it for sure ;-)
     
    Yohan N. Leder, Jul 16, 2006
    #6
  7. In article <>,
    says...
    > That shouldn't be particularly mysterious. You're specifying the page's
    > charset as UTF-8 in its header (where you say "Content-type: text/html;
    > charset=UTF-8"), causing the 'é'- character to be sent as Unicode's
    > literal 'é'
    >


    Effectively what I want. However the gunnar explanation show the key of
    the problem : URI escaping when *urlencoded* enctype for form.
     
    Yohan N. Leder, Jul 16, 2006
    #7
  8. Yohan N. Leder wrote:

    > In article <>,
    > says...
    > > That shouldn't be particularly mysterious. You're specifying the page's
    > > charset as UTF-8 in its header (where you say "Content-type: text/html;
    > > charset=UTF-8"), causing the 'é'- character to be sent as Unicode's
    > > literal 'é'

    >
    > Effectively what I want. However the gunnar explanation show the key of
    > the problem : URI escaping when *urlencoded* enctype for form.


    Yes, the URL encoding is done at the browser's side by default, before
    and apart from the sendout of the name/value pairs. This behaviour can
    be altered by adding enctype="multipart/form-data" as an extra argument
    to <form method="post">. The main reason for this feature to exist, is
    the transfer of (binary) files to the gateway software on the server.
    Thus, if you want to send 'é', the browser will pass it as "%E9" by
    default. It's up to your Perl script to decode it back to 'é'. In the
    multipart/form-data encoding type, 'é' is just passed as 'é'. In
    UTF-8 sets, the browser looks for the literal equivalent of 'é', and
    then passes the URL-encoded value of that literal equivalent.

    --
    Bart
     
    Bart Van der Donck, Jul 16, 2006
    #8
  9. Gunnar Hjalmarsson wrote:

    > [...]
    > The browser automatically URI escapes 'unsafe' characters when you make
    > a GET or an x-www-form-urlencoded POST request. Hence those characters
    > need to be unescaped by the web server. CGI.pm as well as other modules
    > for parsing CGI data takes care of that.


    #!/usr/bin/pedant
    I think the correct terminology is actually URL-encoding here (or
    percent-encoding) in stead of URI-escaping
    (http://en.wikipedia.org/wiki/URL_encoding).

    --
    Bart
     
    Bart Van der Donck, Jul 16, 2006
    #9
  10. A. Sinan Unur wrote:

    > [...]
    > Escaping is a general method of changing the meaning of the characters
    > following a designated special character. In this case, % is the special
    > character, and it changes the meaning of the characters following it.
    > Characters not allowed in URIs are replaced with these escape sequences.


    Yes, but escaping would then only refer to the %-sign, not to what
    follows. In '%E9', '%' is the escape character and 'E9' the encoded
    value of 'é'. E9 has nothing to do with escaping; otherwise it would
    have been %é (or \é).

    So I think we're both 50% right here :)

    --
    Bart
     
    Bart Van der Donck, Jul 16, 2006
    #10
  11. In article <>,
    says...
    > Yes, the URL encoding is done at the browser's side by default, before
    > and apart from the sendout of the name/value pairs. This behaviour can
    > be altered by adding enctype="multipart/form-data" as an extra argument
    > to <form method="post">. The main reason for this feature to exist, is
    > the transfer of (binary) files to the gateway software on the server.
    > Thus, if you want to send 'é', the browser will pass it as "%E9" by
    > default. It's up to your Perl script to decode it back to 'é'. In the
    > multipart/form-data encoding type, 'é' is just passed as 'é'. In
    > UTF-8 sets, the browser looks for the literal equivalent of 'é', and
    > then passes the URL-encoded value of that literal equivalent.
    >


    Well understood, Bart. Thanks
     
    Yohan N. Leder, Jul 16, 2006
    #11
  12. A. Sinan Unur wrote:

    > "Bart Van der Donck" <> wrote in
    > news::
    >
    > > Yes, but escaping would then only refer to the %-sign, not to what
    > > follows. In '%E9', '%' is the escape character and 'E9' the encoded
    > > value of 'é'. E9 has nothing to do with escaping; otherwise it would
    > > have been %é (or \é).

    >
    > Not really. In Perl, \n is the "escape sequence" for the platform
    > dependent end-of-line character.


    While that is absolutely true, E9 is still an encoded value of é. It
    might or might not serve inside a notation that uses a designated
    escape character.

    > Clearly, 'n' in that escape sequence is just like the E9 above.


    There is no encoding involved in \n, but there is when you write é as
    %E9

    --
    Bart
     
    Bart Van der Donck, Jul 16, 2006
    #12
  13. A. Sinan Unur wrote:

    > > There is no encoding involved in \n, but there is when you write é as
    > > %E9

    >
    > That does not make sense. 'n' all by itself is not the end of line
    > character anywhere. In the realm of Perl's interpolates strings, the
    > letter 'n' that follows '\' is the encoding of the EOL.


    You can't compare (n vs. \n) to (é vs. %E9). There is simply no
    relation between the characters that are represented by "n" and "\n".
    There is no encoding or conversion or whatever.

    The idea is totally different when _going_from_ é to %E9. You have a
    clear encoding algorithm there that takes its data from some code
    table. There is no "going from" involved in "n" versus "\n".

    é to %E9 consists of 2 parts:
    (1) Encode from é to E9
    (2) Put a % before E9 to make clear it's a escape sequence

    --
    Bart
     
    Bart Van der Donck, Jul 16, 2006
    #13
  14. Bart Van der Donck wrote:
    > Gunnar Hjalmarsson wrote:
    >>The browser automatically URI escapes 'unsafe' characters when you make
    >>a GET or an x-www-form-urlencoded POST request. Hence those characters
    >>need to be unescaped by the web server. CGI.pm as well as other modules
    >>for parsing CGI data takes care of that.

    >
    > #!/usr/bin/pedant
    > I think the correct terminology is actually URL-encoding here (or
    > percent-encoding) in stead of URI-escaping
    > (http://en.wikipedia.org/wiki/URL_encoding).


    I simply chose to use the term from URI::Escape. My English isn't good
    enough for arguing about it. ;-)

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
     
    Gunnar Hjalmarsson, Jul 16, 2006
    #14
  15. Yohan N. Leder

    Dr.Ruud Guest

    Bart Van der Donck schreef:

    > You can't compare (n vs. \n) to (é vs. %E9).


    You can compare (<LF> vs. "\n") to (<é> vs. "%E9"). Often, such
    translations use a table in memory.

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Jul 16, 2006
    #15
  16. Dr.Ruud wrote:

    > Bart Van der Donck schreef:
    >
    > > You can't compare (n vs. \n) to (é vs. %E9).

    >
    > You can compare (<LF> vs. "\n") to (<é> vs. "%E9"). Often, such
    > translations use a table in memory.


    Yes, exactly, like:
    LF -> %0A
    é -> %E9

    <LF> refers to hex 0A by definition, but I'm not sure whether "\n"
    always refers to hex 0A on various operating systems.

    --
    Bart
     
    Bart Van der Donck, Jul 16, 2006
    #16
  17. Yohan N. Leder

    Dr.Ruud Guest

    Bart Van der Donck schreef:
    > Dr.Ruud:
    >> Bart Van der Donck:


    >>> You can't compare (n vs. \n) to (é vs. %E9).

    >>
    >> You can compare (<LF> vs. "\n") to (<é> vs. "%E9"). Often, such
    >> translations use a table in memory.

    >
    > Yes, exactly, like:
    > LF -> %0A
    > é -> %E9
    >
    > <LF> refers to hex 0A by definition,


    Yes, but when going the other way around, "%E9" can be translated to <é>
    (if that's what the character is at position 0xE9 in the current
    charset), and "\n" to LF (or CR or CRLF or whatever, depending on the
    platform). That the "%E9" and 0xE9 look a lot alike, and "\n" and 0x0A
    don't, doesn't really matter.

    If the current charset is UTF-8, the "%E9" is translated to a specific
    multibyte sequence. In this context of escape characters, you could say
    that UTF-8 has an escape bit.
    If the current charset is ASCII, the "%E9" might be translated to "?" or
    "e", or "'e" or "e'" or whatever is feasible.


    > but I'm not sure whether "\n"
    > always refers to hex 0A on various operating systems.


    It doesn't, but that doesn't matter. The escape-character, at the start
    of the escape sequence, brings up a special mode, that just eats the
    escape-sequence and inserts/returns an equally or more specific
    translation.

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Jul 16, 2006
    #17
  18. Yohan N. Leder

    robic0 Guest

    On Sun, 16 Jul 2006 20:56:31 GMT, "A. Sinan Unur" <> wrote:

    >"Bart Van der Donck" <> wrote in
    >news::
    >
    >[ You should not snip attributions. You create the impression that I
    >wrote something I did not. That is not nice. ]
    >
    >> A. Sinan Unur wrote:
    >>
    >>> > There is no encoding involved in \n, but there is when you write é
    >>> > as %E9

    >
    >[ You made the statement above. I did not. ]
    >
    >>> That does not make sense. 'n' all by itself is not the end of line
    >>> character anywhere. In the realm of Perl's interpolates strings, the
    >>> letter 'n' that follows '\' is the encoding of the EOL.

    >>
    >> You can't compare (n vs. \n) to (é vs. %E9). There is simply no
    >> relation between the characters that are represented by "n" and "\n".
    >> There is no encoding or conversion or whatever.

    >
    >This is absurd. "\n" is an encoding of EOL.
    >
    >Since you like quoting Wikipedia:
    ><http://en.wikipedia.org/wiki/Encoding>
    >
    ><blockquote>
    >Encoding is the process of transforming information from one format into
    >another. The opposite operation is called decoding.
    ></blockquote>
    >
    >Any mapping of one thing on to another is an encoding.
    >
    >This is getting tedious. I am out of this thread.
    >
    >> table. There is no "going from" involved in "n" versus "\n".

    >
    >Sinan


    The definition of "encoding" from wikipedia is a broad definition.
    For example encoding/decoding digital media, say mpeg2 involve a
    linear language to, on the encoding side, take out redundency in
    adjacent frames, then reconstruct the full frames on the decoding side.
    The jpeg compression layer is on top of the mpeg layer. Finally a full
    bitmap frame. This is by default the wikipedia definition. This is macro.

    In the case of a single document character, its a either/or argument.
    Either the character is escaped, in which case it loses its binary form
    or its not, in which case it retains it.

    There's a "reason" why certain Unicode characters are reserved as control
    codes. Here is the reason: THE DATA TRANSFERED ON ALL COMPUTERS ARE BINARY.
    There is no spoon Neo, there is no spoon.

    I guess in that sence, all files read/written are encoded/decoded.
    The formula for encoding is the same as decoding. Think of it as a train
    of boxcars. The decoder waits for the marker cars, then grabs the next
    series of cars as its formula requires. Those cars grabbed are then sent
    on to the next decoder (possibly the same one) where finite smaller cars
    are extracted. The itteration continues (possibly several times).

    All for what? The first thing done was to standardize on how many bytes in a Unicode char
    (there may be small/big). Then the encoding, but what is an encoding statement?
    Why its nothing more that an offset into a binary blob of data that contains the
    "character bitmap" to display. Of course displaying a different encoding bitmaped characters
    doesn't translate into a language translator. The only thing the Rossetta Stone did was
    to connect alphabet characters across languages, ie: it didn't do language translation.

    Back to the larger issue of html/xml encoding/decoding, and I don't know what the
    OP had in mind (looks like html/xml embedded url's) but he would seem to have to
    know ahead of time how many times a parser will itterate his data and how many
    times and what he needs to escape it.

    The idea of an argument on encoding/decoding as it relates to charactersets is just
    absurd! Encoding/Decoding was not only done on the very first computer display device
    but is a concept over 4,000 years old!

    Don't make this extremely simple concept out to be something just invented.

    robic0
    -get ur act together



    his case, given that
     
    robic0, Jul 17, 2006
    #18
  19. Bart Van der Donck wrote:
    > Dr.Ruud wrote:
    >
    >> Bart Van der Donck schreef:
    >>
    >>> You can't compare (n vs. \n) to (é vs. %E9).

    >> You can compare (<LF> vs. "\n") to (<é> vs. "%E9"). Often, such
    >> translations use a table in memory.

    >
    > Yes, exactly, like:
    > LF -> %0A
    > é -> %E9
    >
    > <LF> refers to hex 0A by definition, but I'm not sure whether "\n"
    > always refers to hex 0A on various operating systems.


    "\n" is 0x0A on all systems.

    However, when writing a file in text mode, 0x0A may be translated on the
    file into 0x0D,an 0x0D0A, or (on an IBM mainframe) a logical record end.
    Similarly, an 0x0D, an 0x0D0A, or a logical record end can be translated
    into an 0x0A when reading a file in text mode.

    Add the extra zeroes for Unicode as needed.

    --
    John W. Kennedy
    "The blind rulers of Logres
    Nourished the land on a fallacy of rational virtue."
    -- Charles Williams. "Taliessin through Logres: Prelude"
     
    John W. Kennedy, Jul 17, 2006
    #19
  20. Yohan N. Leder

    John Bokma Guest

    "John W. Kennedy" <> wrote:

    > Bart Van der Donck wrote:
    >> Dr.Ruud wrote:
    >>
    >>> Bart Van der Donck schreef:
    >>>
    >>>> You can't compare (n vs. \n) to (é vs. %E9).
    >>> You can compare (<LF> vs. "\n") to (<é> vs. "%E9"). Often, such
    >>> translations use a table in memory.

    >>
    >> Yes, exactly, like:
    >> LF -> %0A
    >> é -> %E9
    >>
    >> <LF> refers to hex 0A by definition, but I'm not sure whether "\n"
    >> always refers to hex 0A on various operating systems.

    >
    > "\n" is 0x0A on all systems.


    According to the table on p 161-162 of Programming Perl:

    "Match the newline character (usually NL, but CR on Macs)."

    I doubt that the book talks about file level here, but I have no Mac to
    test this :-D.

    --
    John Bokma Freelance software developer
    &
    Experienced Perl programmer: http://castleamber.com/
     
    John Bokma, Jul 17, 2006
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. JJBW
    Replies:
    1
    Views:
    10,226
    Joerg Jooss
    Apr 24, 2004
  2. Leif K-Brooks
    Replies:
    3
    Views:
    10,380
    Courtney
    Nov 29, 2004
  3. Replies:
    3
    Views:
    481
  4. Thomas Henz

    decode a urlencoded string

    Thomas Henz, Aug 25, 2003, in forum: ASP General
    Replies:
    2
    Views:
    118
    Ray at
    Aug 25, 2003
  5. Yohan N. Leder

    How to get UTF-8 from an urlencoded web form ?

    Yohan N. Leder, Jul 15, 2006, in forum: Perl Misc
    Replies:
    0
    Views:
    303
    Yohan N. Leder
    Jul 15, 2006
Loading...

Share This Page