Re: utf8 encoding problem

Discussion in 'Python' started by Wichert Akkerman, Jan 22, 2004.

  1. Previously Denis S. Otkidach wrote:
    > You have to pass 8-bit string, but not unicode. The following
    > code works as expected:
    >
    > >>> urllib.unquote('t%C3%A9st').decode('utf-8')

    > u't\xe9st'


    Ah, that does work indeed, thanks.

    > P.S. According to HTML standard, with
    > application/x-www-form-urlencoded content type form data are
    > resricted to ASCII codes:
    > http://www.w3.org/TR/html4/interact/forms.html#form-data-set
    > http://www.w3.org/TR/html4/interact/forms.html#submit-format


    Luckily that is not true, otherwise it would be completely impossible to
    have websites using non-ascii input. To be specific, the encoding used
    for HTML forms is determined by:

    1. accept-charset attribute of the form element if present. This is
    not handled by all browsers though.
    2. the encoding used for the html page containing the form
    3. ascii otherwise

    this is specified in section 17.3 of the HTML 4.01 standard you are
    referring to.

    Wichert.

    --
    Wichert Akkerman <> It is simple to make things.
    http://www.wiggy.net/ It is hard to make things simple.
    Wichert Akkerman, Jan 22, 2004
    #1
    1. Advertising

  2. Wichert Akkerman wrote:


    >>P.S. According to HTML standard, with
    >>application/x-www-form-urlencoded content type form data are
    >>resricted to ASCII codes:

    [...]
    > Luckily that is not true, otherwise it would be completely impossible to
    > have websites using non-ascii input. To be specific, the encoding used
    > for HTML forms is determined by: [algorithm omitted]


    As Denis explains, it is true. See 17.13.4

    application/x-www-form-urlencoded
    .... Non-alphanumeric characters are replaced by `%HH', a percent sign
    and two hexadecimal digits representing the ASCII code of the character.

    So this form is restricted only to characters which have an ASCII code,
    i.e. ASCII characters.

    To have non-ASCII input, use multipart/form-data:

    multipart/form-data
    ....
    The content type "multipart/form-data" should be used for submitting
    forms that contain files, non-ASCII data, and binary data.

    This reconfirms that you should use it for non-ASCII.

    Regards,
    Martin
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Jan 24, 2004
    #2
    1. Advertising


  3. >> Luckily that is not true, otherwise it would be completely impossible
    >> to have websites using non-ascii input. To be specific, the encoding
    >> used for HTML forms is determined by: [algorithm omitted]


    Martin> As Denis explains, it is true. See 17.13.4

    Sorry, but I'm coming to this discussion late. See "17.13.4" of what
    document?

    Thx,

    Skip
    Skip Montanaro, Jan 24, 2004
    #3
  4. Martin v. Loewis <> wrote:

    > As Denis explains, it is true. See 17.13.4


    Indeed. [Skip: http://www.w3.org/TR/html4/interact/forms.html#h-17.13.4 ]

    > To have non-ASCII input, use multipart/form-data:


    Quite so, in theory. Of course in reality, no browser today includes a
    Content-Type header in the subparts of a multipart/form-data submission,
    so there's nowhere to specify an charset here either! argh.

    multipart/form-data as implemented in current UAs is just as encoding-unaware
    as application/x-www-form-urlencoded, sadly. In practical terms it does not
    really matter much which is used.

    [...waiting for the glorious day when UTF-8 and UCS-4 are the only acceptable
    encodings; and on that day, Shift-JIS will be the first against the wall oh
    let me blummin' well tell you my brother...]

    --
    Andrew Clover
    mailro:
    http://www.doxdesk.com/
    Andrew Clover, Jan 24, 2004
    #4
  5. Andrew Clover wrote:
    > Quite so, in theory. Of course in reality, no browser today includes a
    > Content-Type header in the subparts of a multipart/form-data submission,
    > so there's nowhere to specify an charset here either! argh.


    Right. In this case, the algorithm Wichert quotes should apply.

    I once tried to study why browsers won't send Content-Type headers.
    Actually, they *do* send Content-Type headers, but omit the charset=
    parameter. I submitted various bug reports, and the Mozilla people
    replied that they tried to, and found that various CGI scripts would
    break when confronted with the standards-conforming request, but
    work when they get the deprecated form.

    So it looks like this situation will extend indefinitely.

    > multipart/form-data as implemented in current UAs is just as encoding-unaware
    > as application/x-www-form-urlencoded, sadly. In practical terms it does not
    > really matter much which is used.


    Right - for practical terms, standards don't matter much. As this thread
    shows, the form used *does* matter in practical terms though: Users
    of application/x-www-form-urlencoded are now confronted with the
    unescaping-then-decoding issue, which apparently is a challenge.

    Regards,
    Martin
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Jan 25, 2004
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Hardy Wang

    Encoding.Default and Encoding.UTF8

    Hardy Wang, Jun 8, 2004, in forum: ASP .Net
    Replies:
    5
    Views:
    18,797
    Jon Skeet [C# MVP]
    Jun 9, 2004
  2. Wichert Akkerman

    utf8 encoding problem

    Wichert Akkerman, Jan 22, 2004, in forum: Python
    Replies:
    1
    Views:
    429
    Erik Max Francis
    Jan 22, 2004
  3. Mark Toth

    Problem with encoding latin1/UTF8

    Mark Toth, Dec 28, 2007, in forum: Ruby
    Replies:
    1
    Views:
    134
    Chris Gers32
    Jan 7, 2008
  4. Ad Ad

    utf8 encoding problem

    Ad Ad, Jun 25, 2009, in forum: Ruby
    Replies:
    3
    Views:
    119
    Ad Ad
    Jun 26, 2009
  5. gry
    Replies:
    2
    Views:
    705
    Alf P. Steinbach
    Mar 13, 2012
Loading...

Share This Page