utf8 encoding problem

Discussion in 'Python' started by Wichert Akkerman, Jan 22, 2004.

  1. I'm struggling with what should be a trivial problem but I can't seem to
    come up with a proper solution: I am working on a CGI that takes utf-8
    input from a browser. The input is nicely encoded so you get something
    like this:

    firstname=t%C3%A9s

    where %C3CA9 is a single character in utf-8 encoding. Passing this
    through urllib.unquote does not help:

    >>> urllib.unquote(u't%C3%A9st')

    u't%C3%A9st'

    The problem turned out to be that urllib.unquote() process processes
    its input character by character which breaks when it tries to call
    chr() for a character: it gets a character which is not valid ascii
    (outside the legal range) or valid unicode (it's only half a utf-8
    character) and as a result it fails:

    >>> chr(195) + u""

    Traceback (most recent call last):
    File "<stdin>", line 1, in ?
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)


    I can't seem to find a working method to do this conversion correctly.
    Can someone point me in the right direction? (and please cc me on
    replies since I'm not currently subscribed to this list/newsgroup).

    Wichert.

    --
    Wichert Akkerman <> It is simple to make things.
    http://www.wiggy.net/ It is hard to make things simple.
    Wichert Akkerman, Jan 22, 2004
    #1
    1. Advertising

  2. Wichert Akkerman wrote:

    > I'm struggling with what should be a trivial problem but I can't seem
    > to
    > come up with a proper solution: I am working on a CGI that takes utf-8
    > input from a browser. The input is nicely encoded so you get something
    > like this:
    >
    > firstname=t%C3%A9s
    >
    > where %C3CA9 is a single character in utf-8 encoding. Passing this
    > through urllib.unquote does not help:
    >
    > >>> urllib.unquote(u't%C3%A9st')

    > u't%C3%A9st'


    Unquote it as a normal string, then convert it to Unicode.

    >>> import urllib
    >>> x = 't%C3%A9s'
    >>> y = urllib.unquote(x)
    >>> y

    't\xc3\xa9s'
    >>> z = unicode(y, 'utf-8')
    >>> z

    u't\xe9s'

    --
    __ Erik Max Francis && && http://www.alcyone.com/max/
    / \ San Jose, CA, USA && 37 20 N 121 53 W && &tSftDotIotE
    \__/ I do not promise to consider race or religion in my appointments.
    I promise only that I will not consider them. -- John F. Kennedy
    Erik Max Francis, Jan 22, 2004
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Hardy Wang

    Encoding.Default and Encoding.UTF8

    Hardy Wang, Jun 8, 2004, in forum: ASP .Net
    Replies:
    5
    Views:
    18,805
    Jon Skeet [C# MVP]
    Jun 9, 2004
  2. Wichert Akkerman

    Re: utf8 encoding problem

    Wichert Akkerman, Jan 22, 2004, in forum: Python
    Replies:
    4
    Views:
    497
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
    Jan 25, 2004
  3. Mark Toth

    Problem with encoding latin1/UTF8

    Mark Toth, Dec 28, 2007, in forum: Ruby
    Replies:
    1
    Views:
    135
    Chris Gers32
    Jan 7, 2008
  4. Ad Ad

    utf8 encoding problem

    Ad Ad, Jun 25, 2009, in forum: Ruby
    Replies:
    3
    Views:
    119
    Ad Ad
    Jun 26, 2009
  5. gry
    Replies:
    2
    Views:
    707
    Alf P. Steinbach
    Mar 13, 2012
Loading...

Share This Page