Encoding Questions

Discussion in 'Python' started by jalil@feghhi.com, Apr 19, 2005.

  1. Guest

    1. I download a page in python using urllib and now want to convert and
    keep it as utf-8? I already know the original encoding of the page.
    What calls should I make to convert the encoding of the page to utf8?
    For example, let's say the page is encoded in gb2312 (simple chinese)
    and I want to keep it in utf-8?

    2. Is this a good approach? Can I keep any pages in any languages in
    this way and return them when requested using utf-8 encoding?

    3. Does python 2.4 support all encodings?

    By the way, I have set my default encoding in Python to utf8.

    I appreciate any help.

    -JF
    , Apr 19, 2005
    #1
    1. Advertising

  2. Kent Johnson Guest

    wrote:
    > 1. I download a page in python using urllib and now want to convert and
    > keep it as utf-8? I already know the original encoding of the page.
    > What calls should I make to convert the encoding of the page to utf8?
    > For example, let's say the page is encoded in gb2312 (simple chinese)
    > and I want to keep it in utf-8?


    Something like
    data = urllib.url_open(...).read()
    unicodeData = data.decode('gb2312')
    utf8Data = unicodeData.encode('utf-8')

    You may want to supply the errors parameter to decode() or encode(); see the docs for details.
    http://docs.python.org/lib/string-methods.html

    > 2. Is this a good approach? Can I keep any pages in any languages in
    > this way and return them when requested using utf-8 encoding?


    Yes, as long as you know reliably what the encoding is for the source pages.

    > 3. Does python 2.4 support all encodings?


    I doubt it :) but it supports many encodings. The list is at
    http://docs.python.org/lib/standard-encodings.html

    Kent
    Kent Johnson, Apr 19, 2005
    #2
    1. Advertising

  3. <> schrieb im Newsbeitrag
    news:...
    | 1. I download a page in python using urllib and now want to convert and
    | keep it as utf-8? I already know the original encoding of the page.
    | What calls should I make to convert the encoding of the page to utf8?
    | For example, let's say the page is encoded in gb2312 (simple chinese)
    | and I want to keep it in utf-8?

    Something like:

    utf8_s = s.decode('gb2312').encode('utf-8')

    - with s being the simplified chinese string - should work.

    |
    | 2. Is this a good approach? Can I keep any pages in any languages in
    | this way and return them when requested using utf-8 encoding?
    |
    | 3. Does python 2.4 support all encodings?

    See http://docs.python.org/lib/standard-encodings.html for an overview.

    |
    | By the way, I have set my default encoding in Python to utf8.
    |

    Why would you want to do that?

    --

    Vincent Wehren

    |
    | I appreciate any help.
    |
    | -JF
    |
    vincent wehren, Apr 19, 2005
    #3
  4. Kent Johnson wrote:
    > Something like
    > data = urllib.url_open(...).read()
    > unicodeData = data.decode('gb2312')
    > utf8Data = unicodeData.encode('utf-8')
    >
    > You may want to supply the errors parameter to decode() or encode(); see
    > the docs for details.
    > http://docs.python.org/lib/string-methods.html


    In addition, for an HTML page, you might need to update the META element
    for the content-type HTTP header. For an XHTML page, you might need to
    update/remove the XML declaration.

    Regards,
    Martin
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Apr 19, 2005
    #4
  5. Guest

    thanks for the replies. As for why I set my default encoding to utf-8
    in python, I did it a while ago and I think I did it because when I was
    reading some strings from database in utf-8 it raised errors b/c there
    were some chars it could recongnize in standard encoding. When I made
    the change, the error didn't happen anymore.

    Does it make sense?

    -JF
    , Apr 19, 2005
    #5
  6. wrote:
    > thanks for the replies. As for why I set my default encoding to utf-8
    > in python, I did it a while ago and I think I did it because when I was
    > reading some strings from database in utf-8 it raised errors b/c there
    > were some chars it could recongnize in standard encoding. When I made
    > the change, the error didn't happen anymore.
    >
    > Does it make sense?


    No. If reading the strings from the database already gives an exception
    (i.e. without any processing of these strings), that is a bug in the
    database. It is also unlikely that this is what actually happened.

    More likely, you are reading the strings from the database, and then
    combining them explicitly with Unicode strings. Instead of changing
    the default encoding, you should tell your database adapter to return
    the strings as Unicode objects; if this is not supported, you should
    convert them to Unicode objects in the process of reading them.

    Regards,
    Martin
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Apr 20, 2005
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Hardy Wang

    Encoding.Default and Encoding.UTF8

    Hardy Wang, Jun 8, 2004, in forum: ASP .Net
    Replies:
    5
    Views:
    18,797
    Jon Skeet [C# MVP]
    Jun 9, 2004
  2. Replies:
    1
    Views:
    23,315
    Real Gagnon
    Oct 8, 2004
  3. Replies:
    3
    Views:
    477
    Joris Gillis
    Jul 6, 2005
  4. Evangelista Sami

    encoding / decoding questions

    Evangelista Sami, Apr 12, 2004, in forum: C Programming
    Replies:
    2
    Views:
    345
    Jack Klein
    Apr 13, 2004
  5. Replies:
    2
    Views:
    353
Loading...

Share This Page