Encoding Questions

J

jalil

1. I download a page in python using urllib and now want to convert and
keep it as utf-8? I already know the original encoding of the page.
What calls should I make to convert the encoding of the page to utf8?
For example, let's say the page is encoded in gb2312 (simple chinese)
and I want to keep it in utf-8?

2. Is this a good approach? Can I keep any pages in any languages in
this way and return them when requested using utf-8 encoding?

3. Does python 2.4 support all encodings?

By the way, I have set my default encoding in Python to utf8.

I appreciate any help.

-JF
 
K

Kent Johnson

1. I download a page in python using urllib and now want to convert and
keep it as utf-8? I already know the original encoding of the page.
What calls should I make to convert the encoding of the page to utf8?
For example, let's say the page is encoded in gb2312 (simple chinese)
and I want to keep it in utf-8?

Something like
data = urllib.url_open(...).read()
unicodeData = data.decode('gb2312')
utf8Data = unicodeData.encode('utf-8')

You may want to supply the errors parameter to decode() or encode(); see the docs for details.
http://docs.python.org/lib/string-methods.html
2. Is this a good approach? Can I keep any pages in any languages in
this way and return them when requested using utf-8 encoding?

Yes, as long as you know reliably what the encoding is for the source pages.
3. Does python 2.4 support all encodings?

I doubt it :) but it supports many encodings. The list is at
http://docs.python.org/lib/standard-encodings.html

Kent
 
V

vincent wehren

| 1. I download a page in python using urllib and now want to convert and
| keep it as utf-8? I already know the original encoding of the page.
| What calls should I make to convert the encoding of the page to utf8?
| For example, let's say the page is encoded in gb2312 (simple chinese)
| and I want to keep it in utf-8?

Something like:

utf8_s = s.decode('gb2312').encode('utf-8')

- with s being the simplified chinese string - should work.

|
| 2. Is this a good approach? Can I keep any pages in any languages in
| this way and return them when requested using utf-8 encoding?
|
| 3. Does python 2.4 support all encodings?

See http://docs.python.org/lib/standard-encodings.html for an overview.

|
| By the way, I have set my default encoding in Python to utf8.
|

Why would you want to do that?

--

Vincent Wehren

|
| I appreciate any help.
|
| -JF
|
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Kent said:
Something like
data = urllib.url_open(...).read()
unicodeData = data.decode('gb2312')
utf8Data = unicodeData.encode('utf-8')

You may want to supply the errors parameter to decode() or encode(); see
the docs for details.
http://docs.python.org/lib/string-methods.html

In addition, for an HTML page, you might need to update the META element
for the content-type HTTP header. For an XHTML page, you might need to
update/remove the XML declaration.

Regards,
Martin
 
J

jalil

thanks for the replies. As for why I set my default encoding to utf-8
in python, I did it a while ago and I think I did it because when I was
reading some strings from database in utf-8 it raised errors b/c there
were some chars it could recongnize in standard encoding. When I made
the change, the error didn't happen anymore.

Does it make sense?

-JF
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

thanks for the replies. As for why I set my default encoding to utf-8
in python, I did it a while ago and I think I did it because when I was
reading some strings from database in utf-8 it raised errors b/c there
were some chars it could recongnize in standard encoding. When I made
the change, the error didn't happen anymore.

Does it make sense?

No. If reading the strings from the database already gives an exception
(i.e. without any processing of these strings), that is a bug in the
database. It is also unlikely that this is what actually happened.

More likely, you are reading the strings from the database, and then
combining them explicitly with Unicode strings. Instead of changing
the default encoding, you should tell your database adapter to return
the strings as Unicode objects; if this is not supported, you should
convert them to Unicode objects in the process of reading them.

Regards,
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,754
Messages
2,569,526
Members
44,997
Latest member
mileyka

Latest Threads

Top