Encoding Questions

jalil · Apr 19, 2005

1. I download a page in python using urllib and now want to convert and
keep it as utf-8? I already know the original encoding of the page.
What calls should I make to convert the encoding of the page to utf8?
For example, let's say the page is encoded in gb2312 (simple chinese)
and I want to keep it in utf-8?

2. Is this a good approach? Can I keep any pages in any languages in
this way and return them when requested using utf-8 encoding?

3. Does python 2.4 support all encodings?

By the way, I have set my default encoding in Python to utf8.

I appreciate any help.

-JF

Kent Johnson · Apr 19, 2005

1. I download a page in python using urllib and now want to convert and
keep it as utf-8? I already know the original encoding of the page.
What calls should I make to convert the encoding of the page to utf8?
For example, let's say the page is encoded in gb2312 (simple chinese)
and I want to keep it in utf-8?

Something like
data = urllib.url_open(...).read()
unicodeData = data.decode('gb2312')
utf8Data = unicodeData.encode('utf-8')

You may want to supply the errors parameter to decode() or encode(); see the docs for details.
http://docs.python.org/lib/string-methods.html

2. Is this a good approach? Can I keep any pages in any languages in
this way and return them when requested using utf-8 encoding?

Yes, as long as you know reliably what the encoding is for the source pages.

3. Does python 2.4 support all encodings?

I doubt it

but it supports many encodings. The list is at
http://docs.python.org/lib/standard-encodings.html

Kent

vincent wehren · Apr 19, 2005

| 1. I download a page in python using urllib and now want to convert and
| keep it as utf-8? I already know the original encoding of the page.
| What calls should I make to convert the encoding of the page to utf8?
| For example, let's say the page is encoded in gb2312 (simple chinese)
| and I want to keep it in utf-8?

Something like:

utf8_s = s.decode('gb2312').encode('utf-8')

- with s being the simplified chinese string - should work.

|
| 2. Is this a good approach? Can I keep any pages in any languages in
| this way and return them when requested using utf-8 encoding?
|
| 3. Does python 2.4 support all encodings?

See http://docs.python.org/lib/standard-encodings.html for an overview.

|
| By the way, I have set my default encoding in Python to utf8.
|

Why would you want to do that?

--

Vincent Wehren

|
| I appreciate any help.
|
| -JF
|

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Apr 19, 2005

Kent said:
Something like
data = urllib.url_open(...).read()
unicodeData = data.decode('gb2312')
utf8Data = unicodeData.encode('utf-8')

You may want to supply the errors parameter to decode() or encode(); see
the docs for details.
http://docs.python.org/lib/string-methods.html

In addition, for an HTML page, you might need to update the META element
for the content-type HTTP header. For an XHTML page, you might need to
update/remove the XML declaration.

Regards,
Martin

jalil · Apr 19, 2005

thanks for the replies. As for why I set my default encoding to utf-8
in python, I did it a while ago and I think I did it because when I was
reading some strings from database in utf-8 it raised errors b/c there
were some chars it could recongnize in standard encoding. When I made
the change, the error didn't happen anymore.

Does it make sense?

-JF

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Apr 20, 2005

thanks for the replies. As for why I set my default encoding to utf-8
in python, I did it a while ago and I think I did it because when I was
reading some strings from database in utf-8 it raised errors b/c there
were some chars it could recongnize in standard encoding. When I made
the change, the error didn't happen anymore.

Does it make sense?

No. If reading the strings from the database already gives an exception
(i.e. without any processing of these strings), that is a bug in the
database. It is also unlikely that this is what actually happened.

More likely, you are reading the strings from the database, and then
combining them explicitly with Unicode strings. Instead of changing
the default encoding, you should tell your database adapter to return
the strings as Unicode objects; if this is not supported, you should
convert them to Unicode objects in the process of reading them.

Regards,
Martin

Is there a way where i can limit the array output results?	1	Oct 19, 2022
A few questiosn about encoding	103	Jun 9, 2013
Cyrillic text from file - set utf8 in cmd, unknown characters output anyway	0	Nov 11, 2022
Python Windows release and encoding	1	May 22, 2013
How to convert CSV to parquet file without RLE_DICTIONARY encoding?	0	Sep 2, 2022
email 8bit encoding	6	Jul 29, 2013
encoding error	1	Feb 20, 2013
encoding problem	11	Dec 19, 2008

Encoding Questions

jalil

Kent Johnson

vincent wehren

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

jalil

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads