convert strings to utf-8

Niclas · Feb 25, 2007

Hi

I'm having trouble to work with the special charcters in swedish (Å Ä Ö
å ä ö). The script is parsing and extracting information from a webpage.
This works fine and I get all the data correctly. The information is
then added to a rss file (using xml.dom.minidom.Document() to create the
file), this is where it goes wrong. Letters like Å ä ö get messed up and
the rss file does not validate. How can I convert the data to UTF-8
without loosing the special letters?

Thanks in advance

Diez B. Roggisch · Feb 25, 2007

Niclas said:
Hi

I'm having trouble to work with the special charcters in swedish (Å Ä Ö
å ä ö). The script is parsing and extracting information from a webpage.
This works fine and I get all the data correctly. The information is
then added to a rss file (using xml.dom.minidom.Document() to create the
file), this is where it goes wrong. Letters like Å ä ö get messed up and
the rss file does not validate. How can I convert the data to UTF-8
without loosing the special letters?

Show us code, and example text (albeit I know it is difficult to get
that right using news/mail)

The basic idea is this:

scrapped_byte_string = scrap_the_website()

output = scrappend_byte_string.decode('website-encoding').encode('utf-8')

Diez

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Feb 25, 2007

Niclas said:
I'm having trouble to work with the special charcters in swedish (Å Ä Ö
å ä ö). The script is parsing and extracting information from a webpage.
This works fine and I get all the data correctly. The information is
then added to a rss file (using xml.dom.minidom.Document() to create the
file), this is where it goes wrong. Letters like Å ä ö get messed up and
the rss file does not validate. How can I convert the data to UTF-8
without loosing the special letters?

You should convert the strings from the webpage to Unicode strings.
You can see that a string is unicode of

print isinstance(s,unicode)

prints True. Make sure *every* string you put into the Document
actually is a Unicode string. Then it will just work fine.

Regards,
Martin

Niclas · Feb 25, 2007

Thank you!

solved it with this:
unicode( data.decode('latin_1') )
and when I write it to the file...
f = codecs.open(path, encoding='utf-8', mode='w+')
f.write(self.__rssDoc.toxml())

Diez B. Roggisch skrev:

Diez B. Roggisch · Feb 25, 2007

Niclas said:
Thank you!

solved it with this:
unicode( data.decode('latin_1') )

The unicode around this is superfluous. Either do

unicode(bytestring, encoding)

or

bytestring.decode(encoding)

and when I write it to the file...
f = codecs.open(path, encoding='utf-8', mode='w+')
f.write(self.__rssDoc.toxml())

Looks good, yes.

Diez

John Nagle · Feb 26, 2007

Diez said:
The unicode around this is superfluous.

Worse, it's an error. utf-8 needs to go into a stream
of 8-bit bytes, not a Unicode string.

John Nagle

Batch Convert HTML to UTF-8 Files	2	Oct 2, 2023
How to dump a Python 2.6 dictionary with UTF-8 strings?	3	Jan 11, 2011
UTF-8 and strings	44	Jun 7, 2011
utf-8 read/write file	4	Oct 8, 2008
Encoding of surrogate code points to UTF-8	14	Oct 8, 2013
ANSI/UTF-8 File when save string to it	4	Feb 14, 2011
convert sqlite ANSI to UTF-8	1	Jun 11, 2008
Python3.1: gzip encoding with UTF-8 fails	3	Dec 20, 2009

convert strings to utf-8

Niclas

Diez B. Roggisch

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Niclas

Diez B. Roggisch

John Nagle

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads