convert strings to utf-8

N

Niclas

Hi

I'm having trouble to work with the special charcters in swedish (Å Ä Ö
å ä ö). The script is parsing and extracting information from a webpage.
This works fine and I get all the data correctly. The information is
then added to a rss file (using xml.dom.minidom.Document() to create the
file), this is where it goes wrong. Letters like Å ä ö get messed up and
the rss file does not validate. How can I convert the data to UTF-8
without loosing the special letters?

Thanks in advance
 
D

Diez B. Roggisch

Niclas said:
Hi

I'm having trouble to work with the special charcters in swedish (Å Ä Ö
å ä ö). The script is parsing and extracting information from a webpage.
This works fine and I get all the data correctly. The information is
then added to a rss file (using xml.dom.minidom.Document() to create the
file), this is where it goes wrong. Letters like Å ä ö get messed up and
the rss file does not validate. How can I convert the data to UTF-8
without loosing the special letters?

Show us code, and example text (albeit I know it is difficult to get
that right using news/mail)

The basic idea is this:

scrapped_byte_string = scrap_the_website()

output = scrappend_byte_string.decode('website-encoding').encode('utf-8')



Diez
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Niclas said:
I'm having trouble to work with the special charcters in swedish (Å Ä Ö
å ä ö). The script is parsing and extracting information from a webpage.
This works fine and I get all the data correctly. The information is
then added to a rss file (using xml.dom.minidom.Document() to create the
file), this is where it goes wrong. Letters like Å ä ö get messed up and
the rss file does not validate. How can I convert the data to UTF-8
without loosing the special letters?

You should convert the strings from the webpage to Unicode strings.
You can see that a string is unicode of

print isinstance(s,unicode)

prints True. Make sure *every* string you put into the Document
actually is a Unicode string. Then it will just work fine.

Regards,
Martin
 
N

Niclas

Thank you!

solved it with this:
unicode( data.decode('latin_1') )
and when I write it to the file...
f = codecs.open(path, encoding='utf-8', mode='w+')
f.write(self.__rssDoc.toxml())

Diez B. Roggisch skrev:
 
D

Diez B. Roggisch

Niclas said:
Thank you!

solved it with this:
unicode( data.decode('latin_1') )

The unicode around this is superfluous. Either do

unicode(bytestring, encoding)

or

bytestring.decode(encoding)

and when I write it to the file...
f = codecs.open(path, encoding='utf-8', mode='w+')
f.write(self.__rssDoc.toxml())


Looks good, yes.

Diez
 
J

John Nagle

Diez said:
The unicode around this is superfluous.

Worse, it's an error. utf-8 needs to go into a stream
of 8-bit bytes, not a Unicode string.

John Nagle
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,774
Messages
2,569,596
Members
45,143
Latest member
DewittMill
Top