UTF-8 to unicode or latin-1 (and yes, I read the FAQ)

N

NoelByron

Hi!

I'm struggling with the conversion of a UTF-8 string to latin-1. As far
as I know the way to go is to decode the UTF-8 string to unicode and
then encode it back again to latin-1?

So I tried:

'K\xc3\xb6ni'.decode('utf-8') # 'K\xc3\xb6ni' should be 'König',
contains a german 'umlaut'

but failed since python assumes every string to decode to be ASCII?

How can I convert this string to latin-1?

How would you write a function like:

def encode_string(string, from_encoding, to_encoding):
#????

Best regards,
Noel
 
F

Fredrik Lundh

> I'm struggling with the conversion of a UTF-8 string to latin-1. As far
as I know the way to go is to decode the UTF-8 string to unicode and
then encode it back again to latin-1?

So I tried:

'K\xc3\xb6ni'.decode('utf-8') # 'K\xc3\xb6ni' should be 'König',

"Köni", to be precise.
contains a german 'umlaut'

but failed since python assumes every string to decode to be ASCII?

that should work, and it sure works for me:
Köni

what did you do, and how did it fail?

</F>
 
D

Duncan Booth

'K\xc3\xb6ni'.decode('utf-8') # 'K\xc3\xb6ni' should be 'König',
contains a german 'umlaut'

but failed since python assumes every string to decode to be ASCII?

No, Python would assume the string to be utf-8 encoded in this case:
'K\xf6ni'

Your code must have failed somewhere else. Try posting actual failing code
and actual traceback.
 
N

NoelByron

"Köni", to be precise.

Äh, yes.
;o)
that should work, and it sure works for me:

Köni

what did you do, and how did it fail?

First, thank you so much for answering so fast. I proposed python for a
project and it would be very embarrassing for me if I would fail
converting a UTF-8 string to latin-1.

I realized that my problem ist not the decode to UTF-8. The exception
is raised by print if I try to print the unicode string.

UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in
position 1: ordinal not in range(128)

But that is not a problem at all since I can now turn my UTF-8 strings
to unicode! Once again the problem was sitting right in front of my
screen. Silly me...
;o)

Again, thank you for your reply!

Best regards,
Noel
 
N

NoelByron

Duncan said:
No, Python would assume the string to be utf-8 encoded in this case:

'K\xf6ni'

Your code must have failed somewhere else. Try posting actual failing code
and actual traceback.

You are right. My test code was:

print 'K\xc3\xb6ni'.decode('utf-8')

and this line raised a UnicodeDecode exception. I didn't realize that
the exception was actually raised by print and thought it was the
decode. That explains the fact that a 'ignore' in the decode showed no
effect at all, too.

Thank you for helping!

Best regards,
Noel
 
?

=?ISO-8859-1?Q?Michael_Str=F6der?=

print 'K\xc3\xb6ni'.decode('utf-8')

and this line raised a UnicodeDecode exception.

Works for me.

Note that 'K\xc3\xb6ni'.decode('utf-8') returns a Unicode object. With
print this is implicitly converted to string. The char set used depends

Ciao, Michael.
 
N

NoelByron

Michael said:
Works for me.

Note that 'K\xc3\xb6ni'.decode('utf-8') returns a Unicode object. With
print this is implicitly converted to string. The char set used depends
on your console

And that was the problem. I'm developing with eclipse (PyDev). The
console is integrated in the development environment. As I print out an
unicode string python tries to encode it to ASCII. And since the string
contains non ASCII characters it fails. That is no problem if you are
aware of this.

My mistake was that I thought the exception was raised by my call to
decode('UTF-8') because print and decode were on the same line and I
thought print could never raise an exception. Seems like I've learned
something today.

Best regards,
Noel
 
N

Neil Cerutti

Works for me.

Note that 'K\xc3\xb6ni'.decode('utf-8') returns a Unicode
object. With print this is implicitly converted to string. The
char set used depends on your console

No, the setting of the console encoding (sys.stdout.encoding) is
ignored. It's a good thing, too, since it's pretty flaky. It uses
sys.getdefaultencoding(), which is always 'ascii' as far as I
know.
 
N

Neil Cerutti

No, the setting of the console encoding (sys.stdout.encoding) is
ignored.

Nope, it is not ignored. This would not work then::

In [2]: print 'K\xc3\xb6nig'.decode('utf-8')
König

In [3]: import sys

In [4]: sys.getdefaultencoding()
Out[4]: 'ascii'

Interesting! Thanks for the correction.
 
M

Marc 'BlackJack' Rintsch

No, the setting of the console encoding (sys.stdout.encoding) is
ignored.

Nope, it is not ignored. This would not work then::

In [2]: print 'K\xc3\xb6nig'.decode('utf-8')
König

In [3]: import sys

In [4]: sys.getdefaultencoding()
Out[4]: 'ascii'


Ciao,
Marc 'BlackJack' Rintsch
 
N

Neil Cerutti

No, the setting of the console encoding (sys.stdout.encoding) is
ignored.

Nope, it is not ignored. This would not work then::

In [2]: print 'K\xc3\xb6nig'.decode('utf-8')
König

In [3]: import sys

In [4]: sys.getdefaultencoding()
Out[4]: 'ascii'

OK, I was thinking of the behavior of file.write(s). Thanks again
for the correction.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,928
Messages
2,570,068
Members
46,513
Latest member
JacklynMcC

Latest Threads

Top