Unicode drives me crazy...

F

fowlertrainer

Hi !

I want to get the WMI infos from Windows machines.
I use Py from HU (iso-8859-2) charset.

Then I wrote some utility for it, because I want to write it to an XML file.

def ToHU(s,NoneStr='-'):
if s==None: s=NoneStr
if not (type(s) in [type(''),type(u'')]):
s=str(s)
if type(s)<>type(u''):
s=unicode(s)
s=s.replace(chr(0),' ');
s=s.encode('iso-8859-2')
return s

This fn is working, but I have been got an error with this value:
'Kommunik\xe1ci\xf3s port (COM1)'

This routine demonstrates the problem

s='Kommunik\xe1ci\xf3s port (COM1)'
print s
print type(s)
print type(u'aaa')
s=unicode(s) # error !

This is makes me mad.
How to I convert every objects to string, and convert (encode) them to
iso-8859-2 (if needed) ?

Please help me !

Thanx for help:
ft
 
S

Sybren Stuvel

(e-mail address removed) enlightened us with:
I want to get the WMI infos from Windows machines.
I use Py from HU (iso-8859-2) charset.

Why not use Unicode for everything?

Sybren
 
F

Fuzzyman

Hi !

I want to get the WMI infos from Windows machines.
I use Py from HU (iso-8859-2) charset.

Then I wrote some utility for it, because I want to write it to an XML file.

def ToHU(s,NoneStr='-'):
if s==None: s=NoneStr
if not (type(s) in [type(''),type(u'')]):
s=str(s)
if type(s)<>type(u''):
s=unicode(s)
s=s.replace(chr(0),' ');
s=s.encode('iso-8859-2')
return s

This fn is working, but I have been got an error with this value:
'Kommunik\xe1ci\xf3s port (COM1)'

This routine demonstrates the problem

s='Kommunik\xe1ci\xf3s port (COM1)'
print s
print type(s)
print type(u'aaa')
s=unicode(s) # error !

This is makes me mad.
How to I convert every objects to string, and convert (encode) them to
iso-8859-2 (if needed) ?

s is a 'byte string' - a series of characters encoded in bytes. (As is
every string on some level). In order to convert that to a unicdoe
object, Python needs to know what encoding is used. In other words it
needs to know what character each byte represents.

See this :

t = s.decode('iso-8859-1')
t
u'Kommunik\xe1ci\xf3s port (COM1)'
print t
Kommunikációs port (COM1)
print type(s)
<type 'str'>
print type(t)
<type 'unicode'>

The decode instruction converts s into a unicode string - where Python
knows what every character is. If you call unicdoe with no encoding
specified, Python reverts to the system default - which is *probably*
'ascii'. You string contains characters which have *no meaning* in the
ascii codec - so it reports an error....

Does this help ?

Once you 'get unicode', Python support for it is pretty easy. It's a
slightly complicated subject though. Basically you need to *know* what
encoding is being used, and whenever you convert between unicode and
byte-strings you need to specify it.

What can complicate matters is that there are lot's of times an
*implicit* conversion can take place. Adding strings to unicode
objects, printing strings, or writing them to a file are the usual
times implicit conversion can happen. If you haven't specified an
encoding, then Python has to use the system default or the file object
default (sys.stdout often has a different default encoding than the one
returned by sys.getdefaultencoding()). It is these implicit conversions
that often cause the 'UnicodeDecodeError's and 'UnicodeEncodeError's.

HTH

Best Regards,

Fuzzy
http://www.voidspace.org.uk/python
 
F

Fuzzyman

At some point you have to convert - esp. when writing data out to file.
If you receive data as a byte string and have to store it as a byte
string, it is sometimes convenient to *not* convert in the middle.

Best Regards,

Fuzzy
http://www.voidspace.org.uk/python
 
J

John Roth

Hi !

I want to get the WMI infos from Windows machines.
I use Py from HU (iso-8859-2) charset.

Then I wrote some utility for it, because I want to write it to an XML
file.

def ToHU(s,NoneStr='-'):
if s==None: s=NoneStr
if not (type(s) in [type(''),type(u'')]):
s=str(s)
if type(s)<>type(u''):
s=unicode(s)
s=s.replace(chr(0),' ');
s=s.encode('iso-8859-2')
return s

This fn is working, but I have been got an error with this value:
'Kommunik\xe1ci\xf3s port (COM1)'

This routine demonstrates the problem

s='Kommunik\xe1ci\xf3s port (COM1)'
print s
print type(s)
print type(u'aaa')
s=unicode(s) # error !

This is makes me mad.
How to I convert every objects to string, and convert (encode) them to
iso-8859-2 (if needed) ?

Please help me !

As Tim Golden already explained, you're getting a unicode
object from the WMI interface. The best design help I can
give is to either convert it to iso-8859-2 at the point you
get the object and do your entire program with iso-8859-2
encoded strings, or do your entire program with unicode
objects and encode them as iso-8859-2 strings whenever
you want to write them out. Trying to do your conversion
in the middle will lead to excessive complexity, with the
resulting debugging problems.

If you do go the unicode route, you must remember that
any method or function that's defined to return a string will
most likely throw an exception. This includes str()! Whether
or not the print statement will work depends on a number
of factors in how your Python installation was set up.

HTH

John Roth
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top