os.popen encoding!

S

SMALLp

Hy.
I'm playing with os.popen function.
a = os.popen("somecmd").read()

If one of the lines contains characters like "è", "æ"or any other it loks
line this "velja\xe8a 2009" with that "\xe8". It prints fine if i go:

for i in a:
print i:

How to solve this and where exectly is problem with print or read! Windows
XP, Python 2.5.4

Thanks!
 
G

Gabriel Genellina

Hy.
I'm playing with os.popen function.
a = os.popen("somecmd").read()

If one of the lines contains characters like "è", "æ"or any other it loks
line this "velja\xe8a 2009" with that "\xe8". It prints fine if i go:

for i in a:
print i:

'\xe8' is a *single* byte (not four). It is the 'LATIN SMALL LETTER E WITH
GRAVE' Unicode code point u'è' encoded in the Windows-1252 encoding (and
latin-1, and others too). This is the usual Windows encoding (in "Western
Europe" but seems to cover a much larger part of the world... most of
America, if not all).

When you *look* at some string in the interpreter, you see its repr()
(note the surrounding quotes). When you *print* some string, you get its
contents:

py> s = "ma mère"
py> s
'ma m\x8are'
py> print s
ma mère
py> print repr(s)
'ma m\x8are'
How to solve this and where exectly is problem with print or read!
Windows
XP, Python 2.5.4

There is *no* problem. You should read the Unicode howto:
<http://docs.python.org/howto/unicode.html>
If you still think there is a problem, please provide more details.
 
H

Hrvoje Niksic

Gabriel Genellina said:
'\xe8' is a *single* byte (not four). It is the 'LATIN SMALL LETTER E
WITH GRAVE' Unicode code point u'è' encoded in the Windows-1252
encoding (and latin-1, and others too).

Note that it is also 'LATIN SMALL LETTER C WITH CARON' (U+010D or
u'Ä'), encoded in Windows-1250, which is what the OP is likely using.

The rest of your message stands regardless: there is no problem, at
least as long as the OP only prints out the character received from
somecmd to something else that also expects Windows-1250. The problem
would arise if the OP wanted to store the string in a PyGTK label
(which expects UTF8) or send it to a web browser (which expects
explicit encoding, probably defaulting to UTF8), in which case he'd
have to disambiguate whether '\xe8' refers to U+010D or to U+00E8 or
something else entirely.

That is the problem that Python 3 solves by requiring (or strongly
suggesting) that such disambiguation be performed as early in the
program as possible, preferrably while the characters are being read
from the outside source. A similar approach is possible using Python
2 and its unicode type, but since the OP never specified exactly which
problem he had (except for the repr/str confusion), it's hard to tell
if using the unicode type would help.
 
G

Gabriel Genellina

Thanks for help!

My problem was actualy:
a = ["velja\xe8a 2009"]
print a #will print ["velja\xe8a 2009"]
Print a[0] #will print
veljaèa 2009

And why is that a problem?

Almost the only reason to print a list is when debugging a program. To
print a list, Python uses repr() on each of its elements. Otherwise, [5,
"5", u'5'] would be indistinguishable from [5, 5, 5], and you usually want
to know exactly *what* the list contains.

Perhaps if you tell us what do you want to do exactly someone can offer
more advice.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,800
Messages
2,569,656
Members
45,396
Latest member
mayadahan111
Top