Python nuube needs Unicode help

G

gheissenberger

HELP!
Guy who was here before me wrote a script to parse files in Python.

Includes line:
print u
where u is a line from a file we are parsing.
However, we have started recieving data from Brazil. If I open file to
parse in VI, looks like:

<Utt id="3" transcribe="yes" audioRoot="A1"
audio="313-20070102144528.wav" grammarSet="G3" rawText="não"
recValue="{data:CHOICE=NO;}" conf="970" rawText2="" conf2="0"
transcribedText="não" parsableText="não"/

Clearly those "n&#227" are some non-Ascii characters, but how do I get
print to understand that?

I keep getting:
"UnicodeEncodeError: 'ascii' codec can't encode character u'\xe3' in
position 40:
ordinal not in range(128)"
 
D

Diez B. Roggisch

HELP!
Guy who was here before me wrote a script to parse files in Python.

Includes line:
print u
where u is a line from a file we are parsing.
However, we have started recieving data from Brazil. If I open file to
parse in VI, looks like:

<Utt id="3" transcribe="yes" audioRoot="A1"
audio="313-20070102144528.wav" grammarSet="G3" rawText="não"
recValue="{data:CHOICE=NO;}" conf="970" rawText2="" conf2="0"
transcribedText="não" parsableText="não"/

Clearly those "n&#227" are some non-Ascii characters, but how do I get
print to understand that?

I keep getting:
"UnicodeEncodeError: 'ascii' codec can't encode character u'\xe3' in
position 40:
ordinal not in range(128)"

Does the error happen at the

print u

line? If yes, what happens is that you try and print a unicode object.
Which means that it has to be converted (actually the right term is
encoded) to a byte-string. If you don't do that explicitely, it will be
done implicitly, using the default encoding - which is ascii.

If you have non-ascii characters, you end up with the error you see.

What to do? Use something like this:

print u.encode('utf-8')

instead.

Diez
 
G

gheissenberger

Progress! You managed to change the error message.

File "./acc_test_script_generator.py", line 106, in loadData
print u.encode('utf-8')
AttributeError: Utterance instance has no attribute 'encode'

I'm missing somethign really obvious here, but I don't know what it
is...
 
G

Gabriel Genellina

At said:
HELP!
Guy who was here before me wrote a script to parse files in Python.

Includes line:
print u
where u is a line from a file we are parsing.
However, we have started recieving data from Brazil. If I open file to
parse in VI, looks like:

<Utt id="3" transcribe="yes" audioRoot="A1"
audio="313-20070102144528.wav" grammarSet="G3" rawText="não"
recValue="{data:CHOICE=NO;}" conf="970" rawText2="" conf2="0"
transcribedText="não" parsableText="não"/

Is this part of an XML document? You should use a
true XML parser instead of doing that by hand.
Clearly those "n&#227" are some non-Ascii characters, but how do I get
print to understand that?

Understanding how Unicode works may be very
useful: http://www.amk.ca/python/howto/unicode
I keep getting:
"UnicodeEncodeError: 'ascii' codec can't encode character u'\xe3' in
position 40:
ordinal not in range(128)"

py> u = u"áéíóú"
py> print u, repr(u)
áéíóú u'\xe1\xe9\xed\xf3\xfa'
py> print str(u)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode
characters in position 0-4: ordin
al not in range(128)
py> print u.encode('cp850')
áéíóú

(cp850 is my console encoding)


--
Gabriel Genellina
Softlab SRL






__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas
 
G

Gabriel Genellina

At said:
Progress! You managed to change the error message.

File "./acc_test_script_generator.py", line 106, in loadData
print u.encode('utf-8')
AttributeError: Utterance instance has no attribute 'encode'

I'm missing somethign really obvious here, but I don't know what it
is...

Then you're not "printing a line from a file we are parsing", which
should be a string or unicode object. You're printing some
"Utterance" instance; probably it has a __str__ method, and there,
you're mixing unicode+strings.


--
Gabriel Genellina
Softlab SRL






__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top