Python nuube needs Unicode help

Discussion in 'Python' started by gheissenberger@gmail.com, Jan 11, 2007.

  1. Guest

    HELP!
    Guy who was here before me wrote a script to parse files in Python.

    Includes line:
    print u
    where u is a line from a file we are parsing.
    However, we have started recieving data from Brazil. If I open file to
    parse in VI, looks like:

    <Utt id="3" transcribe="yes" audioRoot="A1"
    audio="313-20070102144528.wav" grammarSet="G3" rawText="não"
    recValue="{data:CHOICE=NO;}" conf="970" rawText2="" conf2="0"
    transcribedText="não" parsableText="não"/

    Clearly those "n&#227" are some non-Ascii characters, but how do I get
    print to understand that?

    I keep getting:
    "UnicodeEncodeError: 'ascii' codec can't encode character u'\xe3' in
    position 40:
    ordinal not in range(128)"
     
    , Jan 11, 2007
    #1
    1. Advertisements

  2. schrieb:
    > HELP!
    > Guy who was here before me wrote a script to parse files in Python.
    >
    > Includes line:
    > print u
    > where u is a line from a file we are parsing.
    > However, we have started recieving data from Brazil. If I open file to
    > parse in VI, looks like:
    >
    > <Utt id="3" transcribe="yes" audioRoot="A1"
    > audio="313-20070102144528.wav" grammarSet="G3" rawText="não"
    > recValue="{data:CHOICE=NO;}" conf="970" rawText2="" conf2="0"
    > transcribedText="não" parsableText="não"/
    >
    > Clearly those "n&#227" are some non-Ascii characters, but how do I get
    > print to understand that?
    >
    > I keep getting:
    > "UnicodeEncodeError: 'ascii' codec can't encode character u'\xe3' in
    > position 40:
    > ordinal not in range(128)"
    >


    Does the error happen at the

    print u

    line? If yes, what happens is that you try and print a unicode object.
    Which means that it has to be converted (actually the right term is
    encoded) to a byte-string. If you don't do that explicitely, it will be
    done implicitly, using the default encoding - which is ascii.

    If you have non-ascii characters, you end up with the error you see.

    What to do? Use something like this:

    print u.encode('utf-8')

    instead.

    Diez
     
    Diez B. Roggisch, Jan 11, 2007
    #2
    1. Advertisements

  3. Guest

    Progress! You managed to change the error message.

    File "./acc_test_script_generator.py", line 106, in loadData
    print u.encode('utf-8')
    AttributeError: Utterance instance has no attribute 'encode'

    I'm missing somethign really obvious here, but I don't know what it
    is...


    Diez B. Roggisch wrote:
    > schrieb:
    > > HELP!
    > > Guy who was here before me wrote a script to parse files in Python.
    > >
    > > Includes line:
    > > print u
    > > where u is a line from a file we are parsing.
    > > However, we have started recieving data from Brazil. If I open file to
    > > parse in VI, looks like:
    > >
    > > <Utt id="3" transcribe="yes" audioRoot="A1"
    > > audio="313-20070102144528.wav" grammarSet="G3" rawText="não"
    > > recValue="{data:CHOICE=NO;}" conf="970" rawText2="" conf2="0"
    > > transcribedText="não" parsableText="não"/
    > >
    > > Clearly those "n&#227" are some non-Ascii characters, but how do I get
    > > print to understand that?
    > >
    > > I keep getting:
    > > "UnicodeEncodeError: 'ascii' codec can't encode character u'\xe3' in
    > > position 40:
    > > ordinal not in range(128)"
    > >

    >
    > Does the error happen at the
    >
    > print u
    >
    > line? If yes, what happens is that you try and print a unicode object.
    > Which means that it has to be converted (actually the right term is
    > encoded) to a byte-string. If you don't do that explicitely, it will be
    > done implicitly, using the default encoding - which is ascii.
    >
    > If you have non-ascii characters, you end up with the error you see.
    >
    > What to do? Use something like this:
    >
    > print u.encode('utf-8')
    >
    > instead.
    >
    > Diez
     
    , Jan 11, 2007
    #3
  4. At Thursday 11/1/2007 18:27, wrote:

    >HELP!
    >Guy who was here before me wrote a script to parse files in Python.
    >
    >Includes line:
    >print u
    >where u is a line from a file we are parsing.
    >However, we have started recieving data from Brazil. If I open file to
    >parse in VI, looks like:
    >
    ><Utt id="3" transcribe="yes" audioRoot="A1"
    >audio="313-20070102144528.wav" grammarSet="G3" rawText="não"
    >recValue="{data:CHOICE=NO;}" conf="970" rawText2="" conf2="0"
    >transcribedText="não" parsableText="não"/


    Is this part of an XML document? You should use a
    true XML parser instead of doing that by hand.

    >Clearly those "n&#227" are some non-Ascii characters, but how do I get
    >print to understand that?


    Understanding how Unicode works may be very
    useful: http://www.amk.ca/python/howto/unicode

    >I keep getting:
    >"UnicodeEncodeError: 'ascii' codec can't encode character u'\xe3' in
    >position 40:
    > ordinal not in range(128)"


    py> u = u"áéíóú"
    py> print u, repr(u)
    áéíóú u'\xe1\xe9\xed\xf3\xfa'
    py> print str(u)
    Traceback (most recent call last):
    File "<stdin>", line 1, in ?
    UnicodeEncodeError: 'ascii' codec can't encode
    characters in position 0-4: ordin
    al not in range(128)
    py> print u.encode('cp850')
    áéíóú

    (cp850 is my console encoding)


    --
    Gabriel Genellina
    Softlab SRL






    __________________________________________________
    Preguntá. Respondé. Descubrí.
    Todo lo que querías saber, y lo que ni imaginabas,
    está en Yahoo! Respuestas (Beta).
    ¡Probalo ya!
    http://www.yahoo.com.ar/respuestas
     
    Gabriel Genellina, Jan 12, 2007
    #4
  5. At Thursday 11/1/2007 20:42, wrote:

    > Progress! You managed to change the error message.
    >
    > File "./acc_test_script_generator.py", line 106, in loadData
    > print u.encode('utf-8')
    >AttributeError: Utterance instance has no attribute 'encode'
    >
    >I'm missing somethign really obvious here, but I don't know what it
    >is...


    Then you're not "printing a line from a file we are parsing", which
    should be a string or unicode object. You're printing some
    "Utterance" instance; probably it has a __str__ method, and there,
    you're mixing unicode+strings.


    --
    Gabriel Genellina
    Softlab SRL






    __________________________________________________
    Preguntá. Respondé. Descubrí.
    Todo lo que querías saber, y lo que ni imaginabas,
    está en Yahoo! Respuestas (Beta).
    ¡Probalo ya!
    http://www.yahoo.com.ar/respuestas
     
    Gabriel Genellina, Jan 12, 2007
    #5
    1. Advertisements

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Robert Mark Bram
    Replies:
    0
    Views:
    4,143
    Robert Mark Bram
    Sep 28, 2003
  2. Replies:
    7
    Views:
    342
  3. Grzegorz ¦liwiñski
    Replies:
    2
    Views:
    1,125
    Grzegorz ¦liwiñski
    Jan 19, 2011
  4. Chirag Mistry
    Replies:
    6
    Views:
    229
    Ollivier Robert
    Feb 8, 2008
  5. Terry Reedy
    Replies:
    0
    Views:
    110
    Terry Reedy
    Jan 7, 2014
Loading...

Share This Page