Conditional for...in failing with utf-8, Spanish book translation

H

Hunter

Hi all,

This is my first Usenet post!
I've run into a wall with my first Python program. I'm writing some
simple code to take a text file that's utf-8 and in Spanish and to use
online translation tools to convert it, word-by-word, into English. Then
I'm generating a PDF with both of the languages.

Most of this is working great, but I get intermittent errors of the form:

---

Translating coche(coche)...
Already cached!
English: car
Translating ahora(ahora)...
tw returned now
English: now
Translating mismo?(mismo)...
Already cached!
English: same
Translating ¡A(�a)...
iconv: illegal input sequence at position 0
tw returned error: the required parameter "srctext" is missing
English: error: the required parameter "srctext" is missing

---

The output should look like:
Translating Raw_Text(lowercaserawtextwithoutpunctuation)...
tw returned englishtranslation
English: englishtranslation

I've narrowed the problem down to a simple test program. Check this out:

---

# -*- coding: utf-8 -*-

acceptable = "abcdefghijklmnopqrstuvwxyzóíñú" # this line will work
acceptable = "abcdefghijklmnopqrstuvwxyzóíñúá" # this line won't
#wtf?

word = "¡A"
word_key = ''.join([c for c in word.lower() if c in acceptable])
print "word_key = " + word_key

---

Any ideas? I'm really stumped!

Thanks,
Hunter
 
M

Marc 'BlackJack' Rintsch

I've narrowed the problem down to a simple test program. Check this out:

---

# -*- coding: utf-8 -*-

acceptable = "abcdefghijklmnopqrstuvwxyzóíñú" # this line will work
acceptable = "abcdefghijklmnopqrstuvwxyzóíñúá" # this line won't
#wtf?

word = "¡A"
word_key = ''.join([c for c in word.lower() if c in acceptable])
print "word_key = " + word_key

You are not working with unicode but UTF-8 encoded characters. That's
bytes and not letters/characters. Your `word` for example contains three
bytes and not the two characters you think it contains:

In [43]: word = "¡A"

In [44]: len(word)
Out[44]: 3

In [45]: for c in word: print repr(c)
....:
'\xc2'
'\xa1'
'A'

So you are *not* testing if ¡ is in `acceptable` but the two byte values
that are the UTF-8 representation of that character.

Ciao,
Marc 'BlackJack' Rintsch
 
S

Stefan Behnel

Hunter said:
I've narrowed the problem down to a simple test program. Check this out:

---

# -*- coding: utf-8 -*-

acceptable = "abcdefghijklmnopqrstuvwxyzóíñú" # this line will work
acceptable = "abcdefghijklmnopqrstuvwxyzóíñúá" # this line won't

[bad words stripped]

this should read

acceptable = u"abcdefghijklmnopqrstuvwxyzóíñú"
acceptable = u"abcdefghijklmnopqrstuvwxyzóíñúá"

Mind the little "u" before the string, which makes it a unicode string instead
of an encoded byte string.

http://docs.python.org/tut/node5.html#SECTION005130000000000000000

Stefan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top