Conditional for...in failing with utf-8, Spanish book translation

Hunter · Apr 21, 2008

Hi all,

This is my first Usenet post!
I've run into a wall with my first Python program. I'm writing some
simple code to take a text file that's utf-8 and in Spanish and to use
online translation tools to convert it, word-by-word, into English. Then
I'm generating a PDF with both of the languages.

Most of this is working great, but I get intermittent errors of the form:

---

Translating coche(coche)...
Already cached!
English: car
Translating ahora(ahora)...
tw returned now
English: now
Translating mismo?(mismo)...
Already cached!
English: same
Translating Â¡A(ï¿½a)...
iconv: illegal input sequence at position 0
tw returned error: the required parameter "srctext" is missing
English: error: the required parameter "srctext" is missing

---

The output should look like:
Translating Raw_Text(lowercaserawtextwithoutpunctuation)...
tw returned englishtranslation
English: englishtranslation

I've narrowed the problem down to a simple test program. Check this out:

---

# -*- coding: utf-8 -*-

acceptable = "abcdefghijklmnopqrstuvwxyzÃ³ÃÃ±Ãº" # this line will work
acceptable = "abcdefghijklmnopqrstuvwxyzÃ³ÃÃ±ÃºÃ¡" # this line won't
#wtf?

word = "Â¡A"
word_key = ''.join([c for c in word.lower() if c in acceptable])
print "word_key = " + word_key

---

Any ideas? I'm really stumped!

Thanks,
Hunter

Marc 'BlackJack' Rintsch · Apr 21, 2008

I've narrowed the problem down to a simple test program. Check this out:

---

# -*- coding: utf-8 -*-

acceptable = "abcdefghijklmnopqrstuvwxyzÃ³ÃÃ±Ãº" # this line will work
acceptable = "abcdefghijklmnopqrstuvwxyzÃ³ÃÃ±ÃºÃ¡" # this line won't
#wtf?

word = "Â¡A"
word_key = ''.join([c for c in word.lower() if c in acceptable])
print "word_key = " + word_key

You are not working with unicode but UTF-8 encoded characters. That's
bytes and not letters/characters. Your `word` for example contains three
bytes and not the two characters you think it contains:

In [43]: word = "Â¡A"

In [44]: len(word)
Out[44]: 3

In [45]: for c in word: print repr(c)
....:
'\xc2'
'\xa1'
'A'

So you are *not* testing if Â¡ is in `acceptable` but the two byte values
that are the UTF-8 representation of that character.

Ciao,
Marc 'BlackJack' Rintsch

Stefan Behnel · Apr 21, 2008

Hunter said:
I've narrowed the problem down to a simple test program. Check this out:

---

# -*- coding: utf-8 -*-

acceptable = "abcdefghijklmnopqrstuvwxyzÃ³ÃÃ±Ãº" # this line will work
acceptable = "abcdefghijklmnopqrstuvwxyzÃ³ÃÃ±ÃºÃ¡" # this line won't

[bad words stripped]

this should read

acceptable = u"abcdefghijklmnopqrstuvwxyzÃ³ÃÃ±Ãº"
acceptable = u"abcdefghijklmnopqrstuvwxyzÃ³ÃÃ±ÃºÃ¡"

Mind the little "u" before the string, which makes it a unicode string instead
of an encoded byte string.

http://docs.python.org/tut/node5.html#SECTION005130000000000000000

Stefan

comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
comp.lang.c Changes to Answers to Frequently Asked Questions (FAQ)	1	Jul 4, 2004
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004

Conditional for...in failing with utf-8, Spanish book translation

Hunter

Marc 'BlackJack' Rintsch

Stefan Behnel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads