Ascii to Unicode.

Joe Goldthwaite · Jul 28, 2010

Hi,

I've got an Ascii file with some latin characters. Specifically \xe1 and
\xfc. I'm trying to import it into a Postgresql database that's running in
Unicode mode. The Unicode converter chokes on those two characters.

I could just manually replace those to characters with something valid but
if any other invalid characters show up in later versions of the file, I'd
like to handle them correctly.

I've been playing with the Unicode stuff and I found out that I could
convert both those characters correctly using the latin1 encoder like this;

import unicodedata

s = '\xe1\xfc'
print unicode(s,'latin1')

The above works. When I try to convert my file however, I still get an
error;

import unicodedata

input = file('ascii.csv', 'r')
output = file('unicode.csv','w')

for line in input.xreadlines():
output.write(unicode(line,'latin1'))

input.close()
output.close()

Traceback (most recent call last):
File "C:\Users\jgold\CloudmartFiles\UnicodeTest.py", line 10, in __main__
output.write(unicode(line,'latin1'))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position
295: ordinal not in range(128)

I'm stuck using Python 2.4.4 which may be handling the strings differently
depending on if they're in the program or coming from the file. I just
haven't been able to figure out how to get the Unicode conversion working
from the file data.

Can anyone explain what is going on?

John Nagle · Jul 28, 2010

Hi,

I've got an Ascii file with some latin characters. Specifically \xe1 and
\xfc. I'm trying to import it into a Postgresql database that's running in
Unicode mode. The Unicode converter chokes on those two characters.

I could just manually replace those to characters with something valid but
if any other invalid characters show up in later versions of the file, I'd
like to handle them correctly.

I've been playing with the Unicode stuff and I found out that I could
convert both those characters correctly using the latin1 encoder like this;

import unicodedata

s = '\xe1\xfc'
print unicode(s,'latin1')

The above works. When I try to convert my file however, I still get an
error;

import unicodedata

input = file('ascii.csv', 'r')
output = file('unicode.csv','w')

for line in input.xreadlines():
output.write(unicode(line,'latin1'))

input.close()
output.close()

Try this, which will get you a UTF-8 file, the usual standard for
Unicode in a file.

for rawline in input :
unicodeline = unicode(line,'latin1') # Latin-1 to Unicode
output.write(unicodeline.encode('utf-8')) # Unicode to as UTF-8

John Nagle

Thomas Jollans · Jul 28, 2010

for rawline in input :
unicodeline = unicode(line,'latin1') # Latin-1 to Unicode
output.write(unicodeline.encode('utf-8')) # Unicode to as UTF-8

you got your blocks wrong.

John Machin · Jul 28, 2010

Hi,

I've got an Ascii file with some latin characters. Specifically \xe1 and
\xfc. I'm trying to import it into a Postgresql database that's running in
Unicode mode. The Unicode converter chokes on those two characters.

I could just manually replace those to characters with something valid but
if any other invalid characters show up in later versions of the file, I'd
like to handle them correctly.

I've been playing with the Unicode stuff and I found out that I could
convert both those characters correctly using the latin1 encoder like this;

import unicodedata

s = '\xe1\xfc'
print unicode(s,'latin1')

The above works. When I try to convert my file however, I still get an
error;

import unicodedata

input = file('ascii.csv', 'r')
output = file('unicode.csv','w')

for line in input.xreadlines():
output.write(unicode(line,'latin1'))

input.close()
output.close()

Traceback (most recent call last):
File "C:\Users\jgold\CloudmartFiles\UnicodeTest.py", line 10, in __main__
output.write(unicode(line,'latin1'))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position
295: ordinal not in range(128)

I'm stuck using Python 2.4.4 which may be handling the strings differently
depending on if they're in the program or coming from the file. I just
haven't been able to figure out how to get the Unicode conversion working
from the file data.

Can anyone explain what is going on?

Hello hello ... you are running on Windows; the likelihood that you
actually have data encoded in latin1 is very very small. Follow MRAB's
answer but replace "latin1" by "cp1252".

Joe Goldthwaite · Jul 29, 2010

Hello hello ... you are running on Windows; the likelihood that you
actually have data encoded in latin1 is very very small. Follow MRAB's
answer but replace "latin1" by "cp1252".

I think you're right. The database I'm working with is a US zip code
database. It gets updated monthly. The problem fields are some city names
in Puerto Rico. I thought I had tried the cp1252 codec and that it didn't
work. I tried it again and it works now so I was doing something else wrong.

I agree that's probably what I should be using. Both latin1 and cp1252
produce the same output for the two characters I'm having the trouble with
but I changed it to cp1252 anyway. I think it will avoid problems in the
future

Thanks John.

Ascii to Unicode.	16	Jul 28, 2010
convert Unicode filenames to good-looking ASCII	3	May 6, 2010
Python 3.3, gettext and Unicode problems	0	Dec 31, 2012
Unicode	2	Mar 15, 2013
Looking for UNICODE to ASCII Conversioni Example Code	15	Oct 18, 2013
email with a non-ascii charset in Python3 ?	3	Aug 15, 2012
Right solution to unicode error?	21	Nov 7, 2012
how to write a unicode string to a file ?	0	Oct 16, 2009

Ascii to Unicode.

Joe Goldthwaite

John Nagle

Thomas Jollans

John Machin

Joe Goldthwaite

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads