Ascii to Unicode.

J

Joe Goldthwaite

Hi,

I've got an Ascii file with some latin characters. Specifically \xe1 and
\xfc. I'm trying to import it into a Postgresql database that's running in
Unicode mode. The Unicode converter chokes on those two characters.

I could just manually replace those to characters with something valid but
if any other invalid characters show up in later versions of the file, I'd
like to handle them correctly.


I've been playing with the Unicode stuff and I found out that I could
convert both those characters correctly using the latin1 encoder like this;


import unicodedata

s = '\xe1\xfc'
print unicode(s,'latin1')


The above works. When I try to convert my file however, I still get an
error;

import unicodedata

input = file('ascii.csv', 'r')
output = file('unicode.csv','w')

for line in input.xreadlines():
output.write(unicode(line,'latin1'))

input.close()
output.close()

Traceback (most recent call last):
File "C:\Users\jgold\CloudmartFiles\UnicodeTest.py", line 10, in __main__
output.write(unicode(line,'latin1'))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position
295: ordinal not in range(128)

I'm stuck using Python 2.4.4 which may be handling the strings differently
depending on if they're in the program or coming from the file. I just
haven't been able to figure out how to get the Unicode conversion working
from the file data.

Can anyone explain what is going on?
 
J

John Nagle

Hi,

I've got an Ascii file with some latin characters. Specifically \xe1 and
\xfc. I'm trying to import it into a Postgresql database that's running in
Unicode mode. The Unicode converter chokes on those two characters.

I could just manually replace those to characters with something valid but
if any other invalid characters show up in later versions of the file, I'd
like to handle them correctly.


I've been playing with the Unicode stuff and I found out that I could
convert both those characters correctly using the latin1 encoder like this;


import unicodedata

s = '\xe1\xfc'
print unicode(s,'latin1')


The above works. When I try to convert my file however, I still get an
error;

import unicodedata

input = file('ascii.csv', 'r')
output = file('unicode.csv','w')

for line in input.xreadlines():
output.write(unicode(line,'latin1'))

input.close()
output.close()
Try this, which will get you a UTF-8 file, the usual standard for
Unicode in a file.

for rawline in input :
unicodeline = unicode(line,'latin1') # Latin-1 to Unicode
output.write(unicodeline.encode('utf-8')) # Unicode to as UTF-8


John Nagle
 
T

Thomas Jollans

for rawline in input :
unicodeline = unicode(line,'latin1') # Latin-1 to Unicode
output.write(unicodeline.encode('utf-8')) # Unicode to as UTF-8

you got your blocks wrong.
 
J

John Machin

Hi,

I've got an Ascii file with some latin characters. Specifically \xe1 and
\xfc.  I'm trying to import it into a Postgresql database that's running in
Unicode mode. The Unicode converter chokes on those two characters.

I could just manually replace those to characters with something valid but
if any other invalid characters show up in later versions of the file, I'd
like to handle them correctly.

I've been playing with the Unicode stuff and I found out that I could
convert both those characters correctly using the latin1 encoder like this;

        import unicodedata

        s = '\xe1\xfc'
        print unicode(s,'latin1')

The above works.  When I try to convert my file however, I still get an
error;

        import unicodedata

        input = file('ascii.csv', 'r')
        output = file('unicode.csv','w')

        for line in input.xreadlines():
                output.write(unicode(line,'latin1'))

        input.close()
        output.close()

Traceback (most recent call last):
  File "C:\Users\jgold\CloudmartFiles\UnicodeTest.py", line 10, in __main__
    output.write(unicode(line,'latin1'))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position
295: ordinal not in range(128)

I'm stuck using Python 2.4.4 which may be handling the strings differently
depending on if they're in the program or coming from the file.  I just
haven't been able to figure out how to get the Unicode conversion working
from the file data.

Can anyone explain what is going on?

Hello hello ... you are running on Windows; the likelihood that you
actually have data encoded in latin1 is very very small. Follow MRAB's
answer but replace "latin1" by "cp1252".
 
J

Joe Goldthwaite

Hello hello ... you are running on Windows; the likelihood that you
actually have data encoded in latin1 is very very small. Follow MRAB's
answer but replace "latin1" by "cp1252".

I think you're right. The database I'm working with is a US zip code
database. It gets updated monthly. The problem fields are some city names
in Puerto Rico. I thought I had tried the cp1252 codec and that it didn't
work. I tried it again and it works now so I was doing something else wrong.

I agree that's probably what I should be using. Both latin1 and cp1252
produce the same output for the two characters I'm having the trouble with
but I changed it to cp1252 anyway. I think it will avoid problems in the
future

Thanks John.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,060
Latest member
BuyKetozenseACV

Latest Threads

Top