unicode compare errors

R

Ross

I've a character encoding issue that has stumped me (not that hard to
do). I am parsing a small text file with some possibility of various
currencies being involved, and want to handle them without messing up.

Initially I was simply doing:

currs = [u'$', u'£', u'€', u'¥']
aFile = open(thisFile, 'r')
for mline in aFile: # mline might be "£5.50"
if item[0] in currs:
item = item[1:]

But the problem was:
SyntaxError: Non-ASCII character '\xa3' in file

The remedy was of course to declare the file encoding for my Python
module, at the start of the file I used:

# -*- coding: UTF-8 -*-

That allowed me to progress. But now when I come to line item that is
a non $ currency, I get this error:

views.py:3364: UnicodeWarning: Unicode equal comparison failed to
convert both arguments to Unicode - interpreting them as being
unequal.

…which I think means Python's unable to convert the char's in the file
I'm reading from into unicode to compare to the items in the list
currs.

I think this is saying that u'£' == '£' is false.
(I hope those chars show up okay in my post here)

Since I can't control the encoding of the input file that users
submit, how to I get past this? How do I make such comparisons be
True?

Thanks in advance for any suggestions
Ross.
 
R

Ross

Initially I was simply doing:

  currs = [u'$', u'£', u'€', u'¥']
  aFile = open(thisFile, 'r')
  for mline in aFile:              # mline might be "£5..50"
     if item[0] in currs:
          item = item[1:]

Don't you love it when someone solves their own problem? Posting a
reply here so that other poor chumps like me can get around this...

I found I could import codecs that allow me to read the file with my
desired encoding. Huzzah!

Instead of opening the file with a standard
aFile = open(thisFile, 'r')

I instead ensure I've imported the codecs:

import codecs

.... and then I used a specific encoding on the file read:

aFile = codecs.open(thisFile, encoding='utf-8')

Then all my compares seem to work fine.
If I'm off-base and kludgey here and should be doing something
differently please give me a poke.

Regards,
Ross.
 
N

Nobody

Since I can't control the encoding of the input file that users
submit, how to I get past this? How do I make such comparisons be
True?

I found I could import codecs that allow me to read the file with my
desired encoding. Huzzah!
If I'm off-base and kludgey here and should be doing something

Er, do you know the file's encoding or don't you? Using:

aFile = codecs.open(thisFile, encoding='utf-8')

is telling Python that the file /is/ in utf-8. If it isn't in utf-8,
you'll get decoding errors.

If you are given a file with no known encoding, then you can't reliably
determine what /characters/ it contains, and thus can't reliably compare
the contents of the file against strings of characters, only against
strings of bytes.

About the best you can do is to use an autodetection library such as:

http://chardet.feedparser.org/
 
R

Ross

Er, do you know the file's encoding or don't you? Using:

    aFile = codecs.open(thisFile, encoding='utf-8')

is telling Python that the file /is/ in utf-8. If it isn't in utf-8,
you'll get decoding errors.

If you are given a file with no known encoding, then you can't reliably
determine what /characters/ it contains, and thus can't reliably compare
the contents of the file against strings of characters, only against
strings of bytes.

About the best you can do is to use an autodetection library such as:

       http://chardet.feedparser.org/

That's right I don't know what encoding the user will have used. The
use of autodetection sounds good - I'll look into that. Thx.

R.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top