Changing the default text codec

F

Fuzzyman

Sorry if my terminology is wrong..... but I'm having intermittent
problems dealing with accented characters in python. (Only from the 8
bit latin-1 character set I think..)

I've written an anagram finder that produces anagrams from a
dictionary of words. The user can load their own dictionary.

( http://www.voidspace.org.uk/atlantibots/nanagram.html )

It's particularly difficult for me to understand what is happening -
because python's behaviour *seems* intermittent.

For example - if I run my program from IDLE and give it the word
'degré' (containing e-acute) then I get the error :

Exception in Tkinter callback
Traceback (most recent call last):
[snip..]
File "D:\Python Projects\Nanagram1.3\Nanagram-GUI.pyw", line 123, in
prepare
if letter in self.valid_letters:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x83 in position
26: ordinal not in range(128)
Traceback (most recent call last):

It is testing each character of the users input to remove invalid
characters (like "-" and "'")... It crashes when it comes tot he
e-acute.


*However* - If I run it by double clicking on the file then it appears
to work fine (e.g. if I ask it find anagrams of 'degré hello ma' then
it strips out the e-acute (thinking it's an invalid character) and
finds anagrams of the rest :

gleam holder
hallo merged

What I'd like to do is switch by default to an 8 bit codec (latin-1 I
think ?????) and then offer the user the choice of either mapping the
accented characters to their nearest equivalent (e-acute to e for
example) *or* treating them as seperate characters.............


I can't work out how to change the default codec (no matter what the
locale) ?

Anyone able to help - or point me to a useful resource ?? (I've tried
google - b4 u suggest it )



Fuzzy
 
P

Peter Otten

Fuzzyman said:
Sorry if my terminology is wrong..... but I'm having intermittent
problems dealing with accented characters in python. (Only from the 8
bit latin-1 character set I think..)

I've written an anagram finder that produces anagrams from a
dictionary of words. The user can load their own dictionary.

( http://www.voidspace.org.uk/atlantibots/nanagram.html )

It's particularly difficult for me to understand what is happening -
because python's behaviour *seems* intermittent.

For example - if I run my program from IDLE and give it the word
'degré' (containing e-acute) then I get the error :

Exception in Tkinter callback
Traceback (most recent call last):
[snip..]
File "D:\Python Projects\Nanagram1.3\Nanagram-GUI.pyw", line 123, in
prepare
if letter in self.valid_letters:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x83 in position
26: ordinal not in range(128)
Traceback (most recent call last):

It is testing each character of the users input to remove invalid
characters (like "-" and "'")... It crashes when it comes tot he
e-acute.


*However* - If I run it by double clicking on the file then it appears
to work fine (e.g. if I ask it find anagrams of 'degré hello ma' then
it strips out the e-acute (thinking it's an invalid character) and
finds anagrams of the rest :

gleam holder
hallo merged

What I'd like to do is switch by default to an 8 bit codec (latin-1 I
think ?????) and then offer the user the choice of either mapping the
accented characters to their nearest equivalent (e-acute to e for
example) *or* treating them as seperate characters.............


I can't work out how to change the default codec (no matter what the
locale) ?

Anyone able to help - or point me to a useful resource ?? (I've tried
google - b4 u suggest it )

You can either explicitly convert your unicode strings:

unicodeword.encode("latin-1")

or try to modify your site.py from the default

encoding = "ascii"

to

encoding = "latin-1"

Peter
 
P

Paul Prescod

Fuzzyman said:
Sorry if my terminology is wrong..... but I'm having intermittent
problems dealing with accented characters in python. (Only from the 8
bit latin-1 character set I think..)

I would say that if you get a 100% failure rate in IDLE and a 100%
success rate from a console program then your problem is not
intermittent but environment specific.
For example - if I run my program from IDLE and give it the word
'degri' (containing e-acute) then I get the error :

What do you mean "give it the word". Through raw_input()? Through a file?

However you are getting this information, it seems to me that in IDLE
you are getting a Unicode object rather than an 8-bit string object.
Convert it to an 8-bit string:

mydata.encode("latin-1")
> if letter in self.valid_letters:
> UnicodeDecodeError: 'ascii' codec can't decode byte 0x83 in position
> 26: ordinal not in range(128)

Something looks suspicious here. I wouldn't expect self.valid_letters to
have a 0x83 character in it because I would expect it to be hard-coded
to ASCII in your program like:

valid_letters = "abcdefghijklmnopqrstuvwxyzABCDEF..."

On the other hand I wouldn't expect "letter" to have more than one
character so how could it have a problem at position 26?
What I'd like to do is switch by default to an 8 bit codec (latin-1 I
think ?????) and then offer the user the choice of either mapping the
accented characters to their nearest equivalent (e-acute to e for
example) *or* treating them as seperate characters.............

Why change the default codec rather than explicitly using the codec you
care about? If you want to work in the 8-bit world rather than the
Unicode world, just use the "encode" function on the Unicode object. If
you want to work in the Unicode world.
I can't work out how to change the default codec (no matter what the
locale) ?

I'd advise against fixing the problem in that way. Convert data
appropriately when you bring it from the outside world into the Python
program and ignore the default codec.

Paul Prescod
 
F

Fuzzyman

Paul Prescod said:
I would say that if you get a 100% failure rate in IDLE and a 100%
success rate from a console program then your problem is not
intermittent but environment specific.

If that was the case then I'm sure you'd be right... good not to
quibble about terminology eh ;-)

(in a few other test cases the success-fail pattern was the opposite
way round)

What do you mean "give it the word". Through raw_input()? Through a file?

Right - it is fetching the words from a Tkinter entry box using the
get() method.
However you are getting this information, it seems to me that in IDLE
you are getting a Unicode object rather than an 8-bit string object.
Convert it to an 8-bit string:

mydata.encode("latin-1")

Great - that might do the job.
I'll try it.
Thanks.
Something looks suspicious here. I wouldn't expect self.valid_letters to
have a 0x83 character in it because I would expect it to be hard-coded
to ASCII in your program like:

Self.valid_letters *in fact* is string.lowercase - which I thought
included the 8 bit latin-1 letters as well. (the letters are converted
to lowercase by using the .lower() string method )

valid_letters = "abcdefghijklmnopqrstuvwxyzABCDEF..."

On the other hand I wouldn't expect "letter" to have more than one
character so how could it have a problem at position 26?

I'm iterating over the string.


Why change the default codec rather than explicitly using the codec you
care about? If you want to work in the 8-bit world rather than the
Unicode world, just use the "encode" function on the Unicode object. If
you want to work in the Unicode world.


Great - sounds good.
I'd advise against fixing the problem in that way. Convert data
appropriately when you bring it from the outside world into the Python
program and ignore the default codec.

Paul Prescod

Thanks for your help.



Fuzzyman

http://www.voidspace.org.uk/atlantibots/pythonutils.html
 
F

Fuzzyman

Peter Otten said:
Fuzzyman wrote:
[snip..]
I can't work out how to change the default codec (no matter what the
locale) ?

Anyone able to help - or point me to a useful resource ?? (I've tried
google - b4 u suggest it )

You can either explicitly convert your unicode strings:

unicodeword.encode("latin-1")

I'll try this.
Some of the errors said (something to the effect of) 'character not in
range(128)' which sounds like some standard 'methods' (or functions)
are only prepared to deal with the default 7-bit ascii. That could be
a bugger.
or try to modify your site.py from the default

encoding = "ascii"

to

encoding = "latin-1"

Short of me actually looking... where is site.py :)


Thanks


Fuzzyman


http://www.voidspace.org.uk/atlantibots/pythonutils.html
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,598
Members
45,150
Latest member
MakersCBDReviews
Top