Changing the default text codec

Fuzzyman · Feb 23, 2004

Sorry if my terminology is wrong..... but I'm having intermittent
problems dealing with accented characters in python. (Only from the 8
bit latin-1 character set I think..)

I've written an anagram finder that produces anagrams from a
dictionary of words. The user can load their own dictionary.

( http://www.voidspace.org.uk/atlantibots/nanagram.html )

It's particularly difficult for me to understand what is happening -
because python's behaviour *seems* intermittent.

For example - if I run my program from IDLE and give it the word
'degré' (containing e-acute) then I get the error :

Exception in Tkinter callback
Traceback (most recent call last):
[snip..]
File "D:\Python Projects\Nanagram1.3\Nanagram-GUI.pyw", line 123, in
prepare
if letter in self.valid_letters:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x83 in position
26: ordinal not in range(128)
Traceback (most recent call last):

It is testing each character of the users input to remove invalid
characters (like "-" and "'")... It crashes when it comes tot he
e-acute.

*However* - If I run it by double clicking on the file then it appears
to work fine (e.g. if I ask it find anagrams of 'degré hello ma' then
it strips out the e-acute (thinking it's an invalid character) and
finds anagrams of the rest :

gleam holder
hallo merged

What I'd like to do is switch by default to an 8 bit codec (latin-1 I
think ?????) and then offer the user the choice of either mapping the
accented characters to their nearest equivalent (e-acute to e for
example) *or* treating them as seperate characters.............

I can't work out how to change the default codec (no matter what the
locale) ?

Anyone able to help - or point me to a useful resource ?? (I've tried
google - b4 u suggest it )

Fuzzy

Peter Otten · Feb 23, 2004

Fuzzyman said:
Sorry if my terminology is wrong..... but I'm having intermittent
problems dealing with accented characters in python. (Only from the 8
bit latin-1 character set I think..)

I've written an anagram finder that produces anagrams from a
dictionary of words. The user can load their own dictionary.

( http://www.voidspace.org.uk/atlantibots/nanagram.html )

It's particularly difficult for me to understand what is happening -
because python's behaviour *seems* intermittent.

For example - if I run my program from IDLE and give it the word
'degré' (containing e-acute) then I get the error :

Exception in Tkinter callback
Traceback (most recent call last):
[snip..]
File "D:\Python Projects\Nanagram1.3\Nanagram-GUI.pyw", line 123, in
prepare
if letter in self.valid_letters:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x83 in position
26: ordinal not in range(128)
Traceback (most recent call last):

It is testing each character of the users input to remove invalid
characters (like "-" and "'")... It crashes when it comes tot he
e-acute.

*However* - If I run it by double clicking on the file then it appears
to work fine (e.g. if I ask it find anagrams of 'degré hello ma' then
it strips out the e-acute (thinking it's an invalid character) and
finds anagrams of the rest :

gleam holder
hallo merged

What I'd like to do is switch by default to an 8 bit codec (latin-1 I
think ?????) and then offer the user the choice of either mapping the
accented characters to their nearest equivalent (e-acute to e for
example) *or* treating them as seperate characters.............

I can't work out how to change the default codec (no matter what the
locale) ?

Anyone able to help - or point me to a useful resource ?? (I've tried
google - b4 u suggest it )

You can either explicitly convert your unicode strings:

unicodeword.encode("latin-1")

or try to modify your site.py from the default

encoding = "ascii"

to

encoding = "latin-1"

Peter

Paul Prescod · Feb 23, 2004

Fuzzyman said:
Sorry if my terminology is wrong..... but I'm having intermittent
problems dealing with accented characters in python. (Only from the 8
bit latin-1 character set I think..)

I would say that if you get a 100% failure rate in IDLE and a 100%
success rate from a console program then your problem is not
intermittent but environment specific.

For example - if I run my program from IDLE and give it the word
'degri' (containing e-acute) then I get the error :

What do you mean "give it the word". Through raw_input()? Through a file?

However you are getting this information, it seems to me that in IDLE
you are getting a Unicode object rather than an 8-bit string object.
Convert it to an 8-bit string:

mydata.encode("latin-1")

> if letter in self.valid_letters:
> UnicodeDecodeError: 'ascii' codec can't decode byte 0x83 in position
> 26: ordinal not in range(128)

Something looks suspicious here. I wouldn't expect self.valid_letters to
have a 0x83 character in it because I would expect it to be hard-coded
to ASCII in your program like:

valid_letters = "abcdefghijklmnopqrstuvwxyzABCDEF..."

On the other hand I wouldn't expect "letter" to have more than one
character so how could it have a problem at position 26?

What I'd like to do is switch by default to an 8 bit codec (latin-1 I
think ?????) and then offer the user the choice of either mapping the
accented characters to their nearest equivalent (e-acute to e for
example) *or* treating them as seperate characters.............

Why change the default codec rather than explicitly using the codec you
care about? If you want to work in the 8-bit world rather than the
Unicode world, just use the "encode" function on the Unicode object. If
you want to work in the Unicode world.

I can't work out how to change the default codec (no matter what the
locale) ?

I'd advise against fixing the problem in that way. Convert data
appropriately when you bring it from the outside world into the Python
program and ignore the default codec.

Paul Prescod

Fuzzyman · Feb 23, 2004

Paul Prescod said:
I would say that if you get a 100% failure rate in IDLE and a 100%
success rate from a console program then your problem is not
intermittent but environment specific.

If that was the case then I'm sure you'd be right... good not to
quibble about terminology eh ;-)

(in a few other test cases the success-fail pattern was the opposite
way round)

What do you mean "give it the word". Through raw_input()? Through a file?

Right - it is fetching the words from a Tkinter entry box using the
get() method.

However you are getting this information, it seems to me that in IDLE
you are getting a Unicode object rather than an 8-bit string object.
Convert it to an 8-bit string:

mydata.encode("latin-1")

Great - that might do the job.
I'll try it.
Thanks.

Something looks suspicious here. I wouldn't expect self.valid_letters to
have a 0x83 character in it because I would expect it to be hard-coded
to ASCII in your program like:

Self.valid_letters *in fact* is string.lowercase - which I thought
included the 8 bit latin-1 letters as well. (the letters are converted
to lowercase by using the .lower() string method )

valid_letters = "abcdefghijklmnopqrstuvwxyzABCDEF..."

On the other hand I wouldn't expect "letter" to have more than one
character so how could it have a problem at position 26?

I'm iterating over the string.

Why change the default codec rather than explicitly using the codec you
care about? If you want to work in the 8-bit world rather than the
Unicode world, just use the "encode" function on the Unicode object. If
you want to work in the Unicode world.

Great - sounds good.

I'd advise against fixing the problem in that way. Convert data
appropriately when you bring it from the outside world into the Python
program and ignore the default codec.

Paul Prescod

Thanks for your help.

Fuzzyman

http://www.voidspace.org.uk/atlantibots/pythonutils.html

Fuzzyman · Feb 23, 2004

Peter Otten said:
Fuzzyman wrote:
[snip..]

I can't work out how to change the default codec (no matter what the
locale) ?

Anyone able to help - or point me to a useful resource ?? (I've tried
google - b4 u suggest it )

Click to expand...

You can either explicitly convert your unicode strings:

unicodeword.encode("latin-1")

I'll try this.
Some of the errors said (something to the effect of) 'character not in
range(128)' which sounds like some standard 'methods' (or functions)
are only prepared to deal with the default 7-bit ascii. That could be
a bugger.

or try to modify your site.py from the default

encoding = "ascii"

to

encoding = "latin-1"

Short of me actually looking... where is site.py

Thanks

Fuzzyman

http://www.voidspace.org.uk/atlantibots/pythonutils.html

Peter Otten · Feb 23, 2004

Fuzzyman said:
Short of me actually looking... where is site.py

'/usr/local/lib/python2.3/site.py'

Unicode again ... default codec ...	0	Oct 20, 2009
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position	4	Dec 6, 2012
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position	58	Sep 29, 2013
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position	67	Jul 4, 2013
modifying a codec	1	Nov 5, 2008
Changing the (codec) error handler for the stdout/stderr streams in Python 3.0	3	Sep 2, 2008
Unicode conversion problem (codec can't decode)	2	Apr 4, 2008
'ascii' codec can't encode character u'\u2013'	3	Sep 30, 2005

Changing the default text codec

Fuzzyman

Peter Otten

Paul Prescod

Fuzzyman

Fuzzyman

Peter Otten

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads