different encodings for unicode() and u''.encode(), bug?

M

mario

Hello!

i stumbled on this situation, that is if I decode some string, below
just the empty string, using the mcbs encoding, it succeeds, but if I
try to encode it back with the same encoding it surprisingly fails
with a LookupError. This seems like something to be corrected?

$ python
Python 2.5.1 (r251:54869, Apr 18 2007, 22:08:04)
[GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin
Type "help", "copyright", "credits" or "license" for more information.Traceback (most recent call last):
File "<stdin>", line 1, in <module>
LookupError: unknown encoding: mcbs


Best wishes to everyone for 2008!

mario
 
M

Martin v. Löwis

i stumbled on this situation, that is if I decode some string, below
just the empty string, using the mcbs encoding, it succeeds, but if I
try to encode it back with the same encoding it surprisingly fails
with a LookupError. This seems like something to be corrected?

Indeed - in your code. It's not the same encoding.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
LookupError: unknown encoding: mcbs

Use "mbcs" in the second call, not "mcbs".

HTH,
Martin
 
M

mario

Use "mbcs" in the second call, not "mcbs".

Ooops, sorry about that, when i switched to test it in the interpreter
I mistyped "mbcs" with "mcbs". But remark I did it consistently ;-)
I.e. it was still teh same encoding, even if maybe non-existant.. ?

If I try again using "mbcs" consistently, I still get the same error:


$ python
Python 2.5.1 (r251:54869, Apr 18 2007, 22:08:04)
[GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin
Type "help", "copyright", "credits" or "license" for more information.Traceback (most recent call last):

mario
 
J

John Machin

Use "mbcs" in the second call, not "mcbs".

Ooops, sorry about that, when i switched to test it in the interpreter
I mistyped "mbcs" with "mcbs". But remark I did it consistently ;-)
I.e. it was still teh same encoding, even if maybe non-existant.. ?

If I try again using "mbcs" consistently, I still get the same error:

$ python
Python 2.5.1 (r251:54869, Apr 18 2007, 22:08:04)
[GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin
Type "help", "copyright", "credits" or "license" for more information.>>> unicode('', 'mbcs')
u''
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
LookupError: unknown encoding: mbcs

Two things for you to do:

(1) Try these at the Python interactive prompt:

unicode('', 'latin1')
unicode('', 'mbcs')
unicode('', 'raboof')
unicode('abc', 'latin1')
unicode('abc', 'mbcs')
unicode('abc', 'raboof')

(2) Read what the manual (Library Reference -> codecs module ->
standard encodings) has to say about mbcs.
 
J

John Machin

(1) Try these at the Python interactive prompt:

unicode('', 'latin1')

Also use those 6 cases to check out the difference in behaviour
between unicode(x, y) and x.decode(y)
 
M

mario

Two things for you to do:

(1) Try these at the Python interactive prompt:

unicode('', 'latin1')
unicode('', 'mbcs')
unicode('', 'raboof')
unicode('abc', 'latin1')
unicode('abc', 'mbcs')
unicode('abc', 'raboof')

$ python
Python 2.5.1 (r251:54869, Apr 18 2007, 22:08:04)
[GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin
Type "help", "copyright", "credits" or "license" for more information.Traceback (most recent call last):

Hmmn, strange. Same behaviour for "raboof".

(2) Read what the manual (Library Reference -> codecs module ->
standard encodings) has to say about mbcs.

Page at http://docs.python.org/lib/standard-encodings.html says that
mbcs "purpose":
Windows only: Encode operand according to the ANSI codepage (CP_ACP)

Do not know what the implications of encoding according to "ANSI
codepage (CP_ACP)" are. Windows only seems clear, but why does it only
complain when decoding a non-empty string (or when encoding the empty
unicode string) ?

mario
 
J

John Machin

Two things for you to do:
(1) Try these at the Python interactive prompt:
unicode('', 'latin1')
unicode('', 'mbcs')
unicode('', 'raboof')
unicode('abc', 'latin1')
unicode('abc', 'mbcs')
unicode('abc', 'raboof')

$ python
Python 2.5.1 (r251:54869, Apr 18 2007, 22:08:04)
[GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin
Type "help", "copyright", "credits" or "license" for more information.>>> unicode('', 'mbcs')
u''
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
LookupError: unknown encoding: mbcs



Hmmn, strange. Same behaviour for "raboof".
(2) Read what the manual (Library Reference -> codecs module ->
standard encodings) has to say about mbcs.

Page athttp://docs.python.org/lib/standard-encodings.htmlsays that
mbcs "purpose":
Windows only: Encode operand according to the ANSI codepage (CP_ACP)

Do not know what the implications of encoding according to "ANSI
codepage (CP_ACP)" are.

Neither do I. YAGNI (especially on darwin) so don't lose any sleep
over it.
Windows only seems clear, but why does it only
complain when decoding a non-empty string (or when encoding the empty
unicode string) ?

My presumption: because it doesn't need a codec to decode '' into u'';
no failed codec look-up, so no complaint. Any realistic app will try
to decode a non-empty string sooner or later.
 
M

mario

Neither do I. YAGNI (especially on darwin) so don't lose any sleep
over it.


My presumption: because it doesn't need a codec to decode '' into u'';
no failed codec look-up, so no complaint. Any realistic app will try
to decode a non-empty string sooner or later.

Yes, I suspect I will never need it ;)

Incidentally, the situation is that in a script that tries to guess a
file's encoding, it bombed on the file ".svn/empty-file" -- but why it
was going so far with an empty string was really due to a bug
elsewhere in the script, trivially fixed. Still, I was curious about
this non-symmetric behaviour for the empty string by some encodings.

Anyhow, thanks a lot to both of you for the great feedback!

mario
 
P

Piet van Oostrum

mario said:
M> $ python
M> Python 2.5.1 (r251:54869, Apr 18 2007, 22:08:04)
M> [GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin
M> Type "help", "copyright", "credits" or "license" for more information.M> Traceback (most recent call last):
M> Hmmn, strange. Same behaviour for "raboof".

Apparently for the empty string the encoding is irrelevant as it will not
be used. I guess there is an early check for this special case in the code.
 
M

Martin v. Löwis

Do not know what the implications of encoding according to "ANSI
codepage (CP_ACP)" are. Windows only seems clear, but why does it only
complain when decoding a non-empty string (or when encoding the empty
unicode string) ?

It has no implications for this issue here. CP_ACP is a Microsoft
invention of a specific encoding alias - the "ANSI code page"
(as Microsoft calls it) is not a specific encoding where I could
specify a mapping from bytes to characters, but instead a
system-global indirection based on a langage default. For example,
in the Western-European/U.S. version of Windows, the default for
CP_ACP is cp1252 (local installation may change that default,
system-wide).

The issue likely has the cause that Piet also guessed: If the
input is an empty string, no attempt to actually perform an
encoding is done, but the output is assumed to be an empty
string again. This is correct behavior for all codecs that Python
supports in its default installation, at least for the direction
bytes->unicode. For the reverse direction, such an optimization
would be incorrect; consider u"".encode("utf-16").

HTH,
Martin
 
M

mario

Apparently for the empty string the encoding is irrelevant as it will not
be used. I guess there is an early check for this special case in the code.

In the module I an working on [*] I am remembering a failed encoding
to allow me, if necessary, to later re-process fewer encodings. In the
case of an empty string AND an unknown encoding this strategy
failed...

Anyhow, the question is, should the behaviour be the same for these
operations, and if so what should it be:

u"".encode("non-existent")
unicode("", "non-existent")

mario

[*] a module to decode heuristically, that imho is actually starting
to look quite good, it is at http://gizmojo.org/code/decodeh/ and any
comments very welcome.
 
J

John Machin

Apparently for the empty string the encoding is irrelevant as it will not
be used. I guess there is an early check for this special case in the code.

In the module I an working on [*] I am remembering a failed encoding
to allow me, if necessary, to later re-process fewer encodings.

If you were in fact doing that, you would not have had a problem. What
you appear to have been doing is (a) remembering a NON-failing
encoding, and assuming that it would continue not to fail (b) not
differentiating between failure reasons (codec doesn't exist, input
not consistent with specified encoding).

A good strategy when dealing with encodings that are unknown (in the
sense that they come from user input, or a list of encodings you got
out of the manual, or are constructed on the fly (e.g. encoding = 'cp'
+ str(code_page_number) # old MS Excel files)) is to try to decode
some vanilla ASCII alphabetic text, so that you can give an immemdiate
in-context error message.
In the
case of an empty string AND an unknown encoding this strategy
failed...
Anyhow, the question is, should the behaviour be the same for these
operations, and if so what should it be:

u"".encode("non-existent")
unicode("", "non-existent")

Perhaps you should make TWO comparisons:
(1)
unistrg = strg.decode(encoding)
with
unistrg = unicode(strg, encoding)
[the latter "optimises" the case where strg is ''; the former can't
because its output may be '', not u'', depending on the encoding, so
ut must do the lookup]
(2)
unistrg = strg.decode(encoding)
with
strg = unistrg.encode(encoding)
[both always do the lookup]

In any case, a pointless question (IMHO); the behaviour is extremely
unlikely to change, as the chance of breaking existing code outvotes
any desire to clean up a minor inconsistency that is easily worked
around.
 
M

mario

In the module I an working on [*] I am remembering a failed encoding
to allow me, if necessary, to later re-process fewer encodings.

If you were in fact doing that, you would not have had a problem. What
you appear to have been doing is (a) remembering a NON-failing
encoding, and assuming that it would continue not to fail

Yes, exactly. But there is no difference which ones I remember as the
two subsets will anyway add up to always the same thing. In this
special case (empty string!) the unccode() call does not fail...
(b) not
differentiating between failure reasons (codec doesn't exist, input
not consistent with specified encoding).

There is no failure in the first pass in this case... if I do as you
suggest further down, that is to use s.decode(encoding) instead of
unicode(s, encoding) to force the lookup, then I could remember the
failure reason to be able to make a decision about how to proceed.
However I am aiming at an automatic decision, thus an in-context error
message would need to be replaced with a more rigourous info about how
the guessing should proceed. I am also trying to keep this simple ;)

In any case, a pointless question (IMHO); the behaviour is extremely
unlikely to change, as the chance of breaking existing code outvotes
any desire to clean up a minor inconsistency that is easily worked
around.

Yes, I would agree. The work around may not even be worth it though,
as what I really want is a unicode object, so changing from calling
unicode() to s.decode() is not quite right, and will anyway require a
further check. Less clear code, and a little unnecessary performance
hit for the 99.9 majority of cases... Anyhow, I have improved a little
further the "post guess" checking/refining logic of the algorithm [*].

What I'd like to understand better is the "compatibility heirarchy" of
known encodings, in the positive sense that if a string decodes
successfully with encoding A, then it is also possible that it will
encode with encodings B, C; and in the negative sense that is if a
string fails to decode with encoding A, then for sure it will also
fail to decode with encodings B, C. Any ideas if such an analysis of
the relationships between encodings exists?

Thanks! mario

[*] http://gizmojo.org/code/decodeh/
 
M

Martin v. Löwis

What I'd like to understand better is the "compatibility heirarchy" of
known encodings, in the positive sense that if a string decodes
successfully with encoding A, then it is also possible that it will
encode with encodings B, C; and in the negative sense that is if a
string fails to decode with encoding A, then for sure it will also
fail to decode with encodings B, C. Any ideas if such an analysis of
the relationships between encodings exists?

Most certainly. You'll have to learn a lot about many encodings though
to really understand the relationships.

Many encodings X are "ASCII supersets", in the sense that if you have
only characters in the ASCII set, the encoding of the string in ASCII
is the same as the encoding of the string in X. ISO-8859-X, ISO-2022-X,
koi8-x, and UTF-8 fall in this category.

Other encodings are "ASCII supersets" only in the sense that they
include all characters of ASCII, but encode them differently. EBCDIC
and UCS-2/4, UTF-16/32 fall in that category.

Some encodings are 7-bit, so that they decode as ASCII (producing
moji-bake if the input wasn't ASCII). ISO-2022-X is an example.

Some encodings are 8-bit, so that they can decode arbitrary bytes
(again producing moji-bake if the input wasn't that encoding).
ISO-8859-X are examples, as are some of the EBCDIC encodings, and
koi8-x. Also, things will successfully (but meaninglessly) decode
as UTF-16 if the number of bytes in the input is even (likewise
for UTF-32).

HTH,
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,743
Messages
2,569,478
Members
44,898
Latest member
BlairH7607

Latest Threads

Top