Issues with `codecs.register` and `codecs.CodecInfo` objects

K

Karl Knechtel

Hello all,

While attempting to make a wrapper for opening multiple types of
UTF-encoded files (more on that later, in a separate post, I guess), I
ran into some oddities with the `codecs` module, specifically to do
with `.register` ing `CodecInfo` objects. I'd like to report a bug or
something, but there are several intertangled issues here and I'm not
really sure how to report it so I thought I'd open the discussion.
Apologies in advance if I get a bit rant-y, and a warning that this is
fairly long.

Observe what happens when you `register` the wrong function:
... # Very obviously wrong, just for demonstration purposes
... if name == 'spam': return 'eggs'
...
Already there is a problem in that there is no error... there is no
realistic way to catch this, of course, but IMHO it points to an issue
with the interface. I don't want to register a codec lookup function;
I want to register *a codec*. The built-in lookup process would be
just fine if I could just somehow tell it about this one new codec I
have... I really don't see the use case for the added flexibility of
the current interface, and it means that every time I have a new
codec, I need to either create a new lookup function as well (to
register it), or hook into an existing one that's still of my own
creation.

Anyway, moving on, let's see what happens when we try to use the faulty codec:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python32\lib\codecs.py", line 939, in getencoder
return lookup(encoding).encode
TypeError: codec search functions must return 4-tuples

Ehh?! That's odd. I thought I was supposed to return a `CodecInfo`
object, not a 4-tuple! Although as an aside, AFAICT the documentation
*doesn't actually document the CodecInfo class*, it just says what
attributes CodecInfo objects are supposed to have.

A bit of digging around with Google and existing old bugs on the
tracker suggests that this comes about due to backwards-compatibility:
in 2.4 and below, they *were* 4-tuples. But now CodecInfo objects are
expected to provide 6 functions (and a name), not 4. Clearly that
won't fit in a 4-tuple, and anyway I thought we had gotten rid of all
this deprecated stuff.

Regardless, let's see what happens if we do try to register a 4-tuple-lookup-er:
... # As long as we return a 4-tuple, it doesn't really matter
what the functions are;
... # errors shouldn't happen until we actually attempt to
encode/decode. Right?
... if name == 'spam': return (spam, spam, spam, spam)

Oops, we need to restart the interpreter, or otherwise reset global
state somehow, because the old lookup function has priority over this
one, and *there is no way to unregister it*. But once that's fixed:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python32\lib\codecs.py", line 939, in getencoder
return lookup(encoding).encode
AttributeError: 'tuple' object has no attribute 'encode'

That's quite odd indeed. We can't actually trust the error message we
got before! 4-tuples don't work any more like they used to, so our
backwards-compatibility concession doesn't even work. Meanwhile, we're
left wondering how CodecInfo objects work at all. Is the error message
wrong?

Nope, well, not really. Let's grab an known good CodecInfo object and
see what we can find out...
(<built-in function utf_8_encode>, <function decode at
That long ago...)

.... and if we try `help` (or look at examples in the standard library
or find them with Google - but I sure don't see any in the webpage
docs), we can at least find out how to construct a CodecInfo object
properly - although, curiously, it's implemented using `__new__`
rather than `__init__`.

You *can* hack around with `collections.namedtuple` and create
something that basically works:

# restarting again...... if name == 'spam': return my_codecinfo(spam, spam, spam, spam)

And now the error correctly doesn't occur until we actually attempt to
encode or decode something. Except we still don't have an incremental
decoder/encoder, and in fact those are missing attributes rather than
`None` as they're defaulted to by the `CodecInfo` class. (Of course,
we can subclass `collections.namedtuple` to fix this, but then we're
basically reverse-engineering the `codecs.CodecInfo` class
wholesale...)

Speaking of which, one last thing:
... if name == 'spam': return codecs.CodecInfo(spam, spam)
...Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python32\lib\codecs.py", line 976, in getincrementaldecoder
raise LookupError(encoding)
LookupError: spam

That seems wrong to me too: the codec is certainly *there*, it just
doesn't support incremental decoding. I would expect the error message
to be more specific.
 
S

Steven D'Aprano

Hello all,

While attempting to make a wrapper for opening multiple types of
UTF-encoded files (more on that later, in a separate post, I guess), I
ran into some oddities with the `codecs` module, specifically to do with
`.register` ing `CodecInfo` objects. I'd like to report a bug or
something, but there are several intertangled issues here and I'm not
really sure how to report it so I thought I'd open the discussion.
Apologies in advance if I get a bit rant-y, and a warning that this is
fairly long.
[...]

Yes, it's a strangely indirect API, and yes it looks like you have
identified a whole bucket full of problems with it. And no, I don't know
why that API was chosen.

Changing to a cleaner, more direct (sensible?) API would be a fairly big
step. If you want to pursue this, the steps I recommend you take are:

1) understanding the reason for the old API (search the Internet
and particularly the (e-mail address removed) archives);

2) have a plan for how to avoid breaking code that relies on the
existing API;

3) raise the issue on (e-mail address removed) to gather feedback
and see how much opposition or support it is likely to get;
they'll suggest whether a bug report is sufficient or if you'll
need a PEP;

http://www.python.org/dev/peps/


If you can provide a patch and a test suite, you will have a much better
chance of pushing it through. If not, you are reliant on somebody else
who can being interested enough to do the work.

And one last thing: any new functionality will simply *not* be considered
for Python 2.x. Aim for Python 3.4, since the 2.x series is now in bug-
fix only maintenance mode and the 3.3 beta is no longer accepting new
functionality, only bug fixes.
 
W

Walter Dörwald

Hello all,

While attempting to make a wrapper for opening multiple types of
UTF-encoded files (more on that later, in a separate post, I guess), I
ran into some oddities with the `codecs` module, specifically to do with
`.register` ing `CodecInfo` objects. I'd like to report a bug or
something, but there are several intertangled issues here and I'm not
really sure how to report it so I thought I'd open the discussion.
Apologies in advance if I get a bit rant-y, and a warning that this is
fairly long.
[...]

Yes, it's a strangely indirect API, and yes it looks like you have
identified a whole bucket full of problems with it. And no, I don't know
why that API was chosen.

This API was chosen for backwards compatibility reasons when incremental
encoders/decoders were introduced (in 2006).

And yes: We missed the opportunity to clean that up to always use CodecInfo.
Changing to a cleaner, more direct (sensible?) API would be a fairly big
step. If you want to pursue this, the steps I recommend you take are:

1) understanding the reason for the old API (search the Internet
and particularly the (e-mail address removed) archives);

See e.g. http://mail.python.org/pipermail/patches/2006-March/019122.html
2) have a plan for how to avoid breaking code that relies on the
existing API;

3) raise the issue on (e-mail address removed) to gather feedback
and see how much opposition or support it is likely to get;
they'll suggest whether a bug report is sufficient or if you'll
need a PEP;

http://www.python.org/dev/peps/


If you can provide a patch and a test suite, you will have a much better
chance of pushing it through. If not, you are reliant on somebody else
who can being interested enough to do the work.

And one last thing: any new functionality will simply *not* be considered
for Python 2.x. Aim for Python 3.4, since the 2.x series is now in bug-
fix only maintenance mode and the 3.3 beta is no longer accepting new
functionality, only bug fixes.

Servus,
Walter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top