Proper use of the codecs module.

A

Andrew

I have a mixed binary/text file[0], and the text portions use a radically
nonstandard character set. I want to read them easily given information
about the character encoding and an offset for the beginning of a string.

The descriptions of the codecs module and codecs.register() in particular
seem to suggest that this is already supported in the standard library.
However, I can't find any examples of its proper use. Most people who use
the module seem to want to read utf files in python 2.x.[1] I would like to
know how to correctly set up a new codec for reading files that have
nonstandard encodings.

I have two other related questions:

How does seek() work on a file opened in text mode? Does it seek to a
character offset or to a byte offset? I need the latter behavior. If I
can't get it I will have to find a different approach.

The files I'm working with use a nonstandard end-of-string character in the
same fashion as C null-terminated strings. Is there a builtin function that
will read a file "from seek position until seeing EOS character X"? The
methods I see for this online seem to amount to reading one character at a
time and checking manually, which seems nonoptimal to me.


[0] The file is an SNES ROM dump, but I don't think that matters.
[1] I'm using Python 3, if it's relevant.
 
S

Steven D'Aprano

I have a mixed binary/text file[0], and the text portions use a
radically nonstandard character set. I want to read them easily given
information about the character encoding and an offset for the beginning
of a string.

"Mixed binary/text" is not a helpful model to use. You are better off
thinking of the file as "binary", where some of the fields happen to
contain text encoded with some custom codec.

If you try opening the file in text mode, you'll very likely break the
binary parts (e.g. converting the two bytes 0x0D0A to a single byte
0x0A). So best to stick to binary only, extract the "text" portions of
the file, then explicitly decode them.

The descriptions of the codecs module and codecs.register() in
particular seem to suggest that this is already supported in the
standard library. However, I can't find any examples of its proper use.
Most people who use the module seem to want to read utf files in python
2.x.[1] I would like to know how to correctly set up a new codec for
reading files that have nonstandard encodings.

I suggest you look at the source code for the dozens of codecs in the
standard library. E.g. /usr/local/lib/python3.3/encodings/palmos.py

(Adjust for your installation location as required.)

I have two other related questions:

How does seek() work on a file opened in text mode? Does it seek to a
character offset or to a byte offset? I need the latter behavior. If I
can't get it I will have to find a different approach.

For text files, seek() is only legal for offsets that tell() can return,
but this is not enforced, so you can get nasty rubbish like this:

py> f = open('/tmp/t', 'w', encoding='utf-32')
py> f.write('hello world')
11
py> f.close()
py> f = open('/tmp/t', 'r', encoding='utf-32')
py> f.read(1)
'h'
py> f.tell()
8
py> f.seek(3)
3
py> f.read(1)
'æ €'


So I prefer not to seek in text files if I can help it.

The files I'm working with use a nonstandard end-of-string character in
the same fashion as C null-terminated strings. Is there a builtin
function that will read a file "from seek position until seeing EOS
character X"? The methods I see for this online seem to amount to
reading one character at a time and checking manually, which seems
nonoptimal to me.

How do you think such a built-in function would work, if not inspect each
character until the EOS character is seen? :)

There is no such built-in function though. By default, Python files are
buffered, so it won't literally read one character from disk at a time.
The actual disk IO will read a bunch of bytes into a memory buffer, and
then read from the buffer.
 
A

Andrew

If you try opening the file in text mode, you'll very likely break the
binary parts (e.g. converting the two bytes 0x0D0A to a single byte
0x0A). So best to stick to binary only, extract the "text" portions of
the file, then explicitly decode them.

Okay, I'll do that. Given what you said about seek() and text mode below, I
have no choice anyway.
I suggest you look at the source code for the dozens of codecs in the
standard library. E.g. /usr/local/lib/python3.3/encodings/palmos.py

I'll do that too. My thanks for the pointer.
For text files, seek() is only legal for offsets that tell() can return,
but this is not enforced, so you can get nasty rubbish like this:

<snip evil>

So I prefer not to seek in text files if I can help it.

If I'm understanding the above right, it seeks to a byte offset but the
behavior is undocumented, not guaranteed, shouldn't be used, etc. That
would actually work for me in theory (because I have exact byte offsets to
work with) but I think I'll avoid it anyway, on the grounds that relying on
undocumented behavior is bad.
How do you think such a built-in function would work, if not inspect each
character until the EOS character is seen? :)

I don't know, but I'm assuming it wouldn't involve a function call to
file.read(1) for each character, and that's what Google keeps handing me.
Such an approach fills me with horror. :) I suppose there's nothing
stopping me from reading some educated guess at the length of the string
and then stepping through the result. Or I'll look at the readline() source
and see how it does its thing.
There is no such built-in function though. By default, Python files are
buffered, so it won't literally read one character from disk at a time.
The actual disk IO will read a bunch of bytes into a memory buffer, and
then read from the buffer.

I'd guessed as much, but assumed there was still ridiculous function call
overhead involved in the repeated read(1) method above. Of course, trying
to avoid said overhead is premature optimization; my interest in doing so
is more aesthetic than anything else.

Thanks for the help.
 
C

Chris Angelico

I have a mixed binary/text file[0], and the text portions use a radically
nonstandard character set. I want to read them easily given information
about the character encoding and an offset for the beginning of a string.

To add to all the information already given: Is the file small enough
to comfortably fit into memory? If so, you'll find it a LOT easier to
play with strings in RAM than files on disk. Even if not, you may find
a lot of tasks simplified by just reading a kay or a meg in and then
working within that. That spares you the fiddliness of read(1) all the
time, at the expense of potentially reading more than you need.

ChrisA
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,576
Members
45,054
Latest member
LucyCarper

Latest Threads

Top