T
Tony Nelson
Is there a faster way to decode from charmaps to utf-8 than unicode()?
I'm writing a small card-file program. As a test, I use a 53 MB MBox
file, in mac-roman encoding. My program reads and parses the file into
messages in about 3..5 seconds, but takes about 13.5 seconds to iterate
over the cards and convert them to utf-8:
for i in xrange(len(cards)):
u = unicode(cards, encoding)
cards = u.encode('utf-_8')
The time is nearly all in the unicode() call. It's not so much how much
time it takes, but that it takes 4 times as long as the real work, just
to do table lookups.
Looking at the source (which, if I have it right, is
PyUnicode_DecodeCharmap() in unicodeobject.c), I think it is doing a
dictionary lookup for each character. I would have thought that it
would make and cache a LUT the size of the charmap (and hook the
relevent dictionary stuff to delete the cached LUT if the dictionary is
changed).
I thought of using U"".translate(), but the unicode version is defined
to be slow. Is there some similar approach? I'm almost (but not quite)
ready to try it in Pyrex.
I'm new to Python. I didn't google anything relevent on python.org or
in groups.
________________________________________________________________________
TonyN.:' *firstname*nlsnews@georgea*lastname*.com
' <http://www.georgeanelson.com/>
I'm writing a small card-file program. As a test, I use a 53 MB MBox
file, in mac-roman encoding. My program reads and parses the file into
messages in about 3..5 seconds, but takes about 13.5 seconds to iterate
over the cards and convert them to utf-8:
for i in xrange(len(cards)):
u = unicode(cards, encoding)
cards = u.encode('utf-_8')
The time is nearly all in the unicode() call. It's not so much how much
time it takes, but that it takes 4 times as long as the real work, just
to do table lookups.
Looking at the source (which, if I have it right, is
PyUnicode_DecodeCharmap() in unicodeobject.c), I think it is doing a
dictionary lookup for each character. I would have thought that it
would make and cache a LUT the size of the charmap (and hook the
relevent dictionary stuff to delete the cached LUT if the dictionary is
changed).
I thought of using U"".translate(), but the unicode version is defined
to be slow. Is there some similar approach? I'm almost (but not quite)
ready to try it in Pyrex.
I'm new to Python. I didn't google anything relevent on python.org or
in groups.
________________________________________________________________________
TonyN.:' *firstname*nlsnews@georgea*lastname*.com
' <http://www.georgeanelson.com/>