Unicode charmap decoders slow

Tony Nelson · Oct 2, 2005

Is there a faster way to decode from charmaps to utf-8 than unicode()?

I'm writing a small card-file program. As a test, I use a 53 MB MBox
file, in mac-roman encoding. My program reads and parses the file into
messages in about 3..5 seconds, but takes about 13.5 seconds to iterate
over the cards and convert them to utf-8:

for i in xrange(len(cards)):
u = unicode(cards, encoding)
cards = u.encode('utf-_8')

The time is nearly all in the unicode() call. It's not so much how much
time it takes, but that it takes 4 times as long as the real work, just
to do table lookups.

Looking at the source (which, if I have it right, is
PyUnicode_DecodeCharmap() in unicodeobject.c), I think it is doing a
dictionary lookup for each character. I would have thought that it
would make and cache a LUT the size of the charmap (and hook the
relevent dictionary stuff to delete the cached LUT if the dictionary is
changed).

I thought of using U"".translate(), but the unicode version is defined
to be slow. Is there some similar approach? I'm almost (but not quite)
ready to try it in Pyrex.

I'm new to Python. I didn't google anything relevent on python.org or
in groups.
________________________________________________________________________
TonyN.:' *firstname*nlsnews@georgea*lastname*.com
' <http://www.georgeanelson.com/>

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Oct 3, 2005

Tony said:
Is there a faster way to decode from charmaps to utf-8 than unicode()?

You could try the iconv codec, if your system supports iconv:

http://cvs.sourceforge.net/viewcvs.py/python-codecs/practicecodecs/iconv/

Regards,
Martin

Tony Nelson · Oct 3, 2005

"Martin v. Löwis said:
You could try the iconv codec, if your system supports iconv:

http://cvs.sourceforge.net/viewcvs.py/python-codecs/practicecodecs/iconv/

I had seen iconv. Even if my system supports it and it is faster than
Python's charmap decoder, it might not be available on other systems.
Requiring something unusual in order to do a trivial LUT task isn't an
acceptable solution. If I write a charmap decoder as an extension
module in Pyrex I can include it with the program. I would prefer a
solution that doesn't even need that, preferably in pure Python. Since
Python does all the hard wark so fast it certainly could do it, and it
can almost do it with "".translate().
________________________________________________________________________
TonyN.:' *firstname*nlsnews@georgea*lastname*.com
' <http://www.georgeanelson.com/>

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Oct 3, 2005

Tony said:
I had seen iconv. Even if my system supports it and it is faster than
Python's charmap decoder, it might not be available on other systems.
Requiring something unusual in order to do a trivial LUT task isn't an
acceptable solution. If I write a charmap decoder as an extension
module in Pyrex I can include it with the program. I would prefer a
solution that doesn't even need that, preferably in pure Python. Since
Python does all the hard wark so fast it certainly could do it, and it
can almost do it with "".translate().

Well, did you try a pure-Python version yourself?

table = [chr(i).decode("mac-roman","replace") for i in range(256)]

def decode_mac_roman(s):
result = [table[ord(c)] for c in s]
return u"".join(result)

How much faster than the standard codec is that?

Regards,
Martin

Tony Nelson · Oct 3, 2005

"Martin v. Löwis said:
Tony said:

I had seen iconv. Even if my system supports it and it is faster than
Python's charmap decoder, it might not be available on other systems.
Requiring something unusual in order to do a trivial LUT task isn't an
acceptable solution. If I write a charmap decoder as an extension
module in Pyrex I can include it with the program. I would prefer a
solution that doesn't even need that, preferably in pure Python. Since
Python does all the hard wark so fast it certainly could do it, and it
can almost do it with "".translate().

Click to expand...

Well, did you try a pure-Python version yourself?

table = [chr(i).decode("mac-roman","replace") for i in range(256)]

def decode_mac_roman(s):
result = [table[ord(c)] for c in s]
return u"".join(result)

How much faster than the standard codec is that?

It's .18x faster.
________________________________________________________________________
TonyN.:' *firstname*nlsnews@georgea*lastname*.com
' <http://www.georgeanelson.com/>

Unicode	2	Mar 15, 2013
Unicode Chars in Windows Path	12	Apr 2, 2014
python3 Unicode is slow	1	Oct 25, 2009
[ANN] Speed up Charmap codecs with fastcharmap module	0	Oct 16, 2005
collections.Counter surprisingly slow	11	Jul 28, 2013
avro slow?	1	May 5, 2011
Python 3.3, gettext and Unicode problems	0	Dec 30, 2012
unicode compare errors	3	Dec 10, 2010

Unicode charmap decoders slow

Tony Nelson

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Tony Nelson

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Tony Nelson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads