Unicode charmap decoders slow

Discussion in 'Python' started by Tony Nelson, Oct 3, 2005.

  1. Tony Nelson

    Tony Nelson Guest

    Is there a faster way to decode from charmaps to utf-8 than unicode()?

    I'm writing a small card-file program. As a test, I use a 53 MB MBox
    file, in mac-roman encoding. My program reads and parses the file into
    messages in about 3..5 seconds, but takes about 13.5 seconds to iterate
    over the cards and convert them to utf-8:

    for i in xrange(len(cards)):
    u = unicode(cards, encoding)
    cards = u.encode('utf-_8')

    The time is nearly all in the unicode() call. It's not so much how much
    time it takes, but that it takes 4 times as long as the real work, just
    to do table lookups.

    Looking at the source (which, if I have it right, is
    PyUnicode_DecodeCharmap() in unicodeobject.c), I think it is doing a
    dictionary lookup for each character. I would have thought that it
    would make and cache a LUT the size of the charmap (and hook the
    relevent dictionary stuff to delete the cached LUT if the dictionary is
    changed).

    I thought of using U"".translate(), but the unicode version is defined
    to be slow. Is there some similar approach? I'm almost (but not quite)
    ready to try it in Pyrex.

    I'm new to Python. I didn't google anything relevent on python.org or
    in groups.
    ________________________________________________________________________
    TonyN.:' *firstname*nlsnews@georgea*lastname*.com
    ' <http://www.georgeanelson.com/>
     
    Tony Nelson, Oct 3, 2005
    #1
    1. Advertising

  2. =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Oct 3, 2005
    #2
    1. Advertising

  3. Tony Nelson

    Tony Nelson Guest

    In article <43410f1b$0$7019$>,
    "Martin v. Löwis" <> wrote:

    > Tony Nelson wrote:
    > > Is there a faster way to decode from charmaps to utf-8 than unicode()?

    >
    > You could try the iconv codec, if your system supports iconv:
    >
    > http://cvs.sourceforge.net/viewcvs.py/python-codecs/practicecodecs/iconv/


    I had seen iconv. Even if my system supports it and it is faster than
    Python's charmap decoder, it might not be available on other systems.
    Requiring something unusual in order to do a trivial LUT task isn't an
    acceptable solution. If I write a charmap decoder as an extension
    module in Pyrex I can include it with the program. I would prefer a
    solution that doesn't even need that, preferably in pure Python. Since
    Python does all the hard wark so fast it certainly could do it, and it
    can almost do it with "".translate().
    ________________________________________________________________________
    TonyN.:' *firstname*nlsnews@georgea*lastname*.com
    ' <http://www.georgeanelson.com/>
     
    Tony Nelson, Oct 3, 2005
    #3
  4. Tony Nelson wrote:
    > I had seen iconv. Even if my system supports it and it is faster than
    > Python's charmap decoder, it might not be available on other systems.
    > Requiring something unusual in order to do a trivial LUT task isn't an
    > acceptable solution. If I write a charmap decoder as an extension
    > module in Pyrex I can include it with the program. I would prefer a
    > solution that doesn't even need that, preferably in pure Python. Since
    > Python does all the hard wark so fast it certainly could do it, and it
    > can almost do it with "".translate().


    Well, did you try a pure-Python version yourself?

    table = [chr(i).decode("mac-roman","replace") for i in range(256)]

    def decode_mac_roman(s):
    result = [table[ord(c)] for c in s]
    return u"".join(result)

    How much faster than the standard codec is that?

    Regards,
    Martin
     
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Oct 3, 2005
    #4
  5. Tony Nelson

    Tony Nelson Guest

    In article <43419076$0$25143$>,
    "Martin v. Löwis" <> wrote:

    > Tony Nelson wrote:
    > > I had seen iconv. Even if my system supports it and it is faster than
    > > Python's charmap decoder, it might not be available on other systems.
    > > Requiring something unusual in order to do a trivial LUT task isn't an
    > > acceptable solution. If I write a charmap decoder as an extension
    > > module in Pyrex I can include it with the program. I would prefer a
    > > solution that doesn't even need that, preferably in pure Python. Since
    > > Python does all the hard wark so fast it certainly could do it, and it
    > > can almost do it with "".translate().

    >
    > Well, did you try a pure-Python version yourself?
    >
    > table = [chr(i).decode("mac-roman","replace") for i in range(256)]
    >
    > def decode_mac_roman(s):
    > result = [table[ord(c)] for c in s]
    > return u"".join(result)
    >
    > How much faster than the standard codec is that?


    It's .18x faster.
    ________________________________________________________________________
    TonyN.:' *firstname*nlsnews@georgea*lastname*.com
    ' <http://www.georgeanelson.com/>
     
    Tony Nelson, Oct 3, 2005
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Tony Nelson
    Replies:
    0
    Views:
    285
    Tony Nelson
    Oct 16, 2005
  2. k2in5
    Replies:
    0
    Views:
    825
    k2in5
    Oct 6, 2006
  3. Anjanesh Lekshminarayanan
    Replies:
    0
    Views:
    1,019
    Anjanesh Lekshminarayanan
    Jan 29, 2009
  4. Benjamin Peterson
    Replies:
    0
    Views:
    475
    Benjamin Peterson
    Jan 29, 2009
  5. John Machin
    Replies:
    0
    Views:
    1,020
    John Machin
    Jan 29, 2009
Loading...

Share This Page