modifying a codec

Discussion in 'Python' started by Tim Arnold, Nov 5, 2008.

  1. Tim Arnold

    Tim Arnold Guest

    Hi, I'm using the codecs module to read in utf8 and write out cp1252
    encodings. For some characters I'd like to override the default behavior.
    For example, the mdash character comes out as the code point \227 and I'd
    like to translate it as — instead.
    Example: the file myutf8.txt contains this string:
    'factor one - initially'
    ====================
    import codecs

    fd0 = codecs.open('myutf8.txt', 'rb', encoding='utf8')
    line = fd0.read()
    fd0.close()

    fd1 = codecs.open('my1252.txt', 'wb', encoding='cp1252')
    fd1.write(line)
    fd1.close()
    ====================

    The codec is doing its job, but I want to override the codepoint for this
    character (plus others) to use the html entity instead (from \227 to
    — in this case).

    I see hints writing your own codec and updating the decoding_map, but I
    could use some more detail.

    Is that the best way to solve the problem?

    thanks,
    --Tim Arnold
     
    Tim Arnold, Nov 5, 2008
    #1
    1. Advertising

  2. > The codec is doing its job, but I want to override the codepoint for this
    > character (plus others) to use the html entity instead (from \227 to
    > — in this case).
    >
    > I see hints writing your own codec and updating the decoding_map, but I
    > could use some more detail.
    >
    > Is that the best way to solve the problem?


    I would say so, yes. Look at the source code of cp1252, and it should be
    fairly obvious how a charmap codec works. Make a copy of it, and remove
    the EM DASH line. This will give you a codec that just won't encode the
    character at all anymore.

    Then write an error handler that returns u"—" for \227, but
    otherwise continues to raise errors. See PEP 293 for code examples
    of error handlers.

    Notice that this approach only works for encoding; for decoding, your
    scheme can't work, because you would need to specify how —
    occurring in the input should get decoded -
    as u"—" or as u"\u2014"? Most likely, decoding that output
    is of no concern to you, in which case the approach with the error
    handler is the best way (IMO).

    Regards,
    Martin
     
    Martin v. Löwis, Nov 6, 2008
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Lee
    Replies:
    1
    Views:
    1,008
    Charles Bailey
    May 9, 2004
  2. mobini
    Replies:
    2
    Views:
    1,090
    a_shenavandeh
    Jun 20, 2005
  3. Patrick

    Euclidean Multiplier (RS CODEC)

    Patrick, Feb 4, 2005, in forum: VHDL
    Replies:
    6
    Views:
    899
    Pieter Hulshoff
    Feb 7, 2005
  4. pho

    Codec Video on FPGA

    pho, Jun 7, 2005, in forum: VHDL
    Replies:
    1
    Views:
    844
    M@rte
    Jun 8, 2005
  5. John Nagle
    Replies:
    3
    Views:
    662
    Waldemar Osuch
    Nov 10, 2007
Loading...

Share This Page