Why are some unicode error handlers "encode only"?

Discussion in 'Python' started by Steven D'Aprano, Mar 11, 2012.

  1. At least two standard error handlers are documented as working for
    encoding only:

    xmlcharrefreplace
    backslashreplace

    See http://docs.python.org/library/codecs.html#codec-base-classes

    and http://docs.python.org/py3k/library/codecs.html

    Why is this? I don't see why they shouldn't work for decoding as well.
    Consider this example using Python 3.2:

    >>> b"aaa--\xe9z--\xe9!--bbb".decode("cp932")

    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    UnicodeDecodeError: 'cp932' codec can't decode bytes in position 9-10:
    illegal multibyte sequence

    The two bytes b'\xe9!' is an illegal multibyte sequence for CP-932 (also
    known as MS-KANJI or SHIFT-JIS). Is there some reason why this shouldn't
    or can't be supported?

    # This doesn't actually work.
    b"aaa--\xe9z--\xe9!--bbb".decode("cp932", "backslashreplace")
    => r'aaa--騷--\xe9\x21--bbb'

    and similarly for xmlcharrefreplace.



    --
    Steven
    Steven D'Aprano, Mar 11, 2012
    #1
    1. Advertising

  2. On 11.03.12 15:37, Steven D'Aprano wrote:

    > At least two standard error handlers are documented as working for
    > encoding only:
    >
    > xmlcharrefreplace
    > backslashreplace
    >
    > See http://docs.python.org/library/codecs.html#codec-base-classes
    >
    > and http://docs.python.org/py3k/library/codecs.html
    >
    > Why is this? I don't see why they shouldn't work for decoding as well.


    Because xmlcharrefreplace and backslashreplace are *error* handlers.
    However the bytes sequence b'〹' does *not* contain any bytes that
    are not decodable for e.g. the ASCII codec. So there are no errors to
    handle.

    > Consider this example using Python 3.2:
    >
    >>>> b"aaa--\xe9z--\xe9!--bbb".decode("cp932")

    > Traceback (most recent call last):
    > File "<stdin>", line 1, in<module>
    > UnicodeDecodeError: 'cp932' codec can't decode bytes in position 9-10:
    > illegal multibyte sequence
    >
    > The two bytes b'\xe9!' is an illegal multibyte sequence for CP-932 (also
    > known as MS-KANJI or SHIFT-JIS). Is there some reason why this shouldn't
    > or can't be supported?


    The byte sequence b'\xe9!' however is not something that would have been
    produced by the backslashreplace error handler. b'\\xe9!' (a sequence
    containing 5 bytes) would have been (and this probably would decode
    without any problems with the cp932 codec).

    > # This doesn't actually work.
    > b"aaa--\xe9z--\xe9!--bbb".decode("cp932", "backslashreplace")
    > => r'aaa--騷--\xe9\x21--bbb'
    >
    > and similarly for xmlcharrefreplace.


    This would require a postprocess step *after* the bytes have been
    decoded. This is IMHO out of scope for Python's codec machinery.

    Servus,
    Walter
    Walter Dörwald, Mar 11, 2012
    #2
    1. Advertising

  3. Steven D'Aprano

    Terry Reedy Guest

    On 3/11/2012 10:37 AM, Steven D'Aprano wrote:
    > At least two standard error handlers are documented as working for
    > encoding only:
    >
    > xmlcharrefreplace
    > backslashreplace
    >
    > See http://docs.python.org/library/codecs.html#codec-base-classes
    >
    > and http://docs.python.org/py3k/library/codecs.html
    >
    > Why is this?


    I presume the purpose of both is to facilitate transmission of unicode
    text via byte transmission by extending incomplete byte encodings by
    replacing unicode chars that do not fit in the given encoding by a ascii
    byte sequence that will fit.

    > I don't see why they shouldn't work for decoding as well.
    > Consider this example using Python 3.2:
    >
    >>>> b"aaa--\xe9z--\xe9!--bbb".decode("cp932")

    > Traceback (most recent call last):
    > File "<stdin>", line 1, in<module>
    > UnicodeDecodeError: 'cp932' codec can't decode bytes in position 9-10:
    > illegal multibyte sequence
    >
    > The two bytes b'\xe9!' is an illegal multibyte sequence for CP-932 (also
    > known as MS-KANJI or SHIFT-JIS). Is there some reason why this shouldn't
    > or can't be supported?
    >
    > # This doesn't actually work.
    > b"aaa--\xe9z--\xe9!--bbb".decode("cp932", "backslashreplace")
    > => r'aaa--騷--\xe9\x21--bbb'


    This output does not round-trip and would be a bit of a fib since it
    somewhat misrepresents what the encoded bytes were:

    >>> r'aaa--騷--\xe9\x21--bbb'.encode("cp932")

    b'aaa--\xe9z--\\xe9\\x21--bbb'
    >>> b'aaa--\xe9z--\\xe9\\x21--bbb'.decode("cp932")

    'aaa--騷--\\xe9\\x21--bbb'

    Python 3 added surrogateescape error handling to solve this problem.

    > and similarly for xmlcharrefreplace.


    Since xml character references are representations of unicode chars, and
    not bytes, I do not see how that would work. By analogy, perhaps you
    mean to have '' in your output instead of '\xe9\x21', but
    those would not properly be xml numeric character references.

    --
    Terry Jan Reedy
    Terry Reedy, Mar 11, 2012
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Steve Pugh
    Replies:
    0
    Views:
    526
    Steve Pugh
    Aug 15, 2003
  2. Mr. SweatyFinger

    why why why why why

    Mr. SweatyFinger, Nov 28, 2006, in forum: ASP .Net
    Replies:
    4
    Views:
    878
    Mark Rae
    Dec 21, 2006
  3. Mr. SweatyFinger
    Replies:
    2
    Views:
    1,836
    Smokey Grindel
    Dec 2, 2006
  4. watergirl
    Replies:
    4
    Views:
    2,508
    watergirl
    Oct 10, 2006
  5. Steven D'Aprano

    API for custom Unicode error handlers

    Steven D'Aprano, Oct 4, 2013, in forum: Python
    Replies:
    5
    Views:
    96
    Terry Reedy
    Oct 4, 2013
Loading...

Share This Page