harmful str(bytes)

Discussion in 'Python' started by Hallvard B Furuseth, Oct 7, 2010.

  1. I've been playing a bit with Python3.2a2, and frankly its charset
    handling looks _less_ safe than in Python 2.

    The offender is bytes.__str__: str(b'foo') == "b'foo'".
    It's often not clear from looking at a piece of code whether
    some data is treated as strings or bytes, particularly when
    translating from old code. Which means one cannot see from
    context if str(s) or "%s" % s will produce garbage.

    With 2.<late> conversion Unicode <-> string the equivalent operation did
    not silently produce garbage: it raised UnicodeError instead. With old
    raw Python strings that was not a problem in applications which did not
    need to convert any charsets, with python3 they can break.

    I really wish bytes.__str__ would at least by default fail.

    --
    Hallvard
    Hallvard B Furuseth, Oct 7, 2010
    #1
    1. Advertising

  2. Hallvard B Furuseth <> writes:

    > I've been playing a bit with Python3.2a2, and frankly its charset
    > handling looks _less_ safe than in Python 2.
    >
    > The offender is bytes.__str__: str(b'foo') == "b'foo'".
    > It's often not clear from looking at a piece of code whether
    > some data is treated as strings or bytes, particularly when
    > translating from old code. Which means one cannot see from
    > context if str(s) or "%s" % s will produce garbage.
    >
    > With 2.<late> conversion Unicode <-> string the equivalent operation did
    > not silently produce garbage: it raised UnicodeError instead. With old
    > raw Python strings that was not a problem in applications which did not
    > need to convert any charsets, with python3 they can break.
    >
    > I really wish bytes.__str__ would at least by default fail.


    I think you misunderstand the purpose of str(). It is to provide a
    (unicode) string representation of an object and has nothing to do with
    converting it to unicode:

    >>> b = b"\xc2\xa3"
    >>> str(b)

    "b'\\xc2\\xa3'"


    If you want to *decode* a bytes string, use its decode method and you
    get a unicode string (if your bytes string is a valid encoding):

    >>> b = b"\xc2\xa3"
    >>> b.decode('utf8')

    '£'
    >>> b.decode('ascii')

    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)


    If you want to *encode* a (unicode) string, use its encode method and you
    get a bytes string (provided your string can be encoded using the given
    encoding):

    >>> s="€"
    >>> s.encode('utf8')

    b'\xe2\x82\xac'
    >>> s.encode('ascii')

    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    UnicodeEncodeError: 'ascii' codec can't encode character '\u20ac' in position 0: ordinal not in range(128)

    --
    Arnaud
    Arnaud Delobelle, Oct 7, 2010
    #2
    1. Advertising

  3. On Thu, 07 Oct 2010 23:33:35 +0200
    Hallvard B Furuseth <> wrote:
    >
    > The offender is bytes.__str__: str(b'foo') == "b'foo'".
    > It's often not clear from looking at a piece of code whether
    > some data is treated as strings or bytes, particularly when
    > translating from old code. Which means one cannot see from
    > context if str(s) or "%s" % s will produce garbage.


    This probably comes from overuse of str(s) and "%s". They can be useful
    to produce human-readable messages, but you shouldn't have to use them
    very often.

    > I really wish bytes.__str__ would at least by default fail.


    Actually, the implicit contract of __str__ is that it never fails, so
    that everything can be printed out (for debugging purposes, etc.).

    Regards

    Antoine.
    Antoine Pitrou, Oct 8, 2010
    #3
  4. Arnaud Delobelle writes:
    >Hallvard B Furuseth <> writes:
    >> I've been playing a bit with Python3.2a2, and frankly its charset
    >> handling looks _less_ safe than in Python 2.
    >> (...)
    >> With 2.<late> conversion Unicode <-> string the equivalent operation did
    >> not silently produce garbage: it raised UnicodeError instead. With old
    >> raw Python strings that was not a problem in applications which did not
    >> need to convert any charsets, with python3 they can break.
    >>
    >> I really wish bytes.__str__ would at least by default fail.

    >
    > I think you misunderstand the purpose of str(). It is to provide a
    > (unicode) string representation of an object and has nothing to do with
    > converting it to unicode:


    That's not the point - the point is that for 2.* code which _uses_ str
    vs unicode, the equivalent 3.* code uses str vs bytes. Yet not the
    same way - a 2.* 'str' will sometimes be 3.* bytes, sometime str. So
    upgraded old code will have to expect both str and bytes.

    In 2.*, str<->unicode conversion failed or produced the equivalent
    character/byte data. Yes, there could be charset problems if the
    defaults were set up wrong, but that's a smaller problem than in 3.*.
    In 3.*, the bytes->str conversion always _silently_ produces garbage.

    And lots of code use both, and need to convert back and forth. In
    particular code 3.* code converted from 2.*, or using modules converted
    from 2.*. There's a lot of such code, and will be for a long time.

    --
    Hallvard
    Hallvard B Furuseth, Oct 8, 2010
    #4
  5. Antoine Pitrou writes:
    >Hallvard B Furuseth <> wrote:
    >> The offender is bytes.__str__: str(b'foo') == "b'foo'".
    >> It's often not clear from looking at a piece of code whether
    >> some data is treated as strings or bytes, particularly when
    >> translating from old code. Which means one cannot see from
    >> context if str(s) or "%s" % s will produce garbage.

    >
    > This probably comes from overuse of str(s) and "%s". They can be useful
    > to produce human-readable messages, but you shouldn't have to use them
    > very often.


    Maybe Python 3 has something better, but they could be hard to avoid in
    Python 2. And certainly our site has plenty of code using them, whether
    we should have avoided them or not.

    >> I really wish bytes.__str__ would at least by default fail.

    >
    > Actually, the implicit contract of __str__ is that it never fails, so
    > that everything can be printed out (for debugging purposes, etc.).


    Nope:

    $ python2 -c 'str(u"\u1000")'
    Traceback (most recent call last):
    File "<string>", line 1, in ?
    UnicodeEncodeError: 'ascii' codec can't encode character u'\u1000' in position 0: ordinal not in range(128)

    And the equivalent:

    $ python2 -c 'unicode("\xA0")'
    Traceback (most recent call last):
    File "<string>", line 1, in ?
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

    In Python 2, these two UnicodeEncodeErrors made our data safe from code
    which used str and unicode objects without checking too carefully which
    was which. Code which sort the types out carefully enough would fail.

    In Python 3, that safety only exists for bytes(str), not str(bytes).

    --
    Hallvard
    Hallvard B Furuseth, Oct 8, 2010
    #5
  6. On Fri, 08 Oct 2010 15:31:27 +0200, Hallvard B Furuseth wrote:

    > Arnaud Delobelle writes:
    >>Hallvard B Furuseth <> writes:
    >>> I've been playing a bit with Python3.2a2, and frankly its charset
    >>> handling looks _less_ safe than in Python 2. (...)
    >>> With 2.<late> conversion Unicode <-> string the equivalent operation
    >>> did not silently produce garbage: it raised UnicodeError instead.
    >>> With old raw Python strings that was not a problem in applications
    >>> which did not need to convert any charsets, with python3 they can
    >>> break.
    >>>
    >>> I really wish bytes.__str__ would at least by default fail.

    >>
    >> I think you misunderstand the purpose of str(). It is to provide a
    >> (unicode) string representation of an object and has nothing to do with
    >> converting it to unicode:

    >
    > That's not the point - the point is that for 2.* code which _uses_ str
    > vs unicode, the equivalent 3.* code uses str vs bytes. Yet not the same
    > way - a 2.* 'str' will sometimes be 3.* bytes, sometime str. So
    > upgraded old code will have to expect both str and bytes.


    I'm sorry, this makes no sense to me. I've read it repeatedly, and I
    still don't understand what you're trying to say.


    > In 2.*, str<->unicode conversion failed or produced the equivalent
    > character/byte data. Yes, there could be charset problems if the
    > defaults were set up wrong, but that's a smaller problem than in 3.*. In
    > 3.*, the bytes->str conversion always _silently_ produces garbage.


    So you say, but I don't see it. Why is this garbage?

    >>> b = b'abc\xff'
    >>> str(b)

    "b'abc\\xff'"

    That's what I would expect from the str() function called with a bytes
    argument. Since decoding bytes requires a codec, which you haven't given,
    it can only return a string representation of the bytes.

    If you want to decode bytes into a string, you need to specify a codec:

    >>> >>> str(b, 'latin-1')

    'abcÿ'
    >>> b.decode('latin-1')

    'abcÿ'




    --
    Steven
    Steven D'Aprano, Oct 8, 2010
    #6
  7. Steven D'Aprano writes:
    >On Fri, 08 Oct 2010 15:31:27 +0200, Hallvard B Furuseth wrote:
    >> That's not the point - the point is that for 2.* code which _uses_ str
    >> vs unicode, the equivalent 3.* code uses str vs bytes. Yet not the same
    >> way - a 2.* 'str' will sometimes be 3.* bytes, sometime str. So
    >> upgraded old code will have to expect both str and bytes.

    >
    > I'm sorry, this makes no sense to me. I've read it repeatedly, and I
    > still don't understand what you're trying to say.


    OK, here is a simplified example after 2to3:

    try: from urlparse import urlparse, urlunparse # Python 2.6
    except: from urllib.parse import urlparse, urlunparse # Python 3.2a

    foo, bar = b"/foo", b"bar" # Data from network, bar normally empty

    # Statement inserted for 2.3 when urlparse below said TypeError
    if isinstance(foo, bytes): foo = foo.decode("ASCII")

    p = list(urlparse(foo))
    if bar: p[3] = bar
    print(urlunparse(p))

    2.6 prints "/foo;bar", 3.2a prints "/foo;b'bar'"

    You have a module which receives some strings/bytes, maybe data which
    originates on the net or in a database. The module _and its callers_
    may date back to before the 'bytes' type, maybe before 'unicode'.
    The module is supposed to work with this data and produce some 'str's
    or bytes to output. _Not_ a Python representation like "b'bar'".

    The module doesn't always know which input is 'bytes' and which is
    'str'. Or the callers don't know what it expects, or haven't kept
    track. Maybe the input originated as bytes and were converted to
    str at some point, maybe not.

    Look at urrlib.parse.py and its isinstance(<data>, <str or bytes>)
    calls. urlencode() looks particularly gross, though that one has code
    which could be factored out. They didn't catch everything either, I
    posted this when a 2to3'ed module of mine produced URLs with "b'bar'".

    In the pre-'unicode type' Python (was that early Python 2, or should
    I have said Python 1?) that was a non-issue - it Just Worked, sans
    possible charset issues.

    In Python 2 with unicode, the module would get it right or raise an
    exception. Which helps the programmer fix any charset issues.

    In Python 3, the module does not raise an exception, it produces
    "b'bar'" when it was supposed to produce "bar".

    >> In 2.*, str<->unicode conversion failed or produced the equivalent
    >> character/byte data. Yes, there could be charset problems if the
    >> defaults were set up wrong, but that's a smaller problem than in 3.*. In
    >> 3.*, the bytes->str conversion always _silently_ produces garbage.

    >
    > So you say, but I don't see it. Why is this garbage?


    To the user of the module, stuff with Python syntax is garbage. It
    was supposed to be text/string data.

    >>>> b = b'abc\xff'
    >>>> str(b)

    > "b'abc\\xff'"
    >
    > That's what I would expect from the str() function called with a bytes
    > argument. Since decoding bytes requires a codec, which you haven't given,
    > it can only return a string representation of the bytes.
    >
    > If you want to decode bytes into a string, you need to specify a codec:


    Except I didn't intend to decode anything - I just intended to output
    the contents of the string - which was stored in a 'bytes' object.
    But __str__ got called because a lot of code does that. It wasn't
    even my code which did it.

    There's often no obvious place to decide when to consider a stream of
    data as raw bytes and when to consider it text, and no obvious time
    to convert between bytes and str. When writing a program, one simply
    has to decide. Such as network data (bytes) vs urllib URLs (str)
    in my program. And the decision is different from what one would
    decide for when to use str and when to use unicode in Python 2.

    In this case I'll bugreport urlunparse to python.org, but there'll be
    a _lot_ of such code around. And without an Exception getting raised,
    it'll take time to find it. So it looks like it'll be a long time
    before I dare entrust my data to Python 3, except maybe with modules
    written from scratch.

    --
    Hallvard
    Hallvard B Furuseth, Oct 8, 2010
    #7
  8. On Fri, 08 Oct 2010 15:45:58 +0200
    Hallvard B Furuseth <> wrote:
    > Antoine Pitrou writes:
    > >Hallvard B Furuseth <> wrote:
    > >> The offender is bytes.__str__: str(b'foo') == "b'foo'".
    > >> It's often not clear from looking at a piece of code whether
    > >> some data is treated as strings or bytes, particularly when
    > >> translating from old code. Which means one cannot see from
    > >> context if str(s) or "%s" % s will produce garbage.

    > >
    > > This probably comes from overuse of str(s) and "%s". They can be useful
    > > to produce human-readable messages, but you shouldn't have to use them
    > > very often.

    >
    > Maybe Python 3 has something better, but they could be hard to avoid in
    > Python 2. And certainly our site has plenty of code using them, whether
    > we should have avoided them or not.


    It's difficult to answer more precisely without knowing what you're
    doing precisely. But if you already have str objects, you don't have to
    call str() or format them using "%s", so implicit __str__ calls are
    avoided.

    > > Actually, the implicit contract of __str__ is that it never fails, so
    > > that everything can be printed out (for debugging purposes, etc.).

    >
    > Nope:
    >
    > $ python2 -c 'str(u"\u1000")'
    > Traceback (most recent call last):

    [...]
    > $ python2 -c 'unicode("\xA0")'
    > Traceback (most recent call last):


    Sure, but so what? This mainly shows that unicode support was broken in
    Python 2, because:
    1) it tried to do implicit bytes<->unicode coercion by using some
    process-wide default encoding
    2) some unicode objects didn't have a succesful str()

    Python 3 fixes both these issues. Fixing 1) means there's no automatic
    coercion when trying to mix bytes and unicode. Try for example:

    [Python 2] >>> u"a" + "b"
    u'ab'
    [Python 3] >>> "a" + b"b"
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    TypeError: Can't convert 'bytes' object to str implicitly


    And fixing 2) means bytes object get a meaningful str() in all
    circumstances, which is much better for debug output.

    If you don't think that 2) is important, then perhaps you don't deal
    with non-ASCII data a lot. Failure to print out exception messages (or
    log entries, etc.) containing non-ASCII characters is a big annoyance
    with Python 2 for many people (including me).


    > In Python 2, these two UnicodeEncodeErrors made our data safe from code
    > which used str and unicode objects without checking too carefully which
    > was which.


    That's false, since implicit coercion can actually happen everywhere.
    And it only fails when there's non-ASCII data involved, meaning the
    unsuspecting Anglo-saxon developer doesn't understand why his/her users
    complain.

    Regards

    Antoine.
    Antoine Pitrou, Oct 8, 2010
    #8
  9. Hallvard B Furuseth

    Terry Reedy Guest

    On 10/8/2010 9:45 AM, Hallvard B Furuseth wrote:

    >> Actually, the implicit contract of __str__ is that it never fails, so
    >> that everything can be printed out (for debugging purposes, etc.).

    >
    > Nope:
    >
    > $ python2 -c 'str(u"\u1000")'
    > Traceback (most recent call last):
    > File "<string>", line 1, in ?
    > UnicodeEncodeError: 'ascii' codec can't encode character u'\u1000' in position 0: ordinal not in range(128)


    This could be considered a design bug due to 'str' being used both to
    produce readable string representations of objects (perhaps one that
    could be eval'ed) and to convert unicode objects to equivalent string
    objects. which is not the same operation!

    The above really should have produced '\u1000'! (the equivavlent of what
    str(bytes) does today). The 'conversion to equivalent str object' option
    should have required an explicit encoding arg rather than defaulting to
    the ascii codec. This mistake has been corrected in 3.x, so Yep.

    > And the equivalent:
    >
    > $ python2 -c 'unicode("\xA0")'
    > Traceback (most recent call last):
    > File "<string>", line 1, in ?
    > UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)


    This is an application bug: either bad string or missing decoding arg.

    > In Python 2, these two UnicodeEncodeErrors made our data safe from code
    > which used str and unicode objects without checking too carefully which
    > was which. Code which sort the types out carefully enough would fail.
    >
    > In Python 3, that safety only exists for bytes(str), not str(bytes).


    If you prefer the buggy 2.x design (and there are *many* tracker bug
    reports that were fixed by the 3.x change), stick with it.

    --
    Terry Jan Reedy
    Terry Reedy, Oct 8, 2010
    #9
  10. Hallvard B Furuseth

    Terry Reedy Guest

    On 10/8/2010 9:31 AM, Hallvard B Furuseth wrote:

    > That's not the point - the point is that for 2.* code which _uses_ str
    > vs unicode, the equivalent 3.* code uses str vs bytes. Yet not the
    > same way - a 2.* 'str' will sometimes be 3.* bytes, sometime str. So
    > upgraded old code will have to expect both str and bytes.


    If you want to interconvert code between 2.6/7 and 3.x, use unicode and
    bytes in the 2.x code. Bytes was added to 2.6/7 as a synonym for str
    explicitly and only for conversion purposes.

    --
    Terry Jan Reedy
    Terry Reedy, Oct 8, 2010
    #10
  11. Terry Reedy writes:
    >On 10/8/2010 9:31 AM, Hallvard B Furuseth wrote:
    >> That's not the point - the point is that for 2.* code which _uses_ str
    >> vs unicode, the equivalent 3.* code uses str vs bytes. Yet not the
    >> same way - a 2.* 'str' will sometimes be 3.* bytes, sometime str. So
    >> upgraded old code will have to expect both str and bytes.

    >
    > If you want to interconvert code between 2.6/7 and 3.x, use unicode and
    > bytes in the 2.x code. Bytes was added to 2.6/7 as a synonym for str
    > explicitly and only for conversion purposes.


    That's what I did, see article <>.
    And that's exactly what broke as described, because bytes.__str__
    have different meanings in 2.x and 3.x: the raw contents vs the repr.
    So a library function which did %s output a different result.

    --
    Hallvard
    Hallvard B Furuseth, Oct 11, 2010
    #11
  12. Antoine Pitrou writes:
    >Hallvard B Furuseth <> wrote:
    >>Antoine Pitrou writes:
    >>>Hallvard B Furuseth <> wrote:
    >>>> The offender is bytes.__str__: str(b'foo') == "b'foo'".
    >>>> It's often not clear from looking at a piece of code whether
    >>>> some data is treated as strings or bytes, particularly when
    >>>> translating from old code. Which means one cannot see from
    >>>> context if str(s) or "%s" % s will produce garbage.
    >>>
    >>> This probably comes from overuse of str(s) and "%s". They can be useful
    >>> to produce human-readable messages, but you shouldn't have to use them
    >>> very often.

    >>
    >> Maybe Python 3 has something better, but they could be hard to avoid in
    >> Python 2. And certainly our site has plenty of code using them, whether
    >> we should have avoided them or not.

    >
    > It's difficult to answer more precisely without knowing what you're
    > doing precisely.


    I'd just posted an example in article <>:

    urllib.parse.urlunparse(('', '', '/foo', b'bar', '', '')) returns
    "/foo;b'bar'" instead of raising an exception or returning 2.6's correct
    "/foo;bar".

    > But if you already have str objects, you don't have to
    > call str() or format them using "%s", so implicit __str__ calls are
    > avoided.


    Except it's quite normal to output strings with %s. Above, a library
    did it for me. Maybe also to depend on the fact that str.__str__() is a
    noop, so one can call str() just in case some variable needs to be
    unpacked to a plain string. urllib.parse is an example of that too.

    >>> Actually, the implicit contract of __str__ is that it never fails, so
    >>> that everything can be printed out (for debugging purposes, etc.).

    >>
    >> Nope:
    >>
    >> $ python2 -c 'str(u"\u1000")'
    >> Traceback (most recent call last):

    > [...]
    >> $ python2 -c 'unicode("\xA0")'
    >> Traceback (most recent call last):

    >
    > Sure, but so what?


    So your statement above was wrong, which you made in response to my
    suggested solution.

    > This mainly shows that unicode support was broken in
    > Python 2, because:


    ....because Python 2 was designed so there was no way to avoid poor
    unicode support one way or other. Python 3 has not fixed this, it has
    just moved the problem elsewhere.

    > 1) it tried to do implicit bytes<->unicode coercion by using some
    > process-wide default encoding


    I had completely forgotten that. I've been lucky (with my sysadmins
    maybe:) and lived with ASCII default encoding. Checking around I see
    now Python2 site.py used my locale for the encoding, as if that had any
    relevance for my data...

    > 2) some unicode objects didn't have a succesful str()
    >
    > Python 3 fixes both these issues. Fixing 1) means there's no automatic
    > coercion when trying to mix bytes and unicode.


    Fine, so programs will have to do it themselves...

    > (...)
    > And fixing 2) means bytes object get a meaningful str() in all
    > circumstances, which is much better for debug output.


    Except str() on such data has a different meaning than it did before, so
    equivalent programs *silently* produce different results. Which is why
    I started this thread.

    > If you don't think that 2) is important, then perhaps you don't deal
    > with non-ASCII data a lot. Failure to print out exception messages (or
    > log entries, etc.) containing non-ASCII characters is a big annoyance
    > with Python 2 for many people (including me).


    I'm Norwegian. I do deal with non-ASCII and I agree failures in error
    messages are annoying.

    OTOH if the same bug that previously caused an error in an error,
    instead quietly munges my data, that's worse than annoying. I've dealt
    with that too, and the fix is to use another tool. (Ironically, in one
    case it meant moving from Perl to Python, and now Python has followed
    Perl...)

    >> In Python 2, these two UnicodeEncodeErrors made our data safe from code
    >> which used str and unicode objects without checking too carefully which
    >> was which.

    >
    > That's false, since implicit coercion can actually happen everywhere.


    Right, it was true as long as my encoding was ASCII.

    > And it only fails when there's non-ASCII data involved, meaning the
    > unsuspecting Anglo-saxon developer doesn't understand why his/her users
    > complain.


    --
    Hallvard
    Hallvard B Furuseth, Oct 11, 2010
    #12
  13. Hallvard B Furuseth, 11.10.2010 21:50:
    > Antoine Pitrou writes:
    >> 2) some unicode objects didn't have a succesful str()
    >>
    >> Python 3 fixes both these issues. Fixing 1) means there's no automatic
    >> coercion when trying to mix bytes and unicode.

    >
    > Fine, so programs will have to do it themselves...


    Yes, they can finally handle bytes and Unicode data correctly and safely.
    Having byte data turn into Unicode strings unexpectedly makes the behaviour
    of your code hardly predictable and fairly error prone. In Python 3, it's
    now possible to do the conversion safely at well defined points in your
    code and rely on the runtime to bark at you when something slips through or
    is mistreated. Detecting errors early makes your code better.

    That's a huge improvement. It didn't come for free and the current Python 3
    releases still have their rough edges. But there are few left and the
    situation is constantly improving. You can help out if you want.

    Stefan
    Stefan Behnel, Oct 11, 2010
    #13
  14. Terry Reedy writes:
    >On 10/8/2010 9:45 AM, Hallvard B Furuseth wrote:
    >>> Actually, the implicit contract of __str__ is that it never fails, so
    >>> that everything can be printed out (for debugging purposes, etc.).

    >>
    >> Nope:
    >>
    >> $ python2 -c 'str(u"\u1000")'
    >> Traceback (most recent call last):
    >> File "<string>", line 1, in ?
    >> UnicodeEncodeError: 'ascii' codec can't encode character u'\u1000' in position 0: ordinal not in range(128)

    >
    > This could be considered a design bug due to 'str' being used both to
    > produce readable string representations of objects (perhaps one that
    > could be eval'ed) and to convert unicode objects to equivalent string
    > objects. which is not the same operation!


    Indeed, the eager str() and the lack of a more narrow str function is
    one root of the problem. I'd put it more more generally: Converting an
    object which represents a string, to an actual str. *And* __str__ may
    be intended for Python-independent representations like 23 -> "23".

    I expect that's why quite a bit of code calls str() just in case, which
    is another root of the problem. E.g. urlencode(), as I said. The code
    might not need to, but str('string') is a noop so it doesn't hurt.
    Maybe that's why %s does too, instead of demanding that the user calls
    str() if needed.

    > The above really should have produced '\u1000'! (the equivavlent of what
    > str(bytes) does today). The 'conversion to equivalent str object' option
    > should have required an explicit encoding arg rather than defaulting to
    > the ascii codec. This mistake has been corrected in 3.x, so Yep.


    If there were a __plain_str__() method which was supposed to fail rather
    than start to babble Python syntax, and if there were not plenty of
    Python code around which invoked __str__, I'd agree.

    As it is, this "correction" instead is causing code which previously
    produced the expected non-Python-related string output, to instead
    produce Pythonesque repr() stuff. See below.

    >> And the equivalent:
    >>
    >> $ python2 -c 'unicode("\xA0")'
    >> Traceback (most recent call last):
    >> File "<string>", line 1, in ?
    >> UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

    >
    > This is an application bug: either bad string or missing decoding arg.


    Exactly. And Python 2 caught the bug. (Since I had Ascii default
    decoding, I'd forgotten Python could pick another default.)

    For an app which handles Unicode vs. raw bytes, the equivalent Python 3
    code is str(b"\xA0"). That's the *same* application bug, in equivalent
    application code, and Python 3 does not catch it. This time the bug is
    spelled str() instead, which is much more likely than old unicode() to
    happen somewhere thanks to the str()-related misdesign discussed above.

    Article <> in this thread has an example.


    And that's the third root of the problem above. Technically it's the
    same problem that an application bug can do str(None) where it should be
    using a string, and produce garbage text. The difference is that Python
    forces programs to deal with these two different character/octet string
    types, sometimes swapping back and forth between them. And it's not
    necessarily obvious from the code which type is in use where. Python 3
    has not changed that, it has strengthened it by removing the default
    conversion.

    Yet while the programmer now needs to be _more_ careful about this
    before, Python 3 has removed the exception which caught this particular
    bug instead of doing something to make it easier to find such bugs.

    That's why I suggested making bytes.__str__ fail by default, annoying
    as it would be. But I don't know how annoying it'd be. Maybe there
    could be an option to disable it.

    >> In Python 2, these two UnicodeEncodeErrors made our data safe from code
    >> which used str and unicode objects without checking too carefully which
    >> was which. Code which sort the types out carefully enough would fail.
    >>
    >> In Python 3, that safety only exists for bytes(str), not str(bytes).

    >
    > If you prefer the buggy 2.x design (and there are *many* tracker bug
    > reports that were fixed by the 3.x change), stick with it.


    Bugs even with ASCII default encoding? Looking closer at setencoding()
    in site.py, it doesn't seem to do anything, it's "if 0"ed out.

    As I think I've made clear, I certainly don't feel like entrusting
    Python 3 with my raw string data just yet.

    --
    Hallvard
    Hallvard B Furuseth, Oct 11, 2010
    #14
  15. Stefan Behnel writes:
    >Hallvard B Furuseth, 11.10.2010 21:50:
    >> Fine, so programs will have to do it themselves...

    >
    > Yes, they can finally handle bytes and Unicode data correctly and
    > safely. Having byte data turn into Unicode strings unexpectedly makes
    > the behaviour of your code hardly predictable and fairly error prone. In
    > Python 3, it's now possible to do the conversion safely at well defined
    > points in your code and rely on the runtime to bark at you when
    > something slips through or is mistreated. Detecting errors early makes
    > your code better.
    >
    > That's a huge improvement. It didn't come for free and the current
    > Python 3 releases still have their rough edges. But there are few left
    > and the situation is constantly improving. You can help out if you want.


    I quite agree with most of that - just not about it being safe, see my
    reply to Terry Reedy. Hence my suggestion to change or disable
    bytes.__str__. And yes, I'll be submitting some fixes or bug reports.

    --
    Hallvard
    Hallvard B Furuseth, Oct 11, 2010
    #15
  16. On Mon, 11 Oct 2010 21:50:32 +0200
    Hallvard B Furuseth <> wrote:
    >
    > I'd just posted an example in article <>:
    >
    > urllib.parse.urlunparse(('', '', '/foo', b'bar', '', '')) returns
    > "/foo;b'bar'" instead of raising an exception or returning 2.6's correct
    > "/foo;bar".


    Oh, this looks like a bug in urlparse. Could you report it at
    http://bugs.python.org ? Thanks.

    > > But if you already have str objects, you don't have to
    > > call str() or format them using "%s", so implicit __str__ calls are
    > > avoided.

    >
    > Except it's quite normal to output strings with %s.


    "%s" will take the string representation of anything you give it:
    bytes, but also, files, sockets, dicts, tuples, etc. So, if you're
    using "%s" somewhere, it's your job to ensure that you give it the
    desired type.

    > Maybe also to depend on the fact that str.__str__() is a
    > noop, so one can call str() just in case some variable needs to be
    > unpacked to a plain string.


    Well, if you don't know what types you are currently handling and
    convert them to strings "just in case", chances are you're doing
    something wrong.

    > > 2) some unicode objects didn't have a succesful str()
    > >
    > > Python 3 fixes both these issues. Fixing 1) means there's no automatic
    > > coercion when trying to mix bytes and unicode.

    >
    > Fine, so programs will have to do it themselves...


    That's exactly the point, yes :) It's not Python's job to guess how some
    bytes you got e.g. on a socket should be decoded.

    > > (...)
    > > And fixing 2) means bytes object get a meaningful str() in all
    > > circumstances, which is much better for debug output.

    >
    > Except str() on such data has a different meaning than it did before,


    Yes, it's Python 3 and it's incompatible with Python 2... !

    Regards

    Antoine.
    Antoine Pitrou, Oct 11, 2010
    #16
  17. Hallvard B Furuseth, 11.10.2010 23:45:
    > If there were a __plain_str__() method which was supposed to fail rather
    > than start to babble Python syntax, and if there were not plenty of
    > Python code around which invoked __str__, I'd agree.


    Yes, calling str() "just in case" has a clear code smell. I think that's
    one of the reasons why b'abc' was chosen as output of bytes.__str__, to
    make it clearly visible a) what the type of the value is, e.g. in an
    interactive session, and b) that this wasn't the intended operation if it
    happened during string interpolation etc. and that the user code needs
    fixing. After all, you were complaining about a clearly visible problem (in
    urlunparse) that was easy to find given the incorrect output.

    I think raising an exception in bytes.__str__ would be a horrible thing to
    do. That would make it really hard and dangerous to look at bytes objects
    in a debugger or interpreter session. I think the current way bytes.__str__
    behaves is a good tradeoff between safety and usability, and the output is
    also very clear and readable.

    Stefan
    Stefan Behnel, Oct 12, 2010
    #17
  18. Stefan Behnel <> writes:

    > Hallvard B Furuseth, 11.10.2010 23:45:
    >> If there were a __plain_str__() method which was supposed to fail rather
    >> than start to babble Python syntax, and if there were not plenty of
    >> Python code around which invoked __str__, I'd agree.

    >
    > Yes, calling str() "just in case" has a clear code smell. I think
    > that's one of the reasons why b'abc' was chosen as output of
    > bytes.__str__, to make it clearly visible a) what the type of the
    > value is, e.g. in an interactive session


    Isn't that the point of repr()?

    > I think raising an exception in bytes.__str__ would be a horrible
    > thing to do. That would make it really hard and dangerous to look at
    > bytes objects in a debugger or interpreter session.


    Again, the interactive interpreter prints out the repr, and so should
    debuggers, etc. In fact, when the object is embedded in a container,
    all you get is the repr anyway.
    Hrvoje Niksic, Oct 12, 2010
    #18
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. David
    Replies:
    2
    Views:
    472
    Thomas G. Marshall
    Aug 3, 2003
  2. Trevor

    sizeof(str) or sizeof(str) - 1 ?

    Trevor, Apr 3, 2004, in forum: C Programming
    Replies:
    9
    Views:
    626
    CBFalconer
    Apr 10, 2004
  3. Sullivan WxPyQtKinter

    It is fun.the result of str.lower(str())

    Sullivan WxPyQtKinter, Mar 7, 2006, in forum: Python
    Replies:
    5
    Views:
    335
    Tim Roberts
    Mar 9, 2006
  4. Stefan Ram

    str.equals(null) or str==null ?

    Stefan Ram, Jul 31, 2006, in forum: Java
    Replies:
    21
    Views:
    14,685
    Oliver Wong
    Aug 3, 2006
  5. maestro
    Replies:
    1
    Views:
    302
    Chris
    Aug 11, 2008
Loading...

Share This Page