python3 urlopen(...).read() returns bytes

Discussion in 'Python' started by Glenn G. Chappell, Dec 22, 2008.

  1. I just ran 2to3 on a py2.5 script that does pattern matching on the
    text of a web page. The resulting script crashed, because when I did

    f = urllib.request.urlopen(url)
    text = f.read()

    then "text" is a bytes object, not a string, and so I can't do a
    regexp on it.

    Of course, this is easy to patch: just do "f.read().decode()".
    However, it strikes me as an obvious bug, which ought to be fixed.
    That is, read() should return a string, as it did in py2.5.

    But apparently others disagree? This was mentioned in issue 3930
    ( http://bugs.python.org/issue3930 ) back in September '08, but that
    issue is now closed, apparently because consistent behavior was
    achieved. But I figure consistently bad behavior is still bad.

    This change breaks pretty much every Python program that opens a
    webpage, doesn't it? 2to3 doesn't catch it, and, in any case, why
    should read() return bytes, not string? Am I missing something?

    By the way, I'm running Ubuntu 8.10. Doing "python3 --version" prints
    "Python 3.0rc1+".
    Glenn G. Chappell, Dec 22, 2008
    #1
    1. Advertising

  2. Glenn G. Chappell

    Carl Banks Guest

    On Dec 22, 3:41 pm, "Glenn G. Chappell" <>
    wrote:
    > I just ran 2to3 on a py2.5 script that does pattern matching on the
    > text of a web page. The resulting script crashed, because when I did
    >
    >     f = urllib.request.urlopen(url)
    >     text = f.read()
    >
    > then "text" is a bytes object, not a string, and so I can't do a
    > regexp on it.
    >
    > Of course, this is easy to patch: just do "f.read().decode()".
    > However, it strikes me as an obvious bug, which ought to be fixed.
    > That is, read() should return a string, as it did in py2.5.


    Well, I can't agree that it's an obvious bug (in Python 3). It might
    be something worth raising a warning over in 2to3. It would also be a
    reasonable wishlist item for automatic encoding detection and
    conversion to a string (see below). But it's not a bug.


    > But apparently others disagree? This was mentioned in issue 3930
    > (http://bugs.python.org/issue3930) back in September '08, but that
    > issue is now closed, apparently because consistent behavior was
    > achieved. But I figure consistently bad behavior is still bad.
    >
    > This change breaks pretty much every Python program that opens a
    > webpage, doesn't it?


    No. What if someone is using urllib retrieve (say) a JPEG image? A
    bytes object is what they'd want in Python 3. Also, many people were
    already explicitly dealing with encodings in Python 2.5; the change
    wouldn't affect them.


    > 2to3 doesn't catch it, and, in any case, why
    > should read() return bytes, not string? Am I missing something?


    It returns bytes because it doesn't know what encoding to use. This
    is the appropriate behavior.


    HOWEVER... a web page request often does know what encoding to use,
    since it ostensibly has to parse the header. It's reasonable that IF
    a url request's "Content-type" is text, and/or the "Content-encoding"
    is given, for urllib to have an option to automatically decode and
    return a string instead of bytes. (For all I know, it already can do
    that.)


    Carl Banks
    Carl Banks, Dec 22, 2008
    #2
    1. Advertising

  3. Glenn G. Chappell

    ajaksu Guest

    On Dec 22, 8:25 pm, Christian Heimes <> wrote:
    > It's not possible unless you know the encoding of the bytes. Network io
    > only returns byte and you must encode it explicitly.

    [...]
    > There is no generic and simple way to detect the encoding of a remote
    > site. Sometimes the encoding is mentioned in the HTTP header, sometimes
    > it's embedded in the <head> section of the HTML document.


    That said, a "decode to declared HTTP header encoding" version of
    urlopen could be useful to give some users the output they want (text
    from network io) or to make it clear why bytes is the safe way.

    Daniel
    ajaksu, Dec 22, 2008
    #3
  4. Okay, so I guess I didn't really *get* the whole unicode/text/binary
    thing. Maybe I still don't, but I think I'm getting closer. Thanks to
    everyone who replied.

    On Dec 22, 1:41 pm, ajaksu <> wrote:
    > On Dec 22, 8:25 pm, Christian Heimes <> wrote:
    > That said, a "decode to declared HTTP header encoding" version of
    > urlopen could be useful to give some users the output they want (text
    > from network io) or to make it clear why bytes is the safe way.


    Sounds like a great idea. More to the point, it sounds like it's
    pretty much a necessary idea.

    Consider: reading a web page is an easy one-liner. Now, no one is
    going to write that one-liner, and then spend 20 lines trying to get
    the Content-Type and encoding figured out. Instead we're all going to
    do it the short, easy, *wrong* way. So every program in the world that
    uses urlopen gets to have the same bug. Not good. The *right* way
    needs to be the *easy* way.

    -GGC-
    Glenn G. Chappell, Dec 23, 2008
    #4
  5. Glenn G. Chappell

    ajaksu Guest

    On Dec 22, 9:05 pm, Christian Heimes <> wrote:
    > ajaksu schrieb:
    >
    > > That said, a "decode to declared HTTP header encoding" version of
    > > urlopen could be useful to give some users the output they want (text
    > > from network io) or to make it clear why bytes is the safe way.

    >
    > Yeah, your idea sounds both useful and feasible. A patch is welcome! :)


    Would monkeypatching what urlopen returns be good enough or should we
    aim at a cleaner implementation?

    Glenn, does this sketch work for you?

    def urlopen_text(url, data=None,
    timeout=socket._GLOBAL_DEFAULT_TIMEOUT):
    response = urlopen(url, data, timeout)
    _readline = response.readline
    _readlines = response.readlines
    _read = response.read
    charset = response.headers.get_charsets()[0]
    def readline(limit = -1):
    content = _readline()
    return str(content, encoding=charset)
    response.readline = readline
    def readlines(hint = None):
    content = _readlines()
    return [str(line, encoding=charset) for line in content]
    response.readlines = readlines
    def read(n = -1):
    content = _read()
    return str(content, encoding=charset)
    response.read = read
    return response

    Any comments/suggestions are very welcome. I could use some help from
    people that know urllib on the best way to get the charset. Maybe
    after some sleep I can code it in a less awful way :)

    Daniel
    ajaksu, Dec 23, 2008
    #5
  6. Glenn G. Chappell

    ajaksu Guest

    On Dec 23, 12:51 pm, Christian Heimes <> wrote:

    > If you want to do it right ... It should be a clean patch against the
    > py3k svn branch

    Done

    > including documentation

    This thread is a good start :)

    > and a unit test.

    Doing this now.

    Daniel
    ajaksu, Dec 23, 2008
    #6
  7. Glenn G. Chappell

    ajaksu Guest

    On Dec 23, 12:51 pm, Christian Heimes <> wrote:
    > If you want to do it right ... It should be a clean patch against the
    > py3k svn branch including documentation and a unit test.


    Got all three at http://bugs.python.org/issue4733 . Probably got all
    three wrong too, so any feedback is very welcome :)

    I think a neat improvement, besides better docs, names, code, etc.,
    would be to implement Carl Bank's idea of checking Content-Encoding.

    Oh, and there's a 'print("Using default charset '%s'" %
    response.charset)' in there that seemed educational when I wrote it,
    but now sounds lame :)

    Glenn, can you test the patch[1] there?

    Thanks for the encouragement, Chris! I hope we don't regret it :D

    Daniel

    [1]http://bugs.python.org/file12437/urlopen_text.diff
    ajaksu, Dec 23, 2008
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Xu, C.S.
    Replies:
    5
    Views:
    472
    John J. Lee
    Sep 17, 2003
  2. Chris
    Replies:
    0
    Views:
    1,044
    Chris
    Jul 10, 2005
  3. Chris Rebert

    Re: urlopen returns forbidden

    Chris Rebert, Feb 28, 2011, in forum: Python
    Replies:
    4
    Views:
    626
    Chris Rebert
    Feb 28, 2011
  4. Andrew Berg
    Replies:
    0
    Views:
    330
    Andrew Berg
    Jun 16, 2012
  5. Olive

    urlopen in python3

    Olive, Dec 5, 2012, in forum: Python
    Replies:
    2
    Views:
    188
    Olive
    Dec 5, 2012
Loading...

Share This Page