python3 urlopen(...).read() returns bytes

Glenn G. Chappell · Dec 22, 2008

I just ran 2to3 on a py2.5 script that does pattern matching on the
text of a web page. The resulting script crashed, because when I did

f = urllib.request.urlopen(url)
text = f.read()

then "text" is a bytes object, not a string, and so I can't do a
regexp on it.

Of course, this is easy to patch: just do "f.read().decode()".
However, it strikes me as an obvious bug, which ought to be fixed.
That is, read() should return a string, as it did in py2.5.

But apparently others disagree? This was mentioned in issue 3930
( http://bugs.python.org/issue3930 ) back in September '08, but that
issue is now closed, apparently because consistent behavior was
achieved. But I figure consistently bad behavior is still bad.

This change breaks pretty much every Python program that opens a
webpage, doesn't it? 2to3 doesn't catch it, and, in any case, why
should read() return bytes, not string? Am I missing something?

By the way, I'm running Ubuntu 8.10. Doing "python3 --version" prints
"Python 3.0rc1+".

Carl Banks · Dec 22, 2008

I just ran 2to3 on a py2.5 script that does pattern matching on the
text of a web page. The resulting script crashed, because when I did

f = urllib.request.urlopen(url)
text = f.read()

then "text" is a bytes object, not a string, and so I can't do a
regexp on it.

Of course, this is easy to patch: just do "f.read().decode()".
However, it strikes me as an obvious bug, which ought to be fixed.
That is, read() should return a string, as it did in py2.5.

Well, I can't agree that it's an obvious bug (in Python 3). It might
be something worth raising a warning over in 2to3. It would also be a
reasonable wishlist item for automatic encoding detection and
conversion to a string (see below). But it's not a bug.

But apparently others disagree? This was mentioned in issue 3930
(http://bugs.python.org/issue3930) back in September '08, but that
issue is now closed, apparently because consistent behavior was
achieved. But I figure consistently bad behavior is still bad.

This change breaks pretty much every Python program that opens a
webpage, doesn't it?

No. What if someone is using urllib retrieve (say) a JPEG image? A
bytes object is what they'd want in Python 3. Also, many people were
already explicitly dealing with encodings in Python 2.5; the change
wouldn't affect them.

2to3 doesn't catch it, and, in any case, why
should read() return bytes, not string? Am I missing something?

It returns bytes because it doesn't know what encoding to use. This
is the appropriate behavior.

HOWEVER... a web page request often does know what encoding to use,
since it ostensibly has to parse the header. It's reasonable that IF
a url request's "Content-type" is text, and/or the "Content-encoding"
is given, for urllib to have an option to automatically decode and
return a string instead of bytes. (For all I know, it already can do
that.)

Carl Banks

ajaksu · Dec 22, 2008

It's not possible unless you know the encoding of the bytes. Network io
only returns byte and you must encode it explicitly. [...]
There is no generic and simple way to detect the encoding of a remote
site. Sometimes the encoding is mentioned in the HTTP header, sometimes
it's embedded in the <head> section of the HTML document.

That said, a "decode to declared HTTP header encoding" version of
urlopen could be useful to give some users the output they want (text
from network io) or to make it clear why bytes is the safe way.

Daniel

Glenn G. Chappell · Dec 23, 2008

Okay, so I guess I didn't really *get* the whole unicode/text/binary
thing. Maybe I still don't, but I think I'm getting closer. Thanks to
everyone who replied.

That said, a "decode to declared HTTP header encoding" version of
urlopen could be useful to give some users the output they want (text
from network io) or to make it clear why bytes is the safe way.

Sounds like a great idea. More to the point, it sounds like it's
pretty much a necessary idea.

Consider: reading a web page is an easy one-liner. Now, no one is
going to write that one-liner, and then spend 20 lines trying to get
the Content-Type and encoding figured out. Instead we're all going to
do it the short, easy, *wrong* way. So every program in the world that
uses urlopen gets to have the same bug. Not good. The *right* way
needs to be the *easy* way.

-GGC-

ajaksu · Dec 23, 2008

Yeah, your idea sounds both useful and feasible. A patch is welcome!

Would monkeypatching what urlopen returns be good enough or should we
aim at a cleaner implementation?

Glenn, does this sketch work for you?

def urlopen_text(url, data=None,
timeout=socket._GLOBAL_DEFAULT_TIMEOUT):
response = urlopen(url, data, timeout)
_readline = response.readline
_readlines = response.readlines
_read = response.read
charset = response.headers.get_charsets()[0]
def readline(limit = -1):
content = _readline()
return str(content, encoding=charset)
response.readline = readline
def readlines(hint = None):
content = _readlines()
return [str(line, encoding=charset) for line in content]
response.readlines = readlines
def read(n = -1):
content = _read()
return str(content, encoding=charset)
response.read = read
return response

Any comments/suggestions are very welcome. I could use some help from
people that know urllib on the best way to get the charset. Maybe
after some sleep I can code it in a less awful way

Daniel

ajaksu · Dec 23, 2008

If you want to do it right ... It should be a clean patch against the
py3k svn branch Done

including documentation

This thread is a good start

and a unit test.

Doing this now.

Daniel

ajaksu · Dec 23, 2008

If you want to do it right ... It should be a clean patch against the
py3k svn branch including documentation and a unit test.

Got all three at http://bugs.python.org/issue4733 . Probably got all
three wrong too, so any feedback is very welcome

I think a neat improvement, besides better docs, names, code, etc.,
would be to implement Carl Bank's idea of checking Content-Encoding.

Oh, and there's a 'print("Using default charset '%s'" %
response.charset)' in there that seemed educational when I wrote it,
but now sounds lame

Glenn, can you test the patch[1] there?

Thanks for the encouragement, Chris! I hope we don't regret it

Daniel

[1]http://bugs.python.org/file12437/urlopen_text.diff

Why Python3	12	Jun 28, 2010
python-dev Summary for 2006-09-01 through 2006-09-15	0	Nov 3, 2006
Ligmail bug?	0	Aug 12, 2007
The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012
python-dev Summary for 2006-02-16 through 2006-02-28	1	Apr 29, 2006
Sencha Touch--Support 2 browsers in just 228K!	64	Jul 16, 2010
In the Matter of Herb Schildt: a Detailed Analysis of "C: TheComplete Nonsense"	109	Apr 3, 2010
python-dev Summary for 2006-05-01 through 2006-05-15	3	Jun 14, 2006

python3 urlopen(...).read() returns bytes

Glenn G. Chappell

Carl Banks

ajaksu

Glenn G. Chappell

ajaksu

ajaksu

ajaksu

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads