Urllib vs. FireFox

Gilles Ganault · Oct 24, 2008

Hello

After scratching my head as to why I failed finding data from a web
using the "re" module, I discovered that a web page as downloaded by
urllib doesn't match what is displayed when viewing the source page in
FireFox.

For instance, when searching Amazon for "Wargames":

URLLIB:
<a
href="http://www.amazon.fr/Wargames-Matthew-Broderick/dp/B00004RJ7H">Wargames</a>

~ Matthew Broderick, Dabney Coleman, John Wood, et Ally Sheedy
(Cassette
vidéo - 2000)</td></tr>

FIREFOX:
<div class="productTitle"><a
href="http://www.amazon.fr/Wargames-Matth...ef=sr_1_1?ie=UTF8&s=dvd&qid=1224872998&sr=8-1">
Wargames</a> ~ Matthew Broderick, Dabney
Coleman, John Wood, et Ally Sheedy
(Cassette vidéo - 2000)</div>

Why do they differ?

Thank you.

Stefan Behnel · Oct 24, 2008

Gilles said:
After scratching my head as to why I failed finding data from a web
using the "re" module, I discovered that a web page as downloaded by
urllib doesn't match what is displayed when viewing the source page in
FireFox.

For instance, when searching Amazon for "Wargames":

URLLIB:
<a
href="http://www.amazon.fr/Wargames-Matthew-Broderick/dp/B00004RJ7H">Wargames</a>

~ Matthew Broderick, Dabney Coleman, John Wood, et Ally Sheedy
(Cassette
vidéo - 2000)</td></tr>

FIREFOX:
<div class="productTitle"><a
href="http://www.amazon.fr/Wargames-Matth...ef=sr_1_1?ie=UTF8&s=dvd&qid=1224872998&sr=8-1">
Wargames</a> ~ Matthew Broderick, Dabney
Coleman, John Wood, et Ally Sheedy
(Cassette vidéo - 2000)</div>

Why do they differ?

The browser sends a different client identifier than urllib, and the server
sends back different page content depending on what client is asking.

Stefan

Rex · Oct 24, 2008

Right. If you want to get the same results with your Python script
that you did with Firefox, you can modify the browser headers in your
code.

Here's an example with urllib2:
http://vsbabu.org/mt/archives/2003/05/27/urllib2_setting_http_headers.html

By the way, if you're doing non-trivial web scraping, the mechanize
module might make your work much easier. You can install it with
easy_install.
http://wwwsearch.sourceforge.net/mechanize/

Mike Driscoll · Oct 24, 2008

Right. If you want to get the same results with your Python script
that you did with Firefox, you can modify the browser headers in your
code.

Here's an example with urllib2:http://vsbabu.org/mt/archives/2003/05/27/urllib2_setting_http_headers...

By the way, if you're doing non-trivial web scraping, the mechanize
module might make your work much easier. You can install it with
easy_install.http://wwwsearch.sourceforge.net/mechanize/

Or if you just need to query stuff on Amazon, then you might find this
module helpful:

http://pypi.python.org/pypi/Python-Amazon/

Lie Ryan · Oct 25, 2008

Hello

After scratching my head as to why I failed finding data from a web
using the "re" module, I discovered that a web page as downloaded by
urllib doesn't match what is displayed when viewing the source page in
FireFox.

Cookies?

Tim Roberts · Oct 26, 2008

Lie Ryan said:
Cookies?

Yes, please. I'll take two. Chocolate chip. With milk.

Gilles Ganault · Oct 28, 2008

Or if you just need to query stuff on Amazon, then you might find this
module helpful:

http://pypi.python.org/pypi/Python-Amazon/

Thanks a bunch. I didn't know about the AWS service.

Bootstrap contact form not working	2	Feb 15, 2025
use of assert in Java [vs. exceptions]	22	May 30, 2009
Help with Firefox compatibility	5	Aug 5, 2005

Urllib vs. FireFox

Gilles Ganault

Stefan Behnel

Rex

Mike Driscoll

Lie Ryan

Tim Roberts

Gilles Ganault

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads