Urllib vs. FireFox

G

Gilles Ganault

Hello

After scratching my head as to why I failed finding data from a web
using the "re" module, I discovered that a web page as downloaded by
urllib doesn't match what is displayed when viewing the source page in
FireFox.

For instance, when searching Amazon for "Wargames":

URLLIB:
<a
href="http://www.amazon.fr/Wargames-Matthew-Broderick/dp/B00004RJ7H"><span
class="srTitle">Wargames</span></a>

~ Matthew Broderick, Dabney Coleman, John Wood, et Ally Sheedy
<span class="bindingBlock">(<span class="binding">Cassette
vidéo</span> - 2000)</span></td></tr>

FIREFOX:
<div class="productTitle"><a
href="http://www.amazon.fr/Wargames-Matth...ef=sr_1_1?ie=UTF8&s=dvd&qid=1224872998&sr=8-1">
Wargames</a> <span class="binding"> ~ Matthew Broderick, Dabney
Coleman, John Wood, et Ally Sheedy</span><span class="binding">
(<span class="format">Cassette vidéo</span> - 2000)</span></div>

Why do they differ?

Thank you.
 
S

Stefan Behnel

Gilles said:
After scratching my head as to why I failed finding data from a web
using the "re" module, I discovered that a web page as downloaded by
urllib doesn't match what is displayed when viewing the source page in
FireFox.

For instance, when searching Amazon for "Wargames":

URLLIB:
<a
href="http://www.amazon.fr/Wargames-Matthew-Broderick/dp/B00004RJ7H"><span
class="srTitle">Wargames</span></a>

~ Matthew Broderick, Dabney Coleman, John Wood, et Ally Sheedy
<span class="bindingBlock">(<span class="binding">Cassette
vidéo</span> - 2000)</span></td></tr>

FIREFOX:
<div class="productTitle"><a
href="http://www.amazon.fr/Wargames-Matth...ef=sr_1_1?ie=UTF8&s=dvd&qid=1224872998&sr=8-1">
Wargames</a> <span class="binding"> ~ Matthew Broderick, Dabney
Coleman, John Wood, et Ally Sheedy</span><span class="binding">
(<span class="format">Cassette vidéo</span> - 2000)</span></div>

Why do they differ?

The browser sends a different client identifier than urllib, and the server
sends back different page content depending on what client is asking.

Stefan
 
M

Mike Driscoll

Right. If you want to get the same results with your Python script
that you did with Firefox, you can modify the browser headers in your
code.

Here's an example with urllib2:http://vsbabu.org/mt/archives/2003/05/27/urllib2_setting_http_headers...

By the way, if you're doing non-trivial web scraping, the mechanize
module might make your work much easier. You can install it with
easy_install.http://wwwsearch.sourceforge.net/mechanize/

Or if you just need to query stuff on Amazon, then you might find this
module helpful:

http://pypi.python.org/pypi/Python-Amazon/
 
L

Lie Ryan

Hello

After scratching my head as to why I failed finding data from a web
using the "re" module, I discovered that a web page as downloaded by
urllib doesn't match what is displayed when viewing the source page in
FireFox.

Cookies?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,432
Messages
2,571,680
Members
48,796
Latest member
Greg L.

Latest Threads

Top