practical limits of urlopen()

webcomm · Jan 24, 2009

Hi,

Am I going to have problems if I use urlopen() in a loop to get data
from 3000+ URLs? There will be about 2KB of data on average at each
URL. I will probably run the script about twice per day. Data from
each URL will be saved to my database.

I'm asking because I've never opened that many URLs before in a loop.
I'm just wondering if it will be particularly taxing for my server.
Is it very uncommon to get data from so many URLs in a script? I
guess search spiders do it, so I should be able to as well?

Thank you,
Ryan

Steve Holden · Jan 24, 2009

webcomm said:
Hi,

Am I going to have problems if I use urlopen() in a loop to get data
from 3000+ URLs? There will be about 2KB of data on average at each
URL. I will probably run the script about twice per day. Data from
each URL will be saved to my database.

I'm asking because I've never opened that many URLs before in a loop.
I'm just wondering if it will be particularly taxing for my server.
Is it very uncommon to get data from so many URLs in a script? I
guess search spiders do it, so I should be able to as well?

You shouldn't expect problem - though you might want to think about
using some more advanced technique like threading to get your results
more quickly.

This is Python, though. It shouldn't take long to write a test program
to verify that you can indeed spider 3,000 pages this way.

With about 2KB per page, you could probably build up a memory structure
containing the whole content of every page without memory usage becoming
too excessive for modern systems. If you are writing stuff out to a
database as you go and not retaining page content then there should be
no problems whatsoever.

Then look at a parallelized solution of some sort if you need it to work
more quickly.

regards
Steve

Lie Ryan · Jan 27, 2009

Hi,

Am I going to have problems if I use urlopen() in a loop to get data
from 3000+ URLs? There will be about 2KB of data on average at each
URL. I will probably run the script about twice per day. Data from
each URL will be saved to my database.

I'm asking because I've never opened that many URLs before in a loop.
I'm just wondering if it will be particularly taxing for my server. Is
it very uncommon to get data from so many URLs in a script? I guess
search spiders do it, so I should be able to as well?

urllib doesn't have any limits, what might limit your program is your
connection speed and the hardware where the server and downloader is on.
Getting 3000 URLs is about 6MBs, a piece of cake for a sufficiently
modern machine on a decent internet connection (the real calculation
isn't that simple though, there is also some cost associated with sending
and processing HTML headers).

Google indexes millions of pages per day, but they also have one of the
most advanced server farm in the world.

urllib2.urlopen(url) pulling something other than HTML	7	Aug 20, 2007
simple, practical example of "code-reuse with the help of OOP"	0	Apr 18, 2014
A practical exercise: fighting maskons	14	Nov 8, 2009
How can I arrange a series of radio buttons?	2	Jan 24, 2024
I'm tempted to quit out of frustration	1	Aug 13, 2023
Practical packing for structs of bytes	12	Sep 17, 2010
How to use PDF-lib and how to center each line of texts on the page?	1	Aug 16, 2023
Class Overkill? What's practical?	5	Jul 22, 2007

practical limits of urlopen()

webcomm

Steve Holden

Lie Ryan

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads