practical limits of urlopen()

W

webcomm

Hi,

Am I going to have problems if I use urlopen() in a loop to get data
from 3000+ URLs? There will be about 2KB of data on average at each
URL. I will probably run the script about twice per day. Data from
each URL will be saved to my database.

I'm asking because I've never opened that many URLs before in a loop.
I'm just wondering if it will be particularly taxing for my server.
Is it very uncommon to get data from so many URLs in a script? I
guess search spiders do it, so I should be able to as well?

Thank you,
Ryan
 
S

Steve Holden

webcomm said:
Hi,

Am I going to have problems if I use urlopen() in a loop to get data
from 3000+ URLs? There will be about 2KB of data on average at each
URL. I will probably run the script about twice per day. Data from
each URL will be saved to my database.

I'm asking because I've never opened that many URLs before in a loop.
I'm just wondering if it will be particularly taxing for my server.
Is it very uncommon to get data from so many URLs in a script? I
guess search spiders do it, so I should be able to as well?
You shouldn't expect problem - though you might want to think about
using some more advanced technique like threading to get your results
more quickly.

This is Python, though. It shouldn't take long to write a test program
to verify that you can indeed spider 3,000 pages this way.

With about 2KB per page, you could probably build up a memory structure
containing the whole content of every page without memory usage becoming
too excessive for modern systems. If you are writing stuff out to a
database as you go and not retaining page content then there should be
no problems whatsoever.

Then look at a parallelized solution of some sort if you need it to work
more quickly.

regards
Steve
 
L

Lie Ryan

Hi,

Am I going to have problems if I use urlopen() in a loop to get data
from 3000+ URLs? There will be about 2KB of data on average at each
URL. I will probably run the script about twice per day. Data from
each URL will be saved to my database.

I'm asking because I've never opened that many URLs before in a loop.
I'm just wondering if it will be particularly taxing for my server. Is
it very uncommon to get data from so many URLs in a script? I guess
search spiders do it, so I should be able to as well?

urllib doesn't have any limits, what might limit your program is your
connection speed and the hardware where the server and downloader is on.
Getting 3000 URLs is about 6MBs, a piece of cake for a sufficiently
modern machine on a decent internet connection (the real calculation
isn't that simple though, there is also some cost associated with sending
and processing HTML headers).

Google indexes millions of pages per day, but they also have one of the
most advanced server farm in the world.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,043
Latest member
CannalabsCBDReview

Latest Threads

Top