Re: web crawler in python

Discussion in 'Python' started by Philip Semanchuk, Dec 10, 2009.

  1. On Dec 9, 2009, at 7:39 PM, my name wrote:

    > I'm currently planning on writing a web crawler in python but have a
    > question as far as how I should design it. My goal is speed and
    > maximum
    > efficient use of the hardware\bandwidth I have available.
    > As of now I have a Dual 2.4ghz xeon box, 4gb ram, 500gb sata and a
    > 20mbps
    > bandwidth cap (for now) . Running FreeBSD.
    > What would be the best way to design the crawler? Using the thread
    > module?
    > Would I be able to max out this connection with the hardware listed
    > above
    > using python threads?

    I wrote a web crawler in Python (under FreeBSD, in fact) and I chose
    to do it using separate processes. Process A would download pages and
    write them to disk, process B would attempt to convert them to
    Unicode, process C would evaluate the content, etc. That worked well
    for me because the processes were very independent of one another so
    they had very little data to share. Each process had a work queue
    (Postgres database table); process A would feed B's queue, B would
    feed C & D's queues, etc.

    I should point out that my crawler spidered one site at a time. As a
    result the downloading process spent a lot of time waiting (in order
    to be polite to the remote Web server). This sounds pretty different
    from what you want to do (an indeed from most crawlers).

    Figuring out the best design for your crawler depends on a host of
    factors that you haven't mentioned. (What are you doing with the
    pages you download? Is the box doing anything else? Are you storing
    the pages long term or discarding them? etc.) I don't think we can do
    it for you -- I know *I* can't; I have a day job. ;) But I encourage
    you to try something out. If you find your code isn't giving what you
    want, come back to the list with a specific problem. It's always
    easier to help with specific than with general problems.

    Good luck
    Philip Semanchuk, Dec 10, 2009
    1. Advertisements

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. abhinav

    web crawler in python or C?

    abhinav, Feb 16, 2006, in forum: Python
  2. abhinav

    web crawler in python or C?

    abhinav, Feb 16, 2006, in forum: C Programming
  3. Replies:
    Jun 22, 2008
  4. sonich

    Web crawler on python

    sonich, Oct 26, 2008, in forum: Python
  5. yura

    Web crawler on python

    yura, Oct 30, 2008, in forum: Python
    James Mills
    Oct 30, 2008

Share This Page