Web Crawler - Python or Perl?

Discussion in 'Python' started by disappearedng@gmail.com, Jun 9, 2008.

  1. Guest

    Hi all,
    I am currently planning to write my own web crawler. I know Python but
    not Perl, and I am interested in knowing which of these two are a
    better choice given the following scenario:

    1) I/O issues: my biggest constraint in terms of resource will be
    bandwidth throttle neck.
    2) Efficiency issues: The crawlers have to be fast, robust and as
    "memory efficient" as possible. I am running all of my crawlers on
    cheap pcs with about 500 mb RAM and P3 to P4 processors
    3) Compatibility issues: Most of these crawlers will run on Unix
    (FreeBSD), so there should exist a pretty good compiler that can
    optimize my code these under the environments.

    What are your opinions?
    , Jun 9, 2008
    #1
    1. Advertising

  2. subeen Guest

    On Jun 9, 11:48 pm, wrote:
    > Hi all,
    > I am currently planning to write my own web crawler. I know Python but
    > not Perl, and I am interested in knowing which of these two are a
    > better choice given the following scenario:
    >
    > 1) I/O issues: my biggest constraint in terms of resource will be
    > bandwidth throttle neck.
    > 2) Efficiency issues: The crawlers have to be fast, robust and as
    > "memory efficient" as possible. I am running all of my crawlers on
    > cheap pcs with about 500 mb RAM and P3 to P4 processors
    > 3) Compatibility issues: Most of these crawlers will run on Unix
    > (FreeBSD), so there should exist a pretty good compiler that can
    > optimize my code these under the environments.
    >
    > What are your opinions?


    It really doesn't matter whether you use Perl or Python for writing
    web crawlers. I have used both for writing crawlers. The scenarios you
    mentioned (I/O issues, Efficiency, Compatibility) don't differ two
    much for these two languages. Both the languages have fast I/O. You
    can use urllib2 module and/or beautiful soup for developing crawler in
    Python. For Perl you can use Mechanize or LWP modules. Both languages
    have good support for regular expressions. Perl is slightly faster I
    have heard, though I don't find the difference myself. Both are
    compatible with *nix. For writing a good crawler, language is not
    important, it's the technology which is important.

    regards,
    Subeen.
    http://love-python.blogspot.com/
    subeen, Jun 9, 2008
    #2
    1. Advertising

  3. wrote:
    > 1) I/O issues: my biggest constraint in terms of resource will be
    > bandwidth throttle neck.
    > 2) Efficiency issues: The crawlers have to be fast, robust and as
    > "memory efficient" as possible. I am running all of my crawlers on
    > cheap pcs with about 500 mb RAM and P3 to P4 processors
    > 3) Compatibility issues: Most of these crawlers will run on Unix
    > (FreeBSD), so there should exist a pretty good compiler that can
    > optimize my code these under the environments.


    You should rethink your requirements. You expect to be I/O bound, so why do
    you require a good "compiler"? Especially when asking about two interpreted
    languages...

    Consider using lxml (with Python), it has pretty much everything you need for
    a web crawler, supports threaded parsing directly from HTTP URLs, and it's
    plenty fast and pretty memory efficient.

    http://codespeak.net/lxml/

    Stefan
    Stefan Behnel, Jun 9, 2008
    #3
  4. Stefan Behnel, Jun 9, 2008
    #4
  5. subeen Guest

    subeen, Jun 9, 2008
    #5
  6. Ray Cote Guest

    At 11:21 AM -0700 6/9/08, subeen wrote:
    >On Jun 10, 12:15 am, Stefan Behnel <> wrote:
    >> subeen wrote:
    >> > can use urllib2 module and/or beautiful soup for developing crawler

    >>
    >> Not if you care about a) speed and/or b) memory efficiency.
    >>
    > > http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
    >>
    >> Stefan

    >
    >ya, beautiful soup is slower. so it's better to use urllib2 for
    >fetching data and regular expressions for parsing data.
    >
    >
    >regards,
    >Subeen.
    >http://love-python.blogspot.com/
    >--
    >http://mail.python.org/mailman/listinfo/python-list


    Beautiful Soup is a bit slower, but it will actually parse some of
    the bizarre HTML you'll download off the web. We've written a couple
    of crawlers to run over specific clients sites (I note, we did _not_
    create the content on these sites).

    Expect to find html code that looks like this:

    <ul>
    <li>
    <form>
    </li>
    </form>
    </ul>
    [from a real example, and yes, it did indeed render in IE.]

    I don't know if some of the quicker parsers discussed require
    well-formed HTML since I've not used them. You may want to consider
    using one of the quicker HTML parsers and, when they throw a fit on
    the downloaded HTML, drop back to Beautiful Soup -- which usually
    gets _something_ useful off the page.

    --Ray

    --

    Raymond Cote
    Appropriate Solutions, Inc.
    PO Box 458 ~ Peterborough, NH 03458-0458
    Phone: 603.924.6079 ~ Fax: 603.924.8668
    rgacote(at)AppropriateSolutions.com
    www.AppropriateSolutions.com
    Ray Cote, Jun 9, 2008
    #6
  7. subeen <> at Montag 09 Juni 2008 20:21:

    > On Jun 10, 12:15 am, Stefan Behnel <> wrote:
    >> subeen wrote:
    >> > can use urllib2 module and/or beautiful soup for developing crawler

    >>
    >> Not if you care about a) speed and/or b) memory efficiency.
    >>
    >> http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
    >>
    >> Stefan

    >
    > ya, beautiful soup is slower. so it's better to use urllib2 for
    > fetching data and regular expressions for parsing data.


    BeautifulSoup is implemented on regular expressions. I doubt, that you can
    achieve a great performance gain by using plain regular expressions, and
    even if, this gain is certainly not worth the effort. Parsing markup with
    regular expressions is hard, and the result will most likely not be as fast
    and as memory-efficient as lxml.html.

    I personally am absolutely happy with lxml.html. It's fast, memory
    efficient, yet powerful and easy to use.

    --
    Freedom is always the freedom of dissenters.
    (Rosa Luxemburg)
    Sebastian \lunar\ Wiesner, Jun 9, 2008
    #7
  8. On Jun 9, 1:48 pm, wrote:

    > Hi all,
    > I am currently planning to write my own web crawler. I know Python but
    > not Perl, and I am interested in knowing which of these two are a
    > better choice given the following scenario:
    >
    > 1) I/O issues: my biggest constraint in terms of resource will be
    > bandwidth throttle neck.
    > 2) Efficiency issues: The crawlers have to be fast, robust and as
    > "memory efficient" as possible. I am running all of my crawlers on
    > cheap pcs with about 500 mb RAM and P3 to P4 processors
    > 3) Compatibility issues: Most of these crawlers will run on Unix
    > (FreeBSD), so there should exist a pretty good compiler that can
    > optimize my code these under the environments.
    >
    > What are your opinions?


    You mentioned *what* you want but not *why*. If it's for a real-world
    production project, why reinvent a square wheel and not use (or at
    least extend) an existing open source crawler, with years of
    development behind it ? If it's a learning exercise, why bother about
    performance so early ?

    In any case, since you said you know python but not perl, the choice
    is almost a no-brainer, unless you're looking for an excuse to learn
    perl. In terms of performance they are comparable, and you can
    probably manage crawls in the order of 10-100K pages at best. For
    million-page or larger crawls though, you'll have to resort to C/C++
    sooner or later.

    George
    George Sakkis, Jun 9, 2008
    #8
  9. Ray Cote wrote:
    > Beautiful Soup is a bit slower, but it will actually parse some of the
    > bizarre HTML you'll download off the web.

    [...]
    > I don't know if some of the quicker parsers discussed require
    > well-formed HTML since I've not used them. You may want to consider
    > using one of the quicker HTML parsers and, when they throw a fit on the
    > downloaded HTML, drop back to Beautiful Soup -- which usually gets
    > _something_ useful off the page.


    So does lxml.html. And if you still feel like needing BS once in a while,
    there's lxml.html.soupparser.

    http://codespeak.net/lxml/elementsoup.html

    Stefan
    Stefan Behnel, Jun 9, 2008
    #9
  10. Guest

    As to why as opposed to what, I am attempting to build a search engine
    right now that plans to crawl not just html but other things too.

    I am open to learning, and I don't want to learn anything that doesn't
    really contribute to building my search engine for the moment. Hence I
    want to see whether learning PERL will be helpful to the later parts
    of my search engine.

    Victor
    , Jun 10, 2008
    #10
  11. wrote:
    > As to why as opposed to what, I am attempting to build a search engine
    > right now that plans to crawl not just html but other things too.
    >
    > I am open to learning, and I don't want to learn anything that doesn't
    > really contribute to building my search engine for the moment. Hence I
    > want to see whether learning PERL will be helpful to the later parts
    > of my search engine.


    I honestly don't think there's anything useful in Perl that you can't do in
    Python. There's tons of ugly ways to write unreadable code, though, so if you
    prefer that, that's something that's harder to do in Python.

    Stefan
    Stefan Behnel, Jun 10, 2008
    #11
  12. subeen Guest

    On Jun 13, 1:26 am, Chuck Rhode <> wrote:
    > On Mon, 09 Jun 2008 10:48:03 -0700, disappearedng wrote:
    > > I knowPythonbut notPerl, and I am interested in knowing which of
    > > these two are a better choice.

    >
    > I'm partial to *Python*, but, the last time I looked, *urllib2* didn't
    > provide a time-out mechanism that worked under all circumstances. My
    > client-side scripts would usually hang when the server quit
    > responding, which happened a lot.
    >


    You can avoid the problem using the following code:
    import socket

    timeout = 300 # seconds
    socket.setdefaulttimeout(timeout)

    regards,
    Subeen.
    http://love-python.blogspot.com/
    subeen, Jun 22, 2008
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. abhinav

    web crawler in python or C?

    abhinav, Feb 16, 2006, in forum: Python
    Replies:
    13
    Views:
    1,264
  2. abhinav

    web crawler in python or C?

    abhinav, Feb 16, 2006, in forum: C Programming
    Replies:
    1
    Views:
    1,404
  3. sonich

    Web crawler on python

    sonich, Oct 26, 2008, in forum: Python
    Replies:
    4
    Views:
    8,668
  4. yura

    Web crawler on python

    yura, Oct 30, 2008, in forum: Python
    Replies:
    1
    Views:
    306
    James Mills
    Oct 30, 2008
  5. Philip Semanchuk

    Re: web crawler in python

    Philip Semanchuk, Dec 10, 2009, in forum: Python
    Replies:
    0
    Views:
    464
    Philip Semanchuk
    Dec 10, 2009
Loading...

Share This Page