web crawler in python or C?

Discussion in 'Python' started by abhinav, Feb 16, 2006.

  1. abhinav

    abhinav Guest

    Hi guys.I have to implement a topical crawler as a part of my
    project.What language should i implement
    C or Python?Python though has fast development cycle but my concern is
    speed also.I want to strke a balance between development speed and
    crawler speed.Since Python is an interpreted language it is rather
    slow.The crawler which will be working on huge set of pages should be
    as fast as possible.One possible implementation would be implementing
    partly in C and partly in Python so that i can have best of both
    worlds.But i don't know to approach about it.Can anyone guide me on
    what part should i implement in C and what should be in Python?
     
    abhinav, Feb 16, 2006
    #1
    1. Advertising

  2. abhinav

    Paul Rubin Guest

    "abhinav" <> writes:
    > The crawler which will be working on huge set of pages should be
    > as fast as possible.


    What kind of network connection do you have, that's fast enough
    that even a fairly cpu-inefficient crawler won't saturate it?
     
    Paul Rubin, Feb 16, 2006
    #2
    1. Advertising

  3. abhinav

    abhinav Guest

    It is DSL broadband 128kbps.But thats not the point.What i am saying is
    that would python be fine for implementing fast crawler algorithms or
    should i use C.Handling huge data,multithreading,file
    handling,heuristics for ranking,and maintaining huge data
    structures.What should be the language so as not to compromise that
    much on speed.What is the performance of python based crawlers vs C
    based crawlers.Should I use both the languages(partly C and python).How
    should i decide what part to be implemented in C and what should be
    done in python?
    Please guide me.Thanks.
     
    abhinav, Feb 16, 2006
    #3
  4. abhinav

    Fuzzyman Guest

    abhinav wrote:
    > It is DSL broadband 128kbps.But thats not the point.What i am saying is
    > that would python be fine for implementing fast crawler algorithms or
    > should i use C.


    But a web crawler is going to be *mainly* I/O bound - so language
    efficiency won't be the main issue. There are several web crawler
    implemented in Python.

    > Handling huge data,multithreading,file
    > handling,heuristics for ranking,and maintaining huge data
    > structures.What should be the language so as not to compromise that
    > much on speed.What is the performance of python based crawlers vs C
    > based crawlers.Should I use both the languages(partly C and python).How


    If your data processing requirements are fairly heavy you will
    *probably* get a speed advantage coding them in C and accessing them
    from Python.

    The usdual advice (which seems to be applicable to you), is to
    prototype in Python (which will be much more fun than in C) then test.

    Profile to find your real bottlenecks (if the Python one isn't fast
    enough - which it may be), and move your bottlenecks to C.

    All the best,

    Fuzzyman
    http://www.voidspace.org.uk/python/index.shtml

    > should i decide what part to be implemented in C and what should be
    > done in python?
    > Please guide me.Thanks.
     
    Fuzzyman, Feb 16, 2006
    #4
  5. abhinav

    Paul Rubin Guest

    "abhinav" <> writes:
    > It is DSL broadband 128kbps.But thats not the point.


    But it is the point.

    > What i am saying is that would python be fine for implementing fast
    > crawler algorithms or should i use C.Handling huge
    > data,multithreading,file handling,heuristics for ranking,and
    > maintaining huge data structures.What should be the language so as
    > not to compromise that much on speed.What is the performance of
    > python based crawlers vs C based crawlers.Should I use both the
    > languages(partly C and python).How should i decide what part to be
    > implemented in C and what should be done in python? Please guide
    > me.Thanks.


    I think if you don't know how to answer these questions for yourself,
    you're not ready to take on projects of that complexity. My advice
    is start in Python since development will be much easier. If and when
    you start hitting performance problems, you'll have to examine many
    combinations of tactics for dealing with them, and switching languages
    is just one such tactic.
     
    Paul Rubin, Feb 16, 2006
    #5
  6. abhinav

    gene tani Guest

    Paul Rubin wrote:
    > "abhinav" <> writes:


    > > maintaining huge data structures.What should be the language so as
    > > not to compromise that much on speed.What is the performance of
    > > python based crawlers vs C based crawlers.Should I use both the
    > > languages(partly C and python).How should i decide what part to be
    > > implemented in C and what should be done in python? Please guide
    > > me.Thanks.

    >
    > I think if you don't know how to answer these questions for yourself,
    > you're not ready to take on projects of that complexity. My advice
    > is start in Python since development will be much easier. If and when
    > you start hitting performance problems, you'll have to examine many
    > combinations of tactics for dealing with them, and switching languages
    > is just one such tactic.


    There's another potential bottleneck, parsing HTML and extracting the
    text you want, especially when you hit pages that don't meet HTML 4 or
    XHTML spec.
    http://sig.levillage.org/?p=599

    Paul's advice is very sound, given what little info you've provided.

    http://trific.ath.cx/resources/python/optimization/
    (and look at psyco, pyrex, boost, Swig, Ctypes for bridging C and
    python, you have a lot of options. Also look at Harvestman, mechanize,
    other existing libs.
     
    gene tani, Feb 16, 2006
    #6
  7. abhinav

    gene tani Guest

    abhinav wrote:
    > Hi guys.I have to implement a topical crawler as a part of my
    > project.What language should i implement


    Oh, and there's some really good books out there, besides the Orilly
    Spidering Hacks. Springer Verlag has a couple books on "Text Mining"
    and at least a couple books with "web intelligence" in the title.
    Expensive but worth it.
     
    gene tani, Feb 16, 2006
    #7
  8. On 15 Feb 2006 21:56:52 -0800, abhinav <> wrote:
    > Hi guys.I have to implement a topical crawler as a part of my
    > project.What language should i implement
    > C or Python?


    Why does this keep coming up on here as of late? If you search the
    archives, you can find numerous posts about spiders. One interesting
    fact is that google itself starting with their spiders in python.
    http://www-db.stanford.edu/~backrub/google.html I'm _sure_ it'll work
    for you.



    --
    Andrew Gwozdziewycz <>
    http://ihadagreatview.org
    http://plasticandroid.org
     
    Andrew Gwozdziewycz, Feb 16, 2006
    #8
  9. On Wed, 15 Feb 2006 21:56:52 -0800, abhinav wrote:

    > Hi guys.I have to implement a topical crawler as a part of my
    > project.What language should i implement
    > C or Python?Python though has fast development cycle but my concern is
    > speed also.I want to strke a balance between development speed and
    > crawler speed.Since Python is an interpreted language it is rather
    > slow.


    Python is no more interpreted than Java. Like Java, it is compiled to
    byte-code. Unlike Java, it doesn't take three weeks to start the runtime
    environment. (Okay, maybe it just *seems* like three weeks.)

    The nice clean distinctions between "compiled" and "interpreted" languages
    haven't existed in most serious programming languages for a decade or
    more. In these days of tokenizers and byte-code compilers and processors
    emulating other processors, the difference is more of degree than kind.

    It is true that standard Python doesn't compile to platform dependent
    machine code, but that is rarely an issue since the bottleneck for most
    applications is I/O or human interaction, not language speed. And for
    those cases where it is a problem, there are solutions, like Psycho.

    After all, it is almost never true that your code must run as fast as
    physically possible. That's called "over-engineering". It just needs to
    run as fast as needed, that's all. And that's a much simpler problem to
    solve cheaply.



    > The crawler which will be working on huge set of pages should be
    > as fast as possible.


    Web crawler performance is almost certainly going to be I/O bound. Sounds
    to me like you are guilty of trying to optimize your code before even
    writing a single line of code. What you call "huge" may not be huge to
    your computer. Have you tried? The great thing about Python is you can
    write a prototype in maybe a tenth the time it would take you to do the
    same thing in C. Instead of trying to guess what the performance
    bottlenecks will be, you can write your code and profile it and find the
    bottlenecks with accuracy.


    > One possible implementation would be implementing
    > partly in C and partly in Python so that i can have best of both
    > worlds.


    Sure you can do that, if you need to.

    > But i don't know to approach about it.Can anyone guide me on
    > what part should i implement in C and what should be in Python?


    Yes. Write it all in Python. Test it, debug it, get it working.

    Once it is working, and not before, rigorously profile it. You may find it
    is fast enough.

    If it is not fast enough, find the bottlenecks. Replace them with better
    algorithms. We had an example on comp.lang.python just a day or two ago
    where a function which was taking hours to complete was re-written with a
    better algorithm which took only seconds. And still in Python.

    If it is still too slow after using better algorithms, or if there are no
    better algorithms, then and only then re-write those bottlenecks in C for
    speed.



    --
    Steven.
     
    Steven D'Aprano, Feb 16, 2006
    #9
  10. abhinav

    Steve Holden Guest

    abhinav wrote:
    > Hi guys.I have to implement a topical crawler as a part of my
    > project.What language should i implement
    > C or Python?Python though has fast development cycle but my concern is
    > speed also.I want to strke a balance between development speed and
    > crawler speed.Since Python is an interpreted language it is rather
    > slow.The crawler which will be working on huge set of pages should be
    > as fast as possible.One possible implementation would be implementing
    > partly in C and partly in Python so that i can have best of both
    > worlds.But i don't know to approach about it.Can anyone guide me on
    > what part should i implement in C and what should be in Python?
    >

    Get real. Any web crawler is bound to spend huge amounts of its time
    waiting for data to come in over network pipes. Or do you have plans for
    massive parallelism previously unheard of in the Python world?

    regards
    Steve
    --
    Steve Holden +44 150 684 7255 +1 800 494 3119
    Holden Web LLC www.holdenweb.com
    PyCon TX 2006 www.python.org/pycon/
     
    Steve Holden, Feb 17, 2006
    #10
  11. abhinav

    Ravi Teja Guest

    This is following the pattern of your previous post on language choice
    wrt. writing a mail server. It is very common for beginers to over
    emphasize performance requirements, size of the executable etc. More is
    always good. Right? Yes! But at what cost?

    The rule of thumb for all your Python Vs C questions is ...
    1.) Choose Python by default.
    2.) If your program is slow, it's your algorithm that you need to check
    first. Python strictly speaking will be slow because of its dynamism.
    However, most of whatever is performance critical in Python is already
    implemented in C. And the speed difference of well written Python
    programs with properly chosen extensions and algorithms is not far off.
    3.) Remember that you can always drop back to C where ever you need to
    without throwing all of your code. And even if you had to, Python is
    very valuable as a prototyping tool since it is very agile. You would
    have figured out what you needed to do by then, that rewriting it in C
    will only take a fraction of the time compared to if it was written in
    C directly.

    Don't even start with asking the question, "is it fast enough?" till
    you have already written it in Python and it turns out that it is not
    running fast enough despite correctness of your code. If it does, you
    can fix it relatively easily. It is easy to write bad code in C and
    poorly written C code performance is lower than well written Python
    code performance.

    Remember Donald Knuth's quote.
    "Premature optimization is the root of all evil in programming".

    C is a language intended to be used when you NEED tight control over
    memory allocation. It has few advantages in other scenarios. Don't
    abuse it by choosing it by default.
     
    Ravi Teja, Feb 17, 2006
    #11
  12. Ravi Teja <> wrote:
    ...
    > The rule of thumb for all your Python Vs C questions is ...
    > 1.) Choose Python by default.


    +1 QOTW!-)


    > 2.) If your program is slow, it's your algorithm that you need to check


    Seriously: yes, and (often even more importantly) data structure.

    However, often most important tip, particularly for large-scale systems,
    is to consider your program's _architecture_ (algorithms are about
    details of computation, architecture is about partitioning systems into
    components, locating their deployment, and so forth). At a generic and
    lowish level: are you for example creating a lot of threads each for a
    small amount of work? Then consider reusing threads from a "worker
    threads" pool. Or maybe you could avoid threads and use event-driven
    programming; or, at the other extreme, have multiple processes
    communicating by TCP/IP so you can scale up your system to tens or
    hundreds of processors -- in the latter case, partitioning your system
    appropriately to minimize inter process communication may be the
    bottleneck. Consider UDP, when you can afford missing a packet once in a
    while -- sometimes it may let you reduce overheads compared to TCP
    connections.

    Database connections, and less importantly database cursors, are well
    worth reusing. What are you "caching", and what instead is getting
    recomputed over and over? It's possible to undercache (needless
    repeated computation) but also to overcache (tying up memory and causing
    paging). Are you making lots of system calls that you might be able to
    avoid? Each system call has a context-switching cost, after all...

    Any or all of these hints may be irrelevant to a specific category of
    applications, but then, so can the hint about algorithms be. One cool
    thing about Python is that it makes it easy and fast for you to try out
    different approaches (particularly to architecture, but to algorithms as
    well), even drastically different ones, when simple reasoning about the
    issues leaves you undecided and you need to settle them empirically.


    > Remember Donald Knuth's quote.
    > "Premature optimization is the root of all evil in programming".


    I believe Knuth himself said he was quoting Tony Hoare, and indeed
    referred to this as "Hoare's dictum".


    Alex
     
    Alex Martelli, Feb 17, 2006
    #12
  13. abhinav

    Magnus Lycka Guest

    abhinav wrote:
    > I want to strke a balance between development speed and crawler speed.


    "The best performance improvement is the transition from the
    nonworking state to the working state." - J. Osterhout

    Try to get there are soon as possible. You can figure out what
    that means. ;^)

    When you do all your programming in Python, most of the code that
    is relevant for speed *is* written in C already. If performance
    is slow, measure! Use the profiler to see if you are spending a
    lot of time in Python code. If that is your problem, take a close
    look at your algorithms and perhaps your data structures and see
    what you can improve with Python. In the long run, going from from
    e.g. O(n^2) to O(n log n) might mean much more than going from
    Python to C. A poor algorithm in machine code still sucks when you
    have to handle enough data. Changing your code to improve on
    algorithms and structure is a lot easier in Python than in C.

    If you've done all these things, still have performance problems,
    and have identified a bottle neck in your Python code, it might
    be time to get that piece rewritten in C. The easiest and least
    intrusive way to do that might be with pyrex. You might also want
    to try Psyco before you do this.

    Even if you end up writing a whole program in C, it's not unlikely
    that you will get to your goal faster if your first version is
    written in Python.

    Good luck!

    P.S. Why someone would want to write yet another web crawler is
    a puzzle to me. Surely there are plenty of good ideas that haven't
    been properly implemented yet! It's probably very difficult to
    beat Google on their home turf now, but I'd really like to see
    a good tool to manage all that information I got from the net,
    or through mail or wrote myself. I don't think they wrote that
    yet--although I'm sure they are trying.
     
    Magnus Lycka, Feb 20, 2006
    #13
  14. abhinav

    Guest

    I think something that may be even more important to consider than just
    the pure speed of your program, would be ease of design as well as the
    overall stability of your code.

    My opinion would be that writing in Python would have many benefits
    over the speed gains of using C. For instance, you crawler will have to
    handle all types of input from all over the web. Who can say what types
    of malformed or poorly writen data it will come across. I think it
    would be easier to create a system to handle this type of data in
    Python than in C.

    I don't want to pigeon-hole your project, but if it is for any use
    other than a commercial product, I would say speed would be a concern
    lower on the list than accurracy or time to develop. As others have
    pointed out, if you hit many performance barriers chances are the
    problem is the algorithm and not Python itself.

    I wish you luck and hope you will experiment in Python first. If your
    crawler is still not up to par, at the very least you might come up
    with some ideas for how Python could be improved.
     
    , Feb 20, 2006
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. abhinav

    web crawler in python or C?

    abhinav, Feb 16, 2006, in forum: C Programming
    Replies:
    1
    Views:
    1,410
  2. Replies:
    11
    Views:
    2,820
    subeen
    Jun 22, 2008
  3. sonich

    Web crawler on python

    sonich, Oct 26, 2008, in forum: Python
    Replies:
    4
    Views:
    8,685
  4. yura

    Web crawler on python

    yura, Oct 30, 2008, in forum: Python
    Replies:
    1
    Views:
    311
    James Mills
    Oct 30, 2008
  5. Philip Semanchuk

    Re: web crawler in python

    Philip Semanchuk, Dec 10, 2009, in forum: Python
    Replies:
    0
    Views:
    472
    Philip Semanchuk
    Dec 10, 2009
Loading...

Share This Page