multithreaded web crawler

Discussion in 'Perl Misc' started by xuqy@jlu.edu.cn, Sep 22, 2005.

  1. Guest

    I am interested in "focused crawling" (crawling web pages of some
    specific topic and ignoring all the others) and have written a "focused
    crawler" recently. Perl is a reasonable alternative to writing web
    crawler for its LWP module and CPAN. However, when I planed to
    implement multithread strategy in crawling, I was confused by perl
    5.8.4's "threads" module, especially threads::shared. How is a object
    reference shared by multiple threads? I want to utilize
    "Cache::File::Heap" module to sort the urls in "crawling frontier" by
    heuristic prediction of its "harvest outcome". Below is the relevant
    code part:

    #!/usr/bin/perl -w
    use strict;
    use threads;
    use threads::shared;
    use Cache::File::Heap;

    my $heap = Cache::File::Heap->new('frontier');
    my $heap_lock : shared = 0;

    ...
    sub go {#crawling thread's control flow
    ....
    #extract best promising url
    {
    lock $heap_lock;
    my($value, $url) = $heap->extract_minimum;
    }
    ...
    #after downloading and extract hyperlinks
    {
    lock $heap_lock;
    $heap->add($value, $url);
    }
    ...
    }
    my @threads;
    for(1..10) { push @threads, threads->new(\&go); }
    for(@threads) { $_->join; }


    All is fine, just untill all the threads joined by main thread and main
    thread exists. Following error message appears:

    Scalar leaks : -1
    Segmentation fault.

    My question is : How to share object reference (such as
    Cache::File::Heap) ?? Cache::File::Heap is the wrapper of BerkeleyDB's
    BTREE, is BerkeleyDB thread-safe?
     
    , Sep 22, 2005
    #1
    1. Advertising

  2. Guest

    wrote:
    ....
    >
    > All is fine, just untill all the threads joined by main thread and main
    > thread exists. Following error message appears:
    >
    > Scalar leaks : -1
    > Segmentation fault.
    >
    > My question is : How to share object reference (such as
    > Cache::File::Heap) ??


    There is no absolutely general way to safely do that. It depends on the
    implementation of the objects.


    > Cache::File::Heap is the wrapper of BerkeleyDB's
    > BTREE, is BerkeleyDB thread-safe?


    I'm pretty sure DB_File is not thread safe. Each thread should have it's
    own handle, rather than all of them sharing one. Even then, it is not
    safe unless you take pains to make it so. (At which point, it is probably
    not all that fast anymore). If I were doing this, I'd probably use a
    MySQL table to implement the queue. (Again, with each thread having a
    separate handle)

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Sep 22, 2005
    #2
    1. Advertising

  3. Guest

    I do appreciated your reply, but in focused crawling, a file based heap
    is indispensable rather than FIFO queue. I have to share Berkeley-BTREE
    between crawling threads, but how can I share a object reference
    between threads?
    For example, such is a heap object reference:

    my $heap=Cache::File::Heap->new('heap);

    you said "each thread should have it's own handle, rather than all fo
    them sharing one.". Can you give me a concrete example just using the
    above reference?

    Is it the case below? :

    my $heap=&share(Cache::File::Heap->new('heap'));

    写é“:

    > wrote:
    > ...
    > Each thread should have it's own handle, rather than all of them sharing one.
    > Usenet Newsgroup Service $9.95/Month 30GB
     
    , Sep 23, 2005
    #3
  4. Guest

    wrote:

    Please don't top post.

    > I do appreciated your reply, but in focused crawling, a file based heap
    > is indispensable rather than FIFO queue.


    I'm rather confused about your reply. I didn't propose a FIFO queue.

    > I have to share Berkeley-BTREE
    > between crawling threads, but how can I share a object reference
    > between threads?
    > For example, such is a heap object reference:
    >
    > my $heap=3DCache::File::Heap->new('heap);
    >
    > you said "each thread should have it's own handle, rather than all fo
    > them sharing one.". Can you give me a concrete example just using the
    > above reference?


    After looking into it more, I am no longer so convinced that it is not
    thread safe and that each thread does need it's own handle. It might be
    fine the way you had it.

    Anyway, each one having their own handle would be:

    sub go {
    my $heap=Cache::File::Heap->new('heap');
    .....
    };

    In this case, the sharing would be accomplished because all of them would
    be using the same file ('heap') rather than all using the same handle. But
    as I said, I'm no longer convinced that this is necessary.

    The error you got at the end of your code may be spurious to the actual
    running of the code. I'd write a simplified but large and intense test
    case with verifiable output, and if you always get the right output on each
    of a few hundred runs, then I'd assume the error upon exit is not a
    problem.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Sep 23, 2005
    #4
  5. Guest

    wrote:
    > >
    > > my $heap=3DCache::File::Heap->new('heap);
    > >
    > > you said "each thread should have it's own handle, rather than all fo
    > > them sharing one.". Can you give me a concrete example just using the
    > > above reference?

    >
    > After looking into it more, I am no longer so convinced that it is not
    > thread safe and that each thread does need it's own handle. It might be
    > fine the way you had it.


    Well, after looking into it even more more, it most definitely is not
    thread safe. I don't know what I was thinking. And I don't know what you
    were thinking when you said it "All is fine, just untill all the threads
    joined by main thread and main thread exists." As far as I can tell, there
    is no way that all was fine until then.

    > Anyway, each one having their own handle would be:
    >
    > sub go {
    > my $heap=Cache::File::Heap->new('heap');
    > .....
    > };
    >
    > In this case, the sharing would be accomplished because all of them would
    > be using the same file ('heap') rather than all using the same handle.
    > But as I said, I'm no longer convinced that this is necessary.


    I am again convinced that it is necessary, but I am sure that just doing
    that isn't sufficient.

    Again, I'd recommend using somthing like MySQL to provide this service to
    your threads.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Sep 23, 2005
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Paul Morrison

    Web Crawler

    Paul Morrison, Oct 17, 2005, in forum: Java
    Replies:
    3
    Views:
    4,938
    lamantpirate
    Jun 30, 2012
  2. Sanjay Patra

    Web Crawler

    Sanjay Patra, Nov 17, 2004, in forum: C++
    Replies:
    2
    Views:
    758
  3. abhinav

    web crawler in python or C?

    abhinav, Feb 16, 2006, in forum: Python
    Replies:
    13
    Views:
    1,292
  4. Sanjay Patra

    C Web crawler code

    Sanjay Patra, Nov 18, 2004, in forum: C Programming
    Replies:
    1
    Views:
    1,534
    Raymond Martineau
    Nov 18, 2004
  5. abhinav

    web crawler in python or C?

    abhinav, Feb 16, 2006, in forum: C Programming
    Replies:
    1
    Views:
    1,428
Loading...

Share This Page