multithreaded web crawler

xuqy · Sep 22, 2005

I am interested in "focused crawling" (crawling web pages of some
specific topic and ignoring all the others) and have written a "focused
crawler" recently. Perl is a reasonable alternative to writing web
crawler for its LWP module and CPAN. However, when I planed to
implement multithread strategy in crawling, I was confused by perl
5.8.4's "threads" module, especially threads::shared. How is a object
reference shared by multiple threads? I want to utilize
"Cache::File::Heap" module to sort the urls in "crawling frontier" by
heuristic prediction of its "harvest outcome". Below is the relevant
code part:

#!/usr/bin/perl -w
use strict;
use threads;
use threads::shared;
use Cache::File::Heap;

my $heap = Cache::File::Heap->new('frontier');
my $heap_lock : shared = 0;

...
sub go {#crawling thread's control flow
....
#extract best promising url
{
lock $heap_lock;
my($value, $url) = $heap->extract_minimum;
}
...
#after downloading and extract hyperlinks
{
lock $heap_lock;
$heap->add($value, $url);
}
...
}
my @threads;
for(1..10) { push @threads, threads->new(\&go); }
for(@threads) { $_->join; }

All is fine, just untill all the threads joined by main thread and main
thread exists. Following error message appears:

Scalar leaks : -1
Segmentation fault.

My question is : How to share object reference (such as
Cache::File::Heap) ?? Cache::File::Heap is the wrapper of BerkeleyDB's
BTREE, is BerkeleyDB thread-safe?

xhoster · Sep 22, 2005

All is fine, just untill all the threads joined by main thread and main
thread exists. Following error message appears:

Scalar leaks : -1
Segmentation fault.

My question is : How to share object reference (such as
Cache::File::Heap) ??

There is no absolutely general way to safely do that. It depends on the
implementation of the objects.

Cache::File::Heap is the wrapper of BerkeleyDB's
BTREE, is BerkeleyDB thread-safe?

I'm pretty sure DB_File is not thread safe. Each thread should have it's
own handle, rather than all of them sharing one. Even then, it is not
safe unless you take pains to make it so. (At which point, it is probably
not all that fast anymore). If I were doing this, I'd probably use a
MySQL table to implement the queue. (Again, with each thread having a
separate handle)

Xho

xuqy · Sep 23, 2005

I do appreciated your reply, but in focused crawling, a file based heap
is indispensable rather than FIFO queue. I have to share Berkeley-BTREE
between crawling threads, but how can I share a object reference
between threads?
For example, such is a heap object reference:

my $heap=Cache::File::Heap->new('heap);

you said "each thread should have it's own handle, rather than all fo
them sharing one.". Can you give me a concrete example just using the
above reference?

Is it the case below? :

my $heap=&share(Cache::File::Heap->new('heap'));

(e-mail address removed) å†™é“ï¼š

xhoster · Sep 23, 2005

(e-mail address removed) wrote:

Please don't top post.

I do appreciated your reply, but in focused crawling, a file based heap
is indispensable rather than FIFO queue.

I'm rather confused about your reply. I didn't propose a FIFO queue.

I have to share Berkeley-BTREE
between crawling threads, but how can I share a object reference
between threads?
For example, such is a heap object reference:

my $heap=3DCache::File::Heap->new('heap);

you said "each thread should have it's own handle, rather than all fo
them sharing one.". Can you give me a concrete example just using the
above reference?

After looking into it more, I am no longer so convinced that it is not
thread safe and that each thread does need it's own handle. It might be
fine the way you had it.

Anyway, each one having their own handle would be:

sub go {
my $heap=Cache::File::Heap->new('heap');
.....
};

In this case, the sharing would be accomplished because all of them would
be using the same file ('heap') rather than all using the same handle. But
as I said, I'm no longer convinced that this is necessary.

The error you got at the end of your code may be spurious to the actual
running of the code. I'd write a simplified but large and intense test
case with verifiable output, and if you always get the right output on each
of a few hundred runs, then I'd assume the error upon exit is not a
problem.

Xho

xhoster · Sep 23, 2005

After looking into it more, I am no longer so convinced that it is not
thread safe and that each thread does need it's own handle. It might be
fine the way you had it.

Well, after looking into it even more more, it most definitely is not
thread safe. I don't know what I was thinking. And I don't know what you
were thinking when you said it "All is fine, just untill all the threads
joined by main thread and main thread exists." As far as I can tell, there
is no way that all was fine until then.

Anyway, each one having their own handle would be:

sub go {
my $heap=Cache::File::Heap->new('heap');
.....
};

In this case, the sharing would be accomplished because all of them would
be using the same file ('heap') rather than all using the same handle.
But as I said, I'm no longer convinced that this is necessary.

I am again convinced that it is necessary, but I am sure that just doing
that isn't sufficient.

Again, I'd recommend using somthing like MySQL to provide this service to
your threads.

Xho

Multithreaded Python Mysql MAC Problems	0	Nov 13, 2007
Multithreaded Python Mysql MAC Problems	0	Nov 13, 2007
memory leak	26	Oct 20, 2009
File locking using threads	14	Oct 23, 2009
perl threads	2	Aug 28, 2008
static member variable in multithreaded environment	1	Feb 23, 2005
threads, XSUB allocated memory, destructors, destruction	16	Sep 30, 2005
Segmentation fault: problem with perl threads	6	Sep 15, 2008

multithreaded web crawler

xuqy

xhoster

xuqy

xhoster

xhoster

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads