multithreaded web crawler

X

xuqy

I am interested in "focused crawling" (crawling web pages of some
specific topic and ignoring all the others) and have written a "focused
crawler" recently. Perl is a reasonable alternative to writing web
crawler for its LWP module and CPAN. However, when I planed to
implement multithread strategy in crawling, I was confused by perl
5.8.4's "threads" module, especially threads::shared. How is a object
reference shared by multiple threads? I want to utilize
"Cache::File::Heap" module to sort the urls in "crawling frontier" by
heuristic prediction of its "harvest outcome". Below is the relevant
code part:

#!/usr/bin/perl -w
use strict;
use threads;
use threads::shared;
use Cache::File::Heap;

my $heap = Cache::File::Heap->new('frontier');
my $heap_lock : shared = 0;

...
sub go {#crawling thread's control flow
....
#extract best promising url
{
lock $heap_lock;
my($value, $url) = $heap->extract_minimum;
}
...
#after downloading and extract hyperlinks
{
lock $heap_lock;
$heap->add($value, $url);
}
...
}
my @threads;
for(1..10) { push @threads, threads->new(\&go); }
for(@threads) { $_->join; }


All is fine, just untill all the threads joined by main thread and main
thread exists. Following error message appears:

Scalar leaks : -1
Segmentation fault.

My question is : How to share object reference (such as
Cache::File::Heap) ?? Cache::File::Heap is the wrapper of BerkeleyDB's
BTREE, is BerkeleyDB thread-safe?
 
X

xhoster

All is fine, just untill all the threads joined by main thread and main
thread exists. Following error message appears:

Scalar leaks : -1
Segmentation fault.

My question is : How to share object reference (such as
Cache::File::Heap) ??

There is no absolutely general way to safely do that. It depends on the
implementation of the objects.

Cache::File::Heap is the wrapper of BerkeleyDB's
BTREE, is BerkeleyDB thread-safe?

I'm pretty sure DB_File is not thread safe. Each thread should have it's
own handle, rather than all of them sharing one. Even then, it is not
safe unless you take pains to make it so. (At which point, it is probably
not all that fast anymore). If I were doing this, I'd probably use a
MySQL table to implement the queue. (Again, with each thread having a
separate handle)

Xho
 
X

xuqy

I do appreciated your reply, but in focused crawling, a file based heap
is indispensable rather than FIFO queue. I have to share Berkeley-BTREE
between crawling threads, but how can I share a object reference
between threads?
For example, such is a heap object reference:

my $heap=Cache::File::Heap->new('heap);

you said "each thread should have it's own handle, rather than all fo
them sharing one.". Can you give me a concrete example just using the
above reference?

Is it the case below? :

my $heap=&share(Cache::File::Heap->new('heap'));

(e-mail address removed) 写é“:
 
X

xhoster

(e-mail address removed) wrote:

Please don't top post.
I do appreciated your reply, but in focused crawling, a file based heap
is indispensable rather than FIFO queue.

I'm rather confused about your reply. I didn't propose a FIFO queue.
I have to share Berkeley-BTREE
between crawling threads, but how can I share a object reference
between threads?
For example, such is a heap object reference:

my $heap=3DCache::File::Heap->new('heap);

you said "each thread should have it's own handle, rather than all fo
them sharing one.". Can you give me a concrete example just using the
above reference?

After looking into it more, I am no longer so convinced that it is not
thread safe and that each thread does need it's own handle. It might be
fine the way you had it.

Anyway, each one having their own handle would be:

sub go {
my $heap=Cache::File::Heap->new('heap');
.....
};

In this case, the sharing would be accomplished because all of them would
be using the same file ('heap') rather than all using the same handle. But
as I said, I'm no longer convinced that this is necessary.

The error you got at the end of your code may be spurious to the actual
running of the code. I'd write a simplified but large and intense test
case with verifiable output, and if you always get the right output on each
of a few hundred runs, then I'd assume the error upon exit is not a
problem.

Xho
 
X

xhoster

After looking into it more, I am no longer so convinced that it is not
thread safe and that each thread does need it's own handle. It might be
fine the way you had it.

Well, after looking into it even more more, it most definitely is not
thread safe. I don't know what I was thinking. And I don't know what you
were thinking when you said it "All is fine, just untill all the threads
joined by main thread and main thread exists." As far as I can tell, there
is no way that all was fine until then.
Anyway, each one having their own handle would be:

sub go {
my $heap=Cache::File::Heap->new('heap');
.....
};

In this case, the sharing would be accomplished because all of them would
be using the same file ('heap') rather than all using the same handle.
But as I said, I'm no longer convinced that this is necessary.

I am again convinced that it is necessary, but I am sure that just doing
that isn't sufficient.

Again, I'd recommend using somthing like MySQL to provide this service to
your threads.

Xho
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,020
Latest member
GenesisGai

Latest Threads

Top