only once in storage

A

a

Hi,

I am writing a script to download all the links of the whole site. The link
of the web site is not a simple tree. There may be some replicated links
pointing to the same location.

So, I need to walk through the site to extract and push the URLs of each
page into a data structure.
I dont want the replicated links. Every link should only appear once in my
storage.
So, is there any effective way to achieve this?

Thanks
 
G

Gunnar Hjalmarsson

a said:
I am writing a script to download all the links of the whole site. The link
of the web site is not a simple tree. There may be some replicated links
pointing to the same location.

So, I need to walk through the site to extract and push the URLs of each
page into a data structure.
I dont want the replicated links. Every link should only appear once in my
storage.
So, is there any effective way to achieve this?

Use a hash.
 
P

Peter Makholm

a said:
So, I need to walk through the site to extract and push the URLs of each
page into a data structure.
I dont want the replicated links. Every link should only appear once in my
storage.

You can use a hash to check if an url has been seen before:

my %seen;
my $url;

while ($url = getnext()) {
process_url($url) unless $seen{$url}++;
}

//Makholm
 
J

Josef Moellers

Peter said:
You can use a hash to check if an url has been seen before:

my %seen;
my $url;

while ($url = getnext()) {
process_url($url) unless $seen{$url}++;
}

I'd split that:

1. collect all urls:
$urls{getnext()} = 1;
2. process all unique urls:
process_url($_) foreach (keys %urls);
 
P

Peter Makholm

Josef Moellers said:
I'd split that:

1. collect all urls:
$urls{getnext()} = 1;
2. process all unique urls:
process_url($_) foreach (keys %urls);

That would not work if getnext() return a element from a work queue
and process_url() inserted urls in the work queue based on the content
fetched from the url.

I would probally do the check while inserting into the work queue. So
in the above example getnext() is part of the extract urls from
content and process_url is inserting in work queue.

/Makholm
 
J

Josef Moellers

Peter said:
That would not work if getnext() return a element from a work queue
and process_url() inserted urls in the work queue based on the content
fetched from the url.

I would probally do the check while inserting into the work queue. So
in the above example getnext() is part of the extract urls from
content and process_url is inserting in work queue.

Yes, sorry, I ignored the "the web site is not a simple tree".
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,776
Messages
2,569,603
Members
45,201
Latest member
KourtneyBe

Latest Threads

Top