only once in storage

a · Jul 26, 2007

Hi,

I am writing a script to download all the links of the whole site. The link
of the web site is not a simple tree. There may be some replicated links
pointing to the same location.

So, I need to walk through the site to extract and push the URLs of each
page into a data structure.
I dont want the replicated links. Every link should only appear once in my
storage.
So, is there any effective way to achieve this?

Thanks

Gunnar Hjalmarsson · Jul 26, 2007

a said:
I am writing a script to download all the links of the whole site. The link
of the web site is not a simple tree. There may be some replicated links
pointing to the same location.

So, I need to walk through the site to extract and push the URLs of each
page into a data structure.
I dont want the replicated links. Every link should only appear once in my
storage.
So, is there any effective way to achieve this?

Use a hash.

Peter Makholm · Jul 26, 2007

a said:
So, I need to walk through the site to extract and push the URLs of each
page into a data structure.
I dont want the replicated links. Every link should only appear once in my
storage.

You can use a hash to check if an url has been seen before:

my %seen;
my $url;

while ($url = getnext()) {
process_url($url) unless $seen{$url}++;
}

//Makholm

Josef Moellers · Jul 26, 2007

Peter said:
You can use a hash to check if an url has been seen before:

my %seen;
my $url;

while ($url = getnext()) {
process_url($url) unless $seen{$url}++;
}

I'd split that:

1. collect all urls:
$urls{getnext()} = 1;
2. process all unique urls:
process_url($_) foreach (keys %urls);

Peter Makholm · Jul 26, 2007

Josef Moellers said:
I'd split that:

1. collect all urls:
$urls{getnext()} = 1;
2. process all unique urls:
process_url($_) foreach (keys %urls);

That would not work if getnext() return a element from a work queue
and process_url() inserted urls in the work queue based on the content
fetched from the url.

I would probally do the check while inserting into the work queue. So
in the above example getnext() is part of the extract urls from
content and process_url is inserting in work queue.

/Makholm

Josef Moellers · Jul 26, 2007

Peter said:
That would not work if getnext() return a element from a work queue
and process_url() inserted urls in the work queue based on the content
fetched from the url.

I would probally do the check while inserting into the work queue. So
in the above example getnext() is part of the extract urls from
content and process_url is inserting in work queue.

Yes, sorry, I ignored the "the web site is not a simple tree".

Need Help with List Destroy function in Storage Allocator ( LongCode )	0	Nov 17, 2009
GridView in UpdatePanel only Updates once	8	Oct 23, 2008
Need Help with List Destroy Function in Storage Allocator ( LongCode )	6	Nov 17, 2009
scope, linkage and storage duration	1	Nov 4, 2008
Ruby way to take some action only once in a loop?	7	Oct 21, 2008
Hi, I am a webflow user. I am looking for CSS code that can KEEP ALL ELEMENTS POSITIONED in the SAME spot across all resolutions	0	Oct 27, 2023
Moving Guide Moving Companies Movers Storage	0	Jan 23, 2008
Invoking Later, But Only Once	8	Feb 7, 2005

only once in storage

a

Gunnar Hjalmarsson

Peter Makholm

Josef Moellers

Peter Makholm

Josef Moellers

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads