Speeding up LWP::Simple

D

David Morel

Hi all,

I am looking to collect the HTML of approximately 30 million urls, in
as simple a manner as possible, perhaps using the LWP::Simple module.
If I choose to use LWP::Simple, how can I speed up the process?

It seems that it would be too time consuming to collect the HTML one
url at a time... is there any way to collect, say, 10 urls at a time?
20 urls at a time?
Any ideas on how I might implement this or how long it might take to
gather the information?

Thanks,
David Morel
 
R

Randal L. Schwartz

David> I am looking to collect the HTML of approximately 30 million urls, in
David> as simple a manner as possible, perhaps using the LWP::Simple module.

Are you the next google.com, or a spammer? You're using a hotmail
address, so I suspect the latter.

David> If I choose to use LWP::Simple, how can I speed up the process?

If you're a spammer, can you actually expect a lot of help?
 
2

2mb

David,
Why not just use one of the commercially available email harvesting
packages. Most are available for 19.99. If you are going to spam, it is better
to get your list built and operational as soon as possible, before more
legislation goes into effect.

Better yet, just kill the email servers by using a dictionary/shotgun approach.
You will either be killing web servers or email servers with your load.
May as well take the shortest distance. You can put some html banner in
there to populate your "good" list for the poor suckers that open it.

So much for opt-in eh?

Oh... that's right. The dictionary method could be
prosecuted as a DoS attack. The www site email harvest trick is much more
covert. Don't forget to set the User-agent request header so your script looks
like IE.

Got any Viagra? How about some pR0n? I could use some junk mail with more
of this. I only get 400+ a day. Fill it up dude. Maybe you should cross
post this question to 2-300 lists. That should get your spam-on for at
least a few minutes.

l8,
2mb
 
T

Trent Curry

John said:
Why is David named a spammer without any proof? Just for harvesting
30,000,000 webpages? There are so many legal things to use it for like
text analyses, language analyses, making an estimate of size,
analyzing
HTML tag use, analysing use of scripting, stylesheets etc.

Also, there are websites where one can download entire email databases
for free. I recently saw one. There was one file of 250 (!) MB. Also
files sorted based on country etc.

While I don't know if he is a spammer or not, you have pointed out a rather
ugly flaw in this group: too many people are too dang eager to jump to
conclusions and can easily be mistaken, like in a case like this (and
others), and end up condemning a person who was completely undeserving of
such treatment. In this particular case its easy to say he could be
harvesting emails, but, as spoken above, there is no proof of that.

On the spam note, there already is more and more legislation either going
into effect, already in effect, or pending. Little by little, the government
(or governmentS I should say, as other countries have not been sitting idle
either.) It's still a big problem; it's not easy to get away once you've
been tagged by a spammer, short of changing your email address. Still, I
really hope one day we wont have to worry about them anymore, and hopefully
its not just a | dream...
 
D

David Morel

John Bokma said:
Why is David named a spammer without any proof? Just for harvesting
30,000,000 webpages? There are so many legal things to use it for like
text analyses, language analyses, making an estimate of size, analyzing
HTML tag use, analysing use of scripting, stylesheets etc.

Hi all,

No, I am not a spammer. I'm more likely to be the next google.com than
the next spammer. The mere accusation is offensive to me.

Let me take a moment to clarify some things. If I were a spammer, why
would I take the time to write my own harvester? I'm sure very
effective ones have already been written. Also, I use a hotmail
account here to prevent spam from reaching my school inbox. Don't
harvesters gather emails from Google groups? I hate spam as much as
you do.

Thank you John Bokma for giving me some suggestions. I thought that I
would have to learn how to do some programming with processes/threads,
and you confirmed this.

Why do I need to gather the HTML? Google already offers 900,000+ free
web pages to the general public for analysis purposes. See
http://www.google.com/programming-contest/index.html. I have used
these pages in the past, but there is one problem with them: the
900,000 pages are taken only from the .edu namespace... I need pages
from all of the namespaces. There are other uses to HTML data than
spamming, believe it or not. The Google programming contest lists a
few of them:
* Detecting common templates in pages, and separating out the common
structure from the individual content.
* Classifying links on a page.
* Detecting pages that are near-duplicates of one another.
* Clustering pages by topic or type.


Is there any company out there that sells big databases of web pages?
Perhaps I can avoid some work after all.

Thanks,
David Morel

P.S
Let's be nicer to each other here :)
 
T

Trent Curry

John said:
I live in the Netherlands and finally there will be soon a law that
makes spamming harder. I don't have read all the details but opt-out
is forbidden. Not sure if opt in without confirmation is ok, but it
shouldn't. The only sound way is an opt in *with* confirmation.

Sounds like a good law, and kudos to the Netherland authorities. The
question of such a law is will it really be enforceable? I think this has
long been one of the largest road blocks for the anti spam world.
I recently read a mail I wrote, complaining like mad about 5 or 6 spam
mails a day... those where the days (1997). I receive now 200+
unwanted mails a day :-(. Yet I never going as far as changing my email
address or munging the ones in the headers.


Well I use a false email in my nntp headers for just that reason. If you
don't advertise it, less chances of someone you don't want hear from getting
a hold of it.
 
T

Tintin

John Bokma said:
I recently read a mail I wrote, complaining like mad about 5 or 6 spam
mails a day... those where the days (1997). I receive now 200+ unwanted
mails a day :-(. Yet I never going as far as changing my email address
or munging the ones in the headers.

Time for you to use http://www.spamassassin.org/

Written in Perl of course :)
 
T

Trent Curry

Tintin said:
Time for you to use http://www.spamassassin.org/

Written in Perl of course :)

I had recently read an article (somewhere in groups.google.com) claiming
that, while it cna block some spam, in reality, it will not block
everything and many spams cna get through it. Though nice to know these
solutions are being attempted in Perl ;p

--
Trent Curry

perl -e
'($s=qq/e29716770256864702379602c6275605/)=~s!([0-9a-f]{2})!pack("h2",$1)!eg
;print(reverse("$s")."\n");'
 
G

Greg Schmidt

I had recently read an article (somewhere in groups.google.com) claiming
that, while it cna block some spam, in reality, it will not block
everything and many spams cna get through it. Though nice to know these
solutions are being attempted in Perl ;p

I have it running on my server. It is currently correctly blocking over
1000 spams per week, and allowing only a small handful (I'd say single
digits, often low single digits) through in the same period.
 
T

Trent Curry

Greg said:
I have it running on my server. It is currently correctly blocking
over 1000 spams per week, and allowing only a small handful (I'd say
single digits, often low single digits) through in the same period.

Well Greg, thanks for setting the record straight. I was hoping someone with
experience with it would shed some light. I will see if I can locate a copy
to test it out for my self. I assume it will work fine with sendmail?

--
Trent Curry

perl -e
'($s=qq/e29716770256864702379602c6275605/)=~s!([0-9a-f]{2})!pack("h2",$1)!eg
;print(reverse("$s")."\n");'
 
G

Greg Schmidt

Well Greg, thanks for setting the record straight. I was hoping someone with
experience with it would shed some light. I will see if I can locate a copy
to test it out for my self. I assume it will work fine with sendmail?

It can be made to work with Sendmail. It can be made to work much more
easily with Postfix, which I find to be superior to Sendmail in many,
very off-topic, ways.
 
T

Tintin

Trent Curry said:
I had recently read an article (somewhere in groups.google.com) claiming
that, while it cna block some spam, in reality, it will not block
everything and many spams cna get through it. Though nice to know these
solutions are being attempted in Perl ;p

Using spamassassin "out of the box", it will stop about 95% of spam. With a
little tuning, you should be able to get that to >99%

I've been using it for over a year, and it saves me a great deal of pain.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,577
Members
45,054
Latest member
LucyCarper

Latest Threads

Top