PLEASE help me with this process

J

Jang

I want to do this.. Please show me the way to do it.

step 1. search a certain keyword in google.com
e.g. "United nations"
step 2. on the result page in step 1, extract links other than
advertisements and cache page.
step 3. scrape the html file (the target page) of each link
automatically

Does anybody know how to do these three steps? If you know any sample
script that support this process, or if you have any similar program..
please let me know. You can save a poor guy from deep frustration.

Thank you!

J.
 
F

Fabian Pilkowski

* Jang said:
I want to do this.. Please show me the way to do it.

step 1. search a certain keyword in google.com
e.g. "United nations"
step 2. on the result page in step 1, extract links other than
advertisements and cache page.
step 3. scrape the html file (the target page) of each link
automatically

Does anybody know how to do these three steps? If you know any sample
script that support this process, or if you have any similar program..
please let me know. You can save a poor guy from deep frustration.

Above you described the hard way of asking Google for its results. But
Google provides an easy-to-use interface to their data. They named it
"Google Web API".

http://www.google.com/apis/

First you have to create a google account there (for free, I think).
Thereby you get an license key which entitles you to do 1000 automated
queries per day (read the page for more details).

Second you have to query CPAN to search a tool you want to use. I think
Net::Google is the easiest way to start with. Download this module here

http://search.cpan.org/~ascope/Net-Google-0.62/lib/Net/Google.pm

and read its documentation carefully (try out the examples).

regards,
fabian
 
S

Sherm Pendley

Jang said:
Thank you very much!

Who are you thanking, and for what?

I'm *seriously* thinking of killfiling every post from Google Groups, until
they start teaching their users how to post correctly. The amount of
unreadable gibberish being posted from GG is absurd, and growing. :-(

sherm--
 
S

Sherm Pendley

Paul Lalli said:
I would just like to point out, Sherm, that some of us forced to use
Google Groups *do* understand how to follow both Netiquette in general
and clpm's Posting Guidelines in particular....

Don't worry Paul - you earned an increased score long ago. Your posts are
still visible.

sherm--
 
P

Paul Lalli

Sherm said:
I'm *seriously* thinking of killfiling every post from Google Groups, until
they start teaching their users how to post correctly. The amount of
unreadable gibberish being posted from GG is absurd, and growing. :-(

I would just like to point out, Sherm, that some of us forced to use
Google Groups *do* understand how to follow both Netiquette in general
and clpm's Posting Guidelines in particular.... Seems a shame to kill
file all of us on the basis of those who don't like or know how to
read.

Just my $0.02.

Paul Lalli
 
J

Jang

If sayting thank you is such a big violation of netiquette, I'd like to
apologize for it. Sorry.
 
J

John Bokma

Kevin Michael Vail said:
I did that already. It helps a lot. Google is never going to teach
their users to post correctly, obviously.

Did your ISP teach you how to post?

But it's a good idea, there are some people who promote kf-ing all posts
coming from google. Probably it will never be enough people, but it might
teach some a lesson.
 
T

Tad McClellan

Jang said:
If sayting thank you is such a big violation of netiquette,


Saying thank you is NOT a violation of netiquette.

Not quoting context IS a violation of netiquette.

I'd like to
apologize for it.


I must admit that I doubt your sincerity.

Off you go to invisi-land.
 
T

Tad McClellan

John Bokma said:
Did your ISP teach you how to post?

But it's a good idea, there are some people who promote kf-ing all posts
coming from google. Probably it will never be enough people, but it might
teach some a lesson.


I don't think the point is to teach them a lesson.

I think the point is to conserve the time available for helping
other people by ignoring people that are too hard to help.
 
T

Tad McClellan

I had finally gone "all the way" on GG scoring about half an hour
before I saw this post. Bizarre.

I would just like to point out, Sherm, that some of us forced to use
Google Groups *do* understand how to follow both Netiquette in general
and clpm's Posting Guidelines in particular....


Exceptions can be made, once such a poster is noticed.

Seems a shame to kill
file all of us on the basis of those who don't like or know how to
read.


For me, it is all about probabilities (really about time).

Elimination all GG posts increases the probability of my finding
a question to answer without raising my blood pressure.
 
N

Niall

Paul said:
I would just like to point out, Sherm, that some of us forced to use
Google Groups *do* understand how to follow both Netiquette in general
and clpm's Posting Guidelines in particular.... Seems a shame to kill
file all of us on the basis of those who don't like or know how to
read.

Just my $0.02.

Paul Lalli

I am a Google groups user .

I have found most of the replies on clpm to be extremely helpful and
would feel that I was not getting the most out of the group if my
(occaisional ) questions did not reach all the experts.

However I have found it difficult to establish what the best newsreader
interface would be for me to use . I am running Windows 2000. I am
leaning towards Xnews at the moment but have found it difficult to find
any one newsreader which stands out head and shoulders above the others
..

Any recommendations ?
 
A

Arndt Jonasson

Tad McClellan said:
I don't think the point is to teach them a lesson.

I think the point is to conserve the time available for helping
other people by ignoring people that are too hard to help.

If that is the only reason (and a very good reason it is), I don't see
the point of reacting at all to people's final context-free thank you
messages, which many do. The rest of us have to read (or at least make
an active decision to ignore) the ensuing debates. (I'm not arguing
with you, Tad, your article just provided an appropriate context.)
 
J

John Bokma

Kevin Michael Vail said:
No, I've been reading Usenet since it was carried over UUCP. I was
referring to this sentence in the post I responded to, from Sherm
Pendley:


That last part isn't going to happen.

It was a rhetorical question :-D.
Well, maybe, maybe not, but it makes my life easier, which is all I'm
after.

Yup, I don't like kill files, but I have been playing with the idea
myself for quite some time. And maybe killing each posting with in the
subject newbie, or help, or question.
 
P

Paul Lalli

Tim said:
As someone who has *not* killfiled Google Groups posts (yet),
but has seen an increasing number of people do so,
I'm curious as to how you're forced?

By the corporate proxy and firewall in place at my job. No NNTP
programs are allowed access to the outside world. Using a
web-interface is the only way I can get to usenet. And frankly, I'm
rather surprised the proxy hasn't disallowed this yet. (It's already
killed other communication mediums such as webmail sites). So I
suppose you could say I'm not forced to use Google Groups per se, but
as long as I am forced to use a web client, I don't see any particular
reason not to use this one. Unless someone has a suggestion for a
better web interface to Usenet?

Paul Lalli
 
T

Tad McClellan

John Bokma said:
Yup, I don't like kill files, but I have been playing with the idea
myself for quite some time.


I went for about 4 years with no killfiling at all, 2 years with
killfiling based only on Subject, and the last 4-5 years as far
as killfiling individual posters.

It is a progressive attempt at keeping the cost/benefit ratio of
continued participation at a level that allows me to continue
rather than abandon the newsgroup.

And maybe killing each posting with in the
subject newbie, or help, or question.


Here are the ones that I currently have:

% foolish subjects (no lower case letters)
Score: -9000
~Subject: \c[a-z]

% foolish subjects
Score:: -9000
Subject: ^perl$
Subject: ^help!?$
Subject: ^question!?$
Subject: ^perl question!?$
Subject: (none)
Subject: no subject
Subject: ^$
Subject: !!!
Subject: ###
Subject: ~~~
Subject: \?\?\?
Subject: \$\$\$
Content-Type: multipart/alternative
Content-Type: multipart/mixed
Content-Type: text/html

% red flag subjects
Score:: -5000
Subject: urgent
Subject: newbie
Subject: please read
Subject: ^looking for
Subject: perl (problem|question)
Subject: perl (script|program) (problem|question)

% probably off-topic
Score:: -1000
Subject: browser
Subject: 500 error
Subject: server error
Subject: redirect
Subject: banner
Subject: htaccess
Subject: cgi-lib
Subject: download
Subject: upload
Subject: guest *book
Subject: referr?er
Subject: apache
Subject: check *box
Subject: text *area
 
J

John Bokma

Tad McClellan said:

[ kill file ]
It is a progressive attempt at keeping the cost/benefit ratio of
continued participation at a level that allows me to continue
rather than abandon the newsgroup.

I hear you :)
Here are the ones that I currently have:

Thanks, I will see if I can copy paste them into the score file Xnews uses
(most probably, since the developer took the idea from slrn iirc).
 
C

Chris

Google by default (as you can see here) is set up in such a way that
netiquette is broken (as far as I know.) When one replies to a
posting, the original post is NOT copied into the edit buffer so
netiquette can be followed.

I use Google for the same reason as Paul. I *could* ssh to my server
at home from work and use NNTP from there, but I've disallowed Usenet
from home to filter myself and the children from you all know where...
So Google is the only place for me to use Usenet now.

If there is any port open from work (which we have 22 open by default),
then there should be no reason you couldn't use NNTP through home from
work. Except in cases like my own.

-ceo
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,781
Messages
2,569,615
Members
45,296
Latest member
HeikeHolli

Latest Threads

Top