extract data from web page

J

jobs239

I want to type a word in the search box of google like website and
then extract results from the result page and store in an excel file.
How can I programmatically do the search and extract data?
 
R

Roedy Green

I want to type a word in the search box of google like website and
then extract results from the result page and store in an excel file.
How can I programmatically do the search and extract data?

see http://mindprod.com/products.html#COMMON11 includes code to GET or
POST to retrieve a web page.

You then have want to convert the entities to Unicode. See
http://mindprod.com/products1.html#ENTITIES

Then you want to strip out the <tags>.


import static com.mindprod.entities.StripEntities.*;

Which you can do in a single line:

return stripNbsp( stripEntities( stripHTMLTags( key.trim() ) ) );
 
A

Andrew Thompson

I want to type a word in the search box of google like website and
then extract results from the result page and store in an excel file.

Is that allowed? Google has historically objected to
such access to the data they collect and present.
 
T

Twisted

Is that allowed? Google has historically objected to
such access to the data they collect and present.

How is it any different in principle from viewing the page manually
and making a mental note of what you saw there, or bookmarking all the
results with right clicks in the browser, or some such?

Anyway, what Google doesn't know can't hurt it. Just don't republish
it without their permission, or generate too heavy a load on their
servers with automated traffic. Make sure it accesses and downloads no
faster than a human user would, and use the results locally/privately
only. Then a) you're not doing anything morally wrong and b) Google
doesn't know you're doing this thing that isn't morally wrong, but
that they might decide they don't like.
 
S

Stefan Ram

Roedy Green said:

Recently I started to code a class to read a web page,
and I used »java.net.HttpURLConnection«.
Is anything wrong with this approach?

The class »com.mindprod.http.Read« contains the string
»8859_1«. Isn't this a preconception, given that web pages
might use other encodings? Or may be I have not understand the
intended use yet.
 
R

Roedy Green

The class »com.mindprod.http.Read« contains the string
»8859_1«. Isn't this a preconception, given that web pages
might use other encodings? Or may be I have not understand the
intended use yet.

That should be improved. I will have a look. The header probably
contains info on what encoding to use.
 
A

Andrew Thompson

I want to type a word in the search box of google like website and
then extract results from the result page and store in an excel file.
How can I programmatically do the search and extract data?

Google does not condone such programmatic access to
search results, to the best of my knowledge.
 
T

Twisted

Google does not condone such programmatic access to
search results, to the best of my knowledge.

I don't recall the OP asking for either your or Google's opinion on
that, but simply how to do it.

Or are we reaching the point now where there will be ubiquitous
enforcement of the wishes of all large corporations and a refusal by
most people to divulge any information that might enable someone to
act in any way contrary to same? If so, I'm packing my bags and moving
to someplace that is still sane. (Anyone know anywhere where society
still keeps big business in its place and supports the individual when
it comes down to choosing between an individual and a corporation, and
where the law isn't ludicrously business-centric and anti-consumer?)
 
A

Andrew Thompson

Twisted said:
[quoted text clipped - 3 lines]
Google does not condone such programmatic access to
search results, to the best of my knowledge.

I don't recall the OP asking for either your or Google's opinion on
that, but simply how to do it.

I don't recall asking for your opinion either, Twisted,
but given this is a discussion forum, it is not amazing
you would add it.

..Welcome to the c.l.j.p. *discussion* forum*.

(* This is not a help desk)

--
Andrew Thompson
http://www.athompson.info/andrew/

Message posted via JavaKB.com
http://www.javakb.com/Uwe/Forums.aspx/java-general/200707/1
 
N

nebulous99

Twisted said:
(e-mail address removed) wrote:
I want to type a word in the search box of google like website and
[quoted text clipped - 3 lines]
Google does not condone such programmatic access to
search results, to the best of my knowledge.
I don't recall the OP asking for either your or Google's opinion on
that, but simply how to do it.

I don't recall asking for your opinion either, Twisted,

The point being that your response to the OP was unuseful to the OP,
and appears to be a case of you playing at being rent-a-cop instead of
attempting to be helpful to someone with a coding question.
 
L

Lew

Andrew said:
(e-mail address removed) wrote:
I want to type a word in the search box of google like website and
[quoted text clipped - 6 lines]
I don't recall asking for your opinion either, Twisted,
The point being that your response to the OP was ..

.blah, blah, blah. Try to get interesting.

Andrew's point could be /very/ helpful to the OP if it prevents jail time or a
massive judgment against them.
 
N

nebulous99

Andrew said:
(e-mail address removed) wrote:
I want to type a word in the search box of google like website and
[quoted text clipped - 6 lines]
I don't recall asking for your opinion either, Twisted,
The point being that your response to the OP was ..
.blah, blah, blah. Try to get interesting.

Pot, kettle, and all that.
Andrew's point could be /very/ helpful to the OP if it prevents jail time or a
massive judgment against them.

Unless the OP signed something, or does something dumb like scrape and
republish a huge amount of copyrighted stuff without permission, a
massive judgment seems unlikely, let alone jail time. Actual hacking
or commercial copyright infringement might lead to jail time. Merely
browsing a site with the browser software of his choice, without
either producing abnormally large traffic levels to the server (which
would get his usage noticed and might be treated as a DoS attack) or
republishing anything (which would get his usage noticed and might be
copyright infringement), certainly should do neither if the OP is in a
sane and just country. So unless he's in China or something...what
Google doesn't know won't hurt him. Or hurt Google.
 
A

anal_aviator

Andrew said:
(e-mail address removed) wrote:
(e-mail address removed) wrote:
I want to type a word in the search box of google like website and
[quoted text clipped - 6 lines]
I don't recall asking for your opinion either, Twisted,
The point being that your response to the OP was ..
.blah, blah, blah. Try to get interesting.

Pot, kettle, and all that.
Andrew's point could be /very/ helpful to the OP if it prevents jail time
or a
massive judgment against them.

Unless the OP signed something, or does something dumb like scrape and
republish a huge amount of copyrighted stuff without permission, a
massive judgment seems unlikely, let alone jail time. Actual hacking
or commercial copyright infringement might lead to jail time. Merely
browsing a site with the browser software of his choice, without
either producing abnormally large traffic levels to the server (which
would get his usage noticed and might be treated as a DoS attack) or
republishing anything (which would get his usage noticed and might be
copyright infringement), certainly should do neither if the OP is in a
sane and just country. So unless he's in China or something...what
Google doesn't know won't hurt him. Or hurt Google.


don't sweat it,

Andrew pops his ugly head up now and again, whenever there's anything
unhelpful to be said, he's also quite a Lawyer is our Andrew, I believe he
consulted on the OJ case .


just consider him the news group pet, feed him or Kick him it's up to you.
 
G

G. Garrett Campbell

Isn't a web browser a program.
Isn't the response a URL page?

One would just need to construct a url with an appropriate post from the
search string and read the response.

For personal use, how can that be different than a web browser?


anal_aviator said:
Andrew Thompson wrote:
(e-mail address removed) wrote:
(e-mail address removed) wrote:
I want to type a word in the search box of google like website and
[quoted text clipped - 6 lines]
I don't recall asking for your opinion either, Twisted,
The point being that your response to the OP was ..

.blah, blah, blah. Try to get interesting.

Pot, kettle, and all that.
Andrew's point could be /very/ helpful to the OP if it prevents jail
time
or a
massive judgment against them.

Unless the OP signed something, or does something dumb like scrape and
republish a huge amount of copyrighted stuff without permission, a
massive judgment seems unlikely, let alone jail time. Actual hacking
or commercial copyright infringement might lead to jail time. Merely
browsing a site with the browser software of his choice, without
either producing abnormally large traffic levels to the server (which
would get his usage noticed and might be treated as a DoS attack) or
republishing anything (which would get his usage noticed and might be
copyright infringement), certainly should do neither if the OP is in a
sane and just country. So unless he's in China or something...what
Google doesn't know won't hurt him. Or hurt Google.


don't sweat it,

Andrew pops his ugly head up now and again, whenever there's anything
unhelpful to be said, he's also quite a Lawyer is our Andrew, I believe he
consulted on the OJ case .


just consider him the news group pet, feed him or Kick him it's up to you.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top