How do they do this?

J

Joe Snodgrass

One of today's most useful and generalized programming applications is
to take the data displayed on someone else's website and reformat it
according to one's own standards, for a new and improved website.

This is how Mark Zuckerberg got his first incarnation of facebook
running, by repurposing the jpegs in Harvard's online student
photobook, and then allowing the other students to type in snide
comments about how bad everybody looked.

News aggregators also do this.

Assuming my computer has already requested a page the server, what
tool do I use to intercept the content from that page, as it arrives
on my pc? And what is the name of this general technique? (Not
including "hacking" of course.)

Thanks in advance.
 
P

projectmoon

One of today's most useful and generalized programming applications is
to take the data displayed on someone else's website and reformat it
according to one's own standards, for a new and improved website.

[..]

News aggregators also do this.

This "application" is really a collection of different techniques and
technologies, each with their own specific uses. I guess the most general
term for it would be scraping, but even that isn't really suitable since
"screen scraping" generally refers to pulling raw unformatted data (In my
experience, it tends to be HTML) and then parsing it.

How do you do it? That depends on what you're trying to accomplish and
the language/environment you are using.
Assuming my computer has already requested a page the server, what tool
do I use to intercept the content from that page, as it arrives on my
pc? And what is the name of this general technique? (Not including
"hacking" of course.)

That's a very general question. Need more details than this. My first
guess is that you're trying to do this from a web browser, in which case
the question isn't related to Java at all. But we all know what happens
when you assume...

So what is it, exactly, that you are trying to accomplish?
 
J

Joshua Cranmer

Assuming my computer has already requested a page the server, what
tool do I use to intercept the content from that page, as it arrives
on my pc? And what is the name of this general technique? (Not
including "hacking" of course.)

It depends on what the format of the content is. Most typically, it is
HTML in some form or fashion--in these cases, using an HTML library to
convert the webpage into a DOM that you can then manipulate is
definitely feasible, and I have done it before. The general technique is
often referred to as "scraping" (I personally use the term "webscraping"
more).
 
N

Nigel Wade

One of today's most useful and generalized programming applications is
to take the data displayed on someone else's website and reformat it
according to one's own standards, for a new and improved website.

The first thing you need to be aware of is copyright issues. Taking data
from someone else's website and making it available directly, rather
than via accredited and referenced hyperlinks, will almost certainly be
a breach of copyright. Even hot-linking is dubious.
This is how Mark Zuckerberg got his first incarnation of facebook
running, by repurposing the jpegs in Harvard's online student
photobook, and then allowing the other students to type in snide
comments about how bad everybody looked.

News aggregators also do this.

Assuming my computer has already requested a page the server, what
tool do I use to intercept the content from that page, as it arrives
on my pc?

Normally you wouldn't. Capturing the content being directed to another
application would be fairly complicated. You may be able to do it with
an application such as Wireshark (which is a packet sniffer/traffic
analyser) and get it to save all the traffic in a file for later
analysis and processing.

A better option is to read the web content directly in your application,
by opening the desired URL, then parse the response. You can generally
only do this if the response is HTML in some form. Other responses, such
as JavaScript code, streamed or download content etc. need other
techniques.
And what is the name of this general technique? (Not
including "hacking" of course.)

I think a generic term is "web-scraping", although content authors may
may use other terms.

It's something I've only done once myself, to generate a vCalendar
calendar from a web page containing a fixture list for a sports league.
In this case all I did was to read from a URL by opening a Reader:
java.io.Reader reader;
reader = new InputStreamReader(new URL(args[0]).openStream());

then read from it using:
HtmlParserCallback callback = new HtmlParserCallback();
new ParserDelegator().parse(reader, callback, true);

My HtmlParserCallback was a class which extended
javax.swing.text.html.HTMLEditorKit.ParserCallback

with callbacks to handle the various bits of HTML I was interested in.

This is as much as I have ever done in this respect, using Java.
 
R

Roedy Green

ssuming my computer has already requested a page the server, what
tool do I use to intercept the content from that page, as it arrives
on my pc? And what is the name of this general technique? (Not
including "hacking" of course.)

see http://mindprod.com/jgloss/screenscaping.html

If you look at http://mindprod.com/applet/americantax.html

if you download the source, you will see all kinds of little programs
to extract sales tax information from various websites.
--
Roedy Green Canadian Mind Products
http://mindprod.com

Microsoft has a new version out, Windows XP, which according to everybody is the "most reliable Windows ever." To me, this is like saying that asparagus is "the most articulate vegetable ever."
~ Dave Barry
 
R

Roedy Green

The first thing you need to be aware of is copyright issues.

I got in trouble simply by using publicly posted exchange rates posted
at Oanda. It seems odd people would provide information, that you
could look at but not use, but that's copyright for you.

--
Roedy Green Canadian Mind Products
http://mindprod.com

Microsoft has a new version out, Windows XP, which according to everybody is the "most reliable Windows ever." To me, this is like saying that asparagus is "the most articulate vegetable ever."
~ Dave Barry
 
A

Arne Vajhøj

I got in trouble simply by using publicly posted exchange rates posted
at Oanda. It seems odd people would provide information, that you
could look at but not use, but that's copyright for you.

An author of a book also wrote the book for people to read,
but that does not imply that photocopying the book i OK.

Arne
 
T

Thufir Hawat

One of today's most useful and generalized programming applications is
to take the data displayed on someone else's website and reformat it
according to one's own standards, for a new and improved website.

[..]

News aggregators also do this.

This "application" is really a collection of different techniques and
technologies, each with their own specific uses.

The OP's probably asking about something along the lines of:


HtmlUnit is a "GUI-Less browser for Java programs". It models HTML
documents and provides an API that allows you to invoke pages, fill out
forms, click links, etc... just like you do in your "normal" browser.

http://htmlunit.sourceforge.net/
 
J

Joe Snodgrass

One of today's most useful and generalized programming applications is
to take the data displayed on someone else's website and reformat it
according to one's own standards, for a new and improved website.

The first thing you need to be aware of is copyright issues. Taking data
from someone else's website and making it available directly, rather
than via accredited and referenced hyperlinks, will almost certainly be
a breach of copyright. Even hot-linking is dubious.


This is how Mark Zuckerberg got his first incarnation of facebook
running, by repurposing the jpegs in Harvard's online student
photobook, and then allowing the other students to type in snide
comments about how bad everybody looked.
News aggregators also do this.
Assuming my computer has already requested a page the server, what
tool do I use to intercept the content from that page, as it arrives
on my pc?  

Normally you wouldn't. Capturing the content being directed to another
application would be fairly complicated. You may be able to do it with
an application such as Wireshark (which is a packet sniffer/traffic
analyser) and get it to save all the traffic in a file for later
analysis and processing.

A better option is to read the web content directly in your application,
by opening the desired URL, then parse the response. You can generally
only do this if the response is HTML in some form. Other responses, such
as JavaScript code, streamed or download content etc. need other
techniques.
And what is the name of this general technique? (Not
including "hacking" of course.)

I think a generic term is "web-scraping", although content authors may
may use other terms.

It's something I've only done once myself, to generate a vCalendar
calendar from a web page containing a fixture list for a sports league.
In this case all I did was to read from a URL by opening a Reader:
 java.io.Reader reader;
 reader = new InputStreamReader(new URL(args[0]).openStream());

then read from it using:
 HtmlParserCallback callback = new HtmlParserCallback();
 new ParserDelegator().parse(reader, callback, true);

My HtmlParserCallback was a class which extended
javax.swing.text.html.HTMLEditorKit.ParserCallback

Is this right?

Supposing it's html, I need a string processor, maybe perl, to
intercept the code as it arrives, methodically reading through the raw
html, as strings. As it comes in, the html format would be identical
to what I see when I give my browser the "show source code" command.

My code would have to "dig" its way down to the html that I care
about, skipping everything I don't care about, by finding opening
tags, then discarding everything until the closing tag. Little by
little, it would zero in on the part I want, also discarding non-data
html.

Did I get that right?
with callbacks to handle the various bits of HTML I was interested in.

I don't know what a "callback" is. :(
 
P

projectmoon

Supposing it's html, I need a string processor, maybe perl, to intercept
the code as it arrives, methodically reading through the raw html, as
strings. As it comes in, the html format would be identical to what I
see when I give my browser the "show source code" command.

My code would have to "dig" its way down to the html that I care about,
skipping everything I don't care about, by finding opening tags, then
discarding everything until the closing tag. Little by little, it would
zero in on the part I want, also discarding non-data html.

Did I get that right?

Basically, yes. There are HTML parsing libraries in many languages out
there. You will need to use one. If the structure of the HTML you are
parsing is always going to be the same though, you may be able to get
away with regular expressions.
I don't know what a "callback" is. :(

Then perhaps you should do a lot more reading before attempting screen
scraping.
 
N

Nigel Wade

One of today's most useful and generalized programming applications is
to take the data displayed on someone else's website and reformat it
according to one's own standards, for a new and improved website.

The first thing you need to be aware of is copyright issues. Taking data
from someone else's website and making it available directly, rather
than via accredited and referenced hyperlinks, will almost certainly be
a breach of copyright. Even hot-linking is dubious.


This is how Mark Zuckerberg got his first incarnation of facebook
running, by repurposing the jpegs in Harvard's online student
photobook, and then allowing the other students to type in snide
comments about how bad everybody looked.
News aggregators also do this.
Assuming my computer has already requested a page the server, what
tool do I use to intercept the content from that page, as it arrives
on my pc?

Normally you wouldn't. Capturing the content being directed to another
application would be fairly complicated. You may be able to do it with
an application such as Wireshark (which is a packet sniffer/traffic
analyser) and get it to save all the traffic in a file for later
analysis and processing.

A better option is to read the web content directly in your application,
by opening the desired URL, then parse the response. You can generally
only do this if the response is HTML in some form. Other responses, such
as JavaScript code, streamed or download content etc. need other
techniques.
And what is the name of this general technique? (Not
including "hacking" of course.)

I think a generic term is "web-scraping", although content authors may
may use other terms.

It's something I've only done once myself, to generate a vCalendar
calendar from a web page containing a fixture list for a sports league.
In this case all I did was to read from a URL by opening a Reader:
java.io.Reader reader;
reader = new InputStreamReader(new URL(args[0]).openStream());

then read from it using:
HtmlParserCallback callback = new HtmlParserCallback();
new ParserDelegator().parse(reader, callback, true);

My HtmlParserCallback was a class which extended
javax.swing.text.html.HTMLEditorKit.ParserCallback

Is this right?

Supposing it's html, I need a string processor, maybe perl, to
intercept the code as it arrives, methodically reading through the raw
html, as strings. As it comes in, the html format would be identical
to what I see when I give my browser the "show source code" command.

My code would have to "dig" its way down to the html that I care
about, skipping everything I don't care about, by finding opening
tags, then discarding everything until the closing tag. Little by
little, it would zero in on the part I want, also discarding non-data
html.

Did I get that right?

No. the ParserDelegator.parse() method handles reading and decoding the
HTML returned from the URL. Whenever it has decoded some element of HTML
it sends it to your code for interpretation, via the callback you
registered with it. Your callback should override certain methods in
HTMLEditorKit.ParserCallback, and the appropriate method will be called
depending on the type of element the parser has detected.

Typically you'd declare your callback to extend
HTMLEditorKit.ParserCallback, and then override whichever methods you
wanted to be able to handle those elements. As the parser detects each
type of HTML element it calls the appropriate callback method in the
HTMLEditorKit.ParserCallback object it was passed. If you override that
method your code can process the HTML element, if you don't override the
method the default action takes place (which, AFAIK, is to ignore it).

There's a simple example of how to use HTMLEditorKit.ParserCallback here:

http://www.java2s.com/Tutorial/Java...avaxswingtexthtmlHTMLEditorKittoparseHTML.htm

Of course, you can write your own parser if you wish. In which case you
would need to do everything you've outlined above.

I don't know what a "callback" is. :(

In Java-speak it would be a "listener". It's a method which you register
with some other piece of code. Under certain predefined circumstances
that other piece of code "calls back" to your code via the callback method.
 
J

Joe Snodgrass

On 18/10/10 14:16, Joe Snodgrass wrote:
One of today's most useful and generalized programming applications is
to take the data displayed on someone else's website and reformat it
according to one's own standards, for a new and improved website.
The first thing you need to be aware of is copyright issues. Taking data
from someone else's website and making it available directly, rather
than via accredited and referenced hyperlinks, will almost certainly be
a breach of copyright. Even hot-linking is dubious.
This is how Mark Zuckerberg got his first incarnation of facebook
running, by repurposing the jpegs in Harvard's online student
photobook, and then allowing the other students to type in snide
comments about how bad everybody looked.
News aggregators also do this.
Assuming my computer has already requested a page the server, what
tool do I use to intercept the content from that page, as it arrives
on my pc?  
Normally you wouldn't. Capturing the content being directed to another
application would be fairly complicated. You may be able to do it with
an application such as Wireshark (which is a packet sniffer/traffic
analyser) and get it to save all the traffic in a file for later
analysis and processing.
A better option is to read the web content directly in your application,
by opening the desired URL, then parse the response. You can generally
only do this if the response is HTML in some form. Other responses, such
as JavaScript code, streamed or download content etc. need other
techniques.
And what is the name of this general technique? (Not
including "hacking" of course.)
I think a generic term is "web-scraping", although content authors may
may use other terms.
It's something I've only done once myself, to generate a vCalendar
calendar from a web page containing a fixture list for a sports league..
In this case all I did was to read from a URL by opening a Reader:
 java.io.Reader reader;
 reader = new InputStreamReader(new URL(args[0]).openStream());
then read from it using:
 HtmlParserCallback callback = new HtmlParserCallback();
 new ParserDelegator().parse(reader, callback, true);
My HtmlParserCallback was a class which extended
javax.swing.text.html.HTMLEditorKit.ParserCallback
Is this right?
Supposing it's html, I need a string processor, maybe perl, to
intercept the code as it arrives, methodically reading through the raw
html, as strings.  As it comes in, the html format would be identical
to what I see when I give my browser the "show source code" command.
My code would have to "dig" its way down to the html that I care
about, skipping everything I don't care about, by finding opening
tags, then discarding everything until the closing tag.  Little by
little, it would zero in on the part I want, also discarding non-data
html.
Did I get that right?

No. the ParserDelegator.parse() method handles reading and decoding the
HTML returned from the URL. Whenever it has decoded some element of HTML
it sends it to your code for interpretation, via the callback you
registered with it. Your callback should override certain methods in
HTMLEditorKit.ParserCallback, and the appropriate method will be called
depending on the type of element the parser has detected.

Typically you'd declare your callback to extend
HTMLEditorKit.ParserCallback, and then override whichever methods you
wanted to be able to handle those elements. As the parser detects each
type of HTML element it calls the appropriate callback method in the
HTMLEditorKit.ParserCallback object it was passed. If you override that
method your code can process the HTML element, if you don't override the
method the default action takes place (which, AFAIK, is to ignore it).

There's a simple example of how to use HTMLEditorKit.ParserCallback here:

http://www.java2s.com/Tutorial/Java/0120__Development/Usejavaxswingte...

Of course, you can write your own parser if you wish. In which case you
would need to do everything you've outlined above.


I don't know what a "callback" is.  :(

In Java-speak it would be a "listener". It's a method which you register
with some other piece of code. Under certain predefined circumstances
that other piece of code "calls back" to your code via the callback method.

What's a good book (preferrably cheap) to read up on these class
libraries you mention that I never heard of? TIA.
 
L

Lew

Joe said:
What's a good book (preferrably cheap) to read up on these class
libraries you mention that I never heard of?  TIA.

http://download.oracle.com/javase/6/docs/api/

particularly:
<http://download.oracle.com/javase/6/docs/api/javax/swing/text/html/
package-frame.html>
<http://download.oracle.com/javase/6/docs/api/javax/swing/text/html/
HTMLEditorKit.html>
<http://download.oracle.com/javase/6/docs/api/javax/swing/text/html/
HTMLEditorKit.ParserCallback.html>

Plus the link Nigel already provided you.
 
N

Nigel Wade

What's a good book (preferrably cheap) to read up on these class
libraries you mention that I never heard of? TIA.

Sorry, I don't know of any. When I did this I used Google to find resources.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top