Any (preferrably Java) API for screen scraping sites able to login and batch user actions?

onetitfemme · Sep 7, 2006

Say, people would like to log into their hotmail, yahoo and gmail
accounts and "keep an eye" on some text/part of a site
..
I think something like that should be out there, since not all sites
provide RSS feeds nor are they really interested in providing
consistent and informative content (what we (almost) all are looking
for).
..
I have been mostly programming java lately. THis is how I see such an
API could -very basically indeed- be implemented:
..
1. Get the HTML text.
2. Run it through an HTML to XML/XHTML cleanser (tidy nicely fits the
bill, but I truly hate how it changes character entities whichever way
it thinks without giving you an option to let them be as you coded
them. I haven't thoroughly checked JTidy, though)
3. parse 2 using a SAX parser and handle the callbacks it produces,
based on
4. some XPath-like metadata that is kept from the page and some more
metada how it should be processed ...
..
I know XPath might not be the right technology since it uses the DOM
and it might get a little taxing when you are processing many pages ...
..
I recall there was some java project called HTMLCLient, but I wonder
what appened to it
..
I think search engines use similar algorithms and I was wondering
about how the masters do it
..
Thanks
onetitfemme

=?ISO-8859-1?Q?Arne_Vajh=F8j?= · Sep 7, 2006

onetitfemme said:
Say, people would like to log into their hotmail, yahoo and gmail
accounts and "keep an eye" on some text/part of a site
.
I think something like that should be out there, since not all sites
provide RSS feeds nor are they really interested in providing
consistent and informative content (what we (almost) all are looking
for).
.
I have been mostly programming java lately. THis is how I see such an
API could -very basically indeed- be implemented:

I recall there was some java project called HTMLCLient, but I wonder
what appened to it
.
I think search engines use similar algorithms and I was wondering
about how the masters do it

There are a long list of software here:

http://www.manageability.org/blog/stuff/screen-scraping-tools-written-in-java/view

Arne

alex_f_il · Sep 7, 2006

You can try SWExplorerAutomation SWEA (http:\\webunittesting.com).
SWEA creates an object model (automation interface) for any Web
application running in Internet Explorer. The SWEA works with DHTML
pages, html dialogs, dialogs (alerts) and frames.
SWEA is .Net API, but you can use J# for the development.

Harlan Messinger · Sep 7, 2006

onetitfemme said:
Say, people would like to log into their hotmail, yahoo and gmail
accounts and "keep an eye" on some text/part of a site
.
I think something like that should be out there, since not all sites
provide RSS feeds nor are they really interested in providing
consistent and informative content (what we (almost) all are looking
for).
.
I have been mostly programming java lately. THis is how I see such an
API could -very basically indeed- be implemented:

And then every time a provider changes the layout of its screen--then what?

[...]

I recall there was some java project called HTMLCLient, but I wonder
what appened to it
.
I think search engines use similar algorithms and I was wondering
about how the masters do it

Search engines read the page that it finds without knowing in advance
what it contains and where to find the different pieces. That's very
different from knowing in advance the structure of some page, knowing
what you want to extract from that page, and writing a program to
extract that information.

onetitfemme · Sep 8, 2006

And then every time a provider changes the layout of its screen--then what?
otf: well, this , as they say, is where the rubber meets the road ;-)
..
I think such scraping APIs should have provisions for these cases, or
don't they? Which of these APIs (in the long list) do that?
..
I also see a way to reset the page context in a more or less automatic
way. If the scraper notices incompatible changes in the page, it simply
opens the page to the fleshy, slick end users (those sinner ones, you
know) and let them deal with it while detecting the actions the user
took ... ;-) and while doing so it transmit the information to a
distributing server for many other users of this scraper/html context
pages to update their "request contexts" after some technical
supervision ... this way people responsible for the server end would
have to crazily and constantly change their pages in a way that it
might even be counter productive to themselves
..
I think this is technically feasible and easily so, but do you see
other issues lurking in there?
..
I could imagine some people wouldn't like this kind of stuff. But I
think, true freedom means they should be free to dump on us all their
crud and we should be free to selectively filter in the type of crud we
deem appropriate
..
It amazes me how many people are very careful about what they eat and
then sit for hours to watch CNN and Hollywood crap, even happily so ;-)
..
otf

Problem with a login script, SESSION user rights and put this together so it works with the other pages and MySQL. Code examples.	2	May 5, 2023
Web scraping from Java	2	May 28, 2009
Good XPath API for Java?	3	Jun 16, 2009
Java 8 Streams and Eratosthenes	22	Jun 4, 2013
Java Persistence API and persistence.xml	2	Mar 15, 2008
[ANN] Rails demo application for web page scraping	0	Mar 7, 2006
The end to all language wars and the great unity API to come!	62	Jul 2, 2011
Combining Java Reflection API with Java Annotation Types for Thread Safety	2	Oct 22, 2007

Any (preferrably Java) API for screen scraping sites able to login and batch user actions?

onetitfemme

=?ISO-8859-1?Q?Arne_Vajh=F8j?=

alex_f_il

Harlan Messinger

onetitfemme

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads