Any (preferrably Java) API for screen scraping sites able to login and batch user actions?

Discussion in 'Java' started by onetitfemme, Sep 7, 2006.

  1. onetitfemme

    onetitfemme Guest

    Say, people would like to log into their hotmail, yahoo and gmail
    accounts and "keep an eye" on some text/part of a site
    ..
    I think something like that should be out there, since not all sites
    provide RSS feeds nor are they really interested in providing
    consistent and informative content (what we (almost) all are looking
    for).
    ..
    I have been mostly programming java lately. THis is how I see such an
    API could -very basically indeed- be implemented:
    ..
    1. Get the HTML text.
    2. Run it through an HTML to XML/XHTML cleanser (tidy nicely fits the
    bill, but I truly hate how it changes character entities whichever way
    it thinks without giving you an option to let them be as you coded
    them. I haven't thoroughly checked JTidy, though)
    3. parse 2 using a SAX parser and handle the callbacks it produces,
    based on
    4. some XPath-like metadata that is kept from the page and some more
    metada how it should be processed ...
    ..
    I know XPath might not be the right technology since it uses the DOM
    and it might get a little taxing when you are processing many pages ...
    ..
    I recall there was some java project called HTMLCLient, but I wonder
    what appened to it
    ..
    I think search engines use similar algorithms and I was wondering
    about how the masters do it
    ..
    Thanks
    onetitfemme
    onetitfemme, Sep 7, 2006
    #1
    1. Advertising

  2. Re: Any (preferrably Java) API for screen scraping sites able tologin and batch user actions?

    onetitfemme wrote:
    > Say, people would like to log into their hotmail, yahoo and gmail
    > accounts and "keep an eye" on some text/part of a site
    > .
    > I think something like that should be out there, since not all sites
    > provide RSS feeds nor are they really interested in providing
    > consistent and informative content (what we (almost) all are looking
    > for).
    > .
    > I have been mostly programming java lately. THis is how I see such an
    > API could -very basically indeed- be implemented:


    > I recall there was some java project called HTMLCLient, but I wonder
    > what appened to it
    > .
    > I think search engines use similar algorithms and I was wondering
    > about how the masters do it


    There are a long list of software here:

    http://www.manageability.org/blog/stuff/screen-scraping-tools-written-in-java/view

    Arne
    =?ISO-8859-1?Q?Arne_Vajh=F8j?=, Sep 7, 2006
    #2
    1. Advertising

  3. onetitfemme

    Guest

    You can try SWExplorerAutomation SWEA (http:\\webunittesting.com).
    SWEA creates an object model (automation interface) for any Web
    application running in Internet Explorer. The SWEA works with DHTML
    pages, html dialogs, dialogs (alerts) and frames.
    SWEA is .Net API, but you can use J# for the development.




    onetitfemme wrote:
    > Say, people would like to log into their hotmail, yahoo and gmail
    > accounts and "keep an eye" on some text/part of a site
    > .
    > I think something like that should be out there, since not all sites
    > provide RSS feeds nor are they really interested in providing
    > consistent and informative content (what we (almost) all are looking
    > for).
    > .
    > I have been mostly programming java lately. THis is how I see such an
    > API could -very basically indeed- be implemented:
    > .
    > 1. Get the HTML text.
    > 2. Run it through an HTML to XML/XHTML cleanser (tidy nicely fits the
    > bill, but I truly hate how it changes character entities whichever way
    > it thinks without giving you an option to let them be as you coded
    > them. I haven't thoroughly checked JTidy, though)
    > 3. parse 2 using a SAX parser and handle the callbacks it produces,
    > based on
    > 4. some XPath-like metadata that is kept from the page and some more
    > metada how it should be processed ...
    > .
    > I know XPath might not be the right technology since it uses the DOM
    > and it might get a little taxing when you are processing many pages ...
    > .
    > I recall there was some java project called HTMLCLient, but I wonder
    > what appened to it
    > .
    > I think search engines use similar algorithms and I was wondering
    > about how the masters do it
    > .
    > Thanks
    > onetitfemme
    , Sep 7, 2006
    #3
  4. Re: Any (preferrably Java) API for screen scraping sites able tologin and batch user actions?

    onetitfemme wrote:
    > Say, people would like to log into their hotmail, yahoo and gmail
    > accounts and "keep an eye" on some text/part of a site
    > .
    > I think something like that should be out there, since not all sites
    > provide RSS feeds nor are they really interested in providing
    > consistent and informative content (what we (almost) all are looking
    > for).
    > .
    > I have been mostly programming java lately. THis is how I see such an
    > API could -very basically indeed- be implemented:


    And then every time a provider changes the layout of its screen--then what?

    [...]

    > I recall there was some java project called HTMLCLient, but I wonder
    > what appened to it
    > .
    > I think search engines use similar algorithms and I was wondering
    > about how the masters do it


    Search engines read the page that it finds without knowing in advance
    what it contains and where to find the different pieces. That's very
    different from knowing in advance the structure of some page, knowing
    what you want to extract from that page, and writing a program to
    extract that information.
    Harlan Messinger, Sep 7, 2006
    #4
  5. onetitfemme

    onetitfemme Guest

    > And then every time a provider changes the layout of its screen--then what?
    otf: well, this , as they say, is where the rubber meets the road ;-)
    ..
    I think such scraping APIs should have provisions for these cases, or
    don't they? Which of these APIs (in the long list) do that?
    ..
    I also see a way to reset the page context in a more or less automatic
    way. If the scraper notices incompatible changes in the page, it simply
    opens the page to the fleshy, slick end users (those sinner ones, you
    know) and let them deal with it while detecting the actions the user
    took ... ;-) and while doing so it transmit the information to a
    distributing server for many other users of this scraper/html context
    pages to update their "request contexts" after some technical
    supervision ... this way people responsible for the server end would
    have to crazily and constantly change their pages in a way that it
    might even be counter productive to themselves
    ..
    I think this is technically feasible and easily so, but do you see
    other issues lurking in there?
    ..
    I could imagine some people wouldn't like this kind of stuff. But I
    think, true freedom means they should be free to dump on us all their
    crud and we should be free to selectively filter in the type of crud we
    deem appropriate
    ..
    It amazes me how many people are very careful about what they eat and
    then sit for hours to watch CNN and Hollywood crap, even happily so ;-)
    ..
    otf
    onetitfemme, Sep 8, 2006
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?VmljdG9y?=

    Java script and screen scraping

    =?Utf-8?B?VmljdG9y?=, Oct 17, 2005, in forum: ASP .Net
    Replies:
    2
    Views:
    1,091
    =?Utf-8?B?VmljdG9y?=
    Oct 17, 2005
  2. David Jones

    Web Scraping/Site Scraping

    David Jones, Jul 11, 2004, in forum: Python
    Replies:
    4
    Views:
    493
    Andrew Bennetts
    Jul 13, 2004
  3. onetitfemme
    Replies:
    4
    Views:
    4,313
    onetitfemme
    Sep 8, 2006
  4. Replies:
    0
    Views:
    85
  5. Becca Girl
    Replies:
    11
    Views:
    257
Loading...

Share This Page