Web scraping from Java

T

Tom N

I have some existing web scraping Java code which uses an old
sourceforge project called jacobie (now an orphan and only supports
IE6).

I'm looking for a recommended open-source replacement, one that is
popular and easy to use (popular means unlikely to become an orphan).

My existing code uses Jacobie to navigate the web, and then has ad-hoc
HTML parsing code in Java. Jacobie is a Java class library that uses
the jacob project to drive Internet Explorer. Jacobie is sparsely
documented and any attempts by me to understand or modify it become
bogged down in the inherently un-understandable Windows/IE interface -
yuk.

The web navigation required by my code is fairly straightforward and
will have to be rewritten for a new libray (no big deal) but I don't
want to rewrite the parsing so I need something that will allow easy
access to the raw HTML.

On the other hand, for the future, something that allows some sort of
scriptable parsing of the web content would be good.

I'd prefer to drive an actual browser rather than using a virtual
browser (or the option of both would be good).

Currently thinking that JWebUnit[1] looks like a good candidate.

JWebUnit drives HtmlUnit[2] ("GUI-Less browser for Java programs").
There is a work-in-progress to provide an HtmlUnit interface for
Selenium[3] (front end for multiple real browsers, Firefox and IE
included) - unclear how usable this currently is.

I also came across webdriver [9], which sounds similar in concept.
Seems to support IE, Firefox, HtmlUnit I get the feeling that it is a
less broadly supported project than JWebUnit.

I don't know if these approaches support any kind of scriptable parsing.
Perhaps that is a separate issue because I could easily use a completely
separate tool to parse the HTML once navigated and retrieved.

Having a quick read of the web page for Web-Harvest[4] suggests it may
be a good avenue for future parsing with less pain than current ad-hoc
code.

Any comments/suggestions?

Browsers:
Firefox seems to be the most likely host browser - I have no particular
need to use to any specific browser (other than it being the latest
version of that browser).
Currently I am using IE6 with jacobie. Also have installed Firefox
(latest), Chrome (latest) and Opera (latest). Obviously, IE6 is getting
long in the tooth with IE8 out now. I have not upgraded to IE7 or IE8
because I am not sure whether jacobie will work with them. At some
stage, I'd like to move from Win XP to Windows 7 but I don't want to use
anything proprietary so moving to Windows 7 should not be an issue
(apart from Win7 likely not supporting IE6 or IE7).

Development tools: NetBeans 6.5
Platform: Windows XP.

[1] http://jwebunit.sourceforge.net/
[2] http://htmlunit.sourceforge.net/
[3] http://seleniumhq.org/projects/remote-control/
[4] http://web-harvest.sourceforge.net/
[5] http://simile.mit.edu/wiki/Solvent
[6] http://simile.mit.edu/wiki/Piggy_Bank
[7] http://en.wikipedia.org/wiki/Xquery
[8] http://en.wikipedia.org/wiki/Xpath
[9] http://code.google.com/p/webdriver/
 
R

Roedy Green

I have some existing web scraping Java code which uses an old
sourceforge project called jacobie (now an orphan and only supports
IE6).

I do it with the http package, see
http://mindprod.com/products1.html#HTTP

Examples are in http://mindprod.com/products1.html#SUBMITTER
and
http://mindprod.com/products1.html#AMERICANTAX

I use indexOf and regexes to find what I want in the HTML.

There is a tool called TagSoup for helping clean up invalid HTML.
--
Roedy Green Canadian Mind Products
http://mindprod.com

"Everybody’s worried about stopping terrorism. Well, there’s a really easy way: stop participating in it."
~ Noam Chomsky
 
T

Tom Anderson

Currently thinking that JWebUnit[1] looks like a good candidate.

JWebUnit drives HtmlUnit[2] ("GUI-Less browser for Java programs").

I've used HtmlUnit a lot. It's good, and not only does it surf the web, it
does all the HTML parsing for you, and makes extracting information from
the pages very easy. I'd use that, and not bother with a layer on top of
it. It's not a 'real' browser, but i suspect that doesn't matter for you.
If it does, Selenium or WebDriver would be your choices, although i can't
comment on how you extract information from those.

tom
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,045
Latest member
DRCM

Latest Threads

Top