Suggestions on Parsing HTML

  • Thread starter burgermeister01
  • Start date
B

burgermeister01

Hi,

I'm working on a school project, and I was hoping to get some
suggestions from the group. As part of a project I need a program to be
able to go to dictionary.com and look up a word a user specifies and
return the definition. So far I've figured out how to pull data from a
URL, and I can get a page's HTML code no problem. The next step, which
is displaying the text is what is given me a problem. How can I rip out
just the HTML that I want and leave all the rest behind? So far, my
best idea is just to use some very clever and maticulous text parsing,
but that seems tedious and unreliable (what if dictionary.com makes a
change to their HTML code?). Is there an easier way that I don't know
of? Keep in mind that I have to be able to display this text to a
command line and a GUI so if Java has some kind of built-in HTML
reader, that would only half work.
 
J

Jon Martin Solaas

Hi,

I'm working on a school project, and I was hoping to get some
suggestions from the group. As part of a project I need a program to be
able to go to dictionary.com and look up a word a user specifies and
return the definition. So far I've figured out how to pull data from a
URL, and I can get a page's HTML code no problem. The next step, which
is displaying the text is what is given me a problem. How can I rip out
just the HTML that I want and leave all the rest behind? So far, my
best idea is just to use some very clever and maticulous text parsing,
but that seems tedious and unreliable (what if dictionary.com makes a
change to their HTML code?). Is there an easier way that I don't know
of? Keep in mind that I have to be able to display this text to a
command line and a GUI so if Java has some kind of built-in HTML
reader, that would only half work.


There is a simple html parser in Swing (of all places ...). More
advanced exist for sure, but it's easy to use and exist in the runtime
library.

http://java.sun.com/products/jfc/tsc/articles/bookmarks/index.html

If the webpages change you're still stuck. Maybe the site has some
interface for 3d parties?
 
B

burgermeister01

Thanks, that library looks as though it's really going to make my life
easier. Also just in case by some fluke, somone is looking to do the
same thing as me, it seems as though merriam-websters's website is
easier to work with. Secondly, I expect String.split to be useful in
addition to indexOf, etc.
 
O

Oliver Wong

Thanks, that library looks as though it's really going to make my life
easier. Also just in case by some fluke, somone is looking to do the
same thing as me, it seems as though merriam-websters's website is
easier to work with. Secondly, I expect String.split to be useful in
addition to indexOf, etc.

If you're open to alternative dictionaries, look for one with an open
API. I know Gnome has a widget that allows you to place a dictionary in the
toolbar. You might want to find out which API they're using and use it as
well. You might be able to avoid dealing with HTML altogether if you use an
API (you'd be dealing with XML instead), and the service provider is less
likely to change the HTML formatting if they've published the API openly.

Another thing you might try is using the Google websearch API. In
"normal" Google, if you prefix a search query with "define:", you'll get the
definition of the word, instead of pages which contain the word as keywords.
E.g. "define:dogma" gives you definitions form the word "dogma". Maybe this
facility is also accessible via Google's search API. The API devloper kit
contains sample programs in Java.

http://www.google.com/apis/

- Oliver
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,432
Messages
2,571,680
Members
48,796
Latest member
Greg L.

Latest Threads

Top