parsing HTML

D

Drew

Hi All:

I'm working on a mini HTML parser. Basically, what I need to do is to
take a HTML file and parse thru it. I want to pick out all of the
text that is between table data tags <td> and </td> and all of the
text between list item tags <li> and </li>.

Since, its possible that a line of HTML could have no spaces at all
like the below:

<tr><td>SomeFixture</td></tr>

I'm thinking that I'm going to need to read the HTML file one line at
a time. Then look for < and its closing >. If the text between the
two is td or li, then start capturing text at the location of > + 1
and do that until I hit another < with at /td after it.

Does this sound reasonable? Or am I coming up with too difficult of a
solution. Does Java have any built in HTML parsing methods that make
this easier?

Or even if there's an existing Java program that I could modify for
this, that's great too.

Any help is appreciated!

Drew
 
T

Thomas Weidenfeller

Drew said:
I'm thinking that I'm going to need to read the HTML file one line at
a time.

One character at a time, if you want to build a real parser.
Does this sound reasonable? Or am I coming up with too difficult of a
solution.

In the general case, if you haven't some HTML which is layed out in a
particular simple way, well known to you, your solution is too simple,
not too difficult. A real-world HTML parser is a tricky thing, because
it has to deal with different HTML standards and all kinds of common
HTML errors which page designers like to do.
Does Java have any built in HTML parsing methods that make
this easier?

Yes, but the parser is limited. See the FAQ in my sig for some infos.
Or even if there's an existing Java program that I could modify for
this, that's great too.

http://htmlparser.sourceforge.net/ gets recommended from time to time. I
have no experience with it.

/Thomas
 
T

TechBookReport

Drew said:
Hi All:

I'm working on a mini HTML parser. Basically, what I need to do is to
take a HTML file and parse thru it. I want to pick out all of the
text that is between table data tags <td> and </td> and all of the
text between list item tags <li> and </li>.

Since, its possible that a line of HTML could have no spaces at all
like the below:

<tr><td>SomeFixture</td></tr>

I'm thinking that I'm going to need to read the HTML file one line at
a time. Then look for < and its closing >. If the text between the
two is td or li, then start capturing text at the location of > + 1
and do that until I hit another < with at /td after it.

Does this sound reasonable? Or am I coming up with too difficult of a
solution. Does Java have any built in HTML parsing methods that make
this easier?

Or even if there's an existing Java program that I could modify for
this, that's great too.

Any help is appreciated!

Drew
If it was me I'd consider either using regular expressions or XSL, both
well supported in Java.

Pan
=========================================================================
TechBookReport Java http://www.techbookreport.com/JavaIndex.html
 
H

Hal Rosser

I'm thinking that I'm going to need to read the HTML file one line at
a time. Then look for < and its closing >. If the text between the
two is td or li, then start capturing text at the location of > + 1
and do that until I hit another < with at /td after it.

Does this sound reasonable? Or am I coming up with too difficult of a
solution. Does Java have any built in HTML parsing methods that make
this easier?

You would probably need to read it one char at a time.
But as another poster mentioned - regex may be a good alternative
 
M

Marcin Grunwald

Drew said:
Hi All:

I'm working on a mini HTML parser. Basically, what I need to do is to
take a HTML file and parse thru it. I want to pick out all of the
text that is between table data tags <td> and </td> and all of the
text between list item tags <li> and </li>.

Since, its possible that a line of HTML could have no spaces at all
like the below:

<tr><td>SomeFixture</td></tr>

I'm thinking that I'm going to need to read the HTML file one line at
a time. Then look for < and its closing >. If the text between the
two is td or li, then start capturing text at the location of > + 1
and do that until I hit another < with at /td after it.

Does this sound reasonable? Or am I coming up with too difficult of a
solution. Does Java have any built in HTML parsing methods that make
this easier?

Or even if there's an existing Java program that I could modify for
this, that's great too.

Any help is appreciated!

Drew

There is already html parser in JDK, maybe try it before you write your own.
Start from checking this:
javax.swing.text.html.parser.DocumentParser
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,774
Messages
2,569,598
Members
45,160
Latest member
CollinStri
Top