parsing HTML

Drew · Feb 28, 2005

Hi All:

I'm working on a mini HTML parser. Basically, what I need to do is to
take a HTML file and parse thru it. I want to pick out all of the
text that is between table data tags <td> and </td> and all of the
text between list item tags <li> and </li>.

Since, its possible that a line of HTML could have no spaces at all
like the below:

<tr><td>SomeFixture</td></tr>

I'm thinking that I'm going to need to read the HTML file one line at
a time. Then look for < and its closing >. If the text between the
two is td or li, then start capturing text at the location of > + 1
and do that until I hit another < with at /td after it.

Does this sound reasonable? Or am I coming up with too difficult of a
solution. Does Java have any built in HTML parsing methods that make
this easier?

Or even if there's an existing Java program that I could modify for
this, that's great too.

Any help is appreciated!

Drew

Thomas Weidenfeller · Feb 28, 2005

Drew said:
I'm thinking that I'm going to need to read the HTML file one line at
a time.

One character at a time, if you want to build a real parser.

Does this sound reasonable? Or am I coming up with too difficult of a
solution.

In the general case, if you haven't some HTML which is layed out in a
particular simple way, well known to you, your solution is too simple,
not too difficult. A real-world HTML parser is a tricky thing, because
it has to deal with different HTML standards and all kinds of common
HTML errors which page designers like to do.

Does Java have any built in HTML parsing methods that make
this easier?

Yes, but the parser is limited. See the FAQ in my sig for some infos.

Or even if there's an existing Java program that I could modify for
this, that's great too.

http://htmlparser.sourceforge.net/ gets recommended from time to time. I
have no experience with it.

/Thomas

TechBookReport · Feb 28, 2005

Drew said:
Hi All:

I'm working on a mini HTML parser. Basically, what I need to do is to
take a HTML file and parse thru it. I want to pick out all of the
text that is between table data tags <td> and </td> and all of the
text between list item tags <li> and </li>.

Since, its possible that a line of HTML could have no spaces at all
like the below:

<tr><td>SomeFixture</td></tr>

I'm thinking that I'm going to need to read the HTML file one line at
a time. Then look for < and its closing >. If the text between the
two is td or li, then start capturing text at the location of > + 1
and do that until I hit another < with at /td after it.

Does this sound reasonable? Or am I coming up with too difficult of a
solution. Does Java have any built in HTML parsing methods that make
this easier?

Or even if there's an existing Java program that I could modify for
this, that's great too.

Any help is appreciated!

Drew

If it was me I'd consider either using regular expressions or XSL, both
well supported in Java.

Pan
=========================================================================
TechBookReport Java http://www.techbookreport.com/JavaIndex.html

Hal Rosser · Mar 1, 2005

I'm thinking that I'm going to need to read the HTML file one line at

a time. Then look for < and its closing >. If the text between the
two is td or li, then start capturing text at the location of > + 1
and do that until I hit another < with at /td after it.

Does this sound reasonable? Or am I coming up with too difficult of a
solution. Does Java have any built in HTML parsing methods that make
this easier?

You would probably need to read it one char at a time.
But as another poster mentioned - regex may be a good alternative

Marcin Grunwald · Mar 1, 2005

Drew said:
Hi All:

I'm working on a mini HTML parser. Basically, what I need to do is to
take a HTML file and parse thru it. I want to pick out all of the
text that is between table data tags <td> and </td> and all of the
text between list item tags <li> and </li>.

Since, its possible that a line of HTML could have no spaces at all
like the below:

<tr><td>SomeFixture</td></tr>

I'm thinking that I'm going to need to read the HTML file one line at
a time. Then look for < and its closing >. If the text between the
two is td or li, then start capturing text at the location of > + 1
and do that until I hit another < with at /td after it.

Does this sound reasonable? Or am I coming up with too difficult of a
solution. Does Java have any built in HTML parsing methods that make
this easier?

Or even if there's an existing Java program that I could modify for
this, that's great too.

Any help is appreciated!

Drew

There is already html parser in JDK, maybe try it before you write your own.
Start from checking this:
javax.swing.text.html.parser.DocumentParser

Can someone tell me if this a real tracker? Or is it one designed to show you a different message at certain times, ie. acting like one?	0	Jan 10, 2021
Sort by number of characters	1	Nov 2, 2023
Getting extra blank rows from appending HTML..?	2	Oct 24, 2023
I need help fixing my website	2	Oct 15, 2023
Only one table shows up with the information	2	Mar 29, 2023
How can I calculate the last payment of the year to be the sum of all previous payments for that year and subtracting it from Research Costs value?	7	Aug 22, 2023
Can anyone please help? HTML - two tables applying different styles	4	Dec 1, 2020
Javascript DOM	1	Mar 29, 2023

parsing HTML

Drew

Thomas Weidenfeller

TechBookReport

Hal Rosser

Marcin Grunwald

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads