extract data from xhtml

D

Damo_Suzuki

Hi,
I am in the process of extracting data from a html document. I used
Jtidy to convert it to XHTML . Now that I have the XHTML how can i
extract data from it. Say, I wanted to extract a node with the tag <h2
class ="r">.......</h2> , does anyone know or have sample code to
achieve this. I've been knocking my head off a brick wall for a few
days now trying to do this.
Thanks
 
F

Flo 'Irian' Schaetz

Damo_Suzuki said:
I am in the process of extracting data from a html document. I used
Jtidy to convert it to XHTML . Now that I have the XHTML how can i
extract data from it.

As a valid XHTML document is well formed XML, you should be able to parse
it - either with a DOMParser or SAXParser. Searching for them in Google
should bring up enough examples how to use them.

Flo
 
D

Damo_Suzuki

Hi,
Now that its in XHTML can I use DocumentBuilder to extract data from it
.. I dont want to write the xhml to a file. my code looks like this :

tidy.parse(in, System.out);


DocumentBuilderFactory domFactory =
DocumentBuilderFactory.newInstance();
domFactory.setNamespaceAware(true);
DocumentBuilder builder = domFactory.newDocumentBuilder();
Document doc = builder.parse(XXXXXXXXXX);

In the parse method 'in' is the file i want to extract data from. Its
gotten straight off the web, "JTidied" and output to the console. Can
I somehow use this as the paramater where all the X's are for the
DocumentBuilder parse method?
Thanks
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top