Accessing attributes in HTML with DOM

D

Damo

Hi
I'm trying to extract text from a html page useing DOM. I used JTidy
first on it. The HTml itself is not very descriptive. Theres no
standout tags around the text I need to extract . The way I was
thinking of doing it was accessing the attributes, but I keep getting a
NullPointerException. This is the HTML:


<div class="mb16">
<div id="r_t0" class="prel">
<a id="r0_t" class="L4"href="http://java.sun.com/"">
<b>Java</b> Technology</a></div>
<div class="T1" id="r0_a">Sun's home for <b>Java</b>. Offers
Windows, Solaris, and Linux <b>Java</b> Development Kits (JDKs),
extensions, news, tutorials, and product information.</div>
<div id="r_b0" class="prel T11"><a id="r0_b"
href="http://java.sun.com/">
<img src="http://sp.ask.com/sh/i/icon_bins.gif" border="0"class="bb"
/></a>
<span id="r0_u" class="T10">java.sun.com/</span>
<strong>&middot;</strong> <a class="L5 nw"
href="http://www.askcache.com">
Cached</a> 1f40 <strong>&middot;</strong>
<a class="L5 L5V" href="javascript:void(0)">Save</a>
</div>
</div>


This is the part I want to skip to to extract text. Its buried in loads
of other HTML. Cany anyone please help me do this.
 
D

Daniel Pitts

Damo said:
Hi
I'm trying to extract text from a html page useing DOM. I used JTidy
first on it. The HTml itself is not very descriptive. Theres no
standout tags around the text I need to extract . The way I was
thinking of doing it was accessing the attributes, but I keep getting a
NullPointerException. This is the HTML:


<div class="mb16">
<div id="r_t0" class="prel">
<a id="r0_t" class="L4"href="http://java.sun.com/"">
<b>Java</b> Technology</a></div>
<div class="T1" id="r0_a">Sun's home for <b>Java</b>. Offers
Windows, Solaris, and Linux <b>Java</b> Development Kits (JDKs),
extensions, news, tutorials, and product information.</div>
<div id="r_b0" class="prel T11"><a id="r0_b"
href="http://java.sun.com/">
<img src="http://sp.ask.com/sh/i/icon_bins.gif" border="0"class="bb"
/></a>
<span id="r0_u" class="T10">java.sun.com/</span>
<strong>&middot;</strong> <a class="L5 nw"
href="http://www.askcache.com">
Cached</a> 1f40 <strong>&middot;</strong>
<a class="L5 L5V" href="javascript:void(0)">Save</a>
</div>
</div>


This is the part I want to skip to to extract text. Its buried in loads
of other HTML. Cany anyone please help me do this.
The example HTML is a good start, perhaps you should consider giving us
the code that produces the NPE, and what you expect the output to be.
Also, if its a valid XML document, perhaps you should consider using
XPath, it helps select data based on the path to that data (including
selections based on element names, attributes, order, etc...).
 
D

Damo

If I can get at the first div I can get its child nodes. How would one
use XPath to get it.
The code below is what I have




NodeList sections = document.getElementsByTagName("div");
for(int i=0; i<sections.getLength();i++)
{
Element section =(Element)sections.item(i);

Attr attr = (Attr)section.getAttributeNode("class");
boolean wasSpecified = attr != null && attr.getSpecified();

String at = attr.getValue();
if(at=="mb16")
{
//I have a recursive method to get the text nodes for here
//if I can get at the child nodes of that particular div
}
}
 
D

Damo

Oh and the output I want is

Java Technology

Sun's home for Java Offers
Windows, Solaris, and Linux Java Development Kits (JDKs),
extensions, news, tutorials, and product information.

java.sun.com/

all stored as differnet Strings
 
D

Damo

Oh and the output I want is

Java Technology

Sun's home for Java Offers
Windows, Solaris, and Linux Java Development Kits (JDKs),
extensions, news, tutorials, and product information.

java.sun.com/

all stored as 3 differnet Strings
 
D

Damo

I'm now using this code. It finds the div nodes with the attribute
"pre1", but it wil not get its child nodes.

if(attr.getValue()=="prel"): Is there something wrong with this line?




NodeList sections =
document.getElementsByTagName("div");
System.out.println(sections.getLength());
for(int i=0; i<sections.getLength();i++)
{
Element section =(Element)sections.item(i);
Attr attr =
(Attr)section.getAttributeNode("class");
if(attr==null)
{
System.out.println("false");
}
else
{
System.out.println(attr.getValue());
if(attr.getValue()=="prel")
{
NodeList name =
section.getChildNodes();

System.out.println(name.getLength());
for(int j=0;
j<name.getLength();j++)
{
Element list =
(Element)name.item(j);
String title =
getText(list.getFirstChild());
System.out.println(title);
}
}
}

}
 
A

Andrew Thompson

Damo said:
I'm now using this code. It finds the div nodes with the attribute
"pre1", but it wil not get its child nodes.

// compares references to the two strings
if(attr.getValue()=="prel"): Is there something wrong with this line?

// compares contents of strings
if(attr.getValue().equals("prel"))

Andrew T.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,050
Latest member
AngelS122

Latest Threads

Top