Accessing attributes in HTML with DOM

Damo · Jan 16, 2007

Hi
I'm trying to extract text from a html page useing DOM. I used JTidy
first on it. The HTml itself is not very descriptive. Theres no
standout tags around the text I need to extract . The way I was
thinking of doing it was accessing the attributes, but I keep getting a
NullPointerException. This is the HTML:

<div class="mb16">
<div id="r_t0" class="prel">
<a id="r0_t" class="L4"href="http://java.sun.com/"">
<b>Java</b> Technology</a></div>
<div class="T1" id="r0_a">Sun's home for <b>Java</b>. Offers
Windows, Solaris, and Linux <b>Java</b> Development Kits (JDKs),
extensions, news, tutorials, and product information.</div>
<div id="r_b0" class="prel T11"><a id="r0_b"
href="http://java.sun.com/">
<img src="http://sp.ask.com/sh/i/icon_bins.gif" border="0"class="bb"
/></a>
<span id="r0_u" class="T10">java.sun.com/</span>
<strong>·</strong> <a class="L5 nw"
href="http://www.askcache.com">
Cached</a> 1f40 <strong>·</strong>
<a class="L5 L5V" href="javascript:void(0)">Save</a>
</div>
</div>

This is the part I want to skip to to extract text. Its buried in loads
of other HTML. Cany anyone please help me do this.

Daniel Pitts · Jan 16, 2007

Damo said:
Hi
I'm trying to extract text from a html page useing DOM. I used JTidy
first on it. The HTml itself is not very descriptive. Theres no
standout tags around the text I need to extract . The way I was
thinking of doing it was accessing the attributes, but I keep getting a
NullPointerException. This is the HTML:

<div class="mb16">
<div id="r_t0" class="prel">
<a id="r0_t" class="L4"href="http://java.sun.com/"">
<b>Java</b> Technology</a></div>
<div class="T1" id="r0_a">Sun's home for <b>Java</b>. Offers
Windows, Solaris, and Linux <b>Java</b> Development Kits (JDKs),
extensions, news, tutorials, and product information.</div>
<div id="r_b0" class="prel T11"><a id="r0_b"
href="http://java.sun.com/">
<img src="http://sp.ask.com/sh/i/icon_bins.gif" border="0"class="bb"
/></a>
<span id="r0_u" class="T10">java.sun.com/</span>
<strong>·</strong> <a class="L5 nw"
href="http://www.askcache.com">
Cached</a> 1f40 <strong>·</strong>
<a class="L5 L5V" href="javascript:void(0)">Save</a>
</div>
</div>

This is the part I want to skip to to extract text. Its buried in loads
of other HTML. Cany anyone please help me do this.

The example HTML is a good start, perhaps you should consider giving us
the code that produces the NPE, and what you expect the output to be.
Also, if its a valid XML document, perhaps you should consider using
XPath, it helps select data based on the path to that data (including
selections based on element names, attributes, order, etc...).

Damo · Jan 16, 2007

If I can get at the first div I can get its child nodes. How would one
use XPath to get it.
The code below is what I have

NodeList sections = document.getElementsByTagName("div");
for(int i=0; i<sections.getLength();i++)
{
Element section =(Element)sections.item(i);

Attr attr = (Attr)section.getAttributeNode("class");
boolean wasSpecified = attr != null && attr.getSpecified();

String at = attr.getValue();
if(at=="mb16")
{
//I have a recursive method to get the text nodes for here
//if I can get at the child nodes of that particular div
}
}

Damo · Jan 16, 2007

Oh and the output I want is

Java Technology

Sun's home for Java Offers
Windows, Solaris, and Linux Java Development Kits (JDKs),
extensions, news, tutorials, and product information.

java.sun.com/

all stored as differnet Strings

Damo · Jan 16, 2007

Oh and the output I want is

Java Technology

Sun's home for Java Offers
Windows, Solaris, and Linux Java Development Kits (JDKs),
extensions, news, tutorials, and product information.

java.sun.com/

all stored as 3 differnet Strings

Damo · Jan 17, 2007

I'm now using this code. It finds the div nodes with the attribute
"pre1", but it wil not get its child nodes.

if(attr.getValue()=="prel"): Is there something wrong with this line?

NodeList sections =
document.getElementsByTagName("div");
System.out.println(sections.getLength());
for(int i=0; i<sections.getLength();i++)
{
Element section =(Element)sections.item(i);
Attr attr =
(Attr)section.getAttributeNode("class");
if(attr==null)
{
System.out.println("false");
}
else
{
System.out.println(attr.getValue());
if(attr.getValue()=="prel")
{
NodeList name =
section.getChildNodes();

System.out.println(name.getLength());
for(int j=0;
j<name.getLength();j++)
{
Element list =
(Element)name.item(j);
String title =
getText(list.getFirstChild());
System.out.println(title);
}
}
}

}

Andrew Thompson · Jan 17, 2007

Damo said:
I'm now using this code. It finds the div nodes with the attribute
"pre1", but it wil not get its child nodes.

// compares references to the two strings

if(attr.getValue()=="prel"): Is there something wrong with this line?

// compares contents of strings
if(attr.getValue().equals("prel"))

Andrew T.

Javascript DOM	1	Mar 29, 2023
Positioning CSS components	1	Nov 16, 2023
Help with code	0	Jun 12, 2022
Only one table shows up with the information	2	Mar 29, 2023
I want to Display Excel As HTML In js	2	Feb 24, 2023
Add recipes using JavaScript in table	20	Apr 17, 2023
"input-group-text" help	7	Aug 10, 2023
Another Password Confirmation	2	Sep 22, 2023

Accessing attributes in HTML with DOM

Damo

Daniel Pitts

Damo

Damo

Damo

Damo

Andrew Thompson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads