Accessing attributes in HTML with DOM

Discussion in 'Java' started by Damo, Jan 16, 2007.

  1. Damo

    Damo Guest

    Hi
    I'm trying to extract text from a html page useing DOM. I used JTidy
    first on it. The HTml itself is not very descriptive. Theres no
    standout tags around the text I need to extract . The way I was
    thinking of doing it was accessing the attributes, but I keep getting a
    NullPointerException. This is the HTML:


    <div class="mb16">
    <div id="r_t0" class="prel">
    <a id="r0_t" class="L4"href="http://java.sun.com/"">
    <b>Java</b> Technology</a></div>
    <div class="T1" id="r0_a">Sun's home for <b>Java</b>. Offers
    Windows, Solaris, and Linux <b>Java</b> Development Kits (JDKs),
    extensions, news, tutorials, and product information.</div>
    <div id="r_b0" class="prel T11"><a id="r0_b"
    href="http://java.sun.com/">
    <img src="http://sp.ask.com/sh/i/icon_bins.gif" border="0"class="bb"
    /></a>
    <span id="r0_u" class="T10">java.sun.com/</span>
    <strong>&middot;</strong> <a class="L5 nw"
    href="http://www.askcache.com">
    Cached</a> 1f40 <strong>&middot;</strong>
    <a class="L5 L5V" href="javascript:void(0)">Save</a>
    </div>
    </div>


    This is the part I want to skip to to extract text. Its buried in loads
    of other HTML. Cany anyone please help me do this.
    Damo, Jan 16, 2007
    #1
    1. Advertising

  2. Damo

    Daniel Pitts Guest

    Damo wrote:
    > Hi
    > I'm trying to extract text from a html page useing DOM. I used JTidy
    > first on it. The HTml itself is not very descriptive. Theres no
    > standout tags around the text I need to extract . The way I was
    > thinking of doing it was accessing the attributes, but I keep getting a
    > NullPointerException. This is the HTML:
    >
    >
    > <div class="mb16">
    > <div id="r_t0" class="prel">
    > <a id="r0_t" class="L4"href="http://java.sun.com/"">
    > <b>Java</b> Technology</a></div>
    > <div class="T1" id="r0_a">Sun's home for <b>Java</b>. Offers
    > Windows, Solaris, and Linux <b>Java</b> Development Kits (JDKs),
    > extensions, news, tutorials, and product information.</div>
    > <div id="r_b0" class="prel T11"><a id="r0_b"
    > href="http://java.sun.com/">
    > <img src="http://sp.ask.com/sh/i/icon_bins.gif" border="0"class="bb"
    > /></a>
    > <span id="r0_u" class="T10">java.sun.com/</span>
    > <strong>&middot;</strong> <a class="L5 nw"
    > href="http://www.askcache.com">
    > Cached</a> 1f40 <strong>&middot;</strong>
    > <a class="L5 L5V" href="javascript:void(0)">Save</a>
    > </div>
    > </div>
    >
    >
    > This is the part I want to skip to to extract text. Its buried in loads
    > of other HTML. Cany anyone please help me do this.

    The example HTML is a good start, perhaps you should consider giving us
    the code that produces the NPE, and what you expect the output to be.
    Also, if its a valid XML document, perhaps you should consider using
    XPath, it helps select data based on the path to that data (including
    selections based on element names, attributes, order, etc...).
    Daniel Pitts, Jan 16, 2007
    #2
    1. Advertising

  3. Damo

    Damo Guest

    If I can get at the first div I can get its child nodes. How would one
    use XPath to get it.
    The code below is what I have




    NodeList sections = document.getElementsByTagName("div");
    for(int i=0; i<sections.getLength();i++)
    {
    Element section =(Element)sections.item(i);

    Attr attr = (Attr)section.getAttributeNode("class");
    boolean wasSpecified = attr != null && attr.getSpecified();

    String at = attr.getValue();
    if(at=="mb16")
    {
    //I have a recursive method to get the text nodes for here
    //if I can get at the child nodes of that particular div
    }
    }
    Damo, Jan 16, 2007
    #3
  4. Damo

    Damo Guest

    Oh and the output I want is

    Java Technology

    Sun's home for Java Offers
    Windows, Solaris, and Linux Java Development Kits (JDKs),
    extensions, news, tutorials, and product information.

    java.sun.com/

    all stored as differnet Strings
    Damo, Jan 16, 2007
    #4
  5. Damo

    Damo Guest

    Oh and the output I want is

    Java Technology

    Sun's home for Java Offers
    Windows, Solaris, and Linux Java Development Kits (JDKs),
    extensions, news, tutorials, and product information.

    java.sun.com/

    all stored as 3 differnet Strings
    Damo, Jan 16, 2007
    #5
  6. Damo

    Damo Guest

    I'm now using this code. It finds the div nodes with the attribute
    "pre1", but it wil not get its child nodes.

    if(attr.getValue()=="prel"): Is there something wrong with this line?




    NodeList sections =
    document.getElementsByTagName("div");
    System.out.println(sections.getLength());
    for(int i=0; i<sections.getLength();i++)
    {
    Element section =(Element)sections.item(i);
    Attr attr =
    (Attr)section.getAttributeNode("class");
    if(attr==null)
    {
    System.out.println("false");
    }
    else
    {
    System.out.println(attr.getValue());
    if(attr.getValue()=="prel")
    {
    NodeList name =
    section.getChildNodes();

    System.out.println(name.getLength());
    for(int j=0;
    j<name.getLength();j++)
    {
    Element list =
    (Element)name.item(j);
    String title =
    getText(list.getFirstChild());
    System.out.println(title);
    }
    }
    }

    }
    Damo, Jan 17, 2007
    #6
  7. Damo wrote:
    > I'm now using this code. It finds the div nodes with the attribute
    > "pre1", but it wil not get its child nodes.


    // compares references to the two strings
    > if(attr.getValue()=="prel"): Is there something wrong with this line?


    // compares contents of strings
    if(attr.getValue().equals("prel"))

    Andrew T.
    Andrew Thompson, Jan 17, 2007
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    0
    Views:
    538
  2. sdf
    Replies:
    4
    Views:
    395
    Jonathan N. Little
    Dec 8, 2007
  3. Safalra
    Replies:
    2
    Views:
    126
    -Lost
    Mar 30, 2007
  4. sdf
    Replies:
    1
    Views:
    66
  5. DOM ? HTML DOM

    , Dec 19, 2007, in forum: Javascript
    Replies:
    1
    Views:
    111
Loading...

Share This Page