HTML Parser Help Please

Discussion in 'Java' started by ZOCOR, Sep 30, 2004.

  1. ZOCOR

    ZOCOR Guest

    Hi

    I am using HTMLEditorKit.Parser class to parse a HTML file. However, I have
    found this Swing HTML parser extremely difficult to use.

    I am trying to parse a HTML file and extracting specific information from it
    into a table. Consider the snippet of my HTML and the table I like it to
    generate:

    HTML source:

    <HTML>
    <TITLE></TITLE>
    <BODY>
    <PRE>
    Identifer: ABCDEFG
    </PRE>
    data: 123456
    <PRE>
    </PRE>
    </BODY>
    </HTML>

    TABLE:

    ABCDEFG 123456


    Here is the code I have so far:

    import javax.swing.text.*;
    import javax.swing.text.html.*;
    import java.io.*;

    public class HTMLParser extends HTMLEditorKit
    {
    public HTMLEditorKit.Parser getParser()
    {
    return super.getParser();
    }

    public static void main (String[] args)
    {
    try
    {
    Reader r = new FileReader("html_file.html");
    HTMLEditor.Parser parse = new HTMLParser.getParser()
    HTMLEditorKit.ParserCallback cb =
    {
    public void handleStartTag(HTML.Tag t, MutableAttributeSet
    a, int a)
    {
    if (t==HTML.Tag.PRE)
    {
    //print whats between the pre tag
    }
    }
    public void handleText(char[] data, int pos)
    {
    //print whats between the pre tags
    }
    };

    parse.parse(r, cb, true);
    }
    catch (IOException e)
    {
    System.out.println(e);
    }
    }
    }

    I would appreciate it very much if someone could solve this problem for me.
    I tried the sun tutortial, but the examples aren't that clear enough for me.

    Thanks

    ZOCOR











    ---
    Outgoing mail is certified Virus Free.
    Checked by AVG anti-virus system (http://www.grisoft.com).
    Version: 6.0.760 / Virus Database: 509 - Release Date: 10/09/2004
    ZOCOR, Sep 30, 2004
    #1
    1. Advertising

  2. I've never used this HTML Parser before, but I've done similar things
    when scraping HTML off websites. My general solution is to:

    1. Get the HTML as text (which you already have).
    2. Run it through an HTML to XHTML cleanser (I lik JTidy)
    3. Parse the XHTML using Java's XML parsers.
    4. Use XPath statements to get the values I want.

    This probably isn't very efficient for getting small bits of data, but
    it works.

    //Nathan

    "ZOCOR" <> wrote in message news:<ZwQ6d.9334$>...
    > Hi
    >
    > I am using HTMLEditorKit.Parser class to parse a HTML file. However, I have
    > found this Swing HTML parser extremely difficult to use.
    >
    > I am trying to parse a HTML file and extracting specific information from it
    > into a table. Consider the snippet of my HTML and the table I like it to
    > generate:
    >
    > HTML source:
    >
    > <HTML>
    > <TITLE></TITLE>
    > <BODY>
    > <PRE>
    > Identifer: ABCDEFG
    > </PRE>
    > data: 123456
    > <PRE>
    > </PRE>
    > </BODY>
    > </HTML>
    >
    > TABLE:
    >
    > ABCDEFG 123456
    >
    >
    > Here is the code I have so far:
    >
    > import javax.swing.text.*;
    > import javax.swing.text.html.*;
    > import java.io.*;
    >
    > public class HTMLParser extends HTMLEditorKit
    > {
    > public HTMLEditorKit.Parser getParser()
    > {
    > return super.getParser();
    > }
    >
    > public static void main (String[] args)
    > {
    > try
    > {
    > Reader r = new FileReader("html_file.html");
    > HTMLEditor.Parser parse = new HTMLParser.getParser()
    > HTMLEditorKit.ParserCallback cb =
    > {
    > public void handleStartTag(HTML.Tag t, MutableAttributeSet
    > a, int a)
    > {
    > if (t==HTML.Tag.PRE)
    > {
    > //print whats between the pre tag
    > }
    > }
    > public void handleText(char[] data, int pos)
    > {
    > //print whats between the pre tags
    > }
    > };
    >
    > parse.parse(r, cb, true);
    > }
    > catch (IOException e)
    > {
    > System.out.println(e);
    > }
    > }
    > }
    >
    > I would appreciate it very much if someone could solve this problem for me.
    > I tried the sun tutortial, but the examples aren't that clear enough for me.
    >
    > Thanks
    >
    > ZOCOR
    >
    >
    >
    >
    >
    >
    >
    >
    >
    >
    >
    > ---
    > Outgoing mail is certified Virus Free.
    > Checked by AVG anti-virus system (http://www.grisoft.com).
    > Version: 6.0.760 / Virus Database: 509 - Release Date: 10/09/2004
    Nathan Zumwalt, Sep 30, 2004
    #2
    1. Advertising

  3. ZOCOR

    Paul Lutus Guest

    ZOCOR wrote:

    > Hi
    >
    > I am using HTMLEditorKit.Parser class to parse a HTML file. However, I
    > have found this Swing HTML parser extremely difficult to use.


    Problem: "difficult".

    > I am trying to parse a HTML file and extracting specific information from
    > it into a table.


    Problem: "trying".

    > Consider the snippet of my HTML and the table I like it
    > to generate:


    You left out the table, the final goal of your program.

    / ...

    > I would appreciate it very much if someone could solve this problem for
    > me.


    Which problem, "difficult" or "trying"? Children and both difficult and
    trying, but this is not a specific complaint. Neither is yours.

    Tell us what you wanted, what you got, and how they differ.

    > I tried the sun tutortial, but the examples aren't that clear enough
    > for me.


    Clear enough to do what?

    --
    Paul Lutus
    http://www.arachnoid.com
    Paul Lutus, Sep 30, 2004
    #3
  4. ZOCOR

    John K Guest

    John K, Sep 30, 2004
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. ZOCOR

    XML Parser VS HTML Parser

    ZOCOR, Oct 3, 2004, in forum: Java
    Replies:
    11
    Views:
    797
    Paul King
    Oct 5, 2004
  2. Replies:
    4
    Views:
    490
    Chris Uppal
    May 5, 2005
  3. KK
    Replies:
    2
    Views:
    512
    Big Brian
    Oct 14, 2003
  4. MuZZy
    Replies:
    7
    Views:
    1,716
    Mike Hewson
    Jan 7, 2005
  5. Zach Dennis

    HTML-Parser / SGML-Parser

    Zach Dennis, Oct 1, 2003, in forum: Ruby
    Replies:
    5
    Views:
    375
    Bernard Delmée
    Oct 1, 2003
Loading...

Share This Page