HTML Processing in Java

Discussion in 'Java' started by Honza, Nov 29, 2005.

  1. Honza

    Honza Guest

    Hello,

    I would like to process html pages in java. The very first task would
    be to ignore unnecessary information like comments (everything in <!--
    -->) or images.
    What would be the best start point?
    I have found JTidy and HTML Parser in SourceForge, but none of them is
    able of ignoring tags - or did I miss it?

    Thank you for any clue
    Honza
    Honza, Nov 29, 2005
    #1
    1. Advertising

  2. Honza

    Roedy Green Guest

    On 29 Nov 2005 01:11:37 -0800, "Honza" <> wrote,
    quoted or indirectly quoted someone who said :

    >I would like to process html pages in java. The very first task would
    >be to ignore unnecessary information like comments (everything in <!--
    >-->) or images.
    >What would be the best start point?


    See http://mindprod.com/products1.html#ENTITIES
    to strip the HTML out optionally convert the &xxx; entities back to
    normal characters.
    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
    Roedy Green, Nov 29, 2005
    #2
    1. Advertising

  3. Honza

    zero Guest

    "Honza" <> wrote in news:1133255497.231778.229120
    @g14g2000cwa.googlegroups.com:

    > Hello,
    >
    > I would like to process html pages in java. The very first task would
    > be to ignore unnecessary information like comments (everything in <!--
    > -->) or images.
    > What would be the best start point?
    > I have found JTidy and HTML Parser in SourceForge, but none of them is
    > able of ignoring tags - or did I miss it?
    >
    > Thank you for any clue
    > Honza
    >
    >


    I would be very surprised if either of those actually did anything with the
    comments. If they do, why not just remove the code that handles them?

    --
    Beware the False Authority Syndrome
    zero, Nov 29, 2005
    #3
  4. Honza

    Oliver Wong Guest

    "Honza" <> wrote in message
    news:...
    > Hello,
    >
    > I would like to process html pages in java. The very first task would
    > be to ignore unnecessary information like comments (everything in <!--
    > -->) or images.
    > What would be the best start point?
    > I have found JTidy and HTML Parser in SourceForge, but none of them is
    > able of ignoring tags - or did I miss it?
    >
    > Thank you for any clue
    > Honza


    Haven't used the parsers you're talking about, but if you find any SAX
    based parser, you'll just receive a bunch of "events" representing the
    discovery of "things" in an HTML document, and you can just ignore the
    "comment" events.

    - Oliver
    Oliver Wong, Nov 29, 2005
    #4
  5. Honza

    Roedy Green Guest

    On Tue, 29 Nov 2005 11:01:38 GMT, Roedy Green
    <> wrote, quoted or
    indirectly quoted someone who said :

    >See http://mindprod.com/products1.html#ENTITIES
    >to strip the HTML out optionally convert the &xxx; entities back to
    >normal characters.


    With a simple modification, you could strip just comments, not all
    HTML tags.
    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
    Roedy Green, Nov 29, 2005
    #5
  6. Honza wrote:
    > Hello,
    >
    > I would like to process html pages in java. The very first task would
    > be to ignore unnecessary information like comments (everything in <!--
    > -->) or images.
    > What would be the best start point?
    > I have found JTidy and HTML Parser in SourceForge, but none of them is
    > able of ignoring tags - or did I miss it?
    >
    > Thank you for any clue
    > Honza
    >

    Take a look at classes ParserDelegator and HTMLEditorKit.ParserCallback
    in package javax.swing.text.html

    You can implement(subclass) your own ParserCallback and use that in the
    parse method of ParserDelegator object. This is quite like using SAX
    parsers for XML documents.

    Abhijat
    Abhijat Vatsyayan, Nov 29, 2005
    #6
  7. Honza

    Honza Guest

    Thank you guys, I will check the possibilities.

    I have found another interesting application which could also be
    solution of my problem. Its name is Muffin - http://muffin.doit.org/
    It is highly customizable java writen proxy where you can filter html
    content.
    I am going to try it out tomorrow.

    Thanks a lot
    Honza
    Honza, Nov 29, 2005
    #7
  8. Honza

    Honza Guest

    Hello Abhijat,

    I have tested HTMLEditorKit today. It is really very easy to use and it
    would be appropriate for my purpose...

    BUT: I've tested it with "real world" HTML pages and I find it not
    robust enough. The results are not accurate enough and number of errors
    is too high if parsing any "badly written" HTML page.

    I have found nice page benchmarking "real world" SAX HTML parsers. I
    think I will use one of them...

    Link: http://www.portletbridge.org/saxbenchmark

    Honza
    Honza, Nov 30, 2005
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    2
    Views:
    387
    Andy Dingley
    Sep 15, 2005
  2. phil hunt

    Text-to-HTML processing program

    phil hunt, Jan 3, 2004, in forum: Python
    Replies:
    11
    Views:
    575
    Reinier Post
    Jan 8, 2004
  3. Hubert Hung-Hsien Chang
    Replies:
    2
    Views:
    410
    Michael Foord
    Sep 17, 2004
  4. Ismael Herrera

    html processing

    Ismael Herrera, Sep 25, 2004, in forum: Python
    Replies:
    3
    Views:
    316
    Uche Ogbuji
    Oct 2, 2004
  5. wipit
    Replies:
    2
    Views:
    4,930
    wipit
    Aug 8, 2006
Loading...

Share This Page