XML Parser VS HTML Parser

Discussion in 'Java' started by ZOCOR, Oct 3, 2004.

  1. ZOCOR

    ZOCOR Guest

    Hi

    Can a XML parser be used to parse a HTML document? even if it is not
    well-formed?

    If the answer is yes to both, can you recommend a Java XML parser class
    (from the standard API)?

    Cheers

    ZOCOR



    ---
    Outgoing mail is certified Virus Free.
    Checked by AVG anti-virus system (http://www.grisoft.com).
    Version: 6.0.760 / Virus Database: 509 - Release Date: 10/09/2004
     
    ZOCOR, Oct 3, 2004
    #1
    1. Advertising

  2. ZOCOR

    Sudsy Guest

    ZOCOR wrote:
    > Hi
    >
    > Can a XML parser be used to parse a HTML document? even if it is not
    > well-formed?


    No; an XML parser will balk on a lot of HTML. It's not well-formed.

    > If the answer is yes to both, can you recommend a Java XML parser class
    > (from the standard API)?


    Search the archives for alternate approaches.
     
    Sudsy, Oct 3, 2004
    #2
    1. Advertising

  3. ZOCOR

    [private] Guest

    ZOCOR wrote:
    > Can a XML parser be used to parse a HTML document? even if it is not
    > well-formed?
    >

    It can parse it as long as the HTML is well-formed. XML isn't as
    relaxed as HTML, so any open elements will throw an exception (probably
    org.xml.sax.SAXException, but can't verify right now).
     
    [private], Oct 3, 2004
    #3
  4. ZOCOR wrote:


    > Can a XML parser be used to parse a HTML document? even if it is not
    > well-formed?


    No, an XML parser can't parse HTML, unless of course it is XHTML. But
    HTML 3.2 or HTML 4.01 cannot be parsed with an XML parser.

    --

    Martin Honnen
    http://JavaScript.FAQTs.com/
     
    Martin Honnen, Oct 3, 2004
    #4
  5. ZOCOR wrote:

    > Can a XML parser be used to parse a HTML document? even if it is not
    > well-formed?


    A SAX or DOM parser will throw exceptions on data that's not well-formed.
    So, the answer is no, it cannot.

    --
    /**
    * @author Darryl L. Pierce <>
    * @see The Infobahn Offramp <http://mcpierce.mypage.org>
    * @quote "Lobby, lobby, lobby, lobby, lobby, lobby..." - Adrian Monk
    */
     
    Darryl L. Pierce, Oct 3, 2004
    #5
  6. "[private]" <"[private]"@[private].net> writes:

    > It can parse it as long as the HTML is well-formed.


    Except for XHTML, HTML cannot be assumed to be well-formed since HTML
    does not "end" empty elements properly; they are only empty by
    implication, like <br>.

    Also, real-world HTML is packed full of implicit begin and end tags a
    parser needs to be aware of.
     
    Tor Iver Wilhelmsen, Oct 3, 2004
    #6
  7. ZOCOR

    CarlosRivera Guest

    You could use tidy or similar to turn html into xhtml and then use an
    XML parser.

    ZOCOR wrote:
    > Hi
    >
    > Can a XML parser be used to parse a HTML document? even if it is not
    > well-formed?
    >
    > If the answer is yes to both, can you recommend a Java XML parser class
    > (from the standard API)?
     
    CarlosRivera, Oct 3, 2004
    #7
  8. ZOCOR

    ZOCOR Guest

    "Darryl L. Pierce" <> wrote in message
    news:1096821414.TMHnUn2xrpVueIRygtEFdA@teranews...
    > ZOCOR wrote:
    >
    > > Can a XML parser be used to parse a HTML document? even if it is not
    > > well-formed?

    >
    > A SAX or DOM parser will throw exceptions on data that's not well-formed.
    > So, the answer is no, it cannot.


    Well i can catch the exceptions so that processing can continue?

    Whats the problem?

    ZOCOR



    ---
    Outgoing mail is certified Virus Free.
    Checked by AVG anti-virus system (http://www.grisoft.com).
    Version: 6.0.760 / Virus Database: 509 - Release Date: 10/09/2004
     
    ZOCOR, Oct 4, 2004
    #8
  9. "ZOCOR" <> writes:

    > Whats the problem?


    <br> and the like, which are (implicitly) empty elements that a SAX
    parser will not report an end element for, since they are start tags
    for containing elements as far as the parser knows.

    So you need to add a bunch of logic that handles optional start
    elements, implicit end elements, and non-terminated empty elements.

    But, hey, if you don't consider that a problem...
     
    Tor Iver Wilhelmsen, Oct 4, 2004
    #9
  10. ZOCOR

    ZOCOR Guest

    > > Whats the problem?
    >
    > <br> and the like, which are (implicitly) empty elements that a SAX
    > parser will not report an end element for, since they are start tags
    > for containing elements as far as the parser knows.
    >
    > So you need to add a bunch of logic that handles optional start
    > elements, implicit end elements, and non-terminated empty elements.
    >
    > But, hey, if you don't consider that a problem...


    Well im only after specific text contained in certain tags, which
    fortunately have an end tag for. As for the other tags, I couldn't give 2
    rats about.


    ZOCOR



    ---
    Outgoing mail is certified Virus Free.
    Checked by AVG anti-virus system (http://www.grisoft.com).
    Version: 6.0.760 / Virus Database: 509 - Release Date: 10/09/2004
     
    ZOCOR, Oct 4, 2004
    #10
  11. ZOCOR

    Brusque Guest

    "ZOCOR" <> wrote in message
    news:w5Q7d.12595$...
    > Hi
    >
    > Can a XML parser be used to parse a HTML document? even if it is not
    > well-formed?
    >
    > If the answer is yes to both, can you recommend a Java XML parser class
    > (from the standard API)?
    >
    > Cheers
    >
    > ZOCOR
    >


    Never used it myself, but maybe this is worth a try:
    http://www.apache.org/~andyc/neko/doc/html/
     
    Brusque, Oct 4, 2004
    #11
  12. ZOCOR

    Paul King Guest

    Brusque wrote:
    > "ZOCOR" <> wrote in message
    > news:w5Q7d.12595$...
    >
    >>Hi
    >>
    >>Can a XML parser be used to parse a HTML document? even if it is not
    >>well-formed?
    >>
    >>If the answer is yes to both, can you recommend a Java XML parser class
    >>(from the standard API)?
    >>
    >>Cheers
    >>
    >>ZOCOR
    >>

    >
    >
    > Never used it myself, but maybe this is worth a try:
    > http://www.apache.org/~andyc/neko/doc/html/
    >
    >


    CyberNeko HTML Parser (above link) works well in my experience. If that
    doesn't suit, you might like to try tagsoup (which also works well):
    http://mercury.ccil.org/~cowan/XML/tagsoup/

    If you find them too heavy weight, regex might be what you are after.

    Cheers, Paul.
     
    Paul King, Oct 5, 2004
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Zach Dennis

    HTML-Parser / SGML-Parser

    Zach Dennis, Oct 1, 2003, in forum: Ruby
    Replies:
    5
    Views:
    408
    Bernard Delmée
    Oct 1, 2003
  2. arne
    Replies:
    0
    Views:
    354
  3. Erik Wasser
    Replies:
    5
    Views:
    465
    Peter J. Holzer
    Mar 5, 2006
  4. Sean
    Replies:
    3
    Views:
    279
    robic0
    Oct 3, 2006
  5. Sean
    Replies:
    0
    Views:
    370
Loading...

Share This Page