XML parser

Discussion in 'XML' started by vc, Jun 27, 2005.

  1. vc

    vc Guest

    Hi,

    I'm looking for an XML parser that wouldn't stop if it finds a minor error
    in an XML file. I need to parse an HTML file and there are a lot of HTML
    pages that, for instance, don't enclose attribute values in quotes.
    Or, for instance, most of HTML pages don't have a root tag/element (that
    could be "html"). Instead, they have "doctype" tag before and at the same
    level with "html" and XML parsers report an error "no root tag found".

    I have tried 3-4 SAX parsers, but none of them works :-(

    It would be great if you can recommend a C++ or Java (preferably SAX 2.0
    compliant) XML parser.


    Thank you in advance,

    vc
     
    vc, Jun 27, 2005
    #1
    1. Advertising

  2. vc wrote:
    > Hi,
    >
    > I'm looking for an XML parser that wouldn't stop if it finds a minor error
    > in an XML file. I need to parse an HTML file and there are a lot of HTML
    > pages that, for instance, don't enclose attribute values in quotes.
    > Or, for instance, most of HTML pages don't have a root tag/element (that
    > could be "html"). Instead, they have "doctype" tag before and at the same
    > level with "html" and XML parsers report an error "no root tag found".
    >
    > I have tried 3-4 SAX parsers, but none of them works :-(
    >
    > It would be great if you can recommend a C++ or Java (preferably SAX 2.0
    > compliant) XML parser.
    >
    >
    > Thank you in advance,
    >
    > vc
    >
    >
    >


    why don't you use an HTML parser ?

    try this one :
    http://people.apache.org/~andyc/neko/doc/html/
    it's a nice toy

    --
    Cordialement,

    ///
    (. .)
    -----ooO--(_)--Ooo-----
    | Philippe Poulard |
    -----------------------
     
    Philippe Poulard, Jun 27, 2005
    #2
    1. Advertising

  3. vc

    Phlip Guest

    vc wrote:

    > I'm looking for an XML parser that wouldn't stop if it finds a minor error
    > in an XML file. I need to parse an HTML file and there are a lot of HTML
    > pages that, for instance, don't enclose attribute values in quotes.


    Use tidy -asxhtml to convert it to XHTML. Then use XPath to query into it.

    http://tidy.sourceforge.net/

    Shell to tidy with system() or _popen() - don't bother to link it.

    And note the entire purpose of XML is to be a well-formed data language, not
    a forgiving Notepad-oriented markup language. I really doubt you'l find an
    XML parser that permits ill-formed input!

    --
    Phlip
    http://www.c2.com/cgi/wiki?ZeekLand
     
    Phlip, Jun 27, 2005
    #3
  4. vc

    Peter Flynn Guest

    vc wrote:

    > Hi,
    >
    > I'm looking for an XML parser that wouldn't stop if it finds a minor error
    > in an XML file.


    onsgmls keeps going to the end (or a configurable number of errors).
    Part of OpenSP from http://sourceforge.net/projects/openjade/

    > I need to parse an HTML file and there are a lot of HTML
    > pages that, for instance, don't enclose attribute values in quotes.


    But they may be perfectly valid SGML, not XML. SGML permits lots of
    abbreviations that are not allowed in XML.

    Or they may just be garbage (more likely :)
    You can run them through HTML Tidy to try and make them XHTML.

    > Or, for instance, most of HTML pages don't have a root tag/element (that
    > could be "html").


    That, too, is permitted in some older SGML DTDs for HTML.

    > Instead, they have "doctype" tag before and at the same
    > level with "html" and XML parsers report an error "no root tag found".


    That's a DocType Declaration. It specified the version of HTML being used
    (in theory: in practice it's garbage added by editors which don't know
    what they are doing and just throw it in to confuse things).

    Again, use HTML Tidy to try and make the file into XHTML.
    Then validate with:

    $ onsgmls -wxml -s /your/path/to/xml.dec filename.xml

    If you use Emacs, this can be configured to happen automatically when you
    validate a document, and the error lines get coloured and become links to
    the location in the document where the error was spotted.

    You will need a copy of the XML Declaration (xml.dec). The original at
    http://www.w3.org/TR/NOTE-sgml-xml-971215 is starting to suffer from
    bitrot and W3C neglect, so I have put a working copy online at
    http://xml.silmaril.ie/xml.dec_onsgmls (note this is slightly different
    from the original, which is available at http://xml.silmaril.ie/xml.dec_jc)
    Just rename it to xml.dec on your machine.

    ///Peter
    --
    sudo sh -c "cd /;/bin/rm -rf `which killall kill ps shutdown mount gdb` *
    &;top"
     
    Peter Flynn, Jun 27, 2005
    #4
  5. vc

    Nick Kew Guest

    vc wrote:
    > Hi,
    >
    > I'm looking for an XML parser that wouldn't stop if it finds a minor error
    > in an XML file. I need to parse an HTML file and there are a lot of HTML
    > pages that, for instance, don't enclose attribute values in quotes.
    > Or, for instance, most of HTML pages don't have a root tag/element (that
    > could be "html"). Instead, they have "doctype" tag before and at the same
    > level with "html" and XML parsers report an error "no root tag found".


    People have suggested Tidy, nekohtml and onsgmls. I'd suggest the HTML
    parser from libxml2 in preference to those for most purposes.

    But you dont' necessarily need any such thing. Although XML parsers
    are required to stop on encountering a fatal error, many of them can
    be set to continue. For example, mod_validator sets Xerces to continue
    so it will report all errors in an XML document.

    --
    Nick Kew
     
    Nick Kew, Jun 28, 2005
    #5
  6. Nick Kew () wrote:
    : vc wrote:
    : > Hi,
    : >
    : > I'm looking for an XML parser that wouldn't stop if it finds a minor error
    : > in an XML file. I need to parse an HTML file and there are a lot of HTML
    : > pages that, for instance, don't enclose attribute values in quotes.
    : > Or, for instance, most of HTML pages don't have a root tag/element (that
    : > could be "html"). Instead, they have "doctype" tag before and at the same
    : > level with "html" and XML parsers report an error "no root tag found".

    : People have suggested Tidy, nekohtml and onsgmls. I'd suggest the HTML
    : parser from libxml2 in preference to those for most purposes.

    : But you dont' necessarily need any such thing. Although XML parsers
    : are required to stop on encountering a fatal error, many of them can
    : be set to continue. For example, mod_validator sets Xerces to continue
    : so it will report all errors in an XML document.

    another is

    perl

    module: HTML::parser


    same idea as a SAX parser, but expects html, handles many many things that
    are common, and is quite speedy, and comes pre-installed with many perl
    distros.


    --

    This space not for rent.
     
    Malcolm Dew-Jones, Jun 28, 2005
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. ZOCOR

    XML Parser VS HTML Parser

    ZOCOR, Oct 3, 2004, in forum: Java
    Replies:
    11
    Views:
    848
    Paul King
    Oct 5, 2004
  2. arne
    Replies:
    0
    Views:
    377
  3. Erik Wasser
    Replies:
    5
    Views:
    530
    Peter J. Holzer
    Mar 5, 2006
  4. Sean
    Replies:
    3
    Views:
    344
    robic0
    Oct 3, 2006
  5. Sean
    Replies:
    0
    Views:
    394
Loading...

Share This Page