Parsing HTML without Perl

Discussion in 'HTML' started by TLOlczyk, Jun 29, 2004.

  1. TLOlczyk

    TLOlczyk Guest

    Is there a library that will allow me to parse bad HTML?
    ( good html too, but any lib will do that ).

    Yes I can use Perl, but I want the flexibility to use any one
    of several languages. So a shared object/dll would be best.
    I've looked at libxml ( actually libxml2 ) and expat ( I know
    they are really XML parsers, but one can hope ), and neither
    handles HTML well enough. I'm totally confused by libwww.
    The libwww people suggest looking at the parser in Amaya,
    but I don't know how good it is or if I can extract it from
    the rest of Amaya.

    Suggestions?


    The reply-to email address is .
    This is an address I ignore.
    To reply via email, remove 2002 and change yahoo to
    interaccess,

    **
    Thaddeus L. Olczyk, PhD

    There is a difference between
    *thinking* you know something,
    and *knowing* you know something.
     
    TLOlczyk, Jun 29, 2004
    #1
    1. Advertising

  2. On Tue, 29 Jun 2004 05:23:09 -0500 TLOlczyk posted:

    > Is there a library that will allow me to parse bad HTML?
    > ( good html too, but any lib will do that ).
    >
    > Yes I can use Perl, but I want the flexibility to use any one
    > of several languages. So a shared object/dll would be best.
    > I've looked at libxml ( actually libxml2 ) and expat ( I know
    > they are really XML parsers, but one can hope ), and neither
    > handles HTML well enough. I'm totally confused by libwww.
    > The libwww people suggest looking at the parser in Amaya,
    > but I don't know how good it is or if I can extract it from
    > the rest of Amaya.
    >
    > Suggestions?
    >


    Georg Rehm describes (in: Mehler & Lobin: Automatische Textanalyse,
    2004) a two step process for converting arbitrary HTML Webpages to
    XHTML. According to him it works in 98.7 % of all cases:

    1) use tidy to read and try to convert the HTML to XHTML

    2) if 1) fails they use HTML::Treebuilder (Perl module, see:
    http://www.cpan.org) and then again tidy.

    For 10000 files picked at random 9872 could be successfully converted in
    this fashion. Only 5 of the resulting files were not wellformed afterwards
    (tested with expat).


    Kind regards
    David
     
    David Christopher Weichert, Jun 29, 2004
    #2
    1. Advertising

  3. TLOlczyk

    TLOlczyk Guest

    On Tue, 29 Jun 2004 12:46:52 +0200, David Christopher Weichert
    <> wrote:

    >On Tue, 29 Jun 2004 05:23:09 -0500 TLOlczyk posted:
    >
    >> Is there a library that will allow me to parse bad HTML?
    >> ( good html too, but any lib will do that ).
    >>
    >> Yes I can use Perl, but I want the flexibility to use any one
    >> of several languages. So a shared object/dll would be best.
    >> I've looked at libxml ( actually libxml2 ) and expat ( I know
    >> they are really XML parsers, but one can hope ), and neither
    >> handles HTML well enough. I'm totally confused by libwww.
    >> The libwww people suggest looking at the parser in Amaya,
    >> but I don't know how good it is or if I can extract it from
    >> the rest of Amaya.
    >>
    >> Suggestions?
    >>

    >
    >Georg Rehm describes (in: Mehler & Lobin: Automatische Textanalyse,
    >2004) a two step process for converting arbitrary HTML Webpages to
    >XHTML. According to him it works in 98.7 % of all cases:
    >
    >1) use tidy to read and try to convert the HTML to XHTML
    >

    No. Tidy chokes on embedded < > and a few other bad constructs.
    At least it did last I looked. I will have to try it again.
    Also it comes as a standalone application. If i were to use
    something like tidy, I would preffer it to be a lib ( though I
    understand they are working on it ).

    Ps: I never tried it with javascript. How does it handle that.
    Both expat and libxml choke on the first for loop.

    >2) if 1) fails they use HTML::Treebuilder (Perl module, see:
    >http://www.cpan.org) and then again tidy.
    >

    As I said before I would like to be independent of Perl.



    The reply-to email address is .
    This is an address I ignore.
    To reply via email, remove 2002 and change yahoo to
    interaccess,

    **
    Thaddeus L. Olczyk, PhD

    There is a difference between
    *thinking* you know something,
    and *knowing* you know something.
     
    TLOlczyk, Jun 29, 2004
    #3
  4. On Tue, 29 Jun 2004 06:09:33 -0500 TLOlczyk posted:

    > On Tue, 29 Jun 2004 12:46:52 +0200, David Christopher Weichert
    > <> wrote:
    >
    >>On Tue, 29 Jun 2004 05:23:09 -0500 TLOlczyk posted:
    >>
    >>> Is there a library that will allow me to parse bad HTML?
    >>> ( good html too, but any lib will do that ).
    >>>
    >>> Yes I can use Perl, but I want the flexibility to use any one
    >>> of several languages. So a shared object/dll would be best.
    >>> I've looked at libxml ( actually libxml2 ) and expat ( I know
    >>> they are really XML parsers, but one can hope ), and neither
    >>> handles HTML well enough. I'm totally confused by libwww.
    >>> The libwww people suggest looking at the parser in Amaya,
    >>> but I don't know how good it is or if I can extract it from
    >>> the rest of Amaya.
    >>>
    >>> Suggestions?
    >>>

    >>
    >>Georg Rehm describes (in: Mehler & Lobin: Automatische Textanalyse,
    >>2004) a two step process for converting arbitrary HTML Webpages to
    >>XHTML. According to him it works in 98.7 % of all cases:
    >>
    >>1) use tidy to read and try to convert the HTML to XHTML
    >>

    > No. Tidy chokes on embedded < > and a few other bad constructs.
    > At least it did last I looked. I will have to try it again.
    > Also it comes as a standalone application. If i were to use
    > something like tidy, I would preffer it to be a lib ( though I
    > understand they are working on it ).


    tidy also comes as a lib.

    >
    > Ps: I never tried it with javascript. How does it handle that.


    Just tried tidy on a randomly picked page with JavaScript page
    (http://javascript.internet.com/). Tidy reported warnings and errors, but
    could not handle them automatically. Seems Rehm used easier examples. I
    can't say whether this behaviour is down to JavaScript or other stuff
    wrong with that particular file.

    > Both expat and libxml choke on the first for loop.
    >
    >>2) if 1) fails they use HTML::Treebuilder (Perl module, see:
    >>http://www.cpan.org) and then again tidy.
    >>

    > As I said before I would like to be independent of Perl.
    >

    Rehm states that he used HTML::Treebuilder only in 2.7 % of all cases and
    that otherwise tidy was sufficient. This may have to do with the fact that
    the pages he sampled seemingly worked better with tidy than the random
    sample I picked. (Rehm sampled pages from German educational institutions).

    Looks like tidy on its own is not the solution, but might be of some use.


    Good luck
    David
     
    David Christopher Weichert, Jun 29, 2004
    #4
  5. TLOlczyk

    TLOlczyk Guest

    On Tue, 29 Jun 2004 16:34:02 +0200, David Christopher Weichert
    <> wrote:

    >On Tue, 29 Jun 2004 06:09:33 -0500 TLOlczyk posted:
    >
    >> On Tue, 29 Jun 2004 12:46:52 +0200, David Christopher Weichert
    >> <> wrote:
    >>
    >>>On Tue, 29 Jun 2004 05:23:09 -0500 TLOlczyk posted:
    >>>
    >>>> Is there a library that will allow me to parse bad HTML?
    >>>> ( good html too, but any lib will do that ).
    >>>>
    >>>> Yes I can use Perl, but I want the flexibility to use any one
    >>>> of several languages. So a shared object/dll would be best.
    >>>> I've looked at libxml ( actually libxml2 ) and expat ( I know
    >>>> they are really XML parsers, but one can hope ), and neither
    >>>> handles HTML well enough. I'm totally confused by libwww.
    >>>> The libwww people suggest looking at the parser in Amaya,
    >>>> but I don't know how good it is or if I can extract it from
    >>>> the rest of Amaya.
    >>>>
    >>>> Suggestions?
    >>>>
    >>>
    >>>Georg Rehm describes (in: Mehler & Lobin: Automatische Textanalyse,
    >>>2004) a two step process for converting arbitrary HTML Webpages to
    >>>XHTML. According to him it works in 98.7 % of all cases:
    >>>
    >>>1) use tidy to read and try to convert the HTML to XHTML
    >>>

    >> No. Tidy chokes on embedded < > and a few other bad constructs.
    >> At least it did last I looked. I will have to try it again.
    >> Also it comes as a standalone application. If i were to use
    >> something like tidy, I would preffer it to be a lib ( though I
    >> understand they are working on it ).

    >
    >tidy also comes as a lib.
    >
    >>
    >> Ps: I never tried it with javascript. How does it handle that.

    >
    >Just tried tidy on a randomly picked page with JavaScript page
    >(http://javascript.internet.com/). Tidy reported warnings and errors, but
    >could not handle them automatically. Seems Rehm used easier examples. I
    >can't say whether this behaviour is down to JavaScript or other stuff
    >wrong with that particular file.
    >

    That's because with anything but the most simple Javascript,
    you are going to encounter something like:
    for (var i=0; i < something; i++)
    which is going to choke any XML based parser.
    You can't change it to:
    for (var i=0; i &lt; something; i++)
    because that will screw with any javascript interpreter.

    If you want to really parse HTML. You need to pick out
    javascript on the fly.

    >> Both expat and libxml choke on the first for loop.
    >>
    >>>2) if 1) fails they use HTML::Treebuilder (Perl module, see:
    >>>http://www.cpan.org) and then again tidy.
    >>>

    >> As I said before I would like to be independent of Perl.
    >>

    >Rehm states that he used HTML::Treebuilder only in 2.7 % of all cases and
    >that otherwise tidy was sufficient. This may have to do with the fact that
    >the pages he sampled seemingly worked better with tidy than the random
    >sample I picked. (Rehm sampled pages from German educational institutions).
    >
    >Looks like tidy on its own is not the solution, but might be of some use.
    >

    No. Tidy as it works now, is a recipe for disaster. There are a few
    known exceptions, so you spend a lot of time writing code to get
    around the exceptions. More and more exceptions popup requiring
    more and more complex code. Till your project collapses from the
    weight of maintaining all the exceptions becomes so large that
    it takes up all your time.

    The answer to the problem is not to use a program which
    can take relatively good HTML and produce really good HTML.

    The solution to the problem is to start with a parser that can handle
    badly formed but serivicable HTML in the first place.

    Which points out to me that we have gone far afield.

    So back to my main question:
    Anyone out there know of an HTML parser that comes as a shared
    object/DLL?


    The reply-to email address is .
    This is an address I ignore.
    To reply via email, remove 2002 and change yahoo to
    interaccess,

    **
    Thaddeus L. Olczyk, PhD

    There is a difference between
    *thinking* you know something,
    and *knowing* you know something.
     
    TLOlczyk, Jun 29, 2004
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. GIMME
    Replies:
    2
    Views:
    927
    GIMME
    Feb 11, 2004
  2. Replies:
    7
    Views:
    1,451
  3. Ninja Li

    Parsing HTML with HTML::TableExtract

    Ninja Li, Nov 27, 2009, in forum: Perl Misc
    Replies:
    2
    Views:
    261
    Martien Verbruggen
    Nov 28, 2009
  4. Ninja Li

    Parsing HTML with HTML::Tree

    Ninja Li, Mar 1, 2010, in forum: Perl Misc
    Replies:
    1
    Views:
    175
    Ninja Li
    Mar 1, 2010
  5. Replies:
    1
    Views:
    89
    Stefan Behnel
    Mar 1, 2014
Loading...

Share This Page