HTML DOM parser?

Discussion in 'Python' started by Paul Rubin, Jul 31, 2003.

  1. Paul Rubin

    Paul Rubin Guest

    Is there an HTML DOM parser available for Python? Preferably one that
    does a reasonable job with the crappy HTML out there on real web
    pages, that doesn't get upset about unterminated tables and stuff like
    that. Many extra points if it understands Javascript. Application is
    a screen scraping web robot. Thanks.
     
    Paul Rubin, Jul 31, 2003
    #1
    1. Advertising

  2. adfgvx wrote:

    > Try tidy. There are two python wrappers : mxtidy and utidy, the latest is


    Where can we get utidy?

    > more recent and use the new tidylib. BUT it will only correct a bad html
    > page and transform it to an xml or xhtml output that you load after as a DOM
    > with another parser. Personnaly I use pyRXP.


    Bye,
    Walter Dörwald
     
    =?ISO-8859-15?Q?Walter_D=F6rwald?=, Jul 31, 2003
    #2
    1. Advertising

  3. Paul Rubin

    John J. Lee Guest

    Paul Rubin <http://> writes:

    > Is there an HTML DOM parser available for Python? Preferably one that
    > does a reasonable job with the crappy HTML out there on real web
    > pages, that doesn't get upset about unterminated tables and stuff like
    > that. Many extra points if it understands Javascript. Application is
    > a screen scraping web robot. Thanks.


    glork. I just started working on this myself.

    Email me if you'd like the code, such as it is. I've wrapped the
    Mozilla JS interpreter but am currently stuck on a segfault, so I
    could certainly do with a collaborator.

    I'm using utidylib and 4DOM (latter from PyXML).

    Mind you, if you actually want to get a job done <wink>, for a
    quick-but-bulky (and somewhat closed) solution, try PyKDE (KHTML /
    KJS) or IE automation (MSHTML / JScript). Mozilla + XPCOM also, but I
    think it requires rebuilding Mozilla to get PyXPCOM support. There's
    also httpunit (in Java, useable from Jython).


    John
     
    John J. Lee, Aug 1, 2003
    #3
  4. mailto:
    "Paul Rubin" <http://> a écrit dans le message de news:
    ...
    > Is there an HTML DOM parser available for Python? Preferably one that
    > does a reasonable job with the crappy HTML out there on real web
    > pages, that doesn't get upset about unterminated tables and stuff like
    > that. Many extra points if it understands Javascript. Application is
    > a screen scraping web robot. Thanks.


    Windoze IE5(+) + Win32All python package only :

    Use IE as COM object, browse the file or URL, then, get it's DOM root.
    But any javascript found in that page is executed at page load and may fool
    your app.

    --Gilles
     
    Gilles Lenfant, Aug 1, 2003
    #4
  5. Paul Rubin

    Paul Rubin Guest

    writes:
    > Here is a quick example of using automation with IE
    > # This is a sample of automating IE using Python.


    Thanks, I should have said I'm running under gnu/linux and I was
    hoping for a standalone solution (some of the ones suggested sound
    worth looking into). Even connecting up Python to Mozilla sounds
    awfully heavyweight.
     
    Paul Rubin, Aug 2, 2003
    #5
  6. Paul Rubin

    John J. Lee Guest

    Paul Rubin <http://> writes:

    > writes:
    > > Here is a quick example of using automation with IE
    > > # This is a sample of automating IE using Python.

    >
    > Thanks, I should have said I'm running under gnu/linux and I was
    > hoping for a standalone solution (some of the ones suggested sound
    > worth looking into). Even connecting up Python to Mozilla sounds
    > awfully heavyweight.


    PyKDE is less hassle, I think. It's certainly heavyweight, though.
    Probably more lightweight still is HttpUnit on Jython. I haven't used
    either, but I have compiled PyKDE recently, and didn't run into
    problems (but if you're unlucky, you may have to compile Qt, KDE, sip
    and PyQt first!).

    I seem to have got a basic JavaScript wrapper working now (I'm using
    libjs from Mozilla's standalone spidermonkey distribution), bound 4DOM
    to it, and extracted & executed the script from a web page. Quite a
    lot more to do, though (browser-like interface of some sort,
    javascript: scheme URLs, implement window object, wiring up event
    attributes to the JS interpreter, getting the DOM actually working
    propertly, understanding what document.write does, trying to connect
    the DOM to my Python HTML form and HTTP cookies interfaces...).

    Anybody happen to know where JavaScript's document.some_form is
    documented? Official W3C DOM has document.forms, but real browser
    DOMs apparently have forms directly on the document object.


    John
     
    John J. Lee, Aug 2, 2003
    #6
  7. Paul Rubin

    Guest

    (John J. Lee) wrote in message news:<>...
    > Paul Rubin <http://> writes:
    >
    > > writes:
    > > > Here is a quick example of using automation with IE
    > > > # This is a sample of automating IE using Python.

    > >
    > > Thanks, I should have said I'm running under gnu/linux and I was
    > > hoping for a standalone solution (some of the ones suggested sound
    > > worth looking into). Even connecting up Python to Mozilla sounds
    > > awfully heavyweight.

    >
    > PyKDE is less hassle, I think. It's certainly heavyweight, though.
    > Probably more lightweight still is HttpUnit on Jython. I haven't used
    > either, but I have compiled PyKDE recently, and didn't run into
    > problems (but if you're unlucky, you may have to compile Qt, KDE, sip
    > and PyQt first!).
    >
    > I seem to have got a basic JavaScript wrapper working now (I'm using
    > libjs from Mozilla's standalone spidermonkey distribution), bound 4DOM
    > to it, and extracted & executed the script from a web page. Quite a
    > lot more to do, though (browser-like interface of some sort,
    > javascript: scheme URLs, implement window object, wiring up event
    > attributes to the JS interpreter, getting the DOM actually working
    > propertly, understanding what document.write does, trying to connect
    > the DOM to my Python HTML form and HTTP cookies interfaces...).
    >
    > Anybody happen to know where JavaScript's document.some_form is
    > documented? Official W3C DOM has document.forms, but real browser
    > DOMs apparently have forms directly on the document object.
    >
    >
    > John


    Try here: http://msdn.microsoft.com/library/d...hor/dhtml/reference/dhtml_reference_entry.asp
     
    , Aug 4, 2003
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Rogan Dawes

    HTML parser to DOM via SAX?

    Rogan Dawes, Mar 7, 2005, in forum: Java
    Replies:
    0
    Views:
    662
    Rogan Dawes
    Mar 7, 2005
  2. Replies:
    0
    Views:
    588
  3. Xah Lee

    HTML/DOM parser

    Xah Lee, Feb 28, 2006, in forum: Python
    Replies:
    2
    Views:
    420
    Michael Ekstrand
    Feb 28, 2006
  4. Xah Lee

    HTML/DOM parser

    Xah Lee, Feb 28, 2006, in forum: Perl Misc
    Replies:
    2
    Views:
    114
    DJ Stunks
    Feb 28, 2006
  5. DOM ? HTML DOM

    , Dec 19, 2007, in forum: Javascript
    Replies:
    1
    Views:
    149
Loading...

Share This Page