Parsing HTML/XML documents

Discussion in 'Python' started by pabloski@giochinternet.com, Apr 26, 2007.

  1. Guest

    I need to parse real world HTML/XML documents and I found two nice python
    solution: BeautifulSoup and Tidy.

    However I found pyXPCOM that is a wrapper for Gecko. So I was thinking
    Gecko surely handles bad html in a more consistent and error-proof way
    than BS and Tidy.

    I'm interested in using Mozilla DOM from inside a Python script, however
    I'm a bit confused about how can I use pyXPCOM to accomplish this job.

    Any suggestions?
     
    , Apr 26, 2007
    #1
    1. Advertisements

  2. wrote:
    > I need to parse real world HTML/XML documents and I found two nice python
    > solution: BeautifulSoup and Tidy.


    There's also lxml, in case you want a real XML tool.
    http://codespeak.net/lxml/
    http://codespeak.net/lxml/dev/parsing.html#parsers


    > However I found pyXPCOM that is a wrapper for Gecko. So I was thinking
    > Gecko surely handles bad html in a more consistent and error-proof way
    > than BS and Tidy.
    >
    > I'm interested in using Mozilla DOM from inside a Python script, however
    > I'm a bit confused about how can I use pyXPCOM to accomplish this job.


    I've never used it, but I doubt Gecko would yield substantially better results
    than any of the three above. You're dealing with broken data here, so it just
    depends on your input which one of them wins.

    Stefan
     
    Stefan Behnel, Apr 26, 2007
    #2
    1. Advertisements

  3. Max M Guest

    Stefan Behnel skrev:
    > wrote:
    >> I need to parse real world HTML/XML documents and I found two nice python
    >> solution: BeautifulSoup and Tidy.

    >
    > There's also lxml, in case you want a real XML tool.
    > http://codespeak.net/lxml/
    > http://codespeak.net/lxml/dev/parsing.html#parsers


    I have used both BeautiullSoup and lxml. They are both good tools.

    lxml is blindingly fast compared to BeautifulSoup though.

    A simple tool for importing contact information from 6000 xml files of
    23 MBytes into Zope runs in about 30 seconds. No optimisations at all.
    Just inefficient xpath expressions.

    That is pretty good in my book.

    --

    hilsen/regards Max M, Denmark

    http://www.mxm.dk/
    IT's Mad Science
     
    Max M, Apr 26, 2007
    #3
    1. Advertisements

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Frank LaRosa

    Parsing imperfect HTML documents

    Frank LaRosa, Jul 23, 2003, in forum: Java
    Replies:
    3
    Views:
    547
    Drew Volpe
    Jul 23, 2003
  2. Andy Carson
    Replies:
    8
    Views:
    557
  3. Replies:
    1
    Views:
    673
    Juan T. Llibre
    Oct 18, 2006
  4. Berlin  Brown
    Replies:
    1
    Views:
    416
  5. keioGirl
    Replies:
    0
    Views:
    439
    keioGirl
    Dec 3, 2008
  6. AAaron123
    Replies:
    1
    Views:
    1,308
    Alexey Smirnov
    Nov 17, 2009
  7. Derek
    Replies:
    2
    Views:
    351
    Robert Klemme
    May 1, 2005
  8. Erik Wasser
    Replies:
    5
    Views:
    849
    Peter J. Holzer
    Mar 5, 2006
Loading...