Parsing HTML/XML documents

Discussion in 'Python' started by pabloski@giochinternet.com, Apr 26, 2007.

  1. Guest

    I need to parse real world HTML/XML documents and I found two nice python
    solution: BeautifulSoup and Tidy.

    However I found pyXPCOM that is a wrapper for Gecko. So I was thinking
    Gecko surely handles bad html in a more consistent and error-proof way
    than BS and Tidy.

    I'm interested in using Mozilla DOM from inside a Python script, however
    I'm a bit confused about how can I use pyXPCOM to accomplish this job.

    Any suggestions?
     
    , Apr 26, 2007
    #1
    1. Advertising

  2. wrote:
    > I need to parse real world HTML/XML documents and I found two nice python
    > solution: BeautifulSoup and Tidy.


    There's also lxml, in case you want a real XML tool.
    http://codespeak.net/lxml/
    http://codespeak.net/lxml/dev/parsing.html#parsers


    > However I found pyXPCOM that is a wrapper for Gecko. So I was thinking
    > Gecko surely handles bad html in a more consistent and error-proof way
    > than BS and Tidy.
    >
    > I'm interested in using Mozilla DOM from inside a Python script, however
    > I'm a bit confused about how can I use pyXPCOM to accomplish this job.


    I've never used it, but I doubt Gecko would yield substantially better results
    than any of the three above. You're dealing with broken data here, so it just
    depends on your input which one of them wins.

    Stefan
     
    Stefan Behnel, Apr 26, 2007
    #2
    1. Advertising

  3. Max M Guest

    Stefan Behnel skrev:
    > wrote:
    >> I need to parse real world HTML/XML documents and I found two nice python
    >> solution: BeautifulSoup and Tidy.

    >
    > There's also lxml, in case you want a real XML tool.
    > http://codespeak.net/lxml/
    > http://codespeak.net/lxml/dev/parsing.html#parsers


    I have used both BeautiullSoup and lxml. They are both good tools.

    lxml is blindingly fast compared to BeautifulSoup though.

    A simple tool for importing contact information from 6000 xml files of
    23 MBytes into Zope runs in about 30 seconds. No optimisations at all.
    Just inefficient xpath expressions.

    That is pretty good in my book.

    --

    hilsen/regards Max M, Denmark

    http://www.mxm.dk/
    IT's Mad Science
     
    Max M, Apr 26, 2007
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Frank LaRosa

    Parsing imperfect HTML documents

    Frank LaRosa, Jul 23, 2003, in forum: Java
    Replies:
    3
    Views:
    453
    Drew Volpe
    Jul 23, 2003
  2. Andy Carson
    Replies:
    8
    Views:
    402
  3. Replies:
    1
    Views:
    521
    Juan T. Llibre
    Oct 18, 2006
  4. Berlin  Brown
    Replies:
    1
    Views:
    333
  5. keioGirl
    Replies:
    0
    Views:
    367
    keioGirl
    Dec 3, 2008
Loading...

Share This Page