beautifulsoup .vs tidy

Discussion in 'Python' started by bruce, Jul 1, 2006.

  1. bruce

    bruce Guest

    hi...

    never used perl, but i have an issue trying to resolve some html that
    appears to be "dirty/malformed" regarding the overall structure. in
    researching validators, i came across the beautifulsoup app and wanted to
    know if anybody could give me pros/cons of the app as it relates to any of
    the other validation apps...

    the issue i'm facing involves parsing some websites, so i'm trying to
    extract information based on the DOM/XPath functions.. i'm using perl to
    handle the extraction....

    thanks

    -bruce
     
    bruce, Jul 1, 2006
    #1
    1. Advertising

  2. bruce

    Ravi Teja Guest

    bruce wrote:
    > hi...
    >
    > never used perl, but i have an issue trying to resolve some html that
    > appears to be "dirty/malformed" regarding the overall structure. in
    > researching validators, i came across the beautifulsoup app and wanted to
    > know if anybody could give me pros/cons of the app as it relates to any of
    > the other validation apps...
    >
    > the issue i'm facing involves parsing some websites, so i'm trying to
    > extract information based on the DOM/XPath functions.. i'm using perl to
    > handle the extraction....


    1.) XPath is not a good idea at all with "malformed" HTML or perhaps
    web pages in general.
    2.) BeautifulSoup is not a validator but works well with bad HTML. Also
    look at Mechanize and ClientForm.
    3.) XMLStarlet is a good XML validator
    (http://xmlstar.sourceforge.net/). It's not Python but you don't need
    to care about the language it is written in.
    4.) For a simple HTML validator, Just use http://validator.w3.org/
     
    Ravi Teja, Jul 1, 2006
    #2
    1. Advertising

  3. bruce

    Paddy Guest

    bruce wrote:
    > hi...
    >
    > never used perl, but i have an issue trying to resolve some html that
    > appears to be "dirty/malformed" regarding the overall structure. in
    > researching validators, i came across the beautifulsoup app and wanted to
    > know if anybody could give me pros/cons of the app as it relates to any of
    > the other validation apps...
    >

    I'm not too sure of what you are after. You mention tidy in the subject
    which made me think that maybe you were trying to generate well-formed
    HTML from malformed webppages that nonetheless browsers can interpret.
    If that is the case then try HTML tidy:
    http://www.w3.org/People/Raggett/tidy/

    - Pad.
     
    Paddy, Jul 1, 2006
    #3
  4. bruce wrote:

    > that's exactly what i'm trying to accomplish... i've used tidy, but it seems
    > to still generate warnings...
    >
    > initFile -> tidy ->cleanFile -> perl app (using xpath/livxml)
    >
    > the xpath/linxml functions in the perl app complain regarding the file.


    what exactly do they complain about ?

    </F>
     
    Fredrik Lundh, Jul 1, 2006
    #4
  5. bruce

    Paul Boddie Guest

    Ravi Teja wrote:
    >
    > 1.) XPath is not a good idea at all with "malformed" HTML or perhaps
    > web pages in general.


    import libxml2dom
    import urllib
    f = urllib.urlopen("http://wiki.python.org/moin/")
    s = f.read()
    f.close()
    # s contains HTML not XML text
    d = libxml2dom.parseString(s, html=1)
    # get the community-related links
    for label in d.xpath("//li[.//a/text() = 'Community']//li//a/text()"):
    print label.nodeValue

    Of course, lxml should be able to do this kind of thing as well. I'd be
    interested to know why this "is not a good idea", though.

    Paul
     
    Paul Boddie, Jul 1, 2006
    #5
  6. bruce

    Matt Good Guest

    bruce wrote:
    > that's exactly what i'm trying to accomplish... i've used tidy, but it seems
    > to still generate warnings...
    >
    > initFile -> tidy ->cleanFile -> perl app (using xpath/livxml)
    >
    > the xpath/linxml functions in the perl app complain regarding the file. my
    > thought is that tidy isn't cleaning enough, or that the perl xpath/libxml
    > functions are too strict!


    Clean HTML is not valid XML. If you want to process the output with an
    XML library you'll need to tell Tidy to output XHTML. Then it should
    be valid for XML processing.

    Of course BeautifulSoup is also a very nice library if you need to
    extract some information, but don't necessarilly require XML processing
    to do it.

    -- Matt Good
     
    Matt Good, Jul 1, 2006
    #6
  7. bruce

    Ravi Teja Guest

    Paul Boddie wrote:
    > Ravi Teja wrote:
    > >
    > > 1.) XPath is not a good idea at all with "malformed" HTML or perhaps
    > > web pages in general.

    >
    > import libxml2dom
    > import urllib
    > f = urllib.urlopen("http://wiki.python.org/moin/")
    > s = f.read()
    > f.close()
    > # s contains HTML not XML text
    > d = libxml2dom.parseString(s, html=1)
    > # get the community-related links
    > for label in d.xpath("//li[.//a/text() = 'Community']//li//a/text()"):
    > print label.nodeValue


    I wasn't aware that your module does html as well.

    > Of course, lxml should be able to do this kind of thing as well. I'd be
    > interested to know why this "is not a good idea", though.


    No reason that you don't know already.

    http://www.boddie.org.uk/python/HTML.html

    "If the document text is well-formed XML, we could omit the html
    parameter or set it to have a false value."

    XML parsers are not required to be forgiving to be regarded compliant.
    And much HTML out there is not well formed.
     
    Ravi Teja, Jul 1, 2006
    #7
  8. Ravi Teja wrote:

    >> Of course, lxml should be able to do this kind of thing as well. I'd be
    >> interested to know why this "is not a good idea", though.

    >
    > No reason that you don't know already.
    >
    > http://www.boddie.org.uk/python/HTML.html
    >
    > "If the document text is well-formed XML, we could omit the html
    > parameter or set it to have a false value."
    >
    > XML parsers are not required to be forgiving to be regarded compliant.
    > And much HTML out there is not well formed.


    so? once you run it through an HTML-aware parser, the *resulting*
    structure is well formed.

    a site generator->converter->xpath approach is no less reliable than any
    other HTML-scraping approach.

    </F>
     
    Fredrik Lundh, Jul 2, 2006
    #8
  9. bruce

    Guest

    bruce wrote:
    > hi paddy...
    >
    > that's exactly what i'm trying to accomplish... i've used tidy, but it seems
    > to still generate warnings...
    >
    > initFile -> tidy ->cleanFile -> perl app (using xpath/livxml)
    >
    > the xpath/linxml functions in the perl app complain regarding the file. my
    > thought is that tidy isn't cleaning enough, or that the perl xpath/libxml
    > functions are too strict!
    >
    > which is why i decided to see if anyone on the python side has
    > experienced/solved this problem..


    FWIW here's my usual approach:

    http://copia.ogbuji.net/blog/2005-07-22/Beyond_HTM

    Personally, I avoid Tidy. I've too often seen it crash or hang on
    really bad HTML. TagSoup seems to be built like a tank. I've also
    never seen BeautifulSoup choke, but I don't use it as much as TagSoup.

    --
    Uche Ogbuji Fourthought, Inc.
    http://uche.ogbuji.net http://fourthought.com
    http://copia.ogbuji.net http://4Suite.org
    Articles: http://uche.ogbuji.net/tech/publications/
     
    , Jul 3, 2006
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. d davis
    Replies:
    0
    Views:
    475
    d davis
    Apr 27, 2004
  2. Christoph Schneegans

    HTML Tidy in ASP.NET

    Christoph Schneegans, Nov 2, 2003, in forum: ASP .Net
    Replies:
    2
    Views:
    7,139
    mthakershi
    Apr 28, 2009
  3. Eric
    Replies:
    0
    Views:
    533
  4. Chris Harris

    Tidy configuration

    Chris Harris, Jun 24, 2003, in forum: HTML
    Replies:
    3
    Views:
    6,471
    Headless
    Jul 2, 2003
  5. bruce

    RE: beautifulsoup .vs tidy

    bruce, Jul 1, 2006, in forum: Python
    Replies:
    0
    Views:
    460
    bruce
    Jul 1, 2006
Loading...

Share This Page