HTML purifier using BeautifulSoup?

Discussion in 'Python' started by Dan Stromberg, Dec 21, 2004.

  1. Has anyone tried to construct an HTML janitor script using BeautifulSoup?

    My situation:

    I'm trying to convert a series of web pages from .html to palmdoc format,
    using plucker, which is written in python. The plucker project suggests
    passing html through "tidy", to get well-formed html for plucker to work
    with.

    However, some of the pages I want to convert are so bad that even tidy
    pukes on them.

    I was thinking that BeautifulSoup might be more tolerant of really bad
    html... Which led me to the question this article started out with. :)

    Thanks!
     
    Dan Stromberg, Dec 21, 2004
    #1
    1. Advertising

  2. Dan Stromberg wrote:
    > Has anyone tried to construct an HTML janitor script using

    BeautifulSoup?
    >
    > My situation:
    >
    > I'm trying to convert a series of web pages from .html to palmdoc

    format,
    > using plucker, which is written in python. The plucker project

    suggests
    > passing html through "tidy", to get well-formed html for plucker to

    work
    > with.
    >
    > However, some of the pages I want to convert are so bad that even

    tidy
    > pukes on them.
    >
    > I was thinking that BeautifulSoup might be more tolerant of really

    bad
    > html... Which led me to the question this article started out with.

    :)
    >
    > Thanks!


    I have used BeautifulSoup for screen scraping, pulling html into
    structured form (using XML). Is that similar to a janitor script? I
    used it because tidy was puking on some html. BS has been excellent.
     
    Jonathan Clark, Jan 7, 2005
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Ship1
    Replies:
    1
    Views:
    1,989
    Larry Brasfield
    Nov 23, 2004
  2. John Nagle
    Replies:
    11
    Views:
    1,341
    John Nagle
    May 14, 2007
  3. 1001 Webs

    HTML Purifier

    1001 Webs, Jan 16, 2008, in forum: HTML
    Replies:
    35
    Views:
    1,305
    Viper
    Jan 23, 2008
  4. Дамјан ГеоргиевÑки

    Readability (html purifier) in Python

    Дамјан ГеоргиевÑки, Jun 15, 2010, in forum: Python
    Replies:
    2
    Views:
    1,125
    Дамјан ГеоргиевÑки
    Jun 16, 2010
  5. P E Schoen

    HTML purifier for Perl

    P E Schoen, Jan 21, 2011, in forum: Perl Misc
    Replies:
    2
    Views:
    269
    P E Schoen
    Jan 21, 2011
Loading...

Share This Page