Seek HTML cleanup utilities

Discussion in 'Python' started by Jon Roland, Nov 21, 2004.

  1. Jon Roland

    Jon Roland Guest

    I have a number of changes I like to make to HTML files that are not
    currently supported by HTML Tidy. Most of them arise from OCR
    recognition errors, and many from the ways my OCR program, Finereader,
    saves to HTML. I have begun to write stream editing scripts in python,
    but wonder whether someone else may have already done so. It would
    save me a lot of time to use or modify already-written utilities. I
    would appreciate direction to any that are available. Please respond
    by email.

    Some of the kinds of cleanup I want to be able to do include:

    1. Removal of empty tag pairs.

    2. Trimming/moving whitespace around tags:
    a. Removal whitespace following a <p> and preceding
    a </p>.
    b. Moving whitespace following lead tag to precede
    it, preceding end tag to follow it.

    3. Moving certain punctuation -- comma, period,
    semi-colon, etc. -- outside of certain end tags, such
    as </i>, </b>, etc.

    4. Removal of certain attributes:
    a. In <font> tag, face="Times New Roman" (or
    whatever) so that it will be viewed with default font face.
    b. In <font> tag, size="2" (or whatever) so that it
    will ve viewed with default font size.

    5. Changing of certain attributes:
    a. In <font> tag, absolute size="4" to relative
    size="+1" (or whatever).

    6. Changing of certain tags:
    a. <em> to <i>.
    b. <strong> to <b>.

    7. Removal of certain tags, such as <p>, from around
    all the contents of table cells.

    8. For all tables, removal of empty topmost and
    bottommost rows, leftmost and rightmost columns.

    I could go on, but this provides a sample.

    Please visit my website at to see what
    kinds of HTML documents I am producing.
    Jon Roland, Nov 21, 2004
    1. Advertisements

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. _eee_
    Rick Strahl [MVP]
    Feb 28, 2004
  2. Sean
    Mark Parnell
    Feb 15, 2004
  3. jen
    Michael Laplante
    May 29, 2006
  4. Victor \Zverok\ Shepelev

    HTML cleanup task

    Victor \Zverok\ Shepelev, Nov 30, 2006, in forum: Ruby
    Victor \Zverok\ Shepelev
    Nov 30, 2006
  5. Replies:
    Andreas Perstinger
    May 14, 2013

Share This Page