Seek HTML cleanup utilities

J

Jon Roland

I have a number of changes I like to make to HTML files that are not
currently supported by HTML Tidy. Most of them arise from OCR
recognition errors, and many from the ways my OCR program, Finereader,
saves to HTML. I have begun to write stream editing scripts in python,
but wonder whether someone else may have already done so. It would
save me a lot of time to use or modify already-written utilities. I
would appreciate direction to any that are available. Please respond
by email.

Some of the kinds of cleanup I want to be able to do include:

1. Removal of empty tag pairs.

2. Trimming/moving whitespace around tags:
a. Removal whitespace following a <p> and preceding
a </p>.
b. Moving whitespace following lead tag to precede
it, preceding end tag to follow it.

3. Moving certain punctuation -- comma, period,
semi-colon, etc. -- outside of certain end tags, such
as </i>, </b>, etc.

4. Removal of certain attributes:
a. In <font> tag, face="Times New Roman" (or
whatever) so that it will be viewed with default font face.
b. In <font> tag, size="2" (or whatever) so that it
will ve viewed with default font size.

5. Changing of certain attributes:
a. In <font> tag, absolute size="4" to relative
size="+1" (or whatever).

6. Changing of certain tags:
a. <em> to <i>.
b. <strong> to <b>.

7. Removal of certain tags, such as <p>, from around
all the contents of table cells.

8. For all tables, removal of empty topmost and
bottommost rows, leftmost and rightmost columns.

I could go on, but this provides a sample.

Please visit my website at http://www.constitution.org to see what
kinds of HTML documents I am producing.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top