Right tool and method to strip off html files (python, sed, awk?)

sebzzz · Jul 13, 2007

Hi,

I'm in the process of refactoring a lot of HTML documents and I'm
using html tidy to do a part of this
work. (clean up, change to xhtml and remove font and center tags)

Now, Tidy will just do a part of the work I need to
do, I have to remove all the presentational tags and attributes from
the pages (in other words rip off the pages) including the tables that
are used for disposition of content (how to differentiate?).

I thought about doing that with python (for which I'm in process of
learning), but maybe an other tool (like sed?) would be better suited
for this job.

I kind of know generally what I need to do:

1- Find all html files in the folders (sub-folders ...)
2- Do some file I/O and feed Sed or Python or what else with the file.
3- Apply recursively some regular expression on the file to do the
things a want. (delete when it encounters certain tags, certain
attributes)
4- Write the changed file, and go through all the files like that.

But I don't know how to do it for real, the syntax and everything. I
also want to pick-up the tool that's the easiest for this job. I heard
about BeautifulSoup and lxml for Python, but I don't know if those
modules would help.

Now, I know I'm not a the best place to ask if python is the right
choice (anyways even my little finger tells me it is), but if I can do
the same thing more simply with another tool it would be good to know.

An other argument for the other tools is that I know how to use the
find unix program to find the files and feed them to grep or sed, but
I still don't know what's the syntax with python (fetch files, change
them than write them) and I don't know if I should read the files and
treat them as a whole or just line by line. Of course I could mix
commands with some python, find command to my program's standard
input, and my command's standard output to the original file. But I do
I control STDIN and STDOUT with python?

Sorry if that's a lot of questions in one, and I will probably get a
lot of RTFM (which I'm doing btw), but I feel I little lost in all
that right now.

Any help would be really appreciated.
Thanks

Jay Loden · Jul 13, 2007

I thought about doing that with python (for which I'm in process of
learning), but maybe an other tool (like sed?) would be better suited
for this job.

Generally speaking, in my experience, the best tool for the job is the one you know how to use

There are of course places where certain tools are very well suited - e.g. Perl when it comes to regular expressions and text processing. BUT, the time it will take you to learn Perl would be better spent getting the work done in Python or sed/awk etc. Similarly, maintaining a script in a language you don't know well will introduce headaches later. In short, you're almost always best off using the tool you are most comfortable with.

I kind of know generally what I need to do:

That's usually a good start

1- Find all html files in the folders (sub-folders ...)
2- Do some file I/O and feed Sed or Python or what else with the file.
3- Apply recursively some regular expression on the file to do the
things a want. (delete when it encounters certain tags, certain
attributes)
4- Write the changed file, and go through all the files like that.

This is one valid approach. There are a lot of things that you can do to help define your problem better though. For instance:

* Are the files matching a predefined template of some kind?
Can you use this to help define some of your processing rules?

* Do you know what kind of regular expressions you are going to need?
For that matter, are you even comfortable using regular expressions?
From the sound of your post, you may not have experience with them so
that's going to be a hurdle to overcome when it coms to using them

* Regular expressions are one approach to the problem. However, they
may not be the most maintainable or practical, depending on the actual
requirements. An HTML or XML processing module might be a better option,
particularly if the HTML Tidied pages are valid XHTML.

* Define your program requirements in smaller more specific terms, e.g.
"need to remove all of the following tags: <font>, <center>" or
"need to clean orphaned/invalid tags" - this will help you define
the actual problem statement better and makes it easier to see what
the best solution is. Are you just looking to strip all the HTML from
some files? Perhaps lynx/links with the --dump option is all you need,
as opposed to a full HTML parsing script.

But I don't know how to do it for real, the syntax and everything. I
also want to pick-up the tool that's the easiest for this job. I heard
about BeautifulSoup and lxml for Python, but I don't know if those
modules would help.

See above about defining the problem statement. If you get it pinned down to a finite set of requirements said:
Now, I know I'm not a the best place to ask if python is the right
choice (anyways even my little finger tells me it is), but if I can do
the same thing more simply with another tool it would be good to know.

If all you've got is a hammer, everything looks like a nail

- it's important to not be so dogmatic about one programming language or tool of any kind that you can't see when there's a much more efficient solution available. However, should you end up determining that what is needed is a good all-purpose scripting/programming language, I'm sure you'll find Python plenty capable and this list quite helpful in conquering any problems along the way.

An other argument for the other tools is that I know how to use the
find unix program to find the files and feed them to grep or sed, but
I still don't know what's the syntax with python (fetch files, change
them than write them) and I don't know if I should read the files and
treat them as a whole or just line by line. Of course I could mix
commands with some python, find command to my program's standard
input, and my command's standard output to the original file. But I do
I control STDIN and STDOUT with python?

Either approach is perfectly valid should you end up using Python; you can either feed a list of filenames to Python on the command line, write a recursive directory reading function that will get the filenames, or control STDOUT/STDIN. Again see my first point about defining a problem statement, and then you can Google for example code to help you. The Python Cookbook is often enormously helpful as well, since you can find sample code for manipulating STDIN/STDOUT, reading a directory recursively, and handling command line arguments. But, it's important to know which one you want before you can search for it...

Sorry if that's a lot of questions in one, and I will probably get a
lot of RTFM (which I'm doing btw), but I feel I little lost in all
that right now.

Reading the manual is excellent and important, but it won't always help you with feeling overwhelmed. The best thing to do is break a big problem into little problems and work on those so they don't seem so insurmountable. (You may be detecting a pattern to the advice I'm giving by now).

HTH,

-Jay

Eric_Dexter · Jul 13, 2007

Hi,

I'm in the process of refactoring a lot of HTML documents and I'm
using html tidy to do a part of this
work. (clean up, change to xhtml and remove font and center tags)

Now, Tidy will just do a part of the work I need to
do, I have to remove all the presentational tags and attributes from
the pages (in other words rip off the pages) including the tables that
are used for disposition of content (how to differentiate?).

I thought about doing that with python (for which I'm in process of
learning), but maybe an other tool (like sed?) would be better suited
for this job.

I kind of know generally what I need to do:

1- Find all html files in the folders (sub-folders ...)
2- Do some file I/O and feed Sed or Python or what else with the file.
3- Apply recursively some regular expression on the file to do the
things a want. (delete when it encounters certain tags, certain
attributes)
4- Write the changed file, and go through all the files like that.

But I don't know how to do it for real, the syntax and everything. I
also want to pick-up the tool that's the easiest for this job. I heard
about BeautifulSoup and lxml for Python, but I don't know if those
modules would help.

Now, I know I'm not a the best place to ask if python is the right
choice (anyways even my little finger tells me it is), but if I can do
the same thing more simply with another tool it would be good to know.

An other argument for the other tools is that I know how to use the
find unix program to find the files and feed them to grep or sed, but
I still don't know what's the syntax with python (fetch files, change
them than write them) and I don't know if I should read the files and
treat them as a whole or just line by line. Of course I could mix
commands with some python, find command to my program's standard
input, and my command's standard output to the original file. But I do
I control STDIN and STDOUT with python?

Sorry if that's a lot of questions in one, and I will probably get a
lot of RTFM (which I'm doing btw), but I feel I little lost in all
that right now.

Any help would be really appreciated.
Thanks

You might find a text editor is the way to go.. you can use autoit
either through python or by itself to control the text editor you
use.. I just downloaded pspad and it looks like it will do that. It
may be a pain to script though.

http://sourceforge.net/projects/dex-tracker/

Eric_Dexter · Jul 13, 2007

You might find a text editor is the way to go.. you can use autoit
either through python or by itself to control the text editor you
use.. I just downloaded pspad and it looks like it will do that. It
may be a pain to script though.

http://sourceforge.net/projects/dex-tracker/- Hide quoted text -

- Show quoted text -

let me add to that it may be a pain to script with autoit and I am not
doing more of an example because it won't insert a textfile at a
location like mdipad will.

Stefan Behnel · Jul 14, 2007

1- Find all html files in the folders (sub-folders ...)
2- Do some file I/O and feed Sed or Python or what else with the file.
3- Apply recursively some regular expression on the file to do the
things a want. (delete when it encounters certain tags, certain
attributes)
4- Write the changed file, and go through all the files like that.

Use the lxml.html.clean module, which is made exactly for that purpose. It's
not released yet, but you can use it from the current html branch of lxml.
There will soon be an official alpha of the 2.0 series, which will contain
lxml.html:

http://codespeak.net/svn/lxml/branch/html/

It looks like you're on Ubuntu, so compiling it from sources after an SVN
checkout should be as simple as the usual setup.py dance. Please report back
to the lxml mailing list if you find any problems or have any further ideas on
how to make it even more versatile than it already is.

For lxml is general, see:

http://codespeak.net/lxml/

Stefan

sebzzz · Jul 15, 2007

Thank you guys for all the good advice.

All be working on defining a clearer problem (I think this advice is
good for all areas of life).

I appreciate the help, the python community looks really open to
learners and beginners, hope to be helping people myself in not too
long from now (well, reasonably long to learn the theory and mature
with it of course) ;-)

Looking for a tool to turn hundreds of MSG contacts into VCF files for CRM import.	0	Aug 5, 2025
How to Convert PST Files to HTML?	2	Apr 8, 2026
Batch Convert HTML to UTF-8 Files	2	Oct 2, 2023
Can I export PST emails to HTML files?	1	Apr 28, 2026
[OFF] sed equivalent of something easy in python	5	Oct 25, 2010
I would like to use awk to calculate the total number of records processed	1	Aug 25, 2022
Convert AWK regex to Python	6	May 16, 2011
What tool converts MBOX to PST?	5	Dec 26, 2024

Right tool and method to strip off html files (python, sed, awk?)

sebzzz

Jay Loden

Eric_Dexter

Eric_Dexter

Stefan Behnel

sebzzz

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads