Right tool and method to strip off html files (python, sed, awk?)

Discussion in 'Python' started by sebzzz@gmail.com, Jul 13, 2007.

  1. Guest

    Hi,

    I'm in the process of refactoring a lot of HTML documents and I'm
    using html tidy to do a part of this
    work. (clean up, change to xhtml and remove font and center tags)

    Now, Tidy will just do a part of the work I need to
    do, I have to remove all the presentational tags and attributes from
    the pages (in other words rip off the pages) including the tables that
    are used for disposition of content (how to differentiate?).

    I thought about doing that with python (for which I'm in process of
    learning), but maybe an other tool (like sed?) would be better suited
    for this job.

    I kind of know generally what I need to do:

    1- Find all html files in the folders (sub-folders ...)
    2- Do some file I/O and feed Sed or Python or what else with the file.
    3- Apply recursively some regular expression on the file to do the
    things a want. (delete when it encounters certain tags, certain
    attributes)
    4- Write the changed file, and go through all the files like that.

    But I don't know how to do it for real, the syntax and everything. I
    also want to pick-up the tool that's the easiest for this job. I heard
    about BeautifulSoup and lxml for Python, but I don't know if those
    modules would help.

    Now, I know I'm not a the best place to ask if python is the right
    choice (anyways even my little finger tells me it is), but if I can do
    the same thing more simply with another tool it would be good to know.

    An other argument for the other tools is that I know how to use the
    find unix program to find the files and feed them to grep or sed, but
    I still don't know what's the syntax with python (fetch files, change
    them than write them) and I don't know if I should read the files and
    treat them as a whole or just line by line. Of course I could mix
    commands with some python, find command to my program's standard
    input, and my command's standard output to the original file. But I do
    I control STDIN and STDOUT with python?

    Sorry if that's a lot of questions in one, and I will probably get a
    lot of RTFM (which I'm doing btw), but I feel I little lost in all
    that right now.

    Any help would be really appreciated.
    Thanks
    , Jul 13, 2007
    #1
    1. Advertising

  2. Jay Loden Guest

    wrote:
    > I thought about doing that with python (for which I'm in process of
    > learning), but maybe an other tool (like sed?) would be better suited
    > for this job.


    Generally speaking, in my experience, the best tool for the job is the one you know how to use ;) There are of course places where certain tools are very well suited - e.g. Perl when it comes to regular expressions and text processing. BUT, the time it will take you to learn Perl would be better spent getting the work done in Python or sed/awk etc. Similarly, maintaining a script in a language you don't know well will introduce headaches later. In short, you're almost always best off using the tool you are most comfortable with.

    > I kind of know generally what I need to do:


    That's usually a good start ;)

    > 1- Find all html files in the folders (sub-folders ...)
    > 2- Do some file I/O and feed Sed or Python or what else with the file.
    > 3- Apply recursively some regular expression on the file to do the
    > things a want. (delete when it encounters certain tags, certain
    > attributes)
    > 4- Write the changed file, and go through all the files like that.


    This is one valid approach. There are a lot of things that you can do to help define your problem better though. For instance:

    * Are the files matching a predefined template of some kind?
    Can you use this to help define some of your processing rules?

    * Do you know what kind of regular expressions you are going to need?
    For that matter, are you even comfortable using regular expressions?
    From the sound of your post, you may not have experience with them so
    that's going to be a hurdle to overcome when it coms to using them

    * Regular expressions are one approach to the problem. However, they
    may not be the most maintainable or practical, depending on the actual
    requirements. An HTML or XML processing module might be a better option,
    particularly if the HTML Tidied pages are valid XHTML.

    * Define your program requirements in smaller more specific terms, e.g.
    "need to remove all of the following tags: <font>, <center>" or
    "need to clean orphaned/invalid tags" - this will help you define
    the actual problem statement better and makes it easier to see what
    the best solution is. Are you just looking to strip all the HTML from
    some files? Perhaps lynx/links with the --dump option is all you need,
    as opposed to a full HTML parsing script.

    > But I don't know how to do it for real, the syntax and everything. I
    > also want to pick-up the tool that's the easiest for this job. I heard
    > about BeautifulSoup and lxml for Python, but I don't know if those
    > modules would help.


    See above about defining the problem statement. If you get it pinned down to a finite set of requirements, you can take those smaller problems and determine if, for example, lxml is the right tool for the job. If you come back to the Python mailing list with a smaller problem, e.g. "how can I remove all <center> tags from HTML pages", you're much more likely to get a quick, practical, and useful answer to your question(s).

    > Now, I know I'm not a the best place to ask if python is the right
    > choice (anyways even my little finger tells me it is), but if I can do
    > the same thing more simply with another tool it would be good to know.


    If all you've got is a hammer, everything looks like a nail ;) - it's important to not be so dogmatic about one programming language or tool of any kind that you can't see when there's a much more efficient solution available. However, should you end up determining that what is needed is a good all-purpose scripting/programming language, I'm sure you'll find Python plenty capable and this list quite helpful in conquering any problems along the way.

    > An other argument for the other tools is that I know how to use the
    > find unix program to find the files and feed them to grep or sed, but
    > I still don't know what's the syntax with python (fetch files, change
    > them than write them) and I don't know if I should read the files and
    > treat them as a whole or just line by line. Of course I could mix
    > commands with some python, find command to my program's standard
    > input, and my command's standard output to the original file. But I do
    > I control STDIN and STDOUT with python?


    Either approach is perfectly valid should you end up using Python; you can either feed a list of filenames to Python on the command line, write a recursive directory reading function that will get the filenames, or control STDOUT/STDIN. Again see my first point about defining a problem statement, and then you can Google for example code to help you. The Python Cookbook is often enormously helpful as well, since you can find sample code for manipulating STDIN/STDOUT, reading a directory recursively, and handling command line arguments. But, it's important to know which one you want before you can search for it...

    > Sorry if that's a lot of questions in one, and I will probably get a
    > lot of RTFM (which I'm doing btw), but I feel I little lost in all
    > that right now.


    Reading the manual is excellent and important, but it won't always help you with feeling overwhelmed. The best thing to do is break a big problem into little problems and work on those so they don't seem so insurmountable. (You may be detecting a pattern to the advice I'm giving by now).

    HTH,

    -Jay
    Jay Loden, Jul 13, 2007
    #2
    1. Advertising

  3. Guest

    On Jul 13, 1:57 pm, wrote:
    > Hi,
    >
    > I'm in the process of refactoring a lot of HTML documents and I'm
    > using html tidy to do a part of this
    > work. (clean up, change to xhtml and remove font and center tags)
    >
    > Now, Tidy will just do a part of the work I need to
    > do, I have to remove all the presentational tags and attributes from
    > the pages (in other words rip off the pages) including the tables that
    > are used for disposition of content (how to differentiate?).
    >
    > I thought about doing that with python (for which I'm in process of
    > learning), but maybe an other tool (like sed?) would be better suited
    > for this job.
    >
    > I kind of know generally what I need to do:
    >
    > 1- Find all html files in the folders (sub-folders ...)
    > 2- Do some file I/O and feed Sed or Python or what else with the file.
    > 3- Apply recursively some regular expression on the file to do the
    > things a want. (delete when it encounters certain tags, certain
    > attributes)
    > 4- Write the changed file, and go through all the files like that.
    >
    > But I don't know how to do it for real, the syntax and everything. I
    > also want to pick-up the tool that's the easiest for this job. I heard
    > about BeautifulSoup and lxml for Python, but I don't know if those
    > modules would help.
    >
    > Now, I know I'm not a the best place to ask if python is the right
    > choice (anyways even my little finger tells me it is), but if I can do
    > the same thing more simply with another tool it would be good to know.
    >
    > An other argument for the other tools is that I know how to use the
    > find unix program to find the files and feed them to grep or sed, but
    > I still don't know what's the syntax with python (fetch files, change
    > them than write them) and I don't know if I should read the files and
    > treat them as a whole or just line by line. Of course I could mix
    > commands with some python, find command to my program's standard
    > input, and my command's standard output to the original file. But I do
    > I control STDIN and STDOUT with python?
    >
    > Sorry if that's a lot of questions in one, and I will probably get a
    > lot of RTFM (which I'm doing btw), but I feel I little lost in all
    > that right now.
    >
    > Any help would be really appreciated.
    > Thanks


    You might find a text editor is the way to go.. you can use autoit
    either through python or by itself to control the text editor you
    use.. I just downloaded pspad and it looks like it will do that. It
    may be a pain to script though.

    http://sourceforge.net/projects/dex-tracker/
    , Jul 14, 2007
    #3
  4. Guest

    On Jul 13, 7:07 pm, "" <> wrote:
    > On Jul 13, 1:57 pm, wrote:
    >
    >
    >
    >
    >
    > > Hi,

    >
    > > I'm in the process of refactoring a lot of HTML documents and I'm
    > > using html tidy to do a part of this
    > > work. (clean up, change to xhtml and remove font and center tags)

    >
    > > Now, Tidy will just do a part of the work I need to
    > > do, I have to remove all the presentational tags and attributes from
    > > the pages (in other words rip off the pages) including the tables that
    > > are used for disposition of content (how to differentiate?).

    >
    > > I thought about doing that with python (for which I'm in process of
    > > learning), but maybe an other tool (like sed?) would be better suited
    > > for this job.

    >
    > > I kind of know generally what I need to do:

    >
    > > 1- Find all html files in the folders (sub-folders ...)
    > > 2- Do some file I/O and feed Sed or Python or what else with the file.
    > > 3- Apply recursively some regular expression on the file to do the
    > > things a want. (delete when it encounters certain tags, certain
    > > attributes)
    > > 4- Write the changed file, and go through all the files like that.

    >
    > > But I don't know how to do it for real, the syntax and everything. I
    > > also want to pick-up the tool that's the easiest for this job. I heard
    > > about BeautifulSoup and lxml for Python, but I don't know if those
    > > modules would help.

    >
    > > Now, I know I'm not a the best place to ask if python is the right
    > > choice (anyways even my little finger tells me it is), but if I can do
    > > the same thing more simply with another tool it would be good to know.

    >
    > > An other argument for the other tools is that I know how to use the
    > > find unix program to find the files and feed them to grep or sed, but
    > > I still don't know what's the syntax with python (fetch files, change
    > > them than write them) and I don't know if I should read the files and
    > > treat them as a whole or just line by line. Of course I could mix
    > > commands with some python, find command to my program's standard
    > > input, and my command's standard output to the original file. But I do
    > > I control STDIN and STDOUT with python?

    >
    > > Sorry if that's a lot of questions in one, and I will probably get a
    > > lot of RTFM (which I'm doing btw), but I feel I little lost in all
    > > that right now.

    >
    > > Any help would be really appreciated.
    > > Thanks

    >
    > You might find a text editor is the way to go.. you can use autoit
    > either through python or by itself to control the text editor you
    > use.. I just downloaded pspad and it looks like it will do that. It
    > may be a pain to script though.
    >
    > http://sourceforge.net/projects/dex-tracker/- Hide quoted text -
    >
    > - Show quoted text -


    let me add to that it may be a pain to script with autoit and I am not
    doing more of an example because it won't insert a textfile at a
    location like mdipad will.
    , Jul 14, 2007
    #4
  5. wrote:
    > 1- Find all html files in the folders (sub-folders ...)
    > 2- Do some file I/O and feed Sed or Python or what else with the file.
    > 3- Apply recursively some regular expression on the file to do the
    > things a want. (delete when it encounters certain tags, certain
    > attributes)
    > 4- Write the changed file, and go through all the files like that.


    Use the lxml.html.clean module, which is made exactly for that purpose. It's
    not released yet, but you can use it from the current html branch of lxml.
    There will soon be an official alpha of the 2.0 series, which will contain
    lxml.html:

    http://codespeak.net/svn/lxml/branch/html/

    It looks like you're on Ubuntu, so compiling it from sources after an SVN
    checkout should be as simple as the usual setup.py dance. Please report back
    to the lxml mailing list if you find any problems or have any further ideas on
    how to make it even more versatile than it already is.

    For lxml is general, see:

    http://codespeak.net/lxml/

    Stefan
    Stefan Behnel, Jul 14, 2007
    #5
  6. Guest

    Thank you guys for all the good advice.

    All be working on defining a clearer problem (I think this advice is
    good for all areas of life).

    I appreciate the help, the python community looks really open to
    learners and beginners, hope to be helping people myself in not too
    long from now (well, reasonably long to learn the theory and mature
    with it of course) ;-)
    , Jul 16, 2007
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. NNTP

    sed awk or perl for this?

    NNTP, Sep 11, 2003, in forum: Perl
    Replies:
    13
    Views:
    3,461
    Alan Connor
    Sep 30, 2003
  2. gorda
    Replies:
    2
    Views:
    536
    Andrew Shitov
    Oct 21, 2003
  3. NNTP
    Replies:
    2
    Views:
    935
    rakesh sharma
    Apr 7, 2004
  4. Mathieu Prevot
    Replies:
    2
    Views:
    379
    Mathieu Prevot
    Jul 7, 2008
  5. hofer
    Replies:
    11
    Views:
    2,611
Loading...

Share This Page