Removing certain tags from html files

sebzzz · Jul 27, 2007

Hi,

I'm doing a little script with the help of the BeautifulSoup HTML
parser and uTidyLib (HTML Tidy warper for python).

Essentially what it does is fetch all the html files in a given
directory (and it's subdirectories) clean the code with Tidy (removes
deprecated tags, change the output to be xhtml) and than BeautifulSoup
removes a couple of things that I don't want in the files (Because I'm
stripping the files to bare bone, just keeping layout information).

Finally, I want to remove all trace of layout tables (because the new
layout will be in css for positioning). Now, there is tables to layout
things on the page and tables to represent tabular data, but I think
it would be too hard to make a script that finds out the difference.

My question, since I'm quite new to python, is about what tool I
should use to remove the table, tr and td tags, but not what's
enclosed in it. I think BeautifulSoup isn't good for that because it
removes what's enclosed as well.

Is re the good module for that? Basically, if I make an iteration that
scans the text and tries to match every occurrence of a given regular
expression, would it be a good idea?

Now, I'm quite new to the concept of regular expressions, but would it
ressemble something like this: re.compile("<table.*>")?

Thanks for the help.

Marc 'BlackJack' Rintsch · Jul 27, 2007

My question, since I'm quite new to python, is about what tool I
should use to remove the table, tr and td tags, but not what's
enclosed in it. I think BeautifulSoup isn't good for that because it
removes what's enclosed as well.

Than take a hold on the content and add it to the parent. Somthing like
this should work:

from BeautifulSoup import BeautifulSoup

def remove(soup, tagname):
for tag in soup.findAll(tagname):
contents = tag.contents
parent = tag.parent
tag.extract()
for tag in contents:
parent.append(tag)

def main():
source = '<a><b>This is a <c>Test</c></b></a>'
soup = BeautifulSoup(source)
print soup
remove(soup, 'b')
print soup

Is re the good module for that? Basically, if I make an iteration that
scans the text and tries to match every occurrence of a given regular
expression, would it be a good idea?

No regular expressions are not a very good idea. They get very
complicated very quickly while often still miss some corner cases.

Ciao,
Marc 'BlackJack' Rintsch

sebzzz · Jul 27, 2007

Than take a hold on the content and add it to the parent. Somthing like
this should work:

from BeautifulSoup import BeautifulSoup

def remove(soup, tagname):
for tag in soup.findAll(tagname):
contents = tag.contents
parent = tag.parent
tag.extract()
for tag in contents:
parent.append(tag)

def main():
source = '<a><b>This is a <c>Test</c></b></a>'
soup = BeautifulSoup(source)
print soup
remove(soup, 'b')
print soup

No regular expressions are not a very good idea. They get very
complicated very quickly while often still miss some corner cases.

Thanks a lot for that.

It's true that regular expressions could give me headaches (especially
to find where the tag ends).

Stefan Behnel · Jul 28, 2007

I'm doing a little script with the help of the BeautifulSoup HTML
parser and uTidyLib (HTML Tidy warper for python).

Essentially what it does is fetch all the html files in a given
directory (and it's subdirectories) clean the code with Tidy (removes
deprecated tags, change the output to be xhtml) and than BeautifulSoup
removes a couple of things that I don't want in the files (Because I'm
stripping the files to bare bone, just keeping layout information).

Finally, I want to remove all trace of layout tables (because the new
layout will be in css for positioning). Now, there is tables to layout
things on the page and tables to represent tabular data, but I think
it would be too hard to make a script that finds out the difference.

My question, since I'm quite new to python, is about what tool I
should use to remove the table, tr and td tags, but not what's
enclosed in it. I think BeautifulSoup isn't good for that because it
removes what's enclosed as well.

Use lxml.html. Honestly, you can't have HTML cleanup simpler than that.

It's not released yet (lxml is, but lxml.html is just close), but you can
build it from an SVN branch:

http://codespeak.net/svn/lxml/branch/html/

Looks like you're on Linux, so that's a simple run of setup.py.

Then, use the dedicated "clean" module for your job. See the "Cleaning up
HTML" section in the docs for some examples:

http://codespeak.net/svn/lxml/branch/html/doc/lxmlhtml.txt

and the docstring of the Cleaner class to see all the available options:

http://codespeak.net/svn/lxml/branch/html/src/lxml/html/clean.py

In case you still prefer BeautifulSoup for parsing (just in case you're not
dealing with HTML-like pages, but just with real tag soup), you can also use
the ElementSoup parser:

http://codespeak.net/svn/lxml/branch/html/src/lxml/html/ElementSoup.py

but lxml is generally quite good in dealing with broken HTML already.

Have fun,
Stefan

Removing tags with BeautifulSoup	0	Aug 8, 2007
strip away html tags from extracted links	2	Nov 29, 2013
Removing html tags	2	Aug 25, 2009
Removing empty tags	2	Feb 24, 2011
how to make a tree with randomly selected html tags from an array in python?	0	Mar 10, 2013
I need help in understanding these files on my phone, Could someone help me understand these files? Urgent help needed. Please help.	1	Jun 4, 2023
Right tool and method to strip off html files (python, sed, awk?)	5	Jul 13, 2007
Fetching data from a HTML file	4	Mar 23, 2012

Removing certain tags from html files

sebzzz

Marc 'BlackJack' Rintsch

sebzzz

Stefan Behnel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads