Removing tags with BeautifulSoup

sebzzz · Aug 8, 2007

Hi,

I'm in the process of cleaning some html files with BeautifulSoup and
I want to remove all traces of the tables. Here is the bit of the code
that deals with tables:

def remove(soup, tagname):
for tag in soup.findAll(tagname):
contents = tag.contents
parent = tag.parent
tag.extract()
for tag in contents:
parent.append(tag)

remove(soup, "table")
remove(soup, "tr")
remove(soup, "td")

It works fine but leaves an empty table structure at the end of the
soup. Like:

<table>
<tr>
<td></td>
</tr>

<tr>
<td></td>
</tr>

<tr>
...

And the extract method of BeautifulSoup seems the extract only what is
in the tags.

So I'm just looking for a quick and dirty way to remove this table
structure at the end of the documents. I'm thinking with re but there
must be a way to do it with BeautifulSoup, maybe I'm missing
something.

An other thing that makes me wonder, this code:

for script in soup("script"):
soup.script.extract()

Works fine and remove script tags, but:

for table in soup("table"):
soup.table.extract()

Raises AttributeError: 'NoneType' object has no attribute 'extract'

Oh, and BTW, when I extract script tags this way, all the tag is gone,
like I want it, it doesn't only removes the content of the tag.

Thanks in advance

Extracting text using Beautifulsoup	0	Oct 25, 2009
Need Help with the BeautifulSoup problem, please	5	Dec 16, 2013
Parsing html with Beautifulsoup	0	Dec 10, 2009
Can someone tell me if this a real tracker? Or is it one designed to show you a different message at certain times, ie. acting like one?	0	Jan 10, 2021
Updating Inventory using First In First out(FIFO)	1	Feb 2, 2023
Getting extra blank rows from appending HTML..?	2	Oct 24, 2023
Sort by number of characters	1	Nov 2, 2023
BeautifulSoup and Problem Tables	2	Sep 21, 2008

Removing tags with BeautifulSoup

sebzzz

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads