Use BeautifulSoup to delete certain tag while keeping its content

J

Jackie Wang

Dear all,

I have the following html code:

<td valign="top" headers="col1">
<font size="2">
Center Bank
<br />
Los Angeles, CA
</font>
</td>

<td valign="top" headers="col1">
<font size="2">
Salisbury
Bank and Trust Company
<font face="arial, helvetica" size="2" color="#0000000">
<br />
Lakeville, CT
</font>
</font>
</td>

How should I delete the 'font' tags while keeping the content inside?
Ideally I want to get:

<td valign="top" headers="col1">
Center Bank
<br />
Los Angeles, CA
</td>

<td valign="top" headers="col1">
Salisbury
Bank and Trust Company
<br />
Lakeville, CT
</td>

Thank you.

Jackie
 
P

Paul Boddie

I have the following html code:

<td valign="top" headers="col1">
<font size="2">
Center Bank
<br />
Los Angeles, CA
</font>
</td>

<td valign="top" headers="col1">
<font size="2">
Salisbury
Bank and Trust Company
<font face="arial, helvetica" size="2" color="#0000000">
<br />
Lakeville, CT
</font>
</font>
</td>

How should I delete the 'font' tags while keeping the content inside?

This sounds like an editing exercise, really. If you're comfortable
learning a new tool, I can recommend XSLT for this kind of job. Here's
the stylesheet:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/
Transform">

<xsl:template match="font">
<xsl:apply-templates/>
</xsl:template>

<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>

</xsl:stylesheet>

This just describes two things: firstly, that you want to recognise
font elements and to include their contents, not each element's start
and end tags; secondly, that all other parts of the document should be
copied.

You can apply stylesheets using a number of XSL processors. The
xsltproc program is usually available where libxslt is installed, and
although I'm sure others will be along to tell you all about their
favourite libraries and tools, here's how I use mine within Python:

# XSLTools: http://www.python.org/pypi/XSLTools
# libxml2dom: http://www.python.org/pypi/libxml2dom
import XSLTools.XSLOutput
import libxml2dom
# If s is the document text...
d = libxml2dom.parseString(s)
# Save the above stylesheet to a file somewhere, then...
proc = XSLTools.XSLOutput.Processor(["/tmp/no-font.xsl"])
# Get the result document
d2 = proc.get_result(d)

Anyway, this is just one option of many to deal with this kind of
problem.

Paul
 
S

Stefan Behnel

[fixing the subject appropriately]

Jackie said:
How should I delete the 'font' tags while keeping the content inside?

Amongst many other goodies for working with HTML, the Elements in lxml.html
have a ".drop_tag()" method specifically for that purpose.

http://codespeak.net/lxml/

Stefan
 
J

John Nagle

Jackie said:
Dear all,

I have the following html code:

<td valign="top" headers="col1">
<font size="2">
Center Bank
<br />
Los Angeles, CA
</font>
</td>

<td valign="top" headers="col1">
<font size="2">
Salisbury
Bank and Trust Company
<font face="arial, helvetica" size="2" color="#0000000">
<br />
Lakeville, CT
</font>
</font>
</td>

How should I delete the 'font' tags while keeping the content inside?

See the BeautifulSoup documentation. Find the font tags with findAll,
make a list, then go in and use "extract" and "replaceWith" appropriately.

John Nagle
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,774
Messages
2,569,596
Members
45,143
Latest member
DewittMill
Top