You may also want to look at this stream editor:
http://cheeseshop.python.org/pypi/SE/2.2 beta
It allows multiple replacements in a definition format of utmost simplicity:
<div><p><em>"Python has been an important part of Google since the
beginning, and remains so as the system grows and evolves.
"</em></p>
"~<(.|\n)*?>~=" # This pattern finds all tags and deletes them (replaces with nothing)
"Python has been an important part of Google since the
beginning, and remains so as the system grows and evolves.
"
-- Peter Norvig, <a class="reference"
Now you see a tag fragment. So you add another deletion to the Tag_Stripper (***):
Tag_Stripper = SE.SE ('''
"~<(.|\n)*?>~=" # This pattern finds all tags and deletes them (replaces with nothing)
"~<!--(.|\n)*?-->~=" # This pattern deletes commentsentirely even if they nest tags
"<a class\="reference"=" # *** This deletes the fragment
# "-- Peter Norvig, <a class\="reference"=" # Or like this if Peter Norvig has to go too
''')
"Python has been an important part of Google since the
beginning, and remains so as the system grows and evolves.
"
-- Peter Norvig,
" you can either translate or delete:
Tag_Stripper = SE.SE ('''
"~<(.|\n)*?>~=" # This pattern finds all tags and deletes them (replaces with nothing)
"~<!--(.|\n)*?-->~=" # This pattern deletes commentsentirely even if they nest tags
"<a class\="reference"=" # This deletes the fragment
# "-- Peter Norvig, <a class=\\"reference\\"=" # Or like this if Peter Norvig has to go too
htm2iso.se # This is a file (contained in the SE package that translates all ampersand codes.
# Naming the file is all you need to do to include the replacements which it defines.
''')
'Python has been an important part of Google since the
beginning, and remains so as the system grows and evolves.
'
-- Peter Norvig,
If instead of "htm2iso.se" you write ""=" you delete it and your output will be:
Python has been an important part of Google since the
beginning, and remains so as the system grows and evolves.
-- Peter Norvig,
Your Tag_Stripper also does files:
'my_file_without_tags'
A stream editor is not a substitute for a parser. It does handle more economically simple translation jobs like this one where a
parser does a lot of work which you don't need.
Regards
Frederic
----- Original Message -----
From: "DH" <
[email protected]>
Newsgroups: comp.lang.python
To: <
[email protected]>
Sent: Thursday, August 24, 2006 7:41 PM
Subject: Re: Taking data from a text file to parse html page