How to find <tag> to </tag> HTML strings and 'save' them?

M

mark

Hi All,

Apologies for the newbie question but I've searched and tried all
sorts for a few days and I'm pulling my hair out ;[

I have a 'reference' HTML file and a 'test' HTML file from which I
need to pull 10 strings, all of which are contained within <h2> tags,
e.g.:
<h2 class=r><a href="http://www.someplace.com/">Go Someplace</a></h2>

Once I've found the 10 I'd like to write them to another 'results'
html file. Perhaps a 'reference results' and a 'test results' file.
From where I would then like to 'diff' the results to see if they
match.

Here's the rub: I cannot find a way to pull those 10 strings so I can
save them to the results pages.
Can anyone please suggest how this can be done?

I've tried allsorts but I've been learning Python for 1 week and just
don't know enough to mod example scripts it seems. don't even get me
started on python docs.. ayaa ;] Please feel free to teach me to suck
eggs because it's all new to me :)

Thanks in advance,

Mark.
 
J

Jorge Godoy

Hi All,

Apologies for the newbie question but I've searched and tried all
sorts for a few days and I'm pulling my hair out ;[

I have a 'reference' HTML file and a 'test' HTML file from which I
need to pull 10 strings, all of which are contained within <h2> tags,
e.g.:
<h2 class=r><a href="http://www.someplace.com/">Go Someplace</a></h2>

Once I've found the 10 I'd like to write them to another 'results'
html file. Perhaps a 'reference results' and a 'test results' file.
From where I would then like to 'diff' the results to see if they
match.

Here's the rub: I cannot find a way to pull those 10 strings so I can
save them to the results pages.
Can anyone please suggest how this can be done?

I've tried allsorts but I've been learning Python for 1 week and just
don't know enough to mod example scripts it seems. don't even get me
started on python docs.. ayaa ;] Please feel free to teach me to suck
eggs because it's all new to me :)

Thanks in advance,

Mark.


Take a look at BeautifulSoup. It is easy to use and works well with some
malformed HTML that you might find ahead.
 
M

mark

Great, thanks so much for posting that. It's worked a treat and I'm
getting HTML files with the list of h2 tags I was looking for. Here's
the code just to share, what a relief :) :
................................
from BeautifulSoup import BeautifulSoup
import re

page = open("soup_test/tomatoandcream.html", 'r')
soup = BeautifulSoup(page)

myTagSearch = str(soup.findAll('h2'))

myFile = open('Soup_Results.html', 'w')
myFile.write(myTagSearch)
myFile.close()

del myTagSearch
................................

I do have two other small queries that I wonder if anyone can help
with.

Firstly, I'm getting the following character: "[" at the start, "]" at
the end of the code. Along with "," in between each tag line listing.
This seems like normal behaviour but I can't find the way to strip
them out.

There's an example of stripping comments and I understand the example,
but what's the *reference* to the above '[', ']' and ',' elements?
for the comma I tried:
soup.find(text=",").replaceWith("")

but that throws this error:
AttributeError: 'NoneType' object has no attribute 'replaceWith'

Again working with the 'Removing Elements' example I tried:
soup = BeautifulSoup("you are a banana, banana, banana")
a = str(",")
comments = soup.findAll(text=",")
[",".extract() for "," in comments]
But if I'm doing 'import beautifulSoup' this give me a "soup =
BeautifulSoup("you are a banana, banana, banana")
TypeError: 'module' object is not callable" error, "import
beautifulSoup from BeautifulSoup" does nothing

Secondly, in the above working code that is just pulling the h2 tags -
how the blazes do I 'prettify' before writing to the file?

Thanks in advance!

Mark.

...................

Apologies for the newbie question but I've searched and tried all
sorts for a few days and I'm pulling my hair out ;[
I have a 'reference' HTML file and a 'test' HTML file from which I
need to pull 10 strings, all of which are contained within <h2> tags,
e.g.:
<h2 class=r><a href="http://www.someplace.com/">Go Someplace</a></h2>
Once I've found the 10 I'd like to write them to another 'results'
html file. Perhaps a 'reference results' and a 'test results' file.

Here's the rub: I cannot find a way to pull those 10 strings so I can
save them to the results pages.
Can anyone please suggest how this can be done?
I've tried allsorts but I've been learning Python for 1 week and just
don't know enough to mod example scripts it seems. don't even get me
started on python docs.. ayaa ;] Please feel free to teach me to suck
eggs because it's all new to me :)
Thanks in advance,

Take a look at BeautifulSoup. It is easy to use and works well with some
malformed HTML that you might find ahead.
 
G

Gabriel Genellina

from BeautifulSoup import BeautifulSoup
import re

page = open("soup_test/tomatoandcream.html", 'r')
soup = BeautifulSoup(page)

myTagSearch = str(soup.findAll('h2'))

myFile = open('Soup_Results.html', 'w')
myFile.write(myTagSearch)
myFile.close()

del myTagSearch
...............................

Firstly, I'm getting the following character: "[" at the start, "]" at
the end of the code. Along with "," in between each tag line listing.
This seems like normal behaviour but I can't find the way to strip
them out.

findAll() returns a list. You convert the list to its string
representation, using str(...), and that's the way lists look like: with
[] around, and commas separating elements. If you don't like that, don't
use str(some_list).
Do you like an item by line? Use "\n".join(myTagSearch) (remember to strip
the str() around findAll)
Do you like comma separated items? Use ",".join(myTagSearch)
Read about lists here http://docs.python.org/lib/typesseq.html and strings
here http://docs.python.org/lib/string-methods.html

For the remaining questions, I strongly suggest reading the Python
Tutorial (or any other book like Dive into Python). You should grasp some
basic knowledge of the language at least, before trying to use other tools
like BeautifulSoup; it's too much for a single step.
 
M

Mark Crowther

Yep, I agree! once I've got this done I'll be back to trawling the
tutorials.
Life never gives you the convenience of learning something fully
before having to apply what you have learnt ;]

Thanks for the feedback and links, I'll be sure to check those out.

Mark.

En Sun, 25 Mar 2007 19:44:17 -0300, <[email protected]> escribió:




from BeautifulSoup import BeautifulSoup
import re
page = open("soup_test/tomatoandcream.html", 'r')
soup = BeautifulSoup(page)
myTagSearch = str(soup.findAll('h2'))
myFile = open('Soup_Results.html', 'w')
myFile.write(myTagSearch)
myFile.close()
del myTagSearch
...............................
Firstly, I'm getting the following character: "[" at the start, "]" at
the end of the code. Along with "," in between each tag line listing.
This seems like normal behaviour but I can't find the way to strip
them out.

findAll() returns a list. You convert the list to its string
representation, using str(...), and that's the way lists look like: with
[] around, and commas separating elements. If you don't like that, don't
use str(some_list).
Do you like an item by line? Use "\n".join(myTagSearch) (remember to strip
the str() around findAll)
Do you like comma separated items? Use ",".join(myTagSearch)
Read about lists herehttp://docs.python.org/lib/typesseq.htmland strings
herehttp://docs.python.org/lib/string-methods.html

For the remaining questions, I strongly suggest reading the Python
Tutorial (or any other book like Dive into Python). You should grasp some
basic knowledge of the language at least, before trying to use other tools
like BeautifulSoup; it's too much for a single step.
 
J

John Nagle

Great, thanks so much for posting that. It's worked a treat and I'm
getting HTML files with the list of h2 tags I was looking for. Here's
the code just to share, what a relief :) :
...............................
from BeautifulSoup import BeautifulSoup
import re

page = open("soup_test/tomatoandcream.html", 'r')
soup = BeautifulSoup(page)

myTagSearch = str(soup.findAll('h2'))

myFile = open('Soup_Results.html', 'w')
myFile.write(myTagSearch)
myFile.close()

del myTagSearch
...............................

I do have two other small queries that I wonder if anyone can help
with.

Firstly, I'm getting the following character: "[" at the start, "]" at
the end of the code. Along with "," in between each tag line listing.
This seems like normal behaviour but I can't find the way to strip
them out.

Ah. What you want is more like this:

page = open("soup_test/tomatoandcream.html", 'r')
soup = BeautifulSoup(page)
htags = soup.findAll({'h2':True, 'H2' : True}) # get all H2 tags, both cases

myFile = open('Soup_Results.html', 'w')

for htag in htags : # for each H2 tag
texts = htag.findAll(text=True) # find all text items within this h2
s = ' '.join(texts).strip() + '\n' # combine text items into clean string
myFile.write(s) # write each text from an H2 element on a line.

myFile.close()

John Nagle
 
M

Max Erickson

John Nagle said:
htags = soup.findAll({'h2':True, 'H2' : True}) # get all H2 tags,
both cases

Have you been bitten by this? When I read this, I was operating under
the assumption that BeautifulSoup wasn't case sensitive, and then I
tried this:

So I am a little curious.


max
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,013
Latest member
KatriceSwa

Latest Threads

Top