How to find <tag> to </tag> HTML strings and 'save' them?

mark · Mar 25, 2007

Hi All,

Apologies for the newbie question but I've searched and tried all
sorts for a few days and I'm pulling my hair out ;[

I have a 'reference' HTML file and a 'test' HTML file from which I
need to pull 10 strings, all of which are contained within <h2> tags,
e.g.:
<h2 class=r><a href="http://www.someplace.com/">Go Someplace</a></h2>

Once I've found the 10 I'd like to write them to another 'results'
html file. Perhaps a 'reference results' and a 'test results' file.

From where I would then like to 'diff' the results to see if they

match.

Here's the rub: I cannot find a way to pull those 10 strings so I can
save them to the results pages.
Can anyone please suggest how this can be done?

I've tried allsorts but I've been learning Python for 1 week and just
don't know enough to mod example scripts it seems. don't even get me
started on python docs.. ayaa ;] Please feel free to teach me to suck
eggs because it's all new to me

Thanks in advance,

Mark.

Michael Bentley · Mar 25, 2007

don't even get me
started on python docs.. ayaa ;]

ok, try getting started with this then: http://www.crummy.com/
software/BeautifulSoup/

Jorge Godoy · Mar 25, 2007

Hi All,

Apologies for the newbie question but I've searched and tried all
sorts for a few days and I'm pulling my hair out ;[

I have a 'reference' HTML file and a 'test' HTML file from which I
need to pull 10 strings, all of which are contained within <h2> tags,
e.g.:
<h2 class=r><a href="http://www.someplace.com/">Go Someplace</a></h2>

Once I've found the 10 I'd like to write them to another 'results'
html file. Perhaps a 'reference results' and a 'test results' file.

From where I would then like to 'diff' the results to see if they

Click to expand...

match.

Here's the rub: I cannot find a way to pull those 10 strings so I can
save them to the results pages.
Can anyone please suggest how this can be done?

I've tried allsorts but I've been learning Python for 1 week and just
don't know enough to mod example scripts it seems. don't even get me
started on python docs.. ayaa ;] Please feel free to teach me to suck
eggs because it's all new to me

Thanks in advance,

Mark.

Take a look at BeautifulSoup. It is easy to use and works well with some
malformed HTML that you might find ahead.

mark · Mar 25, 2007

Great, thanks so much for posting that. It's worked a treat and I'm
getting HTML files with the list of h2 tags I was looking for. Here's
the code just to share, what a relief

:
................................
from BeautifulSoup import BeautifulSoup
import re

page = open("soup_test/tomatoandcream.html", 'r')
soup = BeautifulSoup(page)

myTagSearch = str(soup.findAll('h2'))

myFile = open('Soup_Results.html', 'w')
myFile.write(myTagSearch)
myFile.close()

del myTagSearch
................................

I do have two other small queries that I wonder if anyone can help
with.

Firstly, I'm getting the following character: "[" at the start, "]" at
the end of the code. Along with "," in between each tag line listing.
This seems like normal behaviour but I can't find the way to strip
them out.

There's an example of stripping comments and I understand the example,
but what's the *reference* to the above '[', ']' and ',' elements?
for the comma I tried:
soup.find(text=",").replaceWith("")

but that throws this error:
AttributeError: 'NoneType' object has no attribute 'replaceWith'

Again working with the 'Removing Elements' example I tried:
soup = BeautifulSoup("you are a banana, banana, banana")
a = str(",")
comments = soup.findAll(text=",")
[",".extract() for "," in comments]
But if I'm doing 'import beautifulSoup' this give me a "soup =
BeautifulSoup("you are a banana, banana, banana")
TypeError: 'module' object is not callable" error, "import
beautifulSoup from BeautifulSoup" does nothing

Secondly, in the above working code that is just pulling the h2 tags -
how the blazes do I 'prettify' before writing to the file?

Thanks in advance!

Mark.

...................

[email protected] said:
[email protected] said:

Hi All,

Click to expand...

Apologies for the newbie question but I've searched and tried all
sorts for a few days and I'm pulling my hair out ;[

Click to expand...

I have a 'reference' HTML file and a 'test' HTML file from which I
need to pull 10 strings, all of which are contained within <h2> tags,
e.g.:
<h2 class=r><a href="http://www.someplace.com/">Go Someplace</a></h2>

Click to expand...

Once I've found the 10 I'd like to write them to another 'results'
html file. Perhaps a 'reference results' and a 'test results' file.

Here's the rub: I cannot find a way to pull those 10 strings so I can
save them to the results pages.
Can anyone please suggest how this can be done?

Click to expand...

I've tried allsorts but I've been learning Python for 1 week and just
don't know enough to mod example scripts it seems. don't even get me
started on python docs.. ayaa ;] Please feel free to teach me to suck
eggs because it's all new to me

Click to expand...

Thanks in advance,

Click to expand...

Mark.

Click to expand...

Take a look at BeautifulSoup. It is easy to use and works well with some
malformed HTML that you might find ahead.

Gabriel Genellina · Mar 26, 2007

En Sun said:
from BeautifulSoup import BeautifulSoup
import re

page = open("soup_test/tomatoandcream.html", 'r')
soup = BeautifulSoup(page)

myTagSearch = str(soup.findAll('h2'))

myFile = open('Soup_Results.html', 'w')
myFile.write(myTagSearch)
myFile.close()

del myTagSearch
...............................

Firstly, I'm getting the following character: "[" at the start, "]" at
the end of the code. Along with "," in between each tag line listing.
This seems like normal behaviour but I can't find the way to strip
them out.

findAll() returns a list. You convert the list to its string
representation, using str(...), and that's the way lists look like: with
[] around, and commas separating elements. If you don't like that, don't
use str(some_list).
Do you like an item by line? Use "\n".join(myTagSearch) (remember to strip
the str() around findAll)
Do you like comma separated items? Use ",".join(myTagSearch)
Read about lists here http://docs.python.org/lib/typesseq.html and strings
here http://docs.python.org/lib/string-methods.html

For the remaining questions, I strongly suggest reading the Python
Tutorial (or any other book like Dive into Python). You should grasp some
basic knowledge of the language at least, before trying to use other tools
like BeautifulSoup; it's too much for a single step.

Mark Crowther · Mar 26, 2007

Yep, I agree! once I've got this done I'll be back to trawling the
tutorials.
Life never gives you the convenience of learning something fully
before having to apply what you have learnt ;]

Thanks for the feedback and links, I'll be sure to check those out.

Mark.

En Sun, 25 Mar 2007 19:44:17 -0300, <[email protected]> escribió:

from BeautifulSoup import BeautifulSoup
import re

Click to expand...

page = open("soup_test/tomatoandcream.html", 'r')
soup = BeautifulSoup(page)

Click to expand...

myTagSearch = str(soup.findAll('h2'))

Click to expand...

myFile = open('Soup_Results.html', 'w')
myFile.write(myTagSearch)
myFile.close()

Click to expand...

del myTagSearch
...............................

Click to expand...

Firstly, I'm getting the following character: "[" at the start, "]" at
the end of the code. Along with "," in between each tag line listing.
This seems like normal behaviour but I can't find the way to strip
them out.

Click to expand...

findAll() returns a list. You convert the list to its string
representation, using str(...), and that's the way lists look like: with
[] around, and commas separating elements. If you don't like that, don't
use str(some_list).
Do you like an item by line? Use "\n".join(myTagSearch) (remember to strip
the str() around findAll)
Do you like comma separated items? Use ",".join(myTagSearch)
Read about lists herehttp://docs.python.org/lib/typesseq.htmland strings
herehttp://docs.python.org/lib/string-methods.html

For the remaining questions, I strongly suggest reading the Python
Tutorial (or any other book like Dive into Python). You should grasp some
basic knowledge of the language at least, before trying to use other tools
like BeautifulSoup; it's too much for a single step.

John Nagle · Mar 26, 2007

Great, thanks so much for posting that. It's worked a treat and I'm
getting HTML files with the list of h2 tags I was looking for. Here's
the code just to share, what a relief :
...............................
from BeautifulSoup import BeautifulSoup
import re

page = open("soup_test/tomatoandcream.html", 'r')
soup = BeautifulSoup(page)

myTagSearch = str(soup.findAll('h2'))

myFile = open('Soup_Results.html', 'w')
myFile.write(myTagSearch)
myFile.close()

del myTagSearch
...............................

I do have two other small queries that I wonder if anyone can help
with.

Firstly, I'm getting the following character: "[" at the start, "]" at
the end of the code. Along with "," in between each tag line listing.
This seems like normal behaviour but I can't find the way to strip
them out.

Ah. What you want is more like this:

page = open("soup_test/tomatoandcream.html", 'r')
soup = BeautifulSoup(page)
htags = soup.findAll({'h2':True, 'H2' : True}) # get all H2 tags, both cases

myFile = open('Soup_Results.html', 'w')

for htag in htags : # for each H2 tag
texts = htag.findAll(text=True) # find all text items within this h2
s = ' '.join(texts).strip() + '\n' # combine text items into clean string
myFile.write(s) # write each text from an H2 element on a line.

myFile.close()

John Nagle

Max Erickson · Mar 26, 2007

John Nagle said:
htags = soup.findAll({'h2':True, 'H2' : True}) # get all H2 tags,
both cases

Have you been bitten by this? When I read this, I was operating under
the assumption that BeautifulSoup wasn't case sensitive, and then I
tried this:

So I am a little curious.

max

How to display input options only after selecting an option from the 'select class' tag JS?	6	May 12, 2023
I need help making an html website	2	Aug 2, 2023
How to print prefix and suffix without giving a String as an argument between them	2	May 9, 2022
Find and count strings of text from multiple files	17	Dec 16, 2021
How to implement a html parser in java?	1	Dec 28, 2023
How do I save information from an GUI into a XML-file?	0	Aug 17, 2022
How to create a JSON array with values from DOM(HTML TABLE) when I click a button using JQuery/Javascript?	0	May 1, 2023
What's the best way to parse this HTML tag?	3	Mar 11, 2012

How to find <tag> to </tag> HTML strings and 'save' them?

mark

Michael Bentley

Jorge Godoy

mark

Gabriel Genellina

Mark Crowther

John Nagle

Max Erickson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads