How to extract a part of html file

Joe · Oct 20, 2005

I'm trying to extract part of html code from a tag to a tag code begins
with <span class="boldyellow"><B><U> and ends with
TD><TD> <img src="http://whatever/some.gif"> </TD></TR></TABLE>

I was thinking of using a regular expression however I having hard time
getting the desired string. I use

htmlSource = urllib.urlopen("http://address/")
s = htmlSource.read()
htmlSource.close()

to get the html into a string, now I want to match string s from a <span
class Tag to <img src="http://whatever/some.gif"> </TD></TR></TABLE> and
store that into a new string.

Thanks

Ben Finney · Oct 20, 2005

Joe said:
I'm trying to extract part of html code from a tag to a tag

For tag soup, use BeautifulSoup:

<URL:http://www.crummy.com/software/BeautifulSoup/>

Available as a package in Debian, probably other decent OSen also.

Mike Meyer · Oct 20, 2005

Ben Finney said:
For tag soup, use BeautifulSoup:
<URL:http://www.crummy.com/software/BeautifulSoup/>

Except he's trying to extract an apparently random part of the
file. BeautifulSoup is a wonderful thing for dealing with X/HTML
documents as structured documents, which is how you want to deal with
them most of the time.

In this case, an re works nicely:

String.find also works really well:

start = s.find('<span class="boldyellow"><B><U>') + len('<span class="boldyellow"><B><U>')
stop = s.find('TD><TD> <img src="http://whatever/some.gif"> </TD></TR></TABLE>', start)
s[start:stop] ' and ends with '

Click to expand...

Click to expand...

Not a lot to choose between them.

<mike

Joe · Oct 20, 2005

Thanks Mike that is just what I was looking for, I have looked at
beautifulsoup but it doesn't really do what I want it to do, maybe I'm
just new to python and don't exactly know what it is doing just yet.
However string find woks. Thanks

Ben Finney said:
Ben Finney said:

For tag soup, use BeautifulSoup:
<URL:http://www.crummy.com/software/BeautifulSoup/>

Click to expand...

Except he's trying to extract an apparently random part of the file.
BeautifulSoup is a wonderful thing for dealing with X/HTML documents as
structured documents, which is how you want to deal with them most of
the time.

In this case, an re works nicely:
String.find also works really well:

start = s.find('<span class="boldyellow"><B><U>') + len('<span
class="boldyellow"><B><U>') stop = s.find('TD><TD> <img
src="http://whatever/some.gif"> </TD></TR></TABLE>', start)
s[start:stop] ' and ends with '

Click to expand...

Click to expand...

Not a lot to choose between them.

<mike

Sort by number of characters	1	Nov 2, 2023
Getting extra blank rows from appending HTML..?	2	Oct 24, 2023
Can someone tell me if this a real tracker? Or is it one designed to show you a different message at certain times, ie. acting like one?	0	Jan 10, 2021
How to have two html audio players on one page?	0	May 3, 2022
How to create a JSON array with values from DOM(HTML TABLE) when I click a button using JQuery/Javascript?	0	May 1, 2023
Angularjs newbie - second JSON datasource does not display	0	May 18, 2022
How to extract all values except the last value in a string separated by comma in sql	2	Jun 15, 2023
How to create a JSON array with values from DOM(HTML TABLE) when I click a button using JQuery/Javascript?	0	May 1, 2023

How to extract a part of html file

Joe

Ben Finney

Mike Meyer

Joe

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads