Parsing html

C

C Gillespie

Dear All,

I have hopefully a very simple problem. I wish to parse an html page and
extract everything between the <body> tags.

E.g.
<head>
<body>
<b>afsdf</b>
</body>
</head>

Would give
<body>
<b>afsdf</b>
</body>

I've been playing about with htmllib with no successful. Any suggestions?

Thanks

Colin
 
W

William Park

C Gillespie said:
Dear All,

I have hopefully a very simple problem. I wish to parse an html page and
extract everything between the <body> tags.

E.g.
<head>
<body>
<b>afsdf</b>
</body>
</head>

Would give
<body>
<b>afsdf</b>
</body>

I've been playing about with htmllib with no successful. Any suggestions?

Thanks

Colin

1. Take a look at
http://freshmeat.net/projects/bashdiff/
and if you want give it try then I'll give you some pointers.
Essentially,
x=()
array -p '<body>' -q '</body>' x "..."

2. In Python, read the whole thing as string. Delete everything before
'<body>' and everything after '</body>'.

3. Use your editor. :)
 
L

Leif K-Brooks

C said:
I have hopefully a very simple problem. I wish to parse an html page and
extract everything between the <body> tags.

People are actually suggesting using DOM for this?! A simple approach is
much better:

def get_body(html):
body_start = html.find('<body')
body_end = html.find('</body>', body_start) + 7
return html[body_start:body_end]
 
W

wes weston

C said:
Dear All,

I have hopefully a very simple problem. I wish to parse an html page and
extract everything between the <body> tags.

E.g.
<head>
<body>
<b>afsdf</b>
</body>
</head>

Would give
<body>
<b>afsdf</b>
</body>

I've been playing about with htmllib with no successful. Any suggestions?

Thanks

Colin

#--------------------------------------------------------------------------
def TokenizeHTML( s ):
#return a list containing two types of tokens:
# 1. html tokens starting with '<' and ending with '>'
# 2. strings between '>' and '<'
state = 0
htmlStr = ""
str = ""
list = []
for ch in s:
if state == 0: #initial state; detection state
if ch == '<':
state = 1
htmlStr += ch
else:
state = 2
str += ch
elif state == 1: #html state; in a <> pair
htmlStr += ch
if ch == '>':
state = 0
list.append(htmlStr)
htmlStr = ""
elif state == 2: #non html state; not in a <> pair
if ch == '<':
state = 1
list.append(str)
str = ""
htmlStr = "<"
else:
str += ch
if len(str) > 0:
list.append(str)
return list
 
R

Richard Brodie

People are actually suggesting using DOM for this?! A simple approach is
much better:

"For every complex problem, there is a solution that is simple ... and wrong"
Yes, it will work, some of the time. However, it doesn't handle the following
properly (there are probably others).

1. Comments.
2. CDATA sections.
3. White space.
4. Mixed or upper case.

The advantage of using a proper parser is that it caters for these sort of things,
and you only have to get it right once. OTOH, these advantages are largely
negated, if you can't be sure your input HTML is valid. What works best for
you depends on what you are using it for.
 
T

Thomas Guettler

Am Thu, 08 Jul 2004 17:04:24 +0100 schrieb C Gillespie:
Dear All,

I have hopefully a very simple problem. I wish to parse an html page and
extract everything between the <body> tags.

E.g.
<head>
<body>
<b>afsdf</b>
</body>
</head>

Would give
<body>
<b>afsdf</b>
</body>

I've been playing about with htmllib with no successful. Any suggestions?

HTML can be broken in many ways. If you want
a solution which can read most of the HTML on the
web, you can use tidy and use XML as output.


XML can be handled much easier with SAX/DOM.

Regards,
Thomas
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top