Parsing html

C Gillespie · Jul 8, 2004

Dear All,

I have hopefully a very simple problem. I wish to parse an html page and
extract everything between the <body> tags.

E.g.
<head>
<body>
afsdf
</body>
</head>

Would give
<body>
afsdf
</body>

I've been playing about with htmllib with no successful. Any suggestions?

Thanks

Colin

William Park · Jul 8, 2004

C Gillespie said:
Dear All,

I have hopefully a very simple problem. I wish to parse an html page and
extract everything between the <body> tags.

E.g.
<head>
<body>
afsdf
</body>
</head>

Would give
<body>
afsdf
</body>

I've been playing about with htmllib with no successful. Any suggestions?

Thanks

Colin

1. Take a look at
http://freshmeat.net/projects/bashdiff/
and if you want give it try then I'll give you some pointers.
Essentially,
x=()
array -p '<body>' -q '</body>' x "..."

2. In Python, read the whole thing as string. Delete everything before
'<body>' and everything after '</body>'.

3. Use your editor.

Leif K-Brooks · Jul 8, 2004

C said:
I have hopefully a very simple problem. I wish to parse an html page and
extract everything between the <body> tags.

People are actually suggesting using DOM for this?! A simple approach is
much better:

def get_body(html):
body_start = html.find('<body')
body_end = html.find('</body>', body_start) + 7
return html[body_start:body_end]

Lee Harr · Jul 8, 2004

Dear All,

I have hopefully a very simple problem. I wish to parse an html page and
extract everything between the <body> tags.

I have not used it yet,
but I hear that Beatiful Soup
works well:

http://www.crummy.com/software/BeautifulSoup/

wes weston · Jul 9, 2004

C said:
Dear All,

I have hopefully a very simple problem. I wish to parse an html page and
extract everything between the <body> tags.

E.g.
<head>
<body>
afsdf
</body>
</head>

Would give
<body>
afsdf
</body>

I've been playing about with htmllib with no successful. Any suggestions?

Thanks

Colin

#--------------------------------------------------------------------------
def TokenizeHTML( s ):
#return a list containing two types of tokens:
# 1. html tokens starting with '<' and ending with '>'
# 2. strings between '>' and '<'
state = 0
htmlStr = ""
str = ""
list = []
for ch in s:
if state == 0: #initial state; detection state
if ch == '<':
state = 1
htmlStr += ch
else:
state = 2
str += ch
elif state == 1: #html state; in a <> pair
htmlStr += ch
if ch == '>':
state = 0
list.append(htmlStr)
htmlStr = ""
elif state == 2: #non html state; not in a <> pair
if ch == '<':
state = 1
list.append(str)
str = ""
htmlStr = "<"
else:
str += ch
if len(str) > 0:
list.append(str)
return list

Richard Brodie · Jul 9, 2004

People are actually suggesting using DOM for this?! A simple approach is
much better:

"For every complex problem, there is a solution that is simple ... and wrong"
Yes, it will work, some of the time. However, it doesn't handle the following
properly (there are probably others).

1. Comments.
2. CDATA sections.
3. White space.
4. Mixed or upper case.

The advantage of using a proper parser is that it caters for these sort of things,
and you only have to get it right once. OTOH, these advantages are largely
negated, if you can't be sure your input HTML is valid. What works best for
you depends on what you are using it for.

C Gillespie · Jul 9, 2004

Dear All,

Thanks for all the suggestions, much appreciated.

Colin

Thomas Guettler · Jul 9, 2004

Am Thu, 08 Jul 2004 17:04:24 +0100 schrieb C Gillespie:

Dear All,

I have hopefully a very simple problem. I wish to parse an html page and
extract everything between the <body> tags.

E.g.
<head>
<body>
afsdf
</body>
</head>

Would give
<body>
afsdf
</body>

I've been playing about with htmllib with no successful. Any suggestions?

HTML can be broken in many ways. If you want
a solution which can read most of the HTML on the
web, you can use tidy and use XML as output.

XML can be handled much easier with SAX/DOM.

Regards,
Thomas

Istvan Albert · Jul 9, 2004

You could use pyparsing too:

http://pyparsing.sourceforge.net/

i.

Python client/server that reads HTML body from server	1	Apr 12, 2023
I dont get this. Please help me!!	2	Jan 24, 2023
Fading effect between play and play-over and pause and pause-over	0	Oct 16, 2021
Need assistance finetuning HTML, CSS, Javascript - sticky header issue	3	Feb 25, 2022
XML/XHTML/HTML differences, bugs... and howto	0	Jan 23, 2013
HTML Parsing	5	Feb 10, 2007
plain text parsing to html (newbie problem)	10	Dec 9, 2009
Survey details won't go through using php, ajax, Mysql	0	Oct 26, 2023

Parsing html

C Gillespie

William Park

Leif K-Brooks

Lee Harr

wes weston

Richard Brodie

C Gillespie

Thomas Guettler

Istvan Albert

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads