Extract Title from HTML documents

N

Nickolay Kolev

Hi all,

I am looking for a way to extract the titles of HTML documents. I have
made an honest attempt at doing it, and it even works. Is there an
easier (faster / more efficient / clearer) way?

------------ START SCRIPT --------------------

#!/usr/bin/python

import sgmllib

class MyParser(sgmllib.SGMLParser):

inside_title = False
title = ''

def start_title(self, attrs):
self.inside_title = True

def end_title(self):
self.inside_title = False

def handle_data(self, data):
if self.inside_title and data:
self.title = self.title + data + ' '

p = MyParser()
p.feed(file('test.html').read())
p.close()
print p.title.strip()

---------------- END SCRIPT -------------------------


Many thanks in advance!

Best regards,
Nickolay Kolev
 
M

Mike Meyer

Nickolay Kolev said:
Hi all,

I am looking for a way to extract the titles of HTML documents. I have
made an honest attempt at doing it, and it even works. Is there an
easier (faster / more efficient / clearer) way?

------------ START SCRIPT --------------------

#!/usr/bin/python

import sgmllib

class MyParser(sgmllib.SGMLParser):

inside_title = False
title = ''

def start_title(self, attrs):
self.inside_title = True

def end_title(self):
self.inside_title = False

def handle_data(self, data):
if self.inside_title and data:
self.title = self.title + data + ' '

I'm pretty sure the trailing "+ ' '" is wrong. At least I never needed
it when I was using sgmllib for this kind of thing.

<mike
 
M

Max M

Nickolay said:
Hi all,

I am looking for a way to extract the titles of HTML documents. I have
made an honest attempt at doing it, and it even works. Is there an
easier (faster / more efficient / clearer) way?

You anly need one tag here, so using a regex is ok.

linkPattern = re.compile('((<title.*?>(.*?)</body>))', re.I|re.S)
match = linkPattern.search(source)
if match is None:
result = ''
result = match.group(0)

If you need more than just the title I would definitely go with
BeautifulSoap.

--

hilsen/regards Max M, Denmark

http://www.mxm.dk/
IT's Mad Science
 
M

Mike Meyer

Max M said:
You anly need one tag here, so using a regex is ok.

linkPattern = re.compile('((<title.*?>(.*?)</body>))', re.I|re.S)
^^^^
Shouldn't that be </title>

<mike?
 
?

=?ISO-8859-1?Q?Walter_D=F6rwald?=

Nickolay said:
Hi all,

I am looking for a way to extract the titles of HTML documents. I have
made an honest attempt at doing it, and it even works. Is there an
easier (faster / more efficient / clearer) way?

You might try XIST (http://www.livinglogic.de/Python/xist):
---
from ll.xist import parsers, xfind
from ll.xist.ns import html

e = parsers.parseFile("test.html", tidy=True)
print unicode(xfind.first(e//html.title))
 
?

=?ISO-8859-1?Q?Walter_D=F6rwald?=

Nickolay said:
Hi all,

I am looking for a way to extract the titles of HTML documents. I have
made an honest attempt at doing it, and it even works. Is there an
easier (faster / more efficient / clearer) way?

You might try XIST (http://www.livinglogic.de/Python/xist):
---
from ll.xist import parsers, xfind
from ll.xist.ns import html

e = parsers.parseFile("test.html", tidy=True)
print unicode(xfind.first(e//html.title))
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,772
Messages
2,569,593
Members
45,112
Latest member
VinayKumar Nevatia
Top