Extract Title from HTML documents

Nickolay Kolev · Nov 4, 2004

Hi all,

I am looking for a way to extract the titles of HTML documents. I have
made an honest attempt at doing it, and it even works. Is there an
easier (faster / more efficient / clearer) way?

------------ START SCRIPT --------------------

#!/usr/bin/python

import sgmllib

class MyParser(sgmllib.SGMLParser):

inside_title = False
title = ''

def start_title(self, attrs):
self.inside_title = True

def end_title(self):
self.inside_title = False

def handle_data(self, data):
if self.inside_title and data:
self.title = self.title + data + ' '

p = MyParser()
p.feed(file('test.html').read())
p.close()
print p.title.strip()

---------------- END SCRIPT -------------------------

Many thanks in advance!

Best regards,
Nickolay Kolev

Anakim Border · Nov 4, 2004

You may find BeautifulSoup (http://www.crummy.com/software/BeautifulSoup/)
useful.

from BeautifulSoup import BeautifulSoup
b = BeautifulSoup()
b.feed(file('test.html').read())
print b.first('title').renderContents()

HTH

Mike Meyer · Nov 5, 2004

Nickolay Kolev said:
Hi all,

I am looking for a way to extract the titles of HTML documents. I have
made an honest attempt at doing it, and it even works. Is there an
easier (faster / more efficient / clearer) way?

------------ START SCRIPT --------------------

#!/usr/bin/python

import sgmllib

class MyParser(sgmllib.SGMLParser):

inside_title = False
title = ''

def start_title(self, attrs):
self.inside_title = True

def end_title(self):
self.inside_title = False

def handle_data(self, data):
if self.inside_title and data:
self.title = self.title + data + ' '

I'm pretty sure the trailing "+ ' '" is wrong. At least I never needed
it when I was using sgmllib for this kind of thing.

<mike

Max M · Nov 5, 2004

Nickolay said:
Hi all,

I am looking for a way to extract the titles of HTML documents. I have
made an honest attempt at doing it, and it even works. Is there an
easier (faster / more efficient / clearer) way?

You anly need one tag here, so using a regex is ok.

linkPattern = re.compile('((<title.*?>(.*?)</body>))', re.I|re.S)
match = linkPattern.search(source)
if match is None:
result = ''
result = match.group(0)

If you need more than just the title I would definitely go with
BeautifulSoap.

--

hilsen/regards Max M, Denmark

http://www.mxm.dk/
IT's Mad Science

Mike Meyer · Nov 5, 2004

Max M said:
You anly need one tag here, so using a regex is ok.

linkPattern = re.compile('((<title.*?>(.*?)</body>))', re.I|re.S)

^^^^
Shouldn't that be </title>

<mike?

=?ISO-8859-1?Q?Walter_D=F6rwald?= · Nov 5, 2004

Nickolay said:
Hi all,

I am looking for a way to extract the titles of HTML documents. I have
made an honest attempt at doing it, and it even works. Is there an
easier (faster / more efficient / clearer) way?

You might try XIST (http://www.livinglogic.de/Python/xist):
---
from ll.xist import parsers, xfind
from ll.xist.ns import html

e = parsers.parseFile("test.html", tidy=True)
print unicode(xfind.first(e//html.title))

=?ISO-8859-1?Q?Walter_D=F6rwald?= · Nov 5, 2004

Nickolay said:
Hi all,

I am looking for a way to extract the titles of HTML documents. I have
made an honest attempt at doing it, and it even works. Is there an
easier (faster / more efficient / clearer) way?

You might try XIST (http://www.livinglogic.de/Python/xist):
---
from ll.xist import parsers, xfind
from ll.xist.ns import html

e = parsers.parseFile("test.html", tidy=True)
print unicode(xfind.first(e//html.title))

not able to HTTPS page from python	3	Nov 9, 2005
HTMLParser skipping HTML? [newbie]	6	Sep 5, 2012
Making sgmlib more liberal	0	Aug 26, 2004
extract subsets of an array-like object	1	May 12, 2009
Errors on REXML reading an HTML.	1	Dec 24, 2010
Howto: extract a 'column' from a list of lists into a new list?	7	Jul 1, 2003
HTMLParser and write	1	Mar 5, 2004
HTMLParser handler_starttag misses lots of tags!	2	Nov 22, 2003

Extract Title from HTML documents

Nickolay Kolev

Anakim Border

Mike Meyer

Max M

Mike Meyer

=?ISO-8859-1?Q?Walter_D=F6rwald?=

=?ISO-8859-1?Q?Walter_D=F6rwald?=

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads