BeautifulSoup

yamamoto · Jan 13, 2010

Hi,
I am new to Python. I'd like to extract "a" tag from a website by
using "beautifulsoup" module.
but it doesnt work!

//sample.py

from BeautifulSoup import BeautifulSoup as bs
import urllib
url="http://www.d-addicts.com/forum/torrents.php"
doc=urllib.urlopen(url).read()
soup=bs(doc)
result=soup.findAll("a")
for i in result:
print i

Traceback (most recent call last):
File "C:\Users\falcon\workspace\p\pyqt\ex1.py", line 8, in <module>
soup=bs(doc)
File "C:\Python26\lib\site-packages\BeautifulSoup.py", line 1499, in
__init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "C:\Python26\lib\site-packages\BeautifulSoup.py", line 1230, in
__init__
self._feed(isHTML=isHTML)
File "C:\Python26\lib\site-packages\BeautifulSoup.py", line 1263, in
_feed
self.builder.feed(markup)
File "C:\Python26\lib\HTMLParser.py", line 108, in feed
self.goahead(0)
File "C:\Python26\lib\HTMLParser.py", line 148, in goahead
k = self.parse_starttag(i)
File "C:\Python26\lib\HTMLParser.py", line 226, in parse_starttag
endpos = self.check_for_whole_start_tag(i)
File "C:\Python26\lib\HTMLParser.py", line 301, in
check_for_whole_start_tag
self.error("malformed start tag")
File "C:\Python26\lib\HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: malformed start tag, at line 276, column 36

any suggestion?
thanks in advance

Peter Otten · Jan 13, 2010

yamamoto said:
Hi,
I am new to Python. I'd like to extract "a" tag from a website by
using "beautifulsoup" module.
but it doesnt work!

//sample.py

from BeautifulSoup import BeautifulSoup as bs
import urllib
url="http://www.d-addicts.com/forum/torrents.php"
doc=urllib.urlopen(url).read()
soup=bs(doc)
result=soup.findAll("a")
for i in result:
print i

Traceback (most recent call last):
File "C:\Users\falcon\workspace\p\pyqt\ex1.py", line 8, in <module>
soup=bs(doc)
File "C:\Python26\lib\site-packages\BeautifulSoup.py", line 1499, in
__init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "C:\Python26\lib\site-packages\BeautifulSoup.py", line 1230, in
__init__
self._feed(isHTML=isHTML)
File "C:\Python26\lib\site-packages\BeautifulSoup.py", line 1263, in
_feed
self.builder.feed(markup)
File "C:\Python26\lib\HTMLParser.py", line 108, in feed
self.goahead(0)
File "C:\Python26\lib\HTMLParser.py", line 148, in goahead
k = self.parse_starttag(i)
File "C:\Python26\lib\HTMLParser.py", line 226, in parse_starttag
endpos = self.check_for_whole_start_tag(i)
File "C:\Python26\lib\HTMLParser.py", line 301, in
check_for_whole_start_tag
self.error("malformed start tag")
File "C:\Python26\lib\HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: malformed start tag, at line 276, column 36

any suggestion?

When BeautifulSoup encounters an error that it cannot fix the first thing
you need is a better error message:

from BeautifulSoup import BeautifulSoup as bs
import urllib
import HTMLParser

url = "http://www.d-addicts.com/forum/torrents.php"
doc = urllib.urlopen(url).read()

#doc = doc.replace("\>", "/>")

try:
soup=bs(doc)
except HTMLParser.HTMLParseError as e:
lines = doc.splitlines(True)
print lines[e.lineno-1].rstrip()
print " " * e.offset + "^"
else:
result = soup.findAll("a")
for i in result:
print i

Once you know the origin of the problem you can devise a manual fix. Here
you could uncomment the line

doc = doc.replace("\>", "/>")

Keep in mind though that what fixes this broken document may break another
(valid) one.

Peter

Rolando Espinoza La Fuente · Jan 13, 2010

Hi,

Also you can check a high-level framework for scrapping:
http://scrapy.org/

In their docs includes an example of extracting torrents data from mininova
http://doc.scrapy.org/intro/overview.html

You will need to understand regular expressions, xpath expressions,
callbacks, etc.
In the faq explains how does Scrapy compare to BeatufilSoup.
http://doc.scrapy.org/faq.html#how-does-scrapy-compare-to-beautifulsoul-or-lxml

Regards,

Hi,
I am new to Python. I'd like to extract "a" tag from a website by
using "beautifulsoup" module.
but it doesnt work!

[snip]

Phlip · Jan 15, 2010

John said:
It's just somebody pirating movies. Ineptly. Ignore.

Anyone who leaves their movies hanging out in <a> tags, without a daily download
limit or a daily hashtag, deserves to be taught a lesson!

John Nagle · Jan 15, 2010

It's just somebody pirating movies. Ineptly. Ignore.

John Nagle

John Bokma · Jan 15, 2010

yamamoto said:
Hi,
I am new to Python. I'd like to extract "a" tag from a website by
using "beautifulsoup" module.
but it doesnt work!
[..]

check_for_whole_start_tag
self.error("malformed start tag")
File "C:\Python26\lib\HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: malformed start tag, at line 276, column 36

any suggestion?

I guess you're using 3.1.0. If yes, see:
http://www.crummy.com/software/BeautifulSoup/3.1-problems.html

You might want to do:

sudo easy_install -U "BeautifulSoup==3.0.7a"

and try again.

John Bokma · Jan 15, 2010

John Nagle said:
It's just somebody pirating movies. Ineptly. Ignore.

Wow, what a childish reply. You should've followed your own advice and
ignored the OP instead of replying with a top post + full quote (!).

John Bokma · Jan 15, 2010

John Bokma said:
yamamoto said:

Hi,
I am new to Python. I'd like to extract "a" tag from a website by
using "beautifulsoup" module.
but it doesnt work!
[..]

check_for_whole_start_tag
self.error("malformed start tag")
File "C:\Python26\lib\HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: malformed start tag, at line 276, column 36

any suggestion?

Click to expand...

I guess you're using 3.1.0. If yes, see:
http://www.crummy.com/software/BeautifulSoup/3.1-problems.html

You might want to do:

sudo easy_install -U "BeautifulSoup==3.0.7a"

and try again.

Forgot to add, see also:
http://johnbokma.com/mexit/2009/09/26/python-downgrading-beatifulsoup.html

Phlip · Jan 15, 2010

John said:
Wow, what a childish reply. You should've followed your own advice and
ignored the OP instead of replying with a top post + full quote (!).

Mr Manners reminds the Gentle Poster(s) that...

A> as Google vs China shows, all programmers should resist hacking, no
matter how inept it may be, by any means necessary

B> John should not have attempted to leave a dead trail in the archives.
Searches for BeautifulSoup should always return answered questions.

Matplotlib/Pylab Error	3	Dec 10, 2012
pyOpenGL Error unable to detect undefined names	0	Dec 2, 2010
importing libraries not working 2.6.4	4	Feb 27, 2010
Deadlock and a rather weird stacktrace	2	Feb 4, 2011
sgmllib bug in Python 2.5, works in 2.4.	2	Feb 5, 2007
HTMLParser and non-ascii html pages	0	Sep 20, 2011
IMAP4_SSL, libgmail, GMail and corporate firewall/proxy	1	Feb 17, 2011
Urllib2, problems with a webserver	1	Aug 30, 2004

BeautifulSoup

yamamoto

Peter Otten

Rolando Espinoza La Fuente

Phlip

John Nagle

John Bokma

John Bokma

John Bokma

Phlip

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads