HTMLParser error

J

jonbutler88

Just writing a simple website spider in python, keep getting these
errors, not sure what to do. The problem seems to be in the feed()
function of htmlparser.

Traceback (most recent call last):
File "spider.py", line 38, in <module>
s.crawl(site)
File "spider.py", line 30, in crawl
self.parse(url)
File "spider.py", line 21, in parse
self.feed(urlopen('http://' + page).read())
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/HTMLParser.py", line 107, in feed
self.rawdata = self.rawdata + data
AttributeError: Spider instance has no attribute 'rawdata'

Any ideas of how to fix this? Im using python 2.5.2 on mac osx
 
J

jonbutler88

        In the absence of minimal runable code reproducing the error
message...

        Did you remember to INITIALIZE the attribute to a null value
somewhere prior to that statement?
--
        Wulfraed        Dennis Lee Bieber               KD6MOG
        (e-mail address removed)              (e-mail address removed)
                HTTP://wlfraed.home.netcom.com/
        (Bestiaria Support Staff:               (e-mail address removed))
                HTTP://www.bestiaria.com/

Its not a variable I set, its one of HTMLParser's inbuilt variables. I
am using it with urlopen to get the source of a website and feed it to
htmlparser.

def parse(self, page):
try:
self.feed(urlopen('http://' + page).read())
except HTTPError:
print 'Error getting page source'

This is the code I am using. I have tested the other modules and they
work fine, but I havn't got a clue how to fix this one.
 
A

alex23

Its not a variable I set, its one of HTMLParser's inbuilt variables. I
am using it with urlopen to get the source of a website and feed it to
htmlparser.

def parse(self, page):
try:
self.feed(urlopen('http://' + page).read())
except HTTPError:
print 'Error getting page source'

This is the code I am using. I have tested the other modules and they
work fine, but I havn't got a clue how to fix this one.

You're not providing enough information. Try to post a minimal code
fragment that demonstrates your error; it gives us all a common basis
for discussion.

Is your Spider class a subclass of HTMLParser? Is it over-riding
__init__? If so, is it doing something like:

super(Spider, self).__init__()

If this is your issue, looking at the HTMLParser code you could get
away with just doing the following in __init__:

self.reset()

This appears to be the function that adds the .rawdata attribute.

Ideally, you should use the former super() syntax...you're less
reliant on the implementation of HTMLParser that way.

- alex23
 
A

alex23

Is your Spider class a subclass of HTMLParser? Is it over-riding
__init__? If so, is it doing something like:

super(Spider, self).__init__()

If this is your issue[...]

I'm sorry, this really wasn't clear at all. What I meant was that you
need to call the HTMLParser.__init__ inside your Spider.__init__ in
order to have it initialise properly. Failing to do so would lead to
the .rawdata attribute not being defined. The super() function is the
best way to achieve this.

Sorry for the rambling, hopefully some of that is relevant.

- alex23
 
J

jonbutler88

Is your Spider class a subclass ofHTMLParser? Is it over-riding
__init__? If so, is it doing something like:
    super(Spider, self).__init__()
If this is your issue[...]

I'm sorry, this really wasn't clear at all. What I meant was that you
need to call theHTMLParser.__init__ inside your Spider.__init__ in
order to have it initialise properly. Failing to do so would lead to
the .rawdata attribute not being defined. The super() function is the
best way to achieve this.

Sorry for the rambling, hopefully some of that is relevant.

- alex23

Sorry, im new to both python and newsgroups, this is all pretty
confusing. So I need a line in my __init__ function of my class? The
spider class I made inherits from HTMLParser. Its just using the
feed() function that produces errors though, the rest seems to work
fine.

Thanks for the help,
Jon
 
A

alex23

Sorry, im new to both python and newsgroups, this is all pretty
confusing. So I need a line in my __init__ function of my class? The
spider class I made inherits from HTMLParser. Its just using the
feed() function that produces errors though, the rest seems to work
fine.

Let me repeat: it would make this a lot easier if you would paste
actual code.

As you say, your Spider class inherits from HTMLParser, so you need to
make sure that you set it up correctly so that the HTMLParser
functionality you've inherited will work correctly (or work as you
want it to work). If you've added your own __init__ to Spider, then
the __init__ on HTMLParser is no longer called unless you *explicitly*
call it yourself.

Unfortunately, my earlier advice wasn't totally correct... HTMLParser
is an old-style object, whereas super() only works for new-style
objects, I believe. (If you don't know about old- v new-style objects,
see http://docs.python.org/ref/node33.html). So there are a couple of
approaches that should work for you:

class SpiderBroken(HTMLParser):
def __init__(self):
pass # don't do any ancestral setup

class SpiderOldStyle(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)

class SpiderNewStyle(HTMLParser, object):
def __init__(self):
super(SpiderNewStyle, self).__init__()

Python 2.5.1 (r251:54863, May 1 2007, 17:47:05) [MSC v.1310 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python25\lib\HTMLParser.py", line 107, in feed
self.rawdata = self.rawdata + data
AttributeError: SpiderBroken instance has no attribute 'rawdata'
The old-style version is probably easiest, so putting this line in
your __init__ should fix your issue:

HTMLParser.__init__(self)

If this still isn't clear, please let me know.

- alex23
 
J

jonbutler88

Sorry, im new to both python and newsgroups, this is all pretty
confusing. So I need a line in my __init__ function of my class? The
spider class I made inherits from HTMLParser. Its just using the
feed() function that produces errors though, the rest seems to work
fine.

Let me repeat: it would make this a lot easier if you would paste
actual code.

As you say, your Spider class inherits from HTMLParser, so you need to
make sure that you set it up correctly so that the HTMLParser
functionality you've inherited will work correctly (or work as you
want it to work). If you've added your own __init__ to Spider, then
the __init__ on HTMLParser is no longer called unless you *explicitly*
call it yourself.

Unfortunately, my earlier advice wasn't totally correct... HTMLParser
is an old-style object, whereas super() only works for new-style
objects, I believe. (If you don't know about old- v new-style objects,
seehttp://docs.python.org/ref/node33.html). So there are a couple of
approaches that should work for you:

    class SpiderBroken(HTMLParser):
        def __init__(self):
            pass # don't do any ancestral setup

    class SpiderOldStyle(HTMLParser):
        def __init__(self):
            HTMLParser.__init__(self)

    class SpiderNewStyle(HTMLParser, object):
        def __init__(self):
            super(SpiderNewStyle, self).__init__()

Python 2.5.1 (r251:54863, May  1 2007, 17:47:05) [MSC v.1310 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.>>> html = open('temp.html','r').read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python25\lib\HTMLParser.py", line 107, in feed
    self.rawdata = self.rawdata + data
AttributeError: SpiderBroken instance has no attribute 'rawdata'

The old-style version is probably easiest, so putting this line in
your __init__ should fix your issue:

    HTMLParser.__init__(self)

If this still isn't clear, please let me know.

- alex23

OK, heres what I have so far:

#!/usr/bin/env python
from HTMLParser import HTMLParser
from urllib2 import urlopen, HTTPError

class Spider(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.found = []
self.queue = []

def handle_starttag(self, tag, attrs):
try:
if tag == 'a':
if attrs[0][0] == 'href':
self.queue.append(attrs[0][1])
except HTMLParseError:
print 'Error parsing HTML tags'

def parse(self, page):
try:
self.feed(urlopen('http://' + page).read())
except HTTPError:
print 'Error getting page source'

def crawl(self, site):
self.queue.append(site)
while 1:
try:
url = self.queue.pop(0)
self.parse(url)
except IndexError:
break
self.found.append(url)
return self.found

if __name__ == '__main__':
s = Spider()
site = raw_input("What site would you like to scan? http://")
s.crawl(site)

Still getting very odd errors though, this being the latest:

Traceback (most recent call last):
File "spider.py", line 38, in <module>
s.crawl(site)
File "spider.py", line 30, in crawl
self.parse(url)
File "spider.py", line 21, in parse
self.feed(urlopen('http://' + page).read())
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/urllib2.py", line 124, in urlopen
return _opener.open(url, data)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/urllib2.py", line 381, in open
response = self._open(req, data)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/urllib2.py", line 399, in _open
'_open', req)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/urllib2.py", line 360, in _call_chain
result = func(*args)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/urllib2.py", line 1107, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/urllib2.py", line 1064, in do_open
h = http_class(host) # will parse host:port
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/httplib.py", line 639, in __init__
self._set_hostport(host, port)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/httplib.py", line 651, in _set_hostport
raise InvalidURL("nonnumeric port: '%s'" % host[i+1:])
httplib.InvalidURL: nonnumeric port: ''

Also could you explain why I needed to add that
HTMLParser.__init__(self) line? Does it matter that I have overwritten
the __init__ function of spider?

Thanks
 
A

alex23

Still getting very odd errors though, this being the latest:

Traceback (most recent call last):
File "spider.py", line 38, in <module>
[...snip...]
raise InvalidURL("nonnumeric port: '%s'" % host[i+1:])
httplib.InvalidURL: nonnumeric port: ''

Okay. What I did was put some output in your Spider.parse method:

def parse(self, page):
try:
print 'http://' + page
self.feed(urlopen('http://' + page).read())
except HTTPError:
print 'Error getting page source'

And here's the output:
>python spider.py
What site would you like to scan? http://www.google.com
http://www.google.com
http://http://images.google.com.au/imghp?hl=en&tab=wi

The links you're finding on each page already have the protocol
specified. I'd remove the 'http://' addition from parse, and just add
it to 'site' in the main section.

if __name__ == '__main__':
s = Spider()
site = raw_input("What site would you like to scan? http://")
site = 'http://' + site
s.crawl(site)
Also could you explain why I needed to add that
HTMLParser.__init__(self) line? Does it matter that I have overwritten
the __init__ function of spider?

You haven't overwritten Spider.__init__. What you're doing every time
you create a Spider object is first get HTMLParser to initialise it as
it would any other HTMLParser object - which is what adds the .rawdata
attribute to each HTMLParser instance - *and then* doing the Spider-
specific initialisation you need.

Here's an abbreviated copy of the actual HTMLParser class featuring
only its __init__ and reset methods:

class HTMLParser(markupbase.ParserBase):
def __init__(self):
"""Initialize and reset this instance."""
self.reset()

def reset(self):
"""Reset this instance. Loses all unprocessed data."""
self.rawdata = ''
self.lasttag = '???'
self.interesting = interesting_normal
markupbase.ParserBase.reset(self)

When you initialise an instance of HTMLParser, it calls its reset
method, which sets rawdata to an empty string, or adds it to the
instance if it doesn't already exist. So when you call
HTMLParser.__init__(self) in Spider.__init__(), it executes the reset
method on the Spider instance, which it inherits from HTMLParser...

Are you familiar with object oriented design at all? If you're not,
let me know and I'll track down some decent intro docs. Inheritance is
a pretty fundamental concept but I don't think I'm doing it justice.
 
J

jonbutler88

Still getting very odd errors though, this being the latest:
Traceback (most recent call last):
  File "spider.py", line 38, in <module>
[...snip...]
    raise InvalidURL("nonnumeric port: '%s'" % host[i+1:])
httplib.InvalidURL: nonnumeric port: ''

Okay. What I did was put some output in your Spider.parse method:

    def parse(self, page):
        try:
            print 'http://' + page
            self.feed(urlopen('http://' + page).read())
        except HTTPError:
            print 'Error getting page source'

And here's the output:

    >python spider.py
    What site would you like to scan?http://www.google.com
   http://www.google.com
   http://http://images.google.com.au/imghp?hl=en&tab=wi

The links you're finding on each page already have the protocol
specified. I'd remove the 'http://' addition from parse, and just add
it to 'site' in the main section.

    if __name__ == '__main__':
        s = Spider()
        site = raw_input("What site would you like to scan? http://")
        site = 'http://' + site
        s.crawl(site)
Also could you explain why I needed to add that
HTMLParser.__init__(self) line? Does it matter that I have overwritten
the __init__ function of spider?

You haven't overwritten Spider.__init__. What you're doing every time
you create a Spider object is first get HTMLParser to initialise it as
it would any other HTMLParser object - which is what adds the .rawdata
attribute to each HTMLParser instance - *and then* doing the Spider-
specific initialisation you need.

Here's an abbreviated copy of the actual HTMLParser class featuring
only its __init__ and reset methods:

    class HTMLParser(markupbase.ParserBase):
        def __init__(self):
            """Initialize and reset this instance."""
            self.reset()

        def reset(self):
            """Reset this instance.  Loses all unprocessed data."""
            self.rawdata = ''
            self.lasttag = '???'
            self.interesting = interesting_normal
            markupbase.ParserBase.reset(self)

When you initialise an instance of HTMLParser, it calls its reset
method, which sets rawdata to an empty string, or adds it to the
instance if it doesn't already exist. So when you call
HTMLParser.__init__(self) in Spider.__init__(), it executes the reset
method on the Spider instance, which it inherits from HTMLParser...

Are you familiar with object oriented design at all? If you're not,
let me know and I'll track down some decent intro docs. Inheritance is
a pretty fundamental concept but I don't think I'm doing it justice.

Nope, this is my first experience with object oriented programming,
only been learning python for a few weeks but it seemed simple enough
to inspire me to be a bit ambitious. If you could hook me up with some
good docs that would be great. I was about to but a book on python,
specifically OO based, but il look at these docs first. I understand
most of the concepts of inheritance, just not ever used them before.

Thanks
 
A

alex23

Nope, this is my first experience with object oriented programming,
only been learning python for a few weeks but it seemed simple enough
to inspire me to be a bit ambitious. If you could hook me up with some
good docs that would be great. I was about to but a book on python,
specifically OO based, but il look at these docs first. I understand
most of the concepts of inheritance, just not ever used them before.

Ah, okay, I'm really sorry, if I'd known I would've tried to explain
things a little differently :)

Mark Pilgrim's Dive Into Python is a really good place to start:
http://www.diveintopython.org/toc/index.html

For a quick overview of object oriented programming in Python, try:
http://www.freenetpages.co.uk/hp/alan.gauld/
Specifically: http://www.freenetpages.co.uk/hp/alan.gauld/tutclass.htm

But don't hesitate to ask questions here or even contact me privately
if you'd prefer.
 
J

jjbutler88

Ah, okay, I'm really sorry, if I'd known I would've tried to explain
things a little differently :)

Mark Pilgrim's Dive Into Python is a really good place to start:http://www..diveintopython.org/toc/index.html

For a quick overview of object oriented programming in Python, try:http://www.freenetpages.co.uk/hp/alan.gauld/
Specifically:http://www.freenetpages.co.uk/hp/alan.gauld/tutclass.htm

But don't hesitate to ask questions here or even contact me privately
if you'd prefer.


Thanks for the help, sorry for the delayed reply, flew out to detroit
yesterday and the wifi here is rubbish. Will definitely get reading
Dive into Python, and the other article cleared a lot up for me.
Hopefully I wont have these errors any more, if I keep getting them il
get in touch.

Cheers
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

HTMLParser and non-ascii html pages 0
HTMLParser can't read japanese 3
IDLE stopped working 0
Question regarding HTMLParser module. 1
scipy 5
Help w/ HTMLParser lib 4
IDLE stopped working 1
HTML File Parsing 3

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,065
Latest member
OrderGreenAcreCBD

Latest Threads

Top