HTMLParser not parsing whole html file

J

josh logan

Hello,

I wanted to use python to scrub an html file for score data, but I'm
having trouble.
I'm using HTMLParser, and the parsing seems to fizzle out around line
192 or so. None of the event functions are being called anymore
(handle_starttag, handle_endtag, etc.) and I don't understand why,
because it is a html page over 1000 lines.

Could someone tell me if this is a bug or simply a misunderstanding on
how HTMLParser works? I'd really appreciate some help in
understanding.

I am using Python 3.1.2 on Windows 7 (hopefully shouldn't matter).

I put the HTML file on pastebin, because I couldn't think of anywhere
better to put it:
http://pastebin.com/wu6Pky2W

The source code has been pared down to the simplest form to exhibit
the problem. It is displayed below, and is also on pastebin for
download (http://pastebin.com/HxwRTqrr):


import sys
import re
import os.path
import itertools as it
import urllib.request
from html.parser import HTMLParser
import operator as op


base_url = 'http://www.dci.org'

class TestParser(HTMLParser):

def handle_starttag(self, tag, attrs):
print('position {}, staring tag {} with attrs
{}'.format(self.getpos(), tag, attrs))

def handle_endtag(self, tag):
print('ending tag {}'.format(tag))


def do_parsing_from_file_stream(fname):
parser = TestParser()

with open(fname) as f:
for num, line in enumerate(f, start=1):
# print('Sending line {} through parser'.format(num))
parser.feed(line)



if __name__ == '__main__':
do_parsing_from_file_stream(sys.argv[1])
 
J

josh logan

Hello,

I wanted to use python to scrub an html file for score data, but I'm
having trouble.
I'm using HTMLParser, and the parsing seems to fizzle out around line
192 or so. None of the event functions are being called anymore
(handle_starttag, handle_endtag, etc.) and I don't understand why,
because it is a html page over 1000 lines.

Could someone tell me if this is a bug or simply a misunderstanding on
how HTMLParser works? I'd really appreciate some help in
understanding.

I am using Python 3.1.2 on Windows 7 (hopefully shouldn't matter).

I put the HTML file on pastebin, because I couldn't think of anywhere
better to put it:http://pastebin.com/wu6Pky2W

The source code has been pared down to the simplest form to exhibit
the problem. It is displayed below, and is also on pastebin for
download (http://pastebin.com/HxwRTqrr):

import sys
import re
import os.path
import itertools as it
import urllib.request
from html.parser import HTMLParser
import operator as op

base_url = 'http://www.dci.org'

class TestParser(HTMLParser):

    def handle_starttag(self, tag, attrs):
        print('position {}, staring tag {} with attrs
{}'.format(self.getpos(), tag, attrs))

    def handle_endtag(self, tag):
        print('ending tag {}'.format(tag))

def do_parsing_from_file_stream(fname):
    parser = TestParser()

    with open(fname) as f:
        for num, line in enumerate(f, start=1):
            # print('Sending line {} through parser'.format(num))
            parser.feed(line)

if __name__ == '__main__':
    do_parsing_from_file_stream(sys.argv[1])

Sorry, the group doesn't like how i surrounded the Python code's
pastebin URL with parentheses:

http://pastebin.com/HxwRTqrr
 
J

josh logan

I wanted to use python to scrub an html file for score data, but I'm
having trouble.
I'm using HTMLParser, and the parsing seems to fizzle out around line
192 or so. None of the event functions are being called anymore
(handle_starttag, handle_endtag, etc.) and I don't understand why,
because it is a html page over 1000 lines.
Could someone tell me if this is a bug or simply a misunderstanding on
how HTMLParser works? I'd really appreciate some help in
understanding.
I am using Python 3.1.2 on Windows 7 (hopefully shouldn't matter).
I put the HTML file on pastebin, because I couldn't think of anywhere
better to put it:http://pastebin.com/wu6Pky2W
The source code has been pared down to the simplest form to exhibit
the problem. It is displayed below, and is also on pastebin for
download (http://pastebin.com/HxwRTqrr):
import sys
import re
import os.path
import itertools as it
import urllib.request
from html.parser import HTMLParser
import operator as op
class TestParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print('position {}, staring tag {} with attrs
{}'.format(self.getpos(), tag, attrs))
    def handle_endtag(self, tag):
        print('ending tag {}'.format(tag))
def do_parsing_from_file_stream(fname):
    parser = TestParser()
    with open(fname) as f:
        for num, line in enumerate(f, start=1):
            # print('Sending line {} through parser'.format(num))
            parser.feed(line)
if __name__ == '__main__':
    do_parsing_from_file_stream(sys.argv[1])

Sorry, the group doesn't like how i surrounded the Python code's
pastebin URL with parentheses:

http://pastebin.com/HxwRTqrr

I found the error. The HTML file I'm parsing has invalid HTML at line
193.
It has something like:

<a href="mystuff "class = "stuff">

Note there is no space between the closing quote for the "href" tag
and the class attribute. I guess I'll go through each file and correct
these issues as I parse them.

Thanks for reading, anyways.
 
S

Stefan Behnel

josh logan, 25.10.2010 04:14:
I found the error. The HTML file I'm parsing has invalid HTML at line
193. It has something like:

<a href="mystuff "class = "stuff">

Note there is no space between the closing quote for the "href" tag
and the class attribute. I guess I'll go through each file and correct
these issues as I parse them.

HTMLparser is not made to deal with non-HTML input. You can take a look at
lxml.html or BeautifulSoup (up to 3.0), which handle these problems a lot
better.

Stefan
 
J

John Nagle

josh logan, 25.10.2010 04:14:

HTMLparser is not made to deal with non-HTML input. You can take a look
at lxml.html or BeautifulSoup (up to 3.0), which handle these problems a
lot better.

Stefan

You might try HTML5lib:

http://code.google.com/p/html5lib/

The HTML 5 spec formalizes the concept of "bad HTML". Really. There's
a specified way to parse the most common HTML errors. Most browsers
are far more tolerant of bad HTML than they should be, and not in a
consistent way. The HTML 5 spec tries to fix that.

I use BeautifulSoup, but it's being abandoned for the Python 3
transition.
"http://www.crummy.com/software/BeautifulSoup/3.1-problems.html"

John Nagle
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
473,763
Messages
2,569,562
Members
45,038
Latest member
OrderProperKetocapsules

Latest Threads

Top