HTML data extraction?

Discussion in 'Python' started by Dave Kuhlman, Dec 22, 2003.

  1. Dave Kuhlman

    Dave Kuhlman Guest

    I recently read an article by Jon Udell about extracting data from
    Web pages as a poor person's Web services. So, I have a question:

    Is there any Python support for finding and extracting information
    from HTML documents.

    I'd like something that would do things like the following:

    - return the data which is inside a <b> tag which is inside a
    <li> tag.

    - return the data which is inside a <a> tag that has attribute
    href="http://www.python.org".

    - Etc.

    It would be a sort of structured grep for HTML.

    I've found the HTMLParser and htmllib modules in the Python
    standard library, but I'm wondering if there is anything at a
    higher level.

    Web searches did not turn up anything interesting.

    Thanks for help.

    Dave

    --
    http://www.rexx.com/~dkuhlman
    Dave Kuhlman, Dec 22, 2003
    #1
    1. Advertising

  2. Dave Kuhlman

    djw Guest

    I don't know if there is anything at a higher level (I guess a Google
    session would tell you that), but doing what you describe with the
    HTMLParser module is very straightforward. All you have to do is keep
    some state flags in the derived HTMLParser class that indicate the
    found/not-found state of what you are looking for and have that control
    the collection of data between the flags.

    Starting with the example in the docs, and adding some (untested) additions:

    from HTMLParser import HTMLParser

    class MyHTMLParser(HTMLParser):

    def __init__( self ):
    HTMLParser.__init__( self )
    self.in_bold_tag = False
    self.in_list_tag = False
    self.data_in_bold_list = ''

    def handle_starttag(self, tag, attrs):
    print "Encountered the beginning of a %s tag" % tag
    if tag == 'b': self.in_bold_tag = True
    if tag == 'li' : self.in_list_tag = True

    def handle_endtag(self, tag):
    print "Encountered the end of a %s tag" % tag
    if tag == 'b': self.in_bold_tag = False
    if tag == 'li' : self.in_list_tag = False

    def handle_data( self, data ):
    if self.in_bold_tag and self.in_list_tag:
    self.data_in_bold_list = ''.join( [ self.data_in_bold_list,
    data ] )

    This is just an outline, but you get the idea...

    -Don



    Dave Kuhlman wrote:
    > I recently read an article by Jon Udell about extracting data from
    > Web pages as a poor person's Web services. So, I have a question:
    >
    > Is there any Python support for finding and extracting information
    > from HTML documents.
    >
    > I'd like something that would do things like the following:
    >
    > - return the data which is inside a <b> tag which is inside a
    > <li> tag.
    >
    > - return the data which is inside a <a> tag that has attribute
    > href="http://www.python.org".
    >
    > - Etc.
    >
    > It would be a sort of structured grep for HTML.
    >
    > I've found the HTMLParser and htmllib modules in the Python
    > standard library, but I'm wondering if there is anything at a
    > higher level.
    >
    > Web searches did not turn up anything interesting.
    >
    > Thanks for help.
    >
    > Dave
    >
    djw, Dec 22, 2003
    #2
    1. Advertising

  3. Dave Kuhlman

    John J. Lee Guest

    [Sorry if this got posted twice, not sure what I did...]

    Dave Kuhlman <> writes:
    [...]
    > I'd like something that would do things like the following:
    >
    > - return the data which is inside a <b> tag which is inside a
    > <li> tag.
    >
    > - return the data which is inside a <a> tag that has attribute
    > href="http://www.python.org".
    >
    > - Etc.
    >
    > It would be a sort of structured grep for HTML.


    1. http://wwwsearch.sf.net/bits/pullparser.py

    It's a port of Perl's HTML::TokeParser.

    p = pullparser.PullParser(f)
    p.get_tag("b")
    p.get_tag("li")
    print p.get_text()


    p = pullparser.PullParser(f)
    for tag in p:
    tag = p.get_tag("a")
    if dict(tag.attrs).get("href") == "http://www.python.org":
    print p.get_text()

    I'll release a beta version in a day or so with a couple of minor
    changes (including that .get_text() will no longer raise
    NoMoreTagsError) and a proper tarball package.


    2. stuff your data through mxTidy or uTidylib to get XHTML, then into
    XPath from PyXML.

    http://www.zvon.org/xxl/XPathTutorial/General/examples.html

    In fact, tidying HTML is sometimes necessary even if you don't need
    XHTML or a tree-based API.


    3. microdom

    http://www.xml.com/pub/a/2003/10/15/microdom.html

    Haven't used it myself.


    John
    John J. Lee, Dec 22, 2003
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Himanshu Garg
    Replies:
    0
    Views:
    611
    Himanshu Garg
    Jan 27, 2004
  2. MaggieMagill

    HTML info extraction utility

    MaggieMagill, Mar 3, 2005, in forum: HTML
    Replies:
    5
    Views:
    353
    Andy Dingley
    Mar 4, 2005
  3. Replies:
    0
    Views:
    612
  4. Filip
    Replies:
    4
    Views:
    1,145
    Filip
    Jul 27, 2009
  5. Ruby Tuesday

    html table data extraction...

    Ruby Tuesday, Feb 18, 2004, in forum: Ruby
    Replies:
    0
    Views:
    103
    Ruby Tuesday
    Feb 18, 2004
Loading...

Share This Page