Parsing HTML?

Discussion in 'Python' started by Benjamin, Apr 3, 2008.

  1. Benjamin

    Benjamin Guest

    I'm trying to parse an HTML file. I want to retrieve all of the text
    inside a certain tag that I find with XPath. The DOM seems to make
    this available with the innerHTML element, but I haven't found a way
    to do it in Python.
     
    Benjamin, Apr 3, 2008
    #1
    1. Advertising

  2. > I'm trying to parse an HTML file. I want to retrieve all of the text
    > inside a certain tag that I find with XPath. The DOM seems to make
    > this available with the innerHTML element, but I haven't found a way
    > to do it in Python.


    Have you tried http://www.google.com/search?q=python html parser ?

    HTH,
    Daniel
     
    Daniel Fetchinson, Apr 3, 2008
    #2
    1. Advertising

  3. Benjamin

    Guest

    BeautifulSoup does what I need it to. Though, I was hoping to find
    something that would let me work with the DOM the way JavaScript can
    work with web browsers' implementations of the DOM. Specifically, I'd
    like to be able to access the innerHTML element of a DOM element.
    Python's built-in HTMLParser is SAX-based, so I don't want to use
    that, and the minidom doesn't appear to implement this part of the
    DOM.

    On Wed, Apr 2, 2008 at 10:37 PM, Daniel Fetchinson
    <> wrote:
    > > I'm trying to parse an HTML file. I want to retrieve all of the text
    > > inside a certain tag that I find with XPath. The DOM seems to make
    > > this available with the innerHTML element, but I haven't found a way
    > > to do it in Python.

    >
    > Have you tried http://www.google.com/search?q=python html parser ?
    >
    > HTH,
    > Daniel
    >
     
    , Apr 3, 2008
    #3
  4. Benjamin

    Paul Boddie Guest

    On 3 Apr, 06:59, Benjamin <> wrote:
    > I'm trying to parse an HTML file. I want to retrieve all of the text
    > inside a certain tag that I find with XPath. The DOM seems to make
    > this available with the innerHTML element, but I haven't found a way
    > to do it in Python.


    With libxml2dom you'd do the following:

    1. Parse the file using libxml2dom.parse with html set to a true
    value.
    2. Use the xpath method on the document to select the desired
    element.
    3. Use the toString method on the element to get the text of the
    element (including start and end tags), or the textContent
    property
    to get the text between the tags.

    See the Package Index page for more details:

    http://www.python.org/pypi/libxml2dom

    Paul
     
    Paul Boddie, Apr 3, 2008
    #4
  5. Benjamin

    7stud Guest

    On Apr 3, 12:39 am, wrote:
    > BeautifulSoup does what I need it to.  Though, I was hoping to find
    > something that would let me work with the DOM the way JavaScript can
    > work with web browsers' implementations of the DOM.  Specifically, I'd
    > like to be able to access the innerHTML element of a DOM element.
    > Python's built-in HTMLParser is SAX-based, so I don't want to use
    > that, and the minidom doesn't appear to implement this part of the
    > DOM.
    >


    innerHTML has never been part of the DOM. It is however a defacto
    browser standard. That's probably why you aren't having any luck
    using a python module that implements the DOM.
     
    7stud, Apr 4, 2008
    #5
  6. Benjamin wrote:
    > I'm trying to parse an HTML file. I want to retrieve all of the text
    > inside a certain tag that I find with XPath. The DOM seems to make
    > this available with the innerHTML element, but I haven't found a way
    > to do it in Python.


    import lxml.html as h
    tree = h.parse("somefile.html")
    text = tree.xpath("string( some/element[@condition] )")

    http://codespeak.net/lxml

    Stefan
     
    Stefan Behnel, Apr 7, 2008
    #6
  7. Benjamin

    Benjamin Guest

    On Apr 3, 9:10 pm, 7stud <> wrote:
    > On Apr 3, 12:39 am, wrote:
    >
    > > BeautifulSoup does what I need it to.  Though, I was hoping to find
    > > something that would let me work with the DOM the way JavaScript can
    > > work with web browsers' implementations of the DOM.  Specifically, I'd
    > > like to be able to access the innerHTML element of a DOM element.
    > > Python's built-in HTMLParser is SAX-based, so I don't want to use
    > > that, and the minidom doesn't appear to implement this part of the
    > > DOM.

    >
    > innerHTML has never been part of the DOM.  It is however a defacto
    > browser standard.  That's probably why you aren't having any luck
    > using a python module that implements the DOM.


    That makes sense.
     
    Benjamin, Apr 26, 2008
    #7
  8. Benjamin

    Benjamin Guest

    On Apr 6, 11:03 pm, Stefan Behnel <> wrote:
    > Benjamin wrote:
    > > I'm trying to parse an HTML file.  I want to retrieve all of the text
    > > inside a certain tag that I find with XPath.  The DOM seems to make
    > > this available with the innerHTML element, but I haven't found a way
    > > to do it in Python.

    >
    >     import lxml.html as h
    >     tree = h.parse("somefile.html")
    >     text = tree.xpath("string( some/element[@condition] )")
    >
    > http://codespeak.net/lxml
    >
    > Stefan


    I actually had trouble getting this to work. I guess only new version
    of lxml have the html module, and I couldn't get it installed. lxml
    does look pretty cool, though.
     
    Benjamin, Apr 26, 2008
    #8
  9. Benjamin wrote:
    > On Apr 6, 11:03 pm, Stefan Behnel <> wrote:
    >> Benjamin wrote:
    >>> I'm trying to parse an HTML file. I want to retrieve all of the text
    >>> inside a certain tag that I find with XPath. The DOM seems to make
    >>> this available with the innerHTML element, but I haven't found a way
    >>> to do it in Python.

    >> import lxml.html as h
    >> tree = h.parse("somefile.html")
    >> text = tree.xpath("string( some/element[@condition] )")
    >>
    >> http://codespeak.net/lxml
    >>
    >> Stefan

    >
    > I actually had trouble getting this to work. I guess only new version
    > of lxml have the html module, and I couldn't get it installed. lxml
    > does look pretty cool, though.


    Yes, the above code requires lxml 2.x. However, older versions should allow
    you to do this:

    import lxml.etree as et
    parser = etree.HTMLParser()
    tree = h.parse("somefile.html", parser)
    text = tree.xpath("string( some/element[@condition] )")

    lxml.html is just a dedicated package that makes HTML handling beautiful. It's
    not required for parsing HTML and doing general XML stuff with it.

    Stefan
     
    Stefan Behnel, Apr 26, 2008
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. GIMME
    Replies:
    2
    Views:
    932
    GIMME
    Feb 11, 2004
  2. Naren
    Replies:
    0
    Views:
    612
    Naren
    May 11, 2004
  3. Replies:
    7
    Views:
    1,456
  4. Ninja Li

    Parsing HTML with HTML::TableExtract

    Ninja Li, Nov 27, 2009, in forum: Perl Misc
    Replies:
    2
    Views:
    264
    Martien Verbruggen
    Nov 28, 2009
  5. Ninja Li

    Parsing HTML with HTML::Tree

    Ninja Li, Mar 1, 2010, in forum: Perl Misc
    Replies:
    1
    Views:
    178
    Ninja Li
    Mar 1, 2010
Loading...

Share This Page