Parsing HTML?

B

Benjamin

I'm trying to parse an HTML file. I want to retrieve all of the text
inside a certain tag that I find with XPath. The DOM seems to make
this available with the innerHTML element, but I haven't found a way
to do it in Python.
 
B

benash

BeautifulSoup does what I need it to. Though, I was hoping to find
something that would let me work with the DOM the way JavaScript can
work with web browsers' implementations of the DOM. Specifically, I'd
like to be able to access the innerHTML element of a DOM element.
Python's built-in HTMLParser is SAX-based, so I don't want to use
that, and the minidom doesn't appear to implement this part of the
DOM.
 
P

Paul Boddie

I'm trying to parse an HTML file. I want to retrieve all of the text
inside a certain tag that I find with XPath. The DOM seems to make
this available with the innerHTML element, but I haven't found a way
to do it in Python.

With libxml2dom you'd do the following:

1. Parse the file using libxml2dom.parse with html set to a true
value.
2. Use the xpath method on the document to select the desired
element.
3. Use the toString method on the element to get the text of the
element (including start and end tags), or the textContent
property
to get the text between the tags.

See the Package Index page for more details:

http://www.python.org/pypi/libxml2dom

Paul
 
7

7stud

BeautifulSoup does what I need it to.  Though, I was hoping to find
something that would let me work with the DOM the way JavaScript can
work with web browsers' implementations of the DOM.  Specifically, I'd
like to be able to access the innerHTML element of a DOM element.
Python's built-in HTMLParser is SAX-based, so I don't want to use
that, and the minidom doesn't appear to implement this part of the
DOM.

innerHTML has never been part of the DOM. It is however a defacto
browser standard. That's probably why you aren't having any luck
using a python module that implements the DOM.
 
S

Stefan Behnel

Benjamin said:
I'm trying to parse an HTML file. I want to retrieve all of the text
inside a certain tag that I find with XPath. The DOM seems to make
this available with the innerHTML element, but I haven't found a way
to do it in Python.

import lxml.html as h
tree = h.parse("somefile.html")
text = tree.xpath("string( some/element[@condition] )")

http://codespeak.net/lxml

Stefan
 
B

Benjamin

innerHTML has never been part of the DOM.  It is however a defacto
browser standard.  That's probably why you aren't having any luck
using a python module that implements the DOM.

That makes sense.
 
B

Benjamin

Benjamin said:
I'm trying to parse an HTML file.  I want to retrieve all of the text
inside a certain tag that I find with XPath.  The DOM seems to make
this available with the innerHTML element, but I haven't found a way
to do it in Python.

    import lxml.html as h
    tree = h.parse("somefile.html")
    text = tree.xpath("string( some/element[@condition] )")

http://codespeak.net/lxml

Stefan

I actually had trouble getting this to work. I guess only new version
of lxml have the html module, and I couldn't get it installed. lxml
does look pretty cool, though.
 
S

Stefan Behnel

Benjamin said:
Benjamin said:
I'm trying to parse an HTML file. I want to retrieve all of the text
inside a certain tag that I find with XPath. The DOM seems to make
this available with the innerHTML element, but I haven't found a way
to do it in Python.
import lxml.html as h
tree = h.parse("somefile.html")
text = tree.xpath("string( some/element[@condition] )")

http://codespeak.net/lxml

Stefan

I actually had trouble getting this to work. I guess only new version
of lxml have the html module, and I couldn't get it installed. lxml
does look pretty cool, though.

Yes, the above code requires lxml 2.x. However, older versions should allow
you to do this:

import lxml.etree as et
parser = etree.HTMLParser()
tree = h.parse("somefile.html", parser)
text = tree.xpath("string( some/element[@condition] )")

lxml.html is just a dedicated package that makes HTML handling beautiful. It's
not required for parsing HTML and doing general XML stuff with it.

Stefan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top