"Soup Strainer" for ElementSoup?

E

erikcw

Hi all,

I was reading in the Beautiful Soup documentation that you should use
a "Soup Strainer" object to keep memory usage down.

Since I'm already using Element Tree elsewhere in the project, I
figured it would make sense to use ElementSoup to keep the api
consistent. (and cElementTree should be faster right??).

I can't seem to figure out how to pass ElementSoup a "soup strainer"
though.

Any ideas?

Also - do I need to use the extract() method with ElementSoup like I
do with Beautiful Soup to keep garbage collection working?

Thanks!
Erik
 
J

John Nagle

erikcw said:
Hi all,

I was reading in the Beautiful Soup documentation that you should use
a "Soup Strainer" object to keep memory usage down.

Since I'm already using Element Tree elsewhere in the project, I
figured it would make sense to use ElementSoup to keep the api
consistent. (and cElementTree should be faster right??).

I can't seem to figure out how to pass ElementSoup a "soup strainer"
though.

Any ideas?

Also - do I need to use the extract() method with ElementSoup like I
do with Beautiful Soup to keep garbage collection working?

Thanks!
Erik

I really should get my version of BeautifulSoup merged back into
the mainstream. I have one that's been modified to use weak pointers
for all "up" and "left" links, which makes the graph cycle free. So
the memory is recovered by reference count update as soon as you
let go of the head of the tree. That helps with the garbage problem.

What are you parsing? If you're parsing well-formed XML,
BeautifulSoup is overkill. If you're parsing real-world HTML,
ElementTree is too brittle.

John Nagle
 
E

erikcw

erikcwwrote:







I really should get my version of BeautifulSoup merged back into
the mainstream. I have one that's been modified to use weak pointers
for all "up" and "left" links, which makes the graph cycle free. So
the memory is recovered by reference count update as soon as you
let go of the head of the tree. That helps with the garbage problem.

What are you parsing? If you're parsing well-formed XML,
BeautifulSoup is overkill. If you're parsing real-world HTML,
ElementTree is too brittle.

John Nagle

I'm parsing real-world HTML with BeautifulSoup and XML with
cElementTree.

I'm guessing that the only benefit to using ElementSoup is that I'll
have one less API to keep track of, right? Or are there memory
benefits in converting the Soup object to an ElementTree?

Any idea about using a Soup Strainer with ElementSoup?

Thanks!
 
S

Stefan Behnel

erikcw said:
I'm parsing real-world HTML with BeautifulSoup and XML with
cElementTree.

I'm guessing that the only benefit to using ElementSoup is that I'll
have one less API to keep track of, right?

If your "real-world" HTML is still somewhat close to HTML, lxml.html might be
an option. It combines the ElementTree API with a good close-to-HTML parser
and some helpful HTML handling tools.

http://codespeak.net/lxml
http://codespeak.net/lxml/lxmlhtml.html

You can also use it with the BeautifulSoup parser if you really need to.

http://codespeak.net/lxml/elementsoup.html

Stefan
 
F

Fredrik Lundh

erikcw said:
I'm parsing real-world HTML with BeautifulSoup and XML with
cElementTree.

I'm guessing that the only benefit to using ElementSoup is that I'll
have one less API to keep track of, right? Or are there memory
benefits in converting the Soup object to an ElementTree?

It's purely an API thing: ElementSoup loads the entire HTML file with
BeautifulSoup, and then uses the resulting BS data structure to build an
ET tree.

The ET tree doesn't contain cycles, though, so you can safely pull out
the strings you need from ET and throw away the rest of the tree.
Any idea about using a Soup Strainer with ElementSoup?

The strainer is used when parsing the file, to control what goes into
the BS tree; to add straining support to ES, you could e.g. add a
parseOnlyThese option that's passed through to BS.

</F>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,074
Latest member
StanleyFra

Latest Threads

Top