Help with using findAll() in BeautifulSoup

Alexnb · Jul 12, 2008

Okay, I am not sure if there is a better way of doing this than findAll() but
that is how I am doing it right now. I am making an app that screen scapes
dictionary.com for definitions. However, I would like to have the type of
the word for each definition. For example if def1 and def2 are noun
defintions but def3 isn't:

noun
def1
def2
verb
def3

Something like that. Now I can get the definitions just fine. But the
problem comes when I want to get the type. I can get the types, but I don't
know for what definitions they go with. So I can get noun and verb, but for
all I know noun is def1, and verb is 2 and 3. I am wondering if there is a
way to use findAll() but like stop once it hits a certain thing, or a way to
do just that. for example, if I have

noun
<table blah>
<table blah>
verb
<table blah>

I want to be able to do like findAll('span', {'class': 'pg'}), but tell me
how many <table> things are after it, or before the next so I know how many
defintions it has.

Here is the code I am using(I used "cheese" because that is kinda my test
word for everything in the app.):

import urllib
from BeautifulSoup import BeautifulSoup

class defWord:
def __init__(self, word):
self.word = word

def get_types(term):
soup =
BeautifulSoup(urllib.urlopen('http://dictionary.reference.com/search?q=%s' %
term))

for tabs in soup.findAll('span', {'class': 'pg'}):
yield tabs.contents[0].string

self.mainList = list(get_types(self.word))
print self.mainList

type = defWord("cheese")

I don't know if this is really something anyone can help me fix or if I have
to do it on my own. But I would love some help.

Stefan Behnel · Jul 12, 2008

Alexnb said:
Okay, I am not sure if there is a better way of doing this than findAll() but
that is how I am doing it right now.

Consider using lxml.html and lxml.cssselect.

http://codespeak.net/lxml/

I am making an app that screen scapes
dictionary.com for definitions.

Do they have a policy for doing that?

noun
<table blah>
<table blah>
verb
<table blah>

I want to be able to do like findAll('span', {'class': 'pg'}), but tell me
how many <table> things are after it, or before the next so I know how many
defintions it has.

You didn't say where the "span" is in the HTML code, but lxml.cssselect should
get you pretty close to what you want. If your tables are descendants of the
"span"s, a selector like:

"span.pg table"

might work. There's also a CSS syntax for siblings.

Stefan

Paul McGuire · Jul 12, 2008

Do they have a policy for doing that?

From the Dictionary.com Terms of Use (http://dictionary.reference.com/
help/terms.html):

3.2 You will not modify, publish, transmit, participate in the
transfer or sale, create derivative works, or in any way exploit, any
of the content, in whole or in part, found on the Site. You will
download copyrighted content solely for your personal use, but will
make no other use of the content without the express written
permission of Lexico and the copyright owner. You will not make any
changes to any content that you are permitted to download under this
Agreement, and in particular you will not delete or alter any
proprietary rights or attribution notices in any content. You agree
that you do not acquire any ownership rights in any downloaded
content.

IANAL, but it seems pretty clear that, unless this content scraper is
"solely for your personal use," you'll need to get written permission
to include content that you have scraped from Dictionary.com into your
app.

-- Paul

Help with BeautifulSoup	0	Jul 12, 2008
Need Help with the BeautifulSoup problem, please	5	Dec 16, 2013
Having trouble with some lists in BeautifulSoup	1	Jul 16, 2008
BeautifulSoup and Problem Tables	2	Sep 21, 2008
Parsing html with Beautifulsoup	0	Dec 10, 2009
Extracting text using Beautifulsoup	0	Oct 25, 2009
cannot get html content of tag with BeautifulSoup	1	Jun 18, 2010
Using for loops in Python?	5	Dec 29, 2023

Help with using findAll() in BeautifulSoup

Alexnb

Stefan Behnel

Paul McGuire

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads