newb: BeautifulSoup

C

crybaby

I need to traverse a html page with big table that has many row and
columns. For example, how to go 35th td tag and do regex to retireve
the content. After that is done, you move down to 15th td tag from
35th tag (35+15) and do regex to retrieve the content?
 
T

TheFlyingDutchman

I need to traverse a html page with big table that has many row and
columns. For example, how to go 35th td tag and do regex to retireve
the content. After that is done, you move down to 15th td tag from
35th tag (35+15) and do regex to retrieve the content?

Make the file an xhtml file (valid xml) if it isn't already and then
you can use software written to process XML files:

http://pyxml.sourceforge.net/topics/
 
7

7stud

I need to traverse a html page with big table that has many row and
columns. For example, how to go 35th td tag and do regex to retireve
the content. After that is done, you move down to 15th td tag from
35th tag (35+15) and do regex to retrieve the content?

1) You can find your table using one of these methods:

a)
target_table = soup.find('table', id='car_parts')

b)
tables = soup.findall('table')
target_table = tables[2]

The tables are put in a list in the order that they appear on the
page.


2) You can get all the td's in the table using this statement:

all_tds = target_table.findall('td')


3) You can get the contents of the tags using these statements:

print all_tds[34].string
print all_tds[49].string


Here is an example:

from BeautifulSoup import BeautifulSoup

doc = """
<html>
<head>
<title></title>
</head>
<body>
<table>
</table>

<table>
<tr><td>hello</td></tr>
<tr><td>world</td><td>goodbye</td></tr>
</table>
</body>
</html>
"""

soup = BeautifulSoup(doc)

tables = soup.findAll('table')
target_table = tables[1]

all_tds = target_table.findAll('td')
print all_tds[0].string
print all_tds[2].string

--output:--
hello
goddbye
 
C

crybaby

I added extra td tags to your example, for whatever reason I am
getting None. When I do the following:

print all_tds[0].string
print all_tds[8].string


from BeautifulSoup import BeautifulSoup

doc = """
<html>
<head>
<title></title>
</head>
<body>
<table>
</table>

<table>
<tr><td>hello</td></tr>
<tr><td>world</td><td>goodbye</td></tr>
<tr>
<td width=1 height=0 bgcolor="#800000"><img src="/img/
spacer.gif" width=1 height=0 alt="|"/></td>
<td align=right width=80><font size=2 face="New Times
Roman,Times,Serif">&nbsp;48.884&nbsp;</font></td>
<td width=1 height=0 bgcolor="#800000"><img src="/img/
spacer.gif" width=1 height=0 alt="|"/></td>
<td align=right width=80><font size=2 face="New Times
Roman,Times,Serif">&nbsp;49.950&nbsp;</font></td>
<td width=1 height=0 bgcolor="#800000"><img src="/img/
spacer.gif" width=1 height=0 alt="|"/></td>
<td align=right width=80><font size=2 face="New Times
Roman,Times,Serif">&nbsp;69.322&nbsp;</font></td>
<td width=1 height=0 bgcolor="#800000"><img src="/img/
spacer.gif" width=1 height=0 alt="|"/></td>
<td align=right width=80><font size=2 face="New Times
Roman,Times,Serif">&nbsp;99.740&nbsp;</font></td>
<td width=1 height=0 bgcolor="#800000"><img src="/img/
spacer.gif" width=1 height=0 alt="|"/></td>
</tr>
</table>
</body>
</html>
"""

soup = BeautifulSoup(doc)

tables = soup.findAll('table')
target_table = tables[1]

all_tds = target_table.findAll('td')
print all_tds[0].string
print all_tds[8].string
tds_str = all_tds[8].string
print tds_str

Output I am getting is following:
None
None

I am not sure why I am getting None for these lines:

print all_tds[0].string
print all_tds[8].string

I need to traverse a html page with big table that has many row and
columns. For example, how to go 35th td tag and do regex to retireve
the content. After that is done, you move down to 15th td tag from
35th tag (35+15) and do regex to retrieve the content?

1) You can find your table using one of these methods:

a)
target_table = soup.find('table', id='car_parts')

b)
tables = soup.findall('table')
target_table = tables[2]

The tables are put in a list in the order that they appear on the
page.

2) You can get all the td's in the table using this statement:

all_tds = target_table.findall('td')

3) You can get the contents of the tags using these statements:

print all_tds[34].string
print all_tds[49].string

Here is an example:

from BeautifulSoup import BeautifulSoup

doc = """
<html>
<head>
<title></title>
</head>
<body>
<table>
</table>

<table>
<tr><td>hello</td></tr>
<tr><td>world</td><td>goodbye</td></tr>
</table>
</body>
</html>
"""

soup = BeautifulSoup(doc)

tables = soup.findAll('table')
target_table = tables[1]

all_tds = target_table.findAll('td')
print all_tds[0].string
print all_tds[2].string

--output:--
hello
goddbye
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,431
Messages
2,571,677
Members
48,796
Latest member
Greg L.

Latest Threads

Top