parsing html table cells

L

lrlebron

I am trying to parse an html page that has strings that looks like this

<tr class="bg2" height="17" valign="middle" align="right"><td
align="left"></td><td>4</td><td>47</td><td>1</td><td>19</td></tr> to
get the numbers inside the table cells.

I would to end up with a simple string that looks like this (for this
row)
4 47 1 19

The number of table cells in a row that have numbers may vary for
different rows.
I'm new to Ruby so bear with me. I'm also learning to use hpricot and
have been able get the table rows using it


thanks,

Luis
 
D

David Vallner

--------------enigB38FA39D7D2640E58C81CF92
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

I am trying to parse an html page that has strings that looks like this=
=20
<tr class=3D"bg2" height=3D"17" valign=3D"middle" align=3D"right"><td
align=3D"left"></td><td>4</td><td>47</td><td>1</td><td>19</td></tr> to
get the numbers inside the table cells.
=20
I would to end up with a simple string that looks like this (for this
row)
4 47 1 19
=20
The number of table cells in a row that have numbers may vary for
different rows.
I'm new to Ruby so bear with me. I'm also learning to use hpricot and
have been able get the table rows using it=20
=20

I'd use XPath, I'm not sure if that's doable with hpricot CSS selectors
or its (admittedly, I think) basic XPath support.

If you know the webpage is valid xhtml, I'd say switch to REXML, if not,
massage with tidy (maybe hpricot can do this better too) and then switch
to REXML.

The code would probably be something like (where doc is the REXML documen=
t):

bg2_strings =3D doc.elements.to_a(%{//tr[@class=3D'bg2']}).map { | bg2_ro=
w |
bg2_row.elements.to_a('td').map { |cell| cell.text }.join('
').strip.gsub(/\s+/, ' ')
}

Which might be horribly wrong, because I find REXML's XPath API hard to
memorise. YMMV. (It also hates the text() axis specifier with a passion,
whence the second map.)

David Vallner


--------------enigB38FA39D7D2640E58C81CF92
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (MingW32)

iD8DBQFFVdYGy6MhrS8astoRAutIAJ9O7DZezalRp/Krmy2cIt1QUQkV1wCeJtz3
wf+3SLap07HkVNwWK1tB8JQ=
=Tyka
-----END PGP SIGNATURE-----

--------------enigB38FA39D7D2640E58C81CF92--
 
L

lrlebron

Thanks for your help. I was able to get it with some hpricot code

intCells = tr.search("td").length

1.upto(intCells-1) do |i|
print tr.search("td:eq(#{i})").inner_html + ' '
end


thanks,

Luis


David said:
I am trying to parse an html page that has strings that looks like this

<tr class="bg2" height="17" valign="middle" align="right"><td
align="left"></td><td>4</td><td>47</td><td>1</td><td>19</td></tr> to
get the numbers inside the table cells.

I would to end up with a simple string that looks like this (for this
row)
4 47 1 19

The number of table cells in a row that have numbers may vary for
different rows.
I'm new to Ruby so bear with me. I'm also learning to use hpricot and
have been able get the table rows using it

I'd use XPath, I'm not sure if that's doable with hpricot CSS selectors
or its (admittedly, I think) basic XPath support.

If you know the webpage is valid xhtml, I'd say switch to REXML, if not,
massage with tidy (maybe hpricot can do this better too) and then switch
to REXML.

The code would probably be something like (where doc is the REXML document):

bg2_strings = doc.elements.to_a(%{//tr[@class='bg2']}).map { | bg2_row |
bg2_row.elements.to_a('td').map { |cell| cell.text }.join('
').strip.gsub(/\s+/, ' ')
}

Which might be horribly wrong, because I find REXML's XPath API hard to
memorise. YMMV. (It also hates the text() axis specifier with a passion,
whence the second map.)

David Vallner


--------------enigB38FA39D7D2640E58C81CF92
Content-Type: application/pgp-signature
Content-Disposition: inline;
filename="signature.asc"
Content-Description: OpenPGP digital signature
X-Google-AttachSize: 188
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,743
Messages
2,569,478
Members
44,899
Latest member
RodneyMcAu

Latest Threads

Top