parsing html table cells

Discussion in 'Ruby' started by lrlebron@gmail.com, Nov 11, 2006.

  1. Guest

    I am trying to parse an html page that has strings that looks like this

    <tr class="bg2" height="17" valign="middle" align="right"><td
    align="left"></td><td>4</td><td>47</td><td>1</td><td>19</td></tr> to
    get the numbers inside the table cells.

    I would to end up with a simple string that looks like this (for this
    row)
    4 47 1 19

    The number of table cells in a row that have numbers may vary for
    different rows.
    I'm new to Ruby so bear with me. I'm also learning to use hpricot and
    have been able get the table rows using it


    thanks,

    Luis
     
    , Nov 11, 2006
    #1
    1. Advertising

  2. --------------enigB38FA39D7D2640E58C81CF92
    Content-Type: text/plain; charset=ISO-8859-1
    Content-Transfer-Encoding: quoted-printable

    wrote:
    > I am trying to parse an html page that has strings that looks like this=


    >=20
    > <tr class=3D"bg2" height=3D"17" valign=3D"middle" align=3D"right"><td
    > align=3D"left"></td><td>4</td><td>47</td><td>1</td><td>19</td></tr> to
    > get the numbers inside the table cells.
    >=20
    > I would to end up with a simple string that looks like this (for this
    > row)
    > 4 47 1 19
    >=20
    > The number of table cells in a row that have numbers may vary for
    > different rows.
    > I'm new to Ruby so bear with me. I'm also learning to use hpricot and
    > have been able get the table rows using it=20
    >=20


    I'd use XPath, I'm not sure if that's doable with hpricot CSS selectors
    or its (admittedly, I think) basic XPath support.

    If you know the webpage is valid xhtml, I'd say switch to REXML, if not,
    massage with tidy (maybe hpricot can do this better too) and then switch
    to REXML.

    The code would probably be something like (where doc is the REXML documen=
    t):

    bg2_strings =3D doc.elements.to_a(%{//tr[@class=3D'bg2']}).map { | bg2_ro=
    w |
    bg2_row.elements.to_a('td').map { |cell| cell.text }.join('
    ').strip.gsub(/\s+/, ' ')
    }

    Which might be horribly wrong, because I find REXML's XPath API hard to
    memorise. YMMV. (It also hates the text() axis specifier with a passion,
    whence the second map.)

    David Vallner


    --------------enigB38FA39D7D2640E58C81CF92
    Content-Type: application/pgp-signature; name="signature.asc"
    Content-Description: OpenPGP digital signature
    Content-Disposition: attachment; filename="signature.asc"

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.4.5 (MingW32)

    iD8DBQFFVdYGy6MhrS8astoRAutIAJ9O7DZezalRp/Krmy2cIt1QUQkV1wCeJtz3
    wf+3SLap07HkVNwWK1tB8JQ=
    =Tyka
    -----END PGP SIGNATURE-----

    --------------enigB38FA39D7D2640E58C81CF92--
     
    David Vallner, Nov 11, 2006
    #2
    1. Advertising

  3. Guest

    Thanks for your help. I was able to get it with some hpricot code

    intCells = tr.search("td").length

    1.upto(intCells-1) do |i|
    print tr.search("td:eq(#{i})").inner_html + ' '
    end


    thanks,

    Luis


    David Vallner wrote:
    > wrote:
    > > I am trying to parse an html page that has strings that looks like this
    > >
    > > <tr class="bg2" height="17" valign="middle" align="right"><td
    > > align="left"></td><td>4</td><td>47</td><td>1</td><td>19</td></tr> to
    > > get the numbers inside the table cells.
    > >
    > > I would to end up with a simple string that looks like this (for this
    > > row)
    > > 4 47 1 19
    > >
    > > The number of table cells in a row that have numbers may vary for
    > > different rows.
    > > I'm new to Ruby so bear with me. I'm also learning to use hpricot and
    > > have been able get the table rows using it
    > >

    >
    > I'd use XPath, I'm not sure if that's doable with hpricot CSS selectors
    > or its (admittedly, I think) basic XPath support.
    >
    > If you know the webpage is valid xhtml, I'd say switch to REXML, if not,
    > massage with tidy (maybe hpricot can do this better too) and then switch
    > to REXML.
    >
    > The code would probably be something like (where doc is the REXML document):
    >
    > bg2_strings = doc.elements.to_a(%{//tr[@class='bg2']}).map { | bg2_row |
    > bg2_row.elements.to_a('td').map { |cell| cell.text }.join('
    > ').strip.gsub(/\s+/, ' ')
    > }
    >
    > Which might be horribly wrong, because I find REXML's XPath API hard to
    > memorise. YMMV. (It also hates the text() axis specifier with a passion,
    > whence the second map.)
    >
    > David Vallner
    >
    >
    > --------------enigB38FA39D7D2640E58C81CF92
    > Content-Type: application/pgp-signature
    > Content-Disposition: inline;
    > filename="signature.asc"
    > Content-Description: OpenPGP digital signature
    > X-Google-AttachSize: 188
     
    , Nov 11, 2006
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    0
    Views:
    531
  2. bbxrider
    Replies:
    0
    Views:
    606
    bbxrider
    Jul 14, 2003
  3. UJ
    Replies:
    8
    Views:
    61,291
  4. Joel Finkel

    Cells[].Text or Cells[].Controls[0]

    Joel Finkel, Sep 1, 2003, in forum: ASP .Net Datagrid Control
    Replies:
    0
    Views:
    313
    Joel Finkel
    Sep 1, 2003
  5. Greg

    Generate html table with merged cells

    Greg, Jul 24, 2007, in forum: ASP .Net Web Controls
    Replies:
    1
    Views:
    411
    Steve C. Orr [MCSD, MVP, CSM, ASP Insider]
    Jul 25, 2007
Loading...

Share This Page