How to get table from some html

Discussion in 'Perl Misc' started by dysgraphia, Feb 5, 2007.

  1. dysgraphia

    dysgraphia Guest

    I am new to Perl and also to the Mechanize module.
    So far I have obtained a table, table[4] below, with
    useful text I would like to put into a tabular format like:

    List Position Patient Name Weight Height Clinic Doctor

    but I am unsure as to how to proceed.
    I will want to send the data to an Access db later so hopefully this
    format will be amenable to this.

    Any suggestions or assistance appreciated!

    Below is my code followed by the relevant portion of html.
    In practice the daily list may vary in length up to about 30 patients.

    #!/usr/bin/perl
    use strict;
    use warnings;
    use WWW::Mechanize;
    use HTTP::Cookies;

    my $mech = WWW::Mechanize->new(
    agent => 'Mozilla/4.0',
    cookie_jar => {}
    );

    $url = 'http://www.somemedicaldata'; # not a real page

    $mech->get($url);
    unless ($mech->success) {
    die "Cannot get login page $url: ",
    $mech->response->status_line;
    }

    my $content = $mech->content();

    print "Content is: \"$content\"\n";

    # get table data
    my @table;
    my $tmp = $content;
    my $tablecount=0;

    while (my $result=$tmp=~/(?=\x3CTABLE).*?(?=\x3C\/TABLE\x3E)/igsm)
    {
    $tablecount++;
    $table[$tablecount]= $&;
    }

    print "Number of tables: \"$tablecount\"\n";

    # table4 has the useful data
    my ($dd1,$dd2) = split('<tr class="texttab" ',$table[4]);
    $table[4] = $dd2;

    # Save table4 raw to see what is collected
    open(FH, ">table4raw.txt");
    print FH $table[4];
    close(FH);
    # end of code

    This is the table4 html:

    <table width="741" border="0" cellpadding="2" cellspacing="1">
    <tr bgcolor="#CC9966"

    class="texttab"> <td><div align="center"><font
    color="#663300"><strong>List
    Position</strong></font></div></td>
    <td><div align="center"><font

    color="#663300"><strong>Patient Name</strong></font></div></td>
    <td><div
    align="center"><font
    color="#663300"><strong>Weight</strong></font></div></td>

    <td><div align="center"><font
    color="#663300"><strong>Height</strong></font></div></td>
    <td><div align="center"><font
    color="#663300"><strong>Clinic</strong></font></div></td>
    <td><div align="center"><font
    color="#663300"><strong>Doctor</strong></font></div></td>
    </tr> <tr
    class="texttab" > <td
    align="center">1</td> <td align="center">A Smith
    </td>
    <td align="center">78.0</td> <td
    align="center">185</td>
    <td align="center">AM</td> <td align="center">F
    Magoo</td> </tr>
    <tr class="texttab" bgcolor=#FFFFFF >
    <td
    align="center">2</td> <td align="center">B
    Smith</td> <td
    align="center">56.0</td> <td
    align="center">165</td> <td
    align="center">PM</td> <td align="center">L
    Magee</td> </tr>
    <tr class="texttab" >
    <td align="center">3</td>
    <td align="center">C Smith </td>
    <td
    align="center">66.0</td> <td
    align="center">171</td> <td
    align="center">RM</td> <td align="center">R
    Magaa</td> </tr>
     
    dysgraphia, Feb 5, 2007
    #1
    1. Advertising

  2. On Feb 5, 5:48 am, dysgraphia <> wrote:
    > I am new to Perl and also to the Mechanize module.
    > So far I have obtained a table, table[4] below, with
    > useful text I would like to put into a tabular format like:
    >
    > List Position Patient Name Weight Height Clinic Doctor
    >
    > but I am unsure as to how to proceed.
    > I will want to send the data to an Access db later so hopefully this
    > format will be amenable to this.
    >
    > Any suggestions or assistance appreciated!


    I suggest you parse HTML with a HTML parser. Looking for a module with
    "HTML" and "Parser" in its name would be a good start. Since you are
    specifically looking for parsing tables you may want to see if there's
    on with "Table" in its name too.
     
    Brian McCauley, Feb 5, 2007
    #2
    1. Advertising

  3. dysgraphia <> wrote:

    > So far I have obtained a table, table[4] below, with
    > useful text I would like to put into a tabular format like:



    > Any suggestions or assistance appreciated!



    use HTML::TableExtract;


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Feb 5, 2007
    #3
  4. dysgraphia

    dysgraphia Guest

    Brian McCauley wrote:
    >
    > I suggest you parse HTML with a HTML parser. Looking for a module with
    > "HTML" and "Parser" in its name would be a good start. Since you are
    > specifically looking for parsing tables you may want to see if there's
    > on with "Table" in its name too.
    >


    Thanks Brian, I will look through the modules based on your suggestions.
    Your help is appreciated!...cheers, Peter
     
    dysgraphia, Feb 5, 2007
    #4
  5. dysgraphia

    gf Guest

    I am partial to HTML::TreeBuilder for my parsing.

    After a tree has been built from the HTML you use the methods in
    HTML::Element to traverse the tree. look_down() is very powerful and
    is my go-to routine.

    You can easily find the location of your target table in the tree with
    look_down(), then loop through the rows and cells, extracting the
    contents of the cells using as_text().

    Use an array to mimic the table structure. This is untested and
    doesn't check for all errors, but I'd loop through the table with
    something like:

    use warnings;
    use strict;
    use LWP::Simple;
    use HTML::TreeBuilder;

    my $html = get('the URL you want to retrieve') or die "Can't get URL.
    \n";
    my $tree = HTML::TreeBuilder->new_from_content($html);

    my @table_data;
    foreach my $table ( $tree->look_down( '_tag' => 'table' ) )
    {
    foreach my $tr ( $table->look_down( '_tag' => 'tr' ) )
    {
    my @row_data;
    foreach my $td ( $table->look_down( '_tag' => 'td' ) )
    {
    push @row_data, $td->as_text();
    }
    push @table_data, [@row_data];
    }
    }

    foreach my $r (@table_data)
    {
    print join( "\t", @$r ), "\n";
    }

    You might have to flesh out the look_down() calls to narrow your table
    selections, but for a single table embedded in a page it should
    suffice.
     
    gf, Feb 5, 2007
    #5
  6. dysgraphia

    gf Guest

    > foreach my $td ( $table->look_down( '_tag' => 'td' ) )
    > {
    > push @row_data, $td->as_text();
    > }


    OOPS, that should be

    foreach my $td ( $tr->look_down( '_tag' => 'td' ) )
    {
    push @row_data, $td->as_text();
    }
     
    gf, Feb 5, 2007
    #6
  7. dysgraphia

    dysgraphia Guest

    gf wrote:
    > I am partial to HTML::TreeBuilder for my parsing.
    >
    > After a tree has been built from the HTML you use the methods in
    > HTML::Element to traverse the tree. look_down() is very powerful and
    > is my go-to routine.


    Thanks gf!
    I have had a look at your suggestion of HTML::TreeBuilder and can see
    it is most likely worth me learning. I have installed the module and
    given it some trial runs on example code and your code. Comments of mine
    below.

    > You can easily find the location of your target table in the tree with
    > look_down(), then loop through the rows and cells, extracting the
    > contents of the cells using as_text().
    >
    > Use an array to mimic the table structure. This is untested and
    > doesn't check for all errors, but I'd loop through the table with
    > something like:
    >
    > use warnings;
    > use strict;
    > use LWP::Simple;
    > use HTML::TreeBuilder;
    >
    > my $html = get('the URL you want to retrieve') or die "Can't get URL.
    > \n";
    > my $tree = HTML::TreeBuilder->new_from_content($html);
    >
    > my @table_data;
    > foreach my $table ( $tree->look_down( '_tag' => 'table' ) )
    > {
    > foreach my $tr ( $table->look_down( '_tag' => 'tr' ) )
    > {
    > my @row_data;
    > foreach my $td ( $table->look_down( '_tag' => 'td' ) )
    > {
    > push @row_data, $td->as_text();
    > }
    > push @table_data, [@row_data];
    > }
    > }
    >
    > foreach my $r (@table_data)
    > {
    > print join( "\t", @$r ), "\n";
    > }
    >
    > You might have to flesh out the look_down() calls to narrow your table
    > selections, but for a single table embedded in a page it should
    > suffice.
    >


    I tried your code and it ran perfectly. My project has a
    table-within-tables structure. The HTML has a lot of dross that I want
    to avoid.
    I did a bit of digging and found some articles and links of Sean M.
    Burke eg
    http://aspn.activestate.com/ASPN/docs/ActivePerl-5.6/site/lib/HTML/Tree/Scanning.html
    and tried to use his suggestion for rejecting certain tables.
    He wrote:

    $h1 = $tree->look_down('_tag', 'h1');
    returns the first element at-or-under $tree whose "_tag" attribute has
    the value "h1".......
    you could exclude ``h1'' elements that contain the word ``visit'' under
    them:

    my $real_h1 = $tree->look_down(
    '_tag', 'h1',
    sub {
    $_[0]->as_text !~ m/\bvisit/i
    }
    );

    I adapted and tried this code but could not get the table to be excluded.
    In my case the HTML has a large (approx 700 line) table I don't want.
    This table has tags like <option>....</option> to identify it but
    putting this in the above code did not work.
    Any comments or suggestions of yours are welcome...thanks again for your
    help so far....cheers, Peter
     
    dysgraphia, Feb 6, 2007
    #7
  8. dysgraphia

    dysgraphia Guest

    Tad McClellan wrote:
    >>Any suggestions or assistance appreciated!

    >
    > use HTML::TableExtract;
    >

    Thanks Tad I will check this module out.
     
    dysgraphia, Feb 6, 2007
    #8
  9. dysgraphia

    dysgraphia Guest

    Michele Dondi wrote:
    > You will have to parse it. So use some HTML parsing module. One such
    > module that gets mentioned frequently here is HTML::TokeParser. There
    > are others though, and you may want to check some of them to find the
    > best one for you.


    Thanks for your input Michele, I will have a look at TokeParser.
    >
    >>List Position Patient Name Weight Height Clinic Doctor

    >
    > Do you mean in pure text? Then use some pure text table formatting
    > module, like Text::Table or Perl6::Form.
    >
    > Michele


    I am using Perl 5.8 from ActiveState. My initial requirement was to see
    the text in either a text editor or spreadsheet format. This was just to
    ensure I am getting the data correctly as I will have a need to download
    many files on a weekly basis. When the parsing looks OK I will then send
    it to a db.

    Again, thanks for your help Michele...appreciated...cheers, Peter
     
    dysgraphia, Feb 6, 2007
    #9
  10. dysgraphia

    gf Guest

    On Feb 6, 6:45 am, dysgraphia <> wrote:

    > I tried your code and it ran perfectly.


    That occasionally happens. :)

    [...]

    > my $real_h1 = $tree->look_down(
    > '_tag', 'h1',
    > sub {
    > $_[0]->as_text !~ m/\bvisit/i
    > }
    > );
    >


    You're on the right track, just keep following it. Because you're so
    close to the answer I'm just going to say "keep going".

    sub {} calls in look_down() are your friends - they're really
    powerful. Sometimes I've needed to use multiple embedded subs to chain
    together the results of the look_down(). In effect this causes the
    test to drill down into the HTML deeper and deeper to determine if the
    child nodes contain what you want.

    And, remember that the parameters to a look_down() constitute an OR
    condition, and the embedded sub {} conditions act as ANDs.

    Also, the use of qr// regexp patterns can be powerful OR tests.

    Stylistically I like to use the '=>' operator to separate my argument
    pairs in the look_down() parameter list rather than plain commas.

    OK, I lied. Here's an (untested) example of drilling in farther.

    [...]
    foreach my $_tr (
    $tree->look_down(
    '_tag' => 'tr',
    'class' => qr/row[123]/,
    sub {
    $_[0]->look_down(
    '_tag' => 'td',
    'id' => qr/^datafield_(?:name|date|age)/,
    sub {
    $_[0]->as_text() =~ /\bfoo\b/;
    }
    );
    }
    )
    )
    {
    ; # ...do something revolutionary here
    }
     
    gf, Feb 6, 2007
    #10
  11. dysgraphia

    dysgraphia Guest

    Michele Dondi wrote:
    > If you want to export to a spreadsheet probably the best way would be
    > to write your data to CSV, in which case Text::CSV_XS comes as a
    > precious tool.
    >
    >
    > Michele


    Thanks again Michele! I will install the Text::CSV_XS module...cheers, Peter
     
    dysgraphia, Feb 7, 2007
    #11
  12. dysgraphia

    dysgraphia Guest

    gf wrote:
    > You're on the right track, just keep following it. Because you're so
    > close to the answer I'm just going to say "keep going".
    >
    > sub {} calls in look_down() are your friends - they're really
    > powerful. Sometimes I've needed to use multiple embedded subs to chain
    > together the results of the look_down(). In effect this causes the
    > test to drill down into the HTML deeper and deeper to determine if the
    > child nodes contain what you want.
    >
    > And, remember that the parameters to a look_down() constitute an OR
    > condition, and the embedded sub {} conditions act as ANDs.
    >
    > Also, the use of qr// regexp patterns can be powerful OR tests.
    >
    > Stylistically I like to use the '=>' operator to separate my argument
    > pairs in the look_down() parameter list rather than plain commas.
    >
    > OK, I lied. Here's an (untested) example of drilling in farther.
    >
    > [...]
    > foreach my $_tr (
    > $tree->look_down(
    > '_tag' => 'tr',
    > 'class' => qr/row[123]/,
    > sub {
    > $_[0]->look_down(
    > '_tag' => 'td',
    > 'id' => qr/^datafield_(?:name|date|age)/,
    > sub {
    > $_[0]->as_text() =~ /\bfoo\b/;
    > }
    > );
    > }
    > )
    > )
    > {
    > ; # ...do something revolutionary here
    > }
    >


    Thanks gf!...A carton of cyber-beer is on its way!!
    After wrestling with this code of yours I now am getting closer to the
    finishing line. In order to better see what I am getting I
    write to an Excel sheet for now....a hangover from the time when I
    generated this data using web queries and VBA, a method that fell over
    recently due to changes in the web page.
    I am still not sure I understand how your code does its magic yet but
    the output to the spreadsheet is very close to what I want. All the text
    from both tables is coming through OK but I would like a different final
    layout.
    Basically the first table has general data about a group of
    patients/tests whilst the second table has specific data about each
    patient within the group. At present my output is coming out as:
    Table_1 field headings (one row)
    Table_1 data (one row)
    Table_2 field headings (one row)
    Table_2 data (many rows)

    My objective is a table format like:
    Table_1 field headings Table_2 field headings
    Table_1 data Table_2 data row 1
    Table_1 data Table_2 data row 2
    Table_1 data Table_2 data row 3
    etc etc

    That is, the Table_1 data is repeated for each row of Table_2 data.
    This final output will be sent to a relational db.
    I have been looking at building this table from array elements of
    Table_1 and Table_2 but so far without success.
    Any pointers?
    Cheers, Peter
     
    dysgraphia, Feb 9, 2007
    #12
  13. dysgraphia

    dysgraphia Guest

    Jim Gibson wrote:
    (snipped for brevity)

    > Yes. Show us some code. How can anybody help you without knowing what
    > you have done so far.
    >
    > I suggest you divide your problem into two parts:
    >
    > 1. Parsing the web page, extracting the data, and storing every piece
    > of data you need into internal Perl structures (arrays, hashes,
    > arrays-of-hashes, etc.)
    >
    > 2. Transforming the extracted data into the form you need.
    >
    > Let us know with which of these parts are you having trouble.
    >
    > Use Data::Dumper to confirm that part 1 is working the way you want. If
    > you want help with part 2, write a test program that starts with simple
    > versions of your data structures generated from assignment statements
    > and post the entire program. That way, anybody can run your program for
    > themselves and suggest fixes or improvements.
    >
    > Good luck.


    G'day Jim!
    Thanks for your input and suggestions which have been quite helpful.
    The code being used is in the preceding posts of this thread. This
    includes my original code plus the valuable suggestions of gf. In the
    interests of brevity I did not repeat all the code but I take your point
    that for a new poster entering the thread it would have been better to
    have repeated it.
    Thanks for the mention of Data::Dumper which I will investigate
    as an alternative to the Excel route.
    Cheers, Peter
     
    dysgraphia, Feb 10, 2007
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. David Williams
    Replies:
    2
    Views:
    1,170
    Jacob Yang [MSFT]
    Aug 12, 2003
  2. Gabe
    Replies:
    3
    Views:
    1,118
  3. asd
    Replies:
    1
    Views:
    353
    torakiki
    Dec 7, 2006
  4. Domino
    Replies:
    5
    Views:
    414
    dorayme
    Nov 5, 2006
  5. Jeremy
    Replies:
    2
    Views:
    360
    Jeremy
    Nov 27, 2007
Loading...

Share This Page