How to get table from some html

dysgraphia · Feb 5, 2007

I am new to Perl and also to the Mechanize module.
So far I have obtained a table, table[4] below, with
useful text I would like to put into a tabular format like:

List Position Patient Name Weight Height Clinic Doctor

but I am unsure as to how to proceed.
I will want to send the data to an Access db later so hopefully this
format will be amenable to this.

Any suggestions or assistance appreciated!

Below is my code followed by the relevant portion of html.
In practice the daily list may vary in length up to about 30 patients.

#!/usr/bin/perl
use strict;
use warnings;
use WWW::Mechanize;
use HTTP::Cookies;

my $mech = WWW::Mechanize->new(
agent => 'Mozilla/4.0',
cookie_jar => {}
);

$url = 'http://www.somemedicaldata'; # not a real page

$mech->get($url);
unless ($mech->success) {
die "Cannot get login page $url: ",
$mech->response->status_line;
}

my $content = $mech->content();

print "Content is: \"$content\"\n";

# get table data
my @table;
my $tmp = $content;
my $tablecount=0;

while (my $result=$tmp=~/(?=\x3CTABLE).*?(?=\x3C\/TABLE\x3E)/igsm)
{
$tablecount++;
$table[$tablecount]= $&;
}

print "Number of tables: \"$tablecount\"\n";

# table4 has the useful data
my ($dd1,$dd2) = split('<tr class="texttab" ',$table[4]);
$table[4] = $dd2;

# Save table4 raw to see what is collected
open(FH, ">table4raw.txt");
print FH $table[4];
close(FH);
# end of code

This is the table4 html:

<table width="741" border="0" cellpadding="2" cellspacing="1">
<tr bgcolor="#CC9966"

class="texttab"> <td><div align="center"><font
color="#663300"><strong>List
Position</strong></font></div></td>
<td><div align="center"><font

color="#663300"><strong>Patient Name</strong></font></div></td>
<td><div
align="center"><font
color="#663300"><strong>Weight</strong></font></div></td>

<td><div align="center"><font
color="#663300"><strong>Height</strong></font></div></td>
<td><div align="center"><font
color="#663300"><strong>Clinic</strong></font></div></td>
<td><div align="center"><font
color="#663300"><strong>Doctor</strong></font></div></td>
</tr> <tr
class="texttab" > <td
align="center">1</td> <td align="center">A Smith
</td>
<td align="center">78.0</td> <td
align="center">185</td>
<td align="center">AM</td> <td align="center">F
Magoo</td> </tr>
<tr class="texttab" bgcolor=#FFFFFF >
<td
align="center">2</td> <td align="center">B
Smith</td> <td
align="center">56.0</td> <td
align="center">165</td> <td
align="center">PM</td> <td align="center">L
Magee</td> </tr>
<tr class="texttab" >
<td align="center">3</td>
<td align="center">C Smith </td>
<td
align="center">66.0</td> <td
align="center">171</td> <td
align="center">RM</td> <td align="center">R
Magaa</td> </tr>

Brian McCauley · Feb 5, 2007

I am new to Perl and also to the Mechanize module.
So far I have obtained a table, table[4] below, with
useful text I would like to put into a tabular format like:

List Position Patient Name Weight Height Clinic Doctor

but I am unsure as to how to proceed.
I will want to send the data to an Access db later so hopefully this
format will be amenable to this.

Any suggestions or assistance appreciated!

I suggest you parse HTML with a HTML parser. Looking for a module with
"HTML" and "Parser" in its name would be a good start. Since you are
specifically looking for parsing tables you may want to see if there's
on with "Table" in its name too.

Tad McClellan · Feb 5, 2007

dysgraphia said:
So far I have obtained a table, table[4] below, with
useful text I would like to put into a tabular format like:

Any suggestions or assistance appreciated!

use HTML::TableExtract;

dysgraphia · Feb 5, 2007

Brian said:
I suggest you parse HTML with a HTML parser. Looking for a module with
"HTML" and "Parser" in its name would be a good start. Since you are
specifically looking for parsing tables you may want to see if there's
on with "Table" in its name too.

Thanks Brian, I will look through the modules based on your suggestions.
Your help is appreciated!...cheers, Peter

gf · Feb 5, 2007

I am partial to HTML::TreeBuilder for my parsing.

After a tree has been built from the HTML you use the methods in
HTML::Element to traverse the tree. look_down() is very powerful and
is my go-to routine.

You can easily find the location of your target table in the tree with
look_down(), then loop through the rows and cells, extracting the
contents of the cells using as_text().

Use an array to mimic the table structure. This is untested and
doesn't check for all errors, but I'd loop through the table with
something like:

use warnings;
use strict;
use LWP::Simple;
use HTML::TreeBuilder;

my $html = get('the URL you want to retrieve') or die "Can't get URL.
\n";
my $tree = HTML::TreeBuilder->new_from_content($html);

my @table_data;
foreach my $table ( $tree->look_down( '_tag' => 'table' ) )
{
foreach my $tr ( $table->look_down( '_tag' => 'tr' ) )
{
my @row_data;
foreach my $td ( $table->look_down( '_tag' => 'td' ) )
{
push @row_data, $td->as_text();
}
push @table_data, [@row_data];
}
}

foreach my $r (@table_data)
{
print join( "\t", @$r ), "\n";
}

You might have to flesh out the look_down() calls to narrow your table
selections, but for a single table embedded in a page it should
suffice.

gf · Feb 5, 2007

foreach my $td ( $table->look_down( '_tag' => 'td' ) )

{
push @row_data, $td->as_text();
}

OOPS, that should be

foreach my $td ( $tr->look_down( '_tag' => 'td' ) )
{
push @row_data, $td->as_text();
}

dysgraphia · Feb 6, 2007

gf said:
I am partial to HTML::TreeBuilder for my parsing.

After a tree has been built from the HTML you use the methods in
HTML::Element to traverse the tree. look_down() is very powerful and
is my go-to routine.

Thanks gf!
I have had a look at your suggestion of HTML::TreeBuilder and can see
it is most likely worth me learning. I have installed the module and
given it some trial runs on example code and your code. Comments of mine
below.

You can easily find the location of your target table in the tree with
look_down(), then loop through the rows and cells, extracting the
contents of the cells using as_text().

Use an array to mimic the table structure. This is untested and
doesn't check for all errors, but I'd loop through the table with
something like:

use warnings;
use strict;
use LWP::Simple;
use HTML::TreeBuilder;

my $html = get('the URL you want to retrieve') or die "Can't get URL.
\n";
my $tree = HTML::TreeBuilder->new_from_content($html);

my @table_data;
foreach my $table ( $tree->look_down( '_tag' => 'table' ) )
{
foreach my $tr ( $table->look_down( '_tag' => 'tr' ) )
{
my @row_data;
foreach my $td ( $table->look_down( '_tag' => 'td' ) )
{
push @row_data, $td->as_text();
}
push @table_data, [@row_data];
}
}

foreach my $r (@table_data)
{
print join( "\t", @$r ), "\n";
}

You might have to flesh out the look_down() calls to narrow your table
selections, but for a single table embedded in a page it should
suffice.

I tried your code and it ran perfectly. My project has a
table-within-tables structure. The HTML has a lot of dross that I want
to avoid.
I did a bit of digging and found some articles and links of Sean M.
Burke eg
http://aspn.activestate.com/ASPN/docs/ActivePerl-5.6/site/lib/HTML/Tree/Scanning.html
and tried to use his suggestion for rejecting certain tables.
He wrote:

$h1 = $tree->look_down('_tag', 'h1');
returns the first element at-or-under $tree whose "_tag" attribute has
the value "h1".......
you could exclude ``h1'' elements that contain the word ``visit'' under
them:

my $real_h1 = $tree->look_down(
'_tag', 'h1',
sub {
$_[0]->as_text !~ m/\bvisit/i
}
);

I adapted and tried this code but could not get the table to be excluded.
In my case the HTML has a large (approx 700 line) table I don't want.
This table has tags like <option>....</option> to identify it but
putting this in the above code did not work.
Any comments or suggestions of yours are welcome...thanks again for your
help so far....cheers, Peter

dysgraphia · Feb 6, 2007

Tad said:
use HTML::TableExtract;

Thanks Tad I will check this module out.

dysgraphia · Feb 6, 2007

Michele said:
You will have to parse it. So use some HTML parsing module. One such
module that gets mentioned frequently here is HTML::TokeParser. There
are others though, and you may want to check some of them to find the
best one for you.

Thanks for your input Michele, I will have a look at TokeParser.

Do you mean in pure text? Then use some pure text table formatting
module, like Text::Table or Perl6::Form.

Michele

I am using Perl 5.8 from ActiveState. My initial requirement was to see
the text in either a text editor or spreadsheet format. This was just to
ensure I am getting the data correctly as I will have a need to download
many files on a weekly basis. When the parsing looks OK I will then send
it to a db.

Again, thanks for your help Michele...appreciated...cheers, Peter

gf · Feb 6, 2007

I tried your code and it ran perfectly.

That occasionally happens.

[...]

my $real_h1 = $tree->look_down(
'_tag', 'h1',
sub {
$_[0]->as_text !~ m/\bvisit/i
}
);

You're on the right track, just keep following it. Because you're so
close to the answer I'm just going to say "keep going".

sub {} calls in look_down() are your friends - they're really
powerful. Sometimes I've needed to use multiple embedded subs to chain
together the results of the look_down(). In effect this causes the
test to drill down into the HTML deeper and deeper to determine if the
child nodes contain what you want.

And, remember that the parameters to a look_down() constitute an OR
condition, and the embedded sub {} conditions act as ANDs.

Also, the use of qr// regexp patterns can be powerful OR tests.

Stylistically I like to use the '=>' operator to separate my argument
pairs in the look_down() parameter list rather than plain commas.

OK, I lied. Here's an (untested) example of drilling in farther.

[...]
foreach my $_tr (
$tree->look_down(
'_tag' => 'tr',
'class' => qr/row[123]/,
sub {
$_[0]->look_down(
'_tag' => 'td',
'id' => qr/^datafield_(?:name|date|age)/,
sub {
$_[0]->as_text() =~ /\bfoo\b/;
}
);
}
)
)
{
; # ...do something revolutionary here
}

dysgraphia · Feb 7, 2007

Michele said:
If you want to export to a spreadsheet probably the best way would be
to write your data to CSV, in which case Text::CSV_XS comes as a
precious tool.

Michele

Thanks again Michele! I will install the Text::CSV_XS module...cheers, Peter

dysgraphia · Feb 9, 2007

gf said:
You're on the right track, just keep following it. Because you're so
close to the answer I'm just going to say "keep going".

sub {} calls in look_down() are your friends - they're really
powerful. Sometimes I've needed to use multiple embedded subs to chain
together the results of the look_down(). In effect this causes the
test to drill down into the HTML deeper and deeper to determine if the
child nodes contain what you want.

And, remember that the parameters to a look_down() constitute an OR
condition, and the embedded sub {} conditions act as ANDs.

Also, the use of qr// regexp patterns can be powerful OR tests.

Stylistically I like to use the '=>' operator to separate my argument
pairs in the look_down() parameter list rather than plain commas.

OK, I lied. Here's an (untested) example of drilling in farther.

[...]
foreach my $_tr (
$tree->look_down(
'_tag' => 'tr',
'class' => qr/row[123]/,
sub {
$_[0]->look_down(
'_tag' => 'td',
'id' => qr/^datafield_(?:name|date|age)/,
sub {
$_[0]->as_text() =~ /\bfoo\b/;
}
);
}
)
)
{
; # ...do something revolutionary here
}

Thanks gf!...A carton of cyber-beer is on its way!!
After wrestling with this code of yours I now am getting closer to the
finishing line. In order to better see what I am getting I
write to an Excel sheet for now....a hangover from the time when I
generated this data using web queries and VBA, a method that fell over
recently due to changes in the web page.
I am still not sure I understand how your code does its magic yet but
the output to the spreadsheet is very close to what I want. All the text
from both tables is coming through OK but I would like a different final
layout.
Basically the first table has general data about a group of
patients/tests whilst the second table has specific data about each
patient within the group. At present my output is coming out as:
Table_1 field headings (one row)
Table_1 data (one row)
Table_2 field headings (one row)
Table_2 data (many rows)

My objective is a table format like:
Table_1 field headings Table_2 field headings
Table_1 data Table_2 data row 1
Table_1 data Table_2 data row 2
Table_1 data Table_2 data row 3
etc etc

That is, the Table_1 data is repeated for each row of Table_2 data.
This final output will be sent to a relational db.
I have been looking at building this table from array elements of
Table_1 and Table_2 but so far without success.
Any pointers?
Cheers, Peter

dysgraphia · Feb 10, 2007

Jim Gibson wrote:
(snipped for brevity)

Yes. Show us some code. How can anybody help you without knowing what
you have done so far.

I suggest you divide your problem into two parts:

1. Parsing the web page, extracting the data, and storing every piece
of data you need into internal Perl structures (arrays, hashes,
arrays-of-hashes, etc.)

2. Transforming the extracted data into the form you need.

Let us know with which of these parts are you having trouble.

Use Data:umper to confirm that part 1 is working the way you want. If
you want help with part 2, write a test program that starts with simple
versions of your data structures generated from assignment statements
and post the entire program. That way, anybody can run your program for
themselves and suggest fixes or improvements.

Good luck.

G'day Jim!
Thanks for your input and suggestions which have been quite helpful.
The code being used is in the preceding posts of this thread. This
includes my original code plus the valuable suggestions of gf. In the
interests of brevity I did not repeat all the code but I take your point
that for a new poster entering the thread it would have been better to
have repeated it.
Thanks for the mention of Data:

umper which I will investigate
as an alternative to the Excel route.
Cheers, Peter

Can someone tell me if this a real tracker? Or is it one designed to show you a different message at certain times, ie. acting like one?	0	Jan 10, 2021
How to have two html audio players on one page?	0	May 3, 2022
Can anyone please help? HTML - two tables applying different styles	4	Dec 1, 2020
Uncaught ReferenceError: item is not defined at HTMLButtonElement.onclick in the: <button onclick="item.inserir()">Inserir dados</button>	1	Apr 22, 2023
HTML Table Issue	1	Aug 29, 2022
How to wrap <td> content .	16	Sep 28, 2023
How can I calculate the last payment of the year to be the sum of all previous payments for that year and subtracting it from Research Costs value?	7	Aug 22, 2023
Javascript DOM	1	Mar 29, 2023

How to get table from some html

dysgraphia

Brian McCauley

Tad McClellan

dysgraphia

gf

gf

dysgraphia

dysgraphia

dysgraphia

gf

dysgraphia

dysgraphia

dysgraphia

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads