Extracting a table from a webpage

G

googlinggoogler

Hi

Hope this is the right group, I dont usually post but am really stuck.

I would like to scrape all the values from the table
http://www.morningstar.co.uk/UK/ISAQuickrank/default.aspx?tab=2&sortby=ReturnM60&lang=en-GB

But im having difficulty getting HTML::TableExtract to achieve this, I
keep returning null values.

The other thing is I want to get all the pages, as you can see from
that page theres something like ~3800 lines in the table.

I have already tried to manipulate my http POST's with the firefox
plugin Tamper Data (great extension, comes highly recommended!) but
the script that serves that page is well written and guards against
this. So I tried to look at the http transfers that cause the "next
button" at the bottom, this has led me to find that it produces an
absolutly massive string, that I can't even begin to understand, plus
I think it uses some sort of validation process based on the field
names (e.g. "__EVENTVALIDATION")

Any advice even its just on the scraping would be greatly recieved.

Kind regards and thanks in advance

Dave
 
B

Ben Bullock

I would like to scrape all the values from the table
http://www.morningstar.co.uk/UK/ISAQuickrank/default.aspx? tab=2&sortby=ReturnM60&lang=en-GB

But im having difficulty getting HTML::TableExtract to achieve this, I
keep returning null values.

It's difficult to analyze your problem without seeing the code you are
using. HTML::TableExtract shouldn't have a problem getting that table
out. I happened to have an old table extracting script lying around,
which I've modified for your case:

#!/usr/bin/perl
use warnings;
use strict;
use HTML::TableExtract;
use LWP::Simple;
my $isafilename = "isa.html";
if (!-f $isafilename) {
my $isaurl = "url goes here";
my $isadata = get($isaurl);
open my $isafile, ">", $isafilename or die $!;
print $isafile $isadata;
close $isafile or die $!;
}
my $te = HTML::TableExtract->new();
$te->parse_file($isafilename);
foreach my $ts ($te->tables) {
print "Table found at ", join(',', $ts->coords), " with ";
print scalar(@{$ts->rows}), " rows\n";
}

This worked correctly for me & found four tables in the page.
The other thing is I want to get all the pages, as you can see from that
page theres something like ~3800 lines in the table.

I have already tried to manipulate my http POST's with the firefox
plugin Tamper Data (great extension, comes highly recommended!) but the
script that serves that page is well written and guards against this. So
I tried to look at the http transfers that cause the "next button" at
the bottom, this has led me to find that it produces an absolutly
massive string, that I can't even begin to understand, plus I think it
uses some sort of validation process based on the field names (e.g.
"__EVENTVALIDATION")

Hmm, I manually changed the tab= string in the URL, to "tab=2" and
"tab=3" etc. and got the subsequent tables correctly, so it doesn't seem
to me that they are trying to hide the data.
 
G

Gunnar Hjalmarsson

I would like to scrape all the values from the table
http://www.morningstar.co.uk/UK/ISAQuickrank/default.aspx?tab=2&sortby=ReturnM60&lang=en-GB

But im having difficulty getting HTML::TableExtract to achieve this, I
keep returning null values.

I decided to play a little with HTML::TableExtract, and this worked fine:

my $te = HTML::TableExtract->new( headers => [
qw(Fund\sName Risk Std\sDev YTD 1\sYr 3\sYr\nAnlsd 5\sYr 10\sYr)
], );
$te->parse($html);
printf "%-42s%-13s%7s%7s%7s%7s%7s%7s\n", @$_
for ($te->tables)[0]->rows;
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top