Extracting a table from a webpage

googlinggoogler · Apr 28, 2008

Hi

Hope this is the right group, I dont usually post but am really stuck.

I would like to scrape all the values from the table
http://www.morningstar.co.uk/UK/ISAQuickrank/default.aspx?tab=2&sortby=ReturnM60&lang=en-GB

But im having difficulty getting HTML::TableExtract to achieve this, I
keep returning null values.

The other thing is I want to get all the pages, as you can see from
that page theres something like ~3800 lines in the table.

I have already tried to manipulate my http POST's with the firefox
plugin Tamper Data (great extension, comes highly recommended!) but
the script that serves that page is well written and guards against
this. So I tried to look at the http transfers that cause the "next
button" at the bottom, this has led me to find that it produces an
absolutly massive string, that I can't even begin to understand, plus
I think it uses some sort of validation process based on the field
names (e.g. "__EVENTVALIDATION")

Any advice even its just on the scraping would be greatly recieved.

Kind regards and thanks in advance

Dave

Ben Bullock · Apr 28, 2008

I would like to scrape all the values from the table
http://www.morningstar.co.uk/UK/ISAQuickrank/default.aspx? tab=2&sortby=ReturnM60&lang=en-GB

But im having difficulty getting HTML::TableExtract to achieve this, I
keep returning null values.

It's difficult to analyze your problem without seeing the code you are
using. HTML::TableExtract shouldn't have a problem getting that table
out. I happened to have an old table extracting script lying around,
which I've modified for your case:

#!/usr/bin/perl
use warnings;
use strict;
use HTML::TableExtract;
use LWP::Simple;
my $isafilename = "isa.html";
if (!-f $isafilename) {
my $isaurl = "url goes here";
my $isadata = get($isaurl);
open my $isafile, ">", $isafilename or die $!;
print $isafile $isadata;
close $isafile or die $!;
}
my $te = HTML::TableExtract->new();
$te->parse_file($isafilename);
foreach my $ts ($te->tables) {
print "Table found at ", join(',', $ts->coords), " with ";
print scalar(@{$ts->rows}), " rows\n";
}

This worked correctly for me & found four tables in the page.

The other thing is I want to get all the pages, as you can see from that
page theres something like ~3800 lines in the table.

I have already tried to manipulate my http POST's with the firefox
plugin Tamper Data (great extension, comes highly recommended!) but the
script that serves that page is well written and guards against this. So
I tried to look at the http transfers that cause the "next button" at
the bottom, this has led me to find that it produces an absolutly
massive string, that I can't even begin to understand, plus I think it
uses some sort of validation process based on the field names (e.g.
"__EVENTVALIDATION")

Hmm, I manually changed the tab= string in the URL, to "tab=2" and
"tab=3" etc. and got the subsequent tables correctly, so it doesn't seem
to me that they are trying to hide the data.

Gunnar Hjalmarsson · Apr 29, 2008

I would like to scrape all the values from the table
http://www.morningstar.co.uk/UK/ISAQuickrank/default.aspx?tab=2&sortby=ReturnM60&lang=en-GB

But im having difficulty getting HTML::TableExtract to achieve this, I
keep returning null values.

I decided to play a little with HTML::TableExtract, and this worked fine:

my $te = HTML::TableExtract->new( headers => [
qw(Fund\sName Risk Std\sDev YTD 1\sYr 3\sYr\nAnlsd 5\sYr 10\sYr)
], );
$te->parse($html);
printf "%-42s%-13s%7s%7s%7s%7s%7s%7s\n", @$_
for ($te->tables)[0]->rows;

Extracting links from a html table	1	May 19, 2008
Extracting text from a Webpage using BeautifulSoup	3	May 27, 2008
Extracting Data from a Webpage	16	Jan 27, 2008
Hell of a time extracting bits from a vector	5	Feb 29, 2008
Problem with a login script, SESSION user rights and put this together so it works with the other pages and MySQL. Code examples.	2	May 5, 2023
getting data from database (sybase) in an html table on a webpage	0	Jun 12, 2007
Extracting matrix from a text file	3	Aug 7, 2009
Sorting a hierarchical table (SQL)	0	Jan 30, 2013

Extracting a table from a webpage

googlinggoogler

Ben Bullock

Gunnar Hjalmarsson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads