Extracting table in html page

shankar_perl_rookie · Jul 21, 2010

Hello All,

I have an html file where I am trying to extract a table. The problem
I am facing is there are lot of tables in the page and the table I am
looking to extract appears after a particular string say $some_text. I
know of a way that I can search for the string in the html page but
what I want to do is capture a table that immediately follows the
$some_text.

Any suggestions on how to do this ??

Thanks,
Shankar

Jim Gibson · Jul 22, 2010

shankar_perl_rookie said:
Hello All,

I have an html file where I am trying to extract a table. The problem
I am facing is there are lot of tables in the page and the table I am
looking to extract appears after a particular string say $some_text. I
know of a way that I can search for the string in the html page but
what I want to do is capture a table that immediately follows the
$some_text.

Any suggestions on how to do this ??

The most reliable way would be to use the HTML:

arser module to parse
the html file, register appropriate handlers for the table elements
(<table>, <tr>, <td>) and one for text elements, look for your string,
and process the next table encountered in a callback (handler
subroutines are called as callbacks by the parsing method).

Another way would be to use a module to extract tables from HTML. There
are at least two on CPAN: HTML::TableExtract and HTML::TableParser. The
problem using these is to find the table after the specified text. Is
there some other way of identifying the table?

The quick and dirty way is to use a regular expression (untested):

if( $html =~ m{ $some_text .*? <table> (.*?) </table> }isx ) {
# table contents in $1
}

However, this will not always work. It fails if you have nested tables,
for example, which is a common occurrence in some HTML. However, if you
are in a hurry it might work for you. It is always better to use a real
parser for HTML.

sln · Jul 22, 2010

[snip]

The quick and dirty way is to use a regular expression (untested):

if( $html =~ m{ $some_text .*? <table> (.*?) </table> }isx ) {
# table contents in $1
}

However, this will not always work. It fails if you have nested tables,
for example, which is a common occurrence in some HTML. However, if you
are in a hurry it might work for you. It is always better to use a real
parser for HTML.

Its ALWAYS trivial to parse a markup language's markup.
ie: parse out tags(open|close)/attributes and content.
Creating an element tree (document) with HTML is another
process altogether. Xhtml/Xml, not so bad, sgml er ..

I always laugh when people say a 'real parser for HTML' because they
don't know what thier saying, instead, just parroting phrases from
so called God's, then passing them along.
As if a SAX parser does nothing more than a realtime parse on a stream,
ie: a markup parse. Easily done by regular expressions.

Oh, and before anybody starts that "regular language" crap, they better
be able to explain what the "can't" part means!

-sln

HASM · Jul 22, 2010

The most reliable way would be to use the HTML:arser module to parse
the html file,

Or HTML::TreeBuilder;

use HTML::TreeBuilder;
use LWP::UserAgent;
my $url = 'http://www.example.com/...";
my $browser = LWP::UserAgent->new;
my $response = $browser->request (HTTP::Request->new(GET => $url));
if ($response->is_success) {
my $tree = HTML::TreeBuilder->new;
my $content =
$tree->parse_content($response->decoded_content);
# search for text with look_down (there are other way)
my $text = $content->look_down (...)
# then for your table
my $table = $content->look_down ('_tag', 'table', ...)

etc,

-- HASM

sopan.shewale · Jul 22, 2010

The best way can be:
use split on $some_text and throw away the first part.
my ($junk, $interest_html) = split (/$some_text/, $html);

on $interest_html - use HTML::TreeBuilder to parse the tables.
grab the first table - you are done.

Let me know if you find difficult to use HTML::TreeBuilder.

--sopan shewale

Capture http address of a link in perl	0	Sep 3, 2010
Extracting html urls on a webpage using linktext	1	Jan 26, 2011
How to push data from one HTML page to another	4	Jan 3, 2024
Background image not showing up on html page	3	Sep 23, 2023
Football league table	0	Aug 21, 2022
Add recipes using JavaScript in table	20	Apr 17, 2023
How to have two html audio players on one page?	0	May 3, 2022
Horizontal scrollable table that is responsive	11	Jan 14, 2023

Extracting table in html page

shankar_perl_rookie

Jim Gibson

sln

HASM

sopan.shewale

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads