Extracting table in html page

  • Thread starter shankar_perl_rookie
  • Start date
S

shankar_perl_rookie

Hello All,

I have an html file where I am trying to extract a table. The problem
I am facing is there are lot of tables in the page and the table I am
looking to extract appears after a particular string say $some_text. I
know of a way that I can search for the string in the html page but
what I want to do is capture a table that immediately follows the
$some_text.

Any suggestions on how to do this ??

Thanks,
Shankar
 
J

Jim Gibson

shankar_perl_rookie said:
Hello All,

I have an html file where I am trying to extract a table. The problem
I am facing is there are lot of tables in the page and the table I am
looking to extract appears after a particular string say $some_text. I
know of a way that I can search for the string in the html page but
what I want to do is capture a table that immediately follows the
$some_text.

Any suggestions on how to do this ??

The most reliable way would be to use the HTML::parser module to parse
the html file, register appropriate handlers for the table elements
(<table>, <tr>, <td>) and one for text elements, look for your string,
and process the next table encountered in a callback (handler
subroutines are called as callbacks by the parsing method).

Another way would be to use a module to extract tables from HTML. There
are at least two on CPAN: HTML::TableExtract and HTML::TableParser. The
problem using these is to find the table after the specified text. Is
there some other way of identifying the table?

The quick and dirty way is to use a regular expression (untested):

if( $html =~ m{ $some_text .*? <table> (.*?) </table> }isx ) {
# table contents in $1
}

However, this will not always work. It fails if you have nested tables,
for example, which is a common occurrence in some HTML. However, if you
are in a hurry it might work for you. It is always better to use a real
parser for HTML.
 
S

sln

[snip]

The quick and dirty way is to use a regular expression (untested):

if( $html =~ m{ $some_text .*? <table> (.*?) </table> }isx ) {
# table contents in $1
}

However, this will not always work. It fails if you have nested tables,
for example, which is a common occurrence in some HTML. However, if you
are in a hurry it might work for you. It is always better to use a real
parser for HTML.

Its ALWAYS trivial to parse a markup language's markup.
ie: parse out tags(open|close)/attributes and content.
Creating an element tree (document) with HTML is another
process altogether. Xhtml/Xml, not so bad, sgml er ..

I always laugh when people say a 'real parser for HTML' because they
don't know what thier saying, instead, just parroting phrases from
so called God's, then passing them along.
As if a SAX parser does nothing more than a realtime parse on a stream,
ie: a markup parse. Easily done by regular expressions.

Oh, and before anybody starts that "regular language" crap, they better
be able to explain what the "can't" part means!

-sln
 
H

HASM

The most reliable way would be to use the HTML::parser module to parse
the html file,

Or HTML::TreeBuilder;

use HTML::TreeBuilder;
use LWP::UserAgent;
my $url = 'http://www.example.com/...";
my $browser = LWP::UserAgent->new;
my $response = $browser->request (HTTP::Request->new(GET => $url));
if ($response->is_success) {
my $tree = HTML::TreeBuilder->new;
my $content =
$tree->parse_content($response->decoded_content);
# search for text with look_down (there are other way)
my $text = $content->look_down (...)
# then for your table
my $table = $content->look_down ('_tag', 'table', ...)

etc,

-- HASM
 
S

sopan.shewale

The best way can be:
use split on $some_text and throw away the first part.
my ($junk, $interest_html) = split (/$some_text/, $html);

on $interest_html - use HTML::TreeBuilder to parse the tables.
grab the first table - you are done.

Let me know if you find difficult to use HTML::TreeBuilder.

--sopan shewale
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Staff online

Members online

Forum statistics

Threads
473,734
Messages
2,569,441
Members
44,832
Latest member
GlennSmall

Latest Threads

Top