Extracting table in html page

Discussion in 'Perl Misc' started by shankar_perl_rookie, Jul 21, 2010.

  1. Hello All,

    I have an html file where I am trying to extract a table. The problem
    I am facing is there are lot of tables in the page and the table I am
    looking to extract appears after a particular string say $some_text. I
    know of a way that I can search for the string in the html page but
    what I want to do is capture a table that immediately follows the
    $some_text.

    Any suggestions on how to do this ??

    Thanks,
    Shankar
     
    shankar_perl_rookie, Jul 21, 2010
    #1
    1. Advertising

  2. shankar_perl_rookie

    Jim Gibson Guest

    In article
    <>,
    shankar_perl_rookie <> wrote:

    > Hello All,
    >
    > I have an html file where I am trying to extract a table. The problem
    > I am facing is there are lot of tables in the page and the table I am
    > looking to extract appears after a particular string say $some_text. I
    > know of a way that I can search for the string in the html page but
    > what I want to do is capture a table that immediately follows the
    > $some_text.
    >
    > Any suggestions on how to do this ??


    The most reliable way would be to use the HTML::parser module to parse
    the html file, register appropriate handlers for the table elements
    (<table>, <tr>, <td>) and one for text elements, look for your string,
    and process the next table encountered in a callback (handler
    subroutines are called as callbacks by the parsing method).

    Another way would be to use a module to extract tables from HTML. There
    are at least two on CPAN: HTML::TableExtract and HTML::TableParser. The
    problem using these is to find the table after the specified text. Is
    there some other way of identifying the table?

    The quick and dirty way is to use a regular expression (untested):

    if( $html =~ m{ $some_text .*? <table> (.*?) </table> }isx ) {
    # table contents in $1
    }

    However, this will not always work. It fails if you have nested tables,
    for example, which is a common occurrence in some HTML. However, if you
    are in a hurry it might work for you. It is always better to use a real
    parser for HTML.

    --
    Jim Gibson
     
    Jim Gibson, Jul 22, 2010
    #2
    1. Advertising

  3. shankar_perl_rookie

    Guest

    On Wed, 21 Jul 2010 16:08:17 -0700, Jim Gibson <> wrote:

    >In article
    ><>,
    >shankar_perl_rookie <> wrote:
    >

    [snip]

    >The quick and dirty way is to use a regular expression (untested):
    >
    >if( $html =~ m{ $some_text .*? <table> (.*?) </table> }isx ) {
    > # table contents in $1
    >}
    >
    >However, this will not always work. It fails if you have nested tables,
    >for example, which is a common occurrence in some HTML. However, if you
    >are in a hurry it might work for you. It is always better to use a real
    >parser for HTML.


    Its ALWAYS trivial to parse a markup language's markup.
    ie: parse out tags(open|close)/attributes and content.
    Creating an element tree (document) with HTML is another
    process altogether. Xhtml/Xml, not so bad, sgml er ..

    I always laugh when people say a 'real parser for HTML' because they
    don't know what thier saying, instead, just parroting phrases from
    so called God's, then passing them along.
    As if a SAX parser does nothing more than a realtime parse on a stream,
    ie: a markup parse. Easily done by regular expressions.

    Oh, and before anybody starts that "regular language" crap, they better
    be able to explain what the "can't" part means!

    -sln
     
    , Jul 22, 2010
    #3
  4. shankar_perl_rookie

    HASM Guest

    Jim Gibson <> writes:

    >> I have an html file where I am trying to extract a table. The problem
    >> I am facing is there are lot of tables in the page and the table I am
    >> looking to extract appears after a particular string say $some_text.


    > The most reliable way would be to use the HTML::parser module to parse
    > the html file,


    Or HTML::TreeBuilder;

    use HTML::TreeBuilder;
    use LWP::UserAgent;
    my $url = 'http://www.example.com/...";
    my $browser = LWP::UserAgent->new;
    my $response = $browser->request (HTTP::Request->new(GET => $url));
    if ($response->is_success) {
    my $tree = HTML::TreeBuilder->new;
    my $content =
    $tree->parse_content($response->decoded_content);
    # search for text with look_down (there are other way)
    my $text = $content->look_down (...)
    # then for your table
    my $table = $content->look_down ('_tag', 'table', ...)

    etc,

    -- HASM
     
    HASM, Jul 22, 2010
    #4
  5. shankar_perl_rookie

    Guest

    The best way can be:
    use split on $some_text and throw away the first part.
    my ($junk, $interest_html) = split (/$some_text/, $html);

    on $interest_html - use HTML::TreeBuilder to parse the tables.
    grab the first table - you are done.

    Let me know if you find difficult to use HTML::TreeBuilder.

    --sopan shewale



    On Jul 22, 3:21 am, shankar_perl_rookie <> wrote:
    > Hello All,
    >
    > I have an html file where I am trying to extract a table. The problem
    > I am facing is there are lot of tables in the page and the table I am
    > looking to extract appears after a particular string say $some_text. I
    > know of a way that I can search for the string in the html page but
    > what I want to do is capture a table that immediately follows the
    > $some_text.
    >
    > Any suggestions on how to do this ??
    >
    > Thanks,
    > Shankar
     
    , Jul 22, 2010
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Konrad Rotuski

    Extracting html source from a web page...

    Konrad Rotuski, Sep 13, 2004, in forum: ASP .Net
    Replies:
    4
    Views:
    6,985
    Patrick Mc
    Feb 15, 2009
  2. Cor Ligthert

    Extracting html source from a web page...

    Cor Ligthert, Sep 13, 2004, in forum: ASP .Net
    Replies:
    0
    Views:
    357
    Cor Ligthert
    Sep 13, 2004
  3. lvcha.gouqizi

    copy part of HTML Table to another HTML page

    lvcha.gouqizi, Dec 29, 2005, in forum: ASP General
    Replies:
    0
    Views:
    187
    lvcha.gouqizi
    Dec 29, 2005
  4. Replies:
    2
    Views:
    110
    Gregory Toomey
    Dec 10, 2004
  5. Replies:
    1
    Views:
    154
    Bill H
    May 19, 2008
Loading...

Share This Page