HTML::TokeParser & TableExtract

Discussion in 'Perl Misc' started by Abram, Apr 25, 2006.

  1. Abram

    Abram Guest

    I'm fairly new to Perl, so bare with me.

    I am trying to extract a table from an HTML file and parse through each
    row, then dump the extracted cell data into a csv file. This was
    pretty easy to accomplish with HTML::TokeParser, however I have one
    problem. Each HTML file I need to parse has three tables with the same
    structure. I need to separate these three tables into three csv files.

    I can use TableExtract to get the exact tables using the depth and
    count matching (depth is always 2 and count is 5-7), but I am not sure
    how to then parse only that table and extract the data. I'm sure this
    is pretty simple stuff, and I'll kick myself when I see the answer.

    Thanks in advance.

    --Abram
     
    Abram, Apr 25, 2006
    #1
    1. Advertising

  2. Abram

    David Squire Guest

    Abram wrote:
    > I'm fairly new to Perl, so bare with me.


    What an image! :) I guess you mean "bear with me" :)

    (Sorry, but it seems to be spelling/idiom correction day here).

    DS
     
    David Squire, Apr 25, 2006
    #2
    1. Advertising

  3. Abram

    Dr.Ruud Guest

    David Squire schreef:

    > it seems to be spelling/idiom correction day here


    How perfect, on my birthday!

    --
    Affijn, Ruud (44)

    "Gewoon is een tijger."
     
    Dr.Ruud, Apr 25, 2006
    #3
  4. Abram

    Abram Guest

    Ha! My brain has become a bit mushy with my hours of "learning" perl,
    so I didn't even notice... I better put something on!

    At least it got some attention, any suggestions (not on my apparel, but
    the html data extraction)?
     
    Abram, Apr 25, 2006
    #4
  5. Abram <> wrote:

    > I can use TableExtract to get the exact tables using the depth and
    > count matching (depth is always 2 and count is 5-7), but I am not sure
    > how to then parse only that table and extract the data. I'm sure this
    > is pretty simple stuff, and I'll kick myself when I see the answer.



    From "perldoc HTML::TableExtract":

    $te = new HTML::TableExtract( depth => 2, count => 2 );
    $te->parse($html_string);
    foreach $ts ($te->table_states) {
    print "Table found at ", join(',', $ts->coords), ":\n";
    foreach $row ($ts->rows) {
    print " ", join(',', @$row), "\n";
    }
    }


    That seems to do it.

    Are you having trouble modifying that to produce CSV?


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Apr 26, 2006
    #5
  6. Abram

    Abram Guest

    Thanks Tad,


    > Tad McClellan wrote:
    > Are you having trouble modifying that to produce CSV?


    Actually yes. I have been using the code from perldoc (slightly
    modified), but cannot seem to get the proper structure for csv. That
    is why I was looking into TokeParser as I could easily parse through
    each TD and conditionally extract the data.

    Could you provide some help on how to get this done with TableExtract?
    My HTML looks something like this:
    ....
    <table>
    <tr>
    <td> Header 1 </td>
    <td> Header 2 </td>
    <td> Header 3 </td>
    </tr>
    <!-- Data Starts Here -->
    <tr id="Data_Row_1">
    <td> data 1_1 </td>
    <td> data 1_2 </td>
    <td> data 1_3 </td>
    </tr>
    <tr id="Data_Row_1_1">
    <td colspan=3> More data for 1 </td>
    </tr>
    <tr id="Data_Row_2">
    <td> data 2_1 </td>
    <td> data 2_2 </td>
    <td> data 2_3 </td>
    </tr>
    <tr id="Data_Row_2_1">
    <td colspan=3> More data for 2 </td>
    </tr>
    </table>
    (NOTE: Actual html doesn' t have tr id's, used just to illustrate
    associated rows)

    To make things even more interesting I need to extract the "More data
    for NN" row and append it to the data row.

    Any suggestions?
     
    Abram, Apr 26, 2006
    #6
  7. "Abram" <> wrote in news:1146068409.560958.129860
    @i39g2000cwa.googlegroups.com:

    > Thanks Tad,
    >
    >
    >> Tad McClellan wrote:
    >> Are you having trouble modifying that to produce CSV?

    >
    > Actually yes. I have been using the code from perldoc (slightly
    > modified), but cannot seem to get the proper structure for csv. That
    > is why I was looking into TokeParser as I could easily parse through
    > each TD and conditionally extract the data.


    ....

    > <tr id="Data_Row_1">
    > <td> data 1_1 </td>
    > <td> data 1_2 </td>
    > <td> data 1_3 </td>
    > </tr>
    > <tr id="Data_Row_1_1">
    > <td colspan=3> More data for 1 </td>
    > </tr>


    ....

    > (NOTE: Actual html doesn' t have tr id's, used just to illustrate
    > associated rows)
    >
    > To make things even more interesting I need to extract the "More data
    > for NN" row and append it to the data row.


    Which column are you supposed to put the data in "More data for NN"?

    Sinan

    --
    A. Sinan Unur <>
    (remove .invalid and reverse each component for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
     
    A. Sinan Unur, Apr 26, 2006
    #7
  8. Abram

    Abram Guest

    Sinan,

    > Which column are you supposed to put the data in "More data for NN"?


    The last column of the row. So it would look like this in the csv:
    data 1_1,data 1_2,data 1_3,More data for 1
    data 2_1,data 2_2,data 2_3,More data for 2
    data 3_1,data 3_2,data 3_3,More data for 3
    data 4_1,data 4_2,data 4_3,More data for 4
    ....etc...

    --Abram
     
    Abram, Apr 26, 2006
    #8
  9. Abram <> wrote:
    >> Tad McClellan wrote:
    >> Are you having trouble modifying that to produce CSV?

    >
    > Actually yes. I have been using the code from perldoc (slightly
    > modified), but cannot seem to get the proper structure for csv.



    It is _already_ CSV will extra spaces at the beginning and
    no quotes around fields.

    Modify the boilerplate code to eliminate the extra spaces, and
    to put quotes around fields.


    > Could you provide some help on how to get this done with TableExtract?



    Sure.

    Post your broken code, and someone will help you fix it.


    > To make things even more interesting I need to extract the "More data
    > for NN" row and append it to the data row.



    How do you identify what is to be joined?

    Does it always have the "More data" text in it? (I doubt it)

    Are there times when there is NOT a "continuation" row?

    Can there be more than one "continuation row"?

    etc...


    > Any suggestions?



    If you need debugging help, you pretty much have to post the
    code that you want debugged...


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Apr 26, 2006
    #9
  10. "Abram" <> wrote in news:1146090054.571782.194790
    @i40g2000cwc.googlegroups.com:

    > Sinan,
    >
    >> Which column are you supposed to put the data in "More data for NN"?

    >
    > The last column of the row. So it would look like this in the csv:
    > data 1_1,data 1_2,data 1_3,More data for 1
    > data 2_1,data 2_2,data 2_3,More data for 2
    > data 3_1,data 3_2,data 3_3,More data for 3
    > data 4_1,data 4_2,data 4_3,More data for 4


    Each regular row will contain 3 elements. The continuation row will have
    only one element. Join that element with the third column of the previous
    row.

    For more help, post your best attempt to implement the algorithm above. If
    it does not work, if I don't get a chance, someone will definitely help
    you fix it.

    Sinan
    --
    A. Sinan Unur <>
    (remove .invalid and reverse each component for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
     
    A. Sinan Unur, Apr 27, 2006
    #10
  11. Abram

    Abram Guest

    Thanks for all your help guys! I think I got it.

    Here's what ended up working for me, please advise as to any better
    approaches.

    #!/usr/bin/perl
    use HTML::TableExtract;

    # Declare the subroutines
    sub trim($);

    my $html_file = "C:/webfiles/test.htm";
    $te = HTML::TableExtract->new( depth => 1, count => 6 );
    $te->parse_file($html_file);

    my $log = "c://perl//pl_projects//web_parser.log";
    open(my $LF,">> $log") or die "Couldn't open $log for writing: $!\n";
    my $we_need_to_truncate = 0;
    foreach $ts ($te->tables) {
    foreach $row ($ts->rows) {
    $counter ++;
    if ($counter > 4 ){
    for ($i=1; $i<6; $i++) {
    #If the table has no top keywords we need to truncate the
    file
    if (@$row[$i] =~ m/No keywords rank*/){
    $we_need_to_truncate = 1;
    }
    # $bit is used to determine if we need to join row to the
    previous row
    if(!$bit){
    $str = $str.trim(@$row[$i]).",";
    }else{
    $str = $str.trim(@$row[$i]);
    }
    }
    if ($bit){
    $bit=0;
    $str = trim($str)."\n";
    }else{
    $bit=1;
    }
    }
    }
    }
    #Write the file
    my $old_fh = select($LF);
    print $str;
    select ($old_fh);
    close($LF) or die "Couldn't close $log: $!\n";

    #remove the last three rows if need be
    if($we_need_to_truncate){
    for ($i=1; $i<4; $i++){
    truncatefile($log);
    }
    }

    # Perl trim function to remove whitespace from the start and end of the
    string
    sub trim($)
    {
    my $string = shift;
    $string =~ s/^\s+//;
    $string =~ s/\s+$//;

    return $string;
    }

    sub truncatefile()
    {
    open (FH, "+< $log") or die "can't update $log: $!";
    while (<FH>) {
    $addr = tell(FH) unless eof(FH);
    }
    truncate(FH, $addr) or die "can't truncate $log: $!";
    }


    --Abram
     
    Abram, Apr 27, 2006
    #11
  12. Abram

    Dr.Ruud Guest

    Abram schreef:
    > Thanks for all your help guys! I think I got it.
    >
    > Here's what ended up working for me, please advise as to any better
    > approaches.
    >
    > #!/usr/bin/perl


    Missing:

    use strict;
    use warnings;

    > use HTML::TableExtract;
    >
    > # Declare the subroutines
    > sub trim($);


    Not necessary.


    > my $html_file = "C:/webfiles/test.htm";


    Use single quotes when double quotes are not needed.


    > $te = HTML::TableExtract->new( depth => 1, count => 6 );


    my $te = ...


    > $te->parse_file($html_file);
    >
    > my $log = "c://perl//pl_projects//web_parser.log";


    Replace the dubble forward slashes by single ones.


    > open(my $LF,">> $log") or die "Couldn't open $log for writing: $!\n";


    Is there a special reason to use uppercase for the lexical filehandle?
    See also the 3-arguments form: perldoc -f open.


    > my $we_need_to_truncate = 0;


    I would use a shorter name, like $must_truncate or even just $truncate.


    > foreach $ts ($te->tables) {


    for my $ts ($te->tables) {
    (further my's not mentioned)

    > foreach $row ($ts->rows) {
    > $counter ++;


    How high may that counter go?


    > if ($counter > 4 ){
    > for ($i=1; $i<6; $i++) {
    > #If the table has no top keywords we need to truncate the
    > file
    > if (@$row[$i] =~ m/No keywords rank*/){


    Zero, one or more k's? Just remove that asterisk.
    Is that text at the start of a line? Add an anchor.


    > $we_need_to_truncate = 1;
    > }
    > # $bit is used to determine if we need to join row to the
    > previous row


    > if(!$bit){
    > $str = $str.trim(@$row[$i]).",";
    > }else{
    > $str = $str.trim(@$row[$i]);
    > }


    Some variants:
    $str = $str.trim(@$row[$i]) . ($bit ? '' : ',');
    or
    $str = $str.trim(@$row[$i]);
    $str .= ',' if $bit == 0;
    or
    $str = $str.trim(@$row[$i]);
    $bit or $str .= ',';


    > }
    > if ($bit){
    > $bit=0;


    if ($bit) {
    $bit = 0;

    Whitepace is quite cheap.


    > $str = trim($str)."\n";
    > }else{
    > $bit=1;
    > }
    > }
    > }
    > }
    > #Write the file
    > my $old_fh = select($LF);
    > print $str;
    > select ($old_fh);


    Brackets are not needed with select.


    > close($LF) or die "Couldn't close $log: $!\n";


    Brackets are not needed with close.


    > #remove the last three rows if need be
    > if($we_need_to_truncate){
    > for ($i=1; $i<4; $i++){
    > truncatefile($log);
    > }


    $truncate and ( truncatefile($log) for {1..4} );


    > }
    >


    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Apr 27, 2006
    #12
  13. Abram

    Ben Morrow Guest

    Quoth "Abram" <>:
    > Thanks for all your help guys! I think I got it.
    >
    > Here's what ended up working for me, please advise as to any better
    > approaches.
    >
    > #!/usr/bin/perl


    You definitely want

    use strict;
    use warnings;

    here. Get Perl to help you get things right.

    > use HTML::TableExtract;
    >
    > # Declare the subroutines
    > sub trim($);
    >
    > my $html_file = "C:/webfiles/test.htm";
    > $te = HTML::TableExtract->new( depth => 1, count => 6 );
    > $te->parse_file($html_file);
    >
    > my $log = "c://perl//pl_projects//web_parser.log";


    Why have you doubled these slashes? Are you confusing them with
    backslashes (which do need doubling in a "" string)?

    > open(my $LF,">> $log") or die "Couldn't open $log for writing: $!\n";


    It's better to use three-arg open when you can (all the time,
    basically), and you don't need those parens since you're using 'or'
    instead of '||'.

    open my $LF, '>>', $log or die "...";

    You get lots of points for 1. using lexical FHs 2. checking the return
    value and 3. including both the file and $! in the massage, though :).

    BTW, do you realise that putting "\n" on the end of a 'die' message
    suppresses the file/line-number information? This is probably a
    situation (a message directed at the user rather than a developer) where
    that is appropriate, but in case you didn't know...

    > my $we_need_to_truncate = 0;


    I wouldn't have the '= 0' here: undef is a perfectly good false value.
    But that's probably a matter of taste...

    > foreach $ts ($te->tables) {
    > foreach $row ($ts->rows) {
    > $counter ++;
    > if ($counter > 4 ){
    > for ($i=1; $i<6; $i++) {


    A much more Perlish way to write that is

    for my $i (1..5) {

    which also makes the upper bound clearer.

    > #If the table has no top keywords we need to truncate the
    > file
    > if (@$row[$i] =~ m/No keywords rank*/){
    > $we_need_to_truncate = 1;
    > }
    > # $bit is used to determine if we need to join row to the
    > previous row
    > if(!$bit){
    > $str = $str.trim(@$row[$i]).",";
    > }else{
    > $str = $str.trim(@$row[$i]);
    > }
    > }
    > if ($bit){
    > $bit=0;
    > $str = trim($str)."\n";
    > }else{
    > $bit=1;
    > }
    > }
    > }
    > }
    > #Write the file
    > my $old_fh = select($LF);
    > print $str;
    > select ($old_fh);


    You can tell print which filehandle to print to without selecting it:

    print $LF $str;

    Note the lack of comma after '$LF'.

    > close($LF) or die "Couldn't close $log: $!\n";
    >
    > #remove the last three rows if need be
    > if($we_need_to_truncate){
    > for ($i=1; $i<4; $i++){
    > truncatefile($log);
    > }
    > }
    >
    > # Perl trim function to remove whitespace from the start and end of the
    > string
    > sub trim($)


    You don't need to prototype (the '($)') Perl subs. This one does no
    harm...

    > {
    > my $string = shift;
    > $string =~ s/^\s+//;
    > $string =~ s/\s+$//;
    >
    > return $string;
    > }
    >
    > sub truncatefile()


    ....but this will fail as you call it with a parameter above. It will
    work correctly, as $log is a global, but that's not good practice; so
    you want something more like

    sub truncatefile {
    my ($log) = @_; # get the paramaters

    > {
    > open (FH, "+< $log") or die "can't update $log: $!";
    > while (<FH>) {
    > $addr = tell(FH) unless eof(FH);
    > }
    > truncate(FH, $addr) or die "can't truncate $log: $!";
    > }


    This is a really inefficient way of removing the last line from the
    file. As you accumulate the whole file before you print it, you can just
    use something like (untested)

    $str =~ s/(?: [^\n]* \n ){0,3} $//x;

    before you print it; or, better, push the lines onto an array as you go
    rather than joining them with "\n" and then chop off the last three
    elements.

    Ben

    --
    Joy and Woe are woven fine,
    A Clothing for the Soul divine William Blake
    Under every grief and pine 'Auguries of Innocence'
    Runs a joy with silken twine.
     
    Ben Morrow, Apr 27, 2006
    #13
  14. Abram

    Abram Guest

    Thanks for the tips! I knew it was a bit sloppy and long-form.
     
    Abram, Apr 27, 2006
    #14
  15. Abram <> wrote:

    > please advise as to any better
    > approaches.
    >
    > #!/usr/bin/perl
    > use HTML::TableExtract;



    You are missing:

    use warnings;
    use strict;


    > sub trim($);



    Prototypes almost for sure don't do what you think they do, consider
    not using prototypes.


    > for ($i=1; $i<6; $i++) {


    foreach my $i ( 1 .. 5 ) {


    > my $old_fh = select($LF);
    > print $str;
    > select ($old_fh);



    What is the point of those 3 lines?

    What is wrong with this 1 line instead?

    print $LF $str;

    I have never used select() for this purpose in over 10 years
    of Perl programming.

    Where did you learn about using select() like that?


    > for ($i=1; $i<4; $i++){



    foreach my $i ( 1 .. 3 ){


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Apr 28, 2006
    #15
  16. Abram

    Ben Morrow Guest

    Quoth "Dr.Ruud" <>:
    > Abram schreef:
    > >
    > > # Declare the subroutines
    > > sub trim($);

    >
    > Not necessary.


    But useful if you want to call it without parens later.

    > > my $html_file = "C:/webfiles/test.htm";

    >
    > Use single quotes when double quotes are not needed.


    I believe this is considered a matter of style (I agree with you, but
    others do not).

    > > open(my $LF,">> $log") or die "Couldn't open $log for writing: $!\n";

    >
    > Is there a special reason to use uppercase for the lexical filehandle?


    It is traditional, from when it was usual to use a uppercase bareword
    :).

    I frequently use a convention like $IN is a file/$in is a line read from
    that file.

    > > if (@$row[$i] =~ m/No keywords rank*/){

    >
    > Zero, one or more k's? Just remove that asterisk.


    I suspect he was thinking of /...rank.*/... but still, not necessary.

    > > close($LF) or die "Couldn't close $log: $!\n";

    >
    > Brackets are not needed with close.


    Again, a matter of style. Some people are more comfortable with function
    calls having parens.

    Ben

    --
    I touch the fire and it freezes me, []
    I look into it and it's black.
    Why can't I feel? My skin should crack and peel---
    I want the fire back... Buffy, 'Once More With Feeling'
     
    Ben Morrow, Apr 28, 2006
    #16
  17. Abram

    David Combs Guest

    In article <>,
    Ben Morrow <> wrote:
    >
    >
    >before you print it; or, better, push the lines onto an array as you go
    >rather than joining them with "\n" and then chop off the last three
    >elements.


    General question about pushing onto an array (and GC):

    Suppose you're reading some large file, and for each
    (or certain) lines in it,

    you want to modify it somehow
    and then push it onto an array.

    Now, about GC and thrashing (eg GC'ing way too often
    for comfort):

    If instead of the above, suppose you first pushed
    the (certain) lines onto the array, and then
    later, in a 2nd pass (through the array) you
    do the per-line modifications.

    Under what conditions might that be a big win,
    in that (with luck) you'd end up pushing each
    line at the end of the "free space" (gotten by
    the most recent GC)?

    That is, if you pushed something onto the array,
    and the array wasn't already at the *end* of
    the free-space, the perl-os would have to *copy*
    the entire array, and then append the line.

    (OOPS: arrays are surely just an array of pointers
    *to* the lines. So, translate the problem to instead
    appending lines onto a single ever-growing *STRING*.)

    Anyway, you can see what I'm getting at: how to
    **sometimes** program so as to minimize the
    GC's.

    What features does perl6 have towards this end?

    (I believe some languages let one allocate *multiple*
    garbage-collectable spaces, so when you *really* need
    to, you can set it up so that you do your appending
    to one continuous structure in one space, and the
    other things on another space, thus keeping them
    from interfering with each other.)

    This having to copy at each append can rapidly overwhelm
    all other cpu-usage, what with it being an n-squared space-usage
    process.


    Comments?

    Thanks,

    David
     
    David Combs, May 22, 2006
    #17
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Patrick Joly
    Replies:
    0
    Views:
    97
    Patrick Joly
    Feb 25, 2004
  2. Maqo
    Replies:
    4
    Views:
    149
    A. Sinan Unur
    Feb 23, 2005
  3. jussi
    Replies:
    3
    Views:
    143
    Sherm Pendley
    Oct 7, 2005
  4. DVH

    HTML::TokeParser

    DVH, Oct 16, 2005, in forum: Perl Misc
    Replies:
    8
    Views:
    123
    A. Sinan Unur
    Oct 19, 2005
  5. Ninja Li

    Parsing HTML with HTML::TableExtract

    Ninja Li, Nov 27, 2009, in forum: Perl Misc
    Replies:
    2
    Views:
    249
    Martien Verbruggen
    Nov 28, 2009
Loading...

Share This Page