Help with HTML:TableContentParser

R

roberthuberjr

I have an html file I'm reading in from my file system (not via
get($URL) that I'm trying to parse the table from using
HTML::TableContentParser.

My script runs, but produces no results. I'm thinking I need to read
the file into an arrary to get the parser to work on it. Here's my
code (lifted from the Perl cookbook mostly) -

#!/usr/bin/perl
use HTML::TableContentParser;
use HTML::Entities;
use strict;
my $FILE = open("<netflow_data");
#??????read the file into an arrary maybe????
my $tcp = new HTML::TableContentParser;
my $tables = $tcp->parse($FILE);
my $modules = $tables->[1];
foreach my $r (@{ $modules->{rows} })
{
my ($date_time, $sip, $dip, $protocol, $sport, $dport,
$tcpflags, $tos, $num_pkts, $num_octets, $srcasn, $dstasn) =
parse_module_row($r, $FILE);
print "$date_time $sip $dip $protocol $sport $dport $tcpflags
$tos $num_pkts $num_octets $$srcasn $dstasn\n";
}
sub parse_module_row
{
my ($row, $FILE) = @_;
my ($date_time, $sip, $dip, $protocol, $sport, $dport,
$tcpflags, $tos, $num_pkts, $num_octets, $srcasn, $dstasn);
$date_time = $row->{cells}[0]{data};
$sip = $row->{cells}[1]{data};
$dip = $row->{cells}[2]{data};
$protocol = $row->{cells}[3]{data};
$sport = $row->{cells}[4]{data};
$dport = $row->{cells}[5]{data};
$tcpflags = $row->{cells}[6]{data};
$tos = $row->{cells}[7]{data};
$num_pkts = $row->{cells}[8]{data};
$num_octets = $row->{cells}[9]{data};
$srcasn = $row->{cells}[10]{data};
$dstasn = $row->{cells}[10]{data};
return ($date_time, $sip, $dip, $protocol, $sport, $dport,
$tcpflags, $tos, $num_pkts, $num_octets, $srcasn, $dstasn);
}
 
J

J. Gleixner

I have an html file I'm reading in from my file system (not via
get($URL) that I'm trying to parse the table from using
HTML::TableContentParser.

My script runs, but produces no results. I'm thinking I need to read
the file into an arrary to get the parser to work on it. Here's my
code (lifted from the Perl cookbook mostly) -

#!/usr/bin/perl
use HTML::TableContentParser;
use HTML::Entities;
use strict;
my $FILE = open("<netflow_data");

That doesn't do what you think it does.

open( my $file, '<', 'netflow_data')
or die "Can't read netflow_data: $!";

perldoc -f open

To learn how to read in the contents of the file

perldoc perlopentut

or

perldoc File::Slurp
#??????read the file into an arrary maybe????
my $tcp = new HTML::TableContentParser;
my $tables = $tcp->parse($FILE);

Never used it, but looking at the example in the documentation:

use HTML::TableContentParser;
$p = HTML::TableContentParser->new();
$html = read_html_from_somewhere();
$tables = $p->parse($html);

You need to read the information from the
file and pass that to parse(). Reading the
above perldoc(umentation) should help you figure
out how to read a file.
 
U

Uri Guttman

r> my $modules = $tables->[1];
r> foreach my $r (@{ $modules->{rows} })
r> {
r> my ($date_time, $sip, $dip, $protocol, $sport, $dport,
r> $tcpflags, $tos, $num_pkts, $num_octets, $srcasn, $dstasn) =
r> parse_module_row($r, $FILE);

GACK!!

use a hash or something. returning long lists of scalars is bug prone
and painful to read.

r> sub parse_module_row
r> {
r> my ($row, $FILE) = @_;
r> my ($date_time, $sip, $dip, $protocol, $sport, $dport,
r> $tcpflags, $tos, $num_pkts, $num_octets, $srcasn, $dstasn);

declare those where you need them

r> $date_time = $row->{cells}[0]{data};

my $date_time = $row->{cells}[0]{data};

r> $sip = $row->{cells}[1]{data};
r> $dip = $row->{cells}[2]{data};
r> $protocol = $row->{cells}[3]{data};
r> $sport = $row->{cells}[4]{data};
r> $dport = $row->{cells}[5]{data};
r> $tcpflags = $row->{cells}[6]{data};
r> $tos = $row->{cells}[7]{data};
r> $num_pkts = $row->{cells}[8]{data};
r> $num_octets = $row->{cells}[9]{data};
r> $srcasn = $row->{cells}[10]{data};
r> $dstasn = $row->{cells}[10]{data};

do you notice any repetition in that code? i can barely see the
differences as it is almost all repeated code.

the last two are the same data. is that a bug? another reason to not do
repeated lines where you edit each one slightly

first, factor out $row->{cells} into $cells

my $cells = $row->{cells} ;

then you can use a hash to get all the data with a simpler slice and
map:

# declare this outside the sub:

my( @fields ) = qw(
date_time sip dip protocol sport dport tcpflags tos
num_pkts num_octets srcasn dstasn
) ;

my %row ;
@row{ @fields } = map $cells->[$_]{data}, 0 .. $#fields ;

return \%row ;

look ma, no fugly repetition!!

uri
 
D

David Combs

r> my $modules = $tables->[1];
r> foreach my $r (@{ $modules->{rows} })
r> {
r> my ($date_time, $sip, $dip, $protocol, $sport, $dport,
r> $tcpflags, $tos, $num_pkts, $num_octets, $srcasn, $dstasn) =
r> parse_module_row($r, $FILE);

GACK!!

use a hash or something.

Probably totally obvious, but could you show how you'd simplify
that with a hash? Thanks!

returning long lists of scalars is bug prone
and painful to read.

r> sub parse_module_row
r> {
r> my ($row, $FILE) = @_;
r> my ($date_time, $sip, $dip, $protocol, $sport, $dport,
r> $tcpflags, $tos, $num_pkts, $num_octets, $srcasn, $dstasn);

declare those where you need them

r> $date_time = $row->{cells}[0]{data};

my $date_time = $row->{cells}[0]{data};

r> $sip = $row->{cells}[1]{data};
r> $dip = $row->{cells}[2]{data};
r> $protocol = $row->{cells}[3]{data};
r> $sport = $row->{cells}[4]{data};
r> $dport = $row->{cells}[5]{data};
r> $tcpflags = $row->{cells}[6]{data};
r> $tos = $row->{cells}[7]{data};
r> $num_pkts = $row->{cells}[8]{data};
r> $num_octets = $row->{cells}[9]{data};
r> $srcasn = $row->{cells}[10]{data};
r> $dstasn = $row->{cells}[10]{data};

do you notice any repetition in that code? i can barely see the
differences as it is almost all repeated code.

the last two are the same data. is that a bug? another reason to not do
repeated lines where you edit each one slightly

first, factor out $row->{cells} into $cells

my $cells = $row->{cells} ;

then you can use a hash to get all the data with a simpler slice and
map:

# declare this outside the sub:

my( @fields ) = qw(
date_time sip dip protocol sport dport tcpflags tos
num_pkts num_octets srcasn dstasn
) ;

my %row ;
@row{ @fields } = map $cells->[$_]{data}, 0 .. $#fields ;

Again probably totally (or mostly) obvious, but you explain
the above line a bit?

Perhaps also "expand" it to show any implicit thus missing right-arrows.

Thanks!

David
 
U

Uri Guttman

r> my $modules = $tables->[1];
r> foreach my $r (@{ $modules->{rows} })
r> {
r> my ($date_time, $sip, $dip, $protocol, $sport, $dport,
r> $tcpflags, $tos, $num_pkts, $num_octets, $srcasn, $dstasn) =
r> parse_module_row($r, $FILE);
DC> Probably totally obvious, but could you show how you'd simplify
DC> that with a hash? Thanks!

the code is below. you even ask to explain that more.
first, factor out $row->{cells} into $cells

my $cells = $row->{cells} ;

then you can use a hash to get all the data with a simpler slice and
map:

# declare this outside the sub:

my( @fields ) = qw(
date_time sip dip protocol sport dport tcpflags tos
num_pkts num_octets srcasn dstasn
) ;

my %row ;
@row{ @fields } = map $cells->[$_]{data}, 0 .. $#fields ;

DC> Again probably totally (or mostly) obvious, but you explain
DC> the above line a bit?

it is just grabbing a list of field values via the map and its index
list argument. this was done in the long set of redundant assignments in
the OP's code. same code, just replaced the integer with $_. the left
side is just a simple assignment to hash slice of the field names. it is
actually very simple as the original code was simple. read more about
hash slices at:

http://sysarch.com/Perl/hash_slice.txt

DC> Perhaps also "expand" it to show any implicit thus missing right-arrows.

no, the code is simple enough. and there is only one implicit right
arrow in the only place it could be. i leave locating that as an exercise.

uri
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,266
Messages
2,571,091
Members
48,773
Latest member
Kaybee

Latest Threads

Top