PERL/HTML: extract repetitive information


S

seminex

Hi all,

I've an HTML (~5Mo) page like that :

<HTML>
<BODY BGCOLOR=#FFFFFF LINK=000066 VLINK=000066 TOPMARGIN=0
LEFTMARGIN=0 MARGINWIDTH=0 MARGINHEIGHT=0><font size=12
color=#000000>Date debut: 07/07/2007<br>Date fin: 08/07/2007<br>Heure
debut : 01:00:00<br>Heure fin: 01:00:00<br>FTI : access<br><br><TABLE
BORDER=>
<tr bgcolor=#FFFFE8>
<th width=100 align=CENTER>login</th>
<th width=70 align=CENTER>ip</th>
<th width=230 align=CENTER>Num. appelant</th>
<th width=80 align=CENTER>J debut</th>
<th width=60 align=CENTER>H debut</th>
<th width=80 align=CENTER>J fin</th>
<th width=60 align=CENTER>H fin</th>
</tr>
<tr>
<td width=100 align=CENTER bgcolor=#FFFFE8>login/access</td>
<td width=70 align=LEFT bgcolor=#FFFFE8>192.168.30.26</td>
<td width=230 align=LEFT bgcolor=#FFFFE8>Supervision ACCESLIBRE</td>
<td width=80 align=LEFT bgcolor=#FFFFE8>2007-07-06</td>
<td width=60 align=LEFT bgcolor=#FFFFE8>23:59:50</td>
<td width=80 align=LEFT bgcolor=#FFFFE8>2007-07-07</td>
<td width=60 align=LEFT bgcolor=#FFFFE8>00:00:00</td>
</tr>
<tr>
<td width=100 align=CENTER bgcolor=#FFFFE8>login/access</td>
<td width=70 align=LEFT bgcolor=#FFFFE8>192.168.30.41</td>
<td width=230 align=LEFT bgcolor=#FFFFE8>Supervision ACCESLIBRE</td>
<td width=80 align=LEFT bgcolor=#FFFFE8>2007-07-07</td>
<td width=60 align=LEFT bgcolor=#FFFFE8>00:00:02</td>
<td width=80 align=LEFT bgcolor=#FFFFE8>2007-07-07</td>
<td width=60 align=LEFT bgcolor=#FFFFE8>00:00:12</td>
</tr>
</HTML>

I would extract only first hours, in this example, "23:59:50" and
"00:00:02".
I've tried more perl program, but I use regular expression ( /^(\d\d):
(\d\d):(\d\d)/) ) to extract my hours but often, they are nothing
(html error), and I've this :

[..]
1 <tr>
2 <td width=100 align=CENTER bgcolor=#FFFFE8>login/access</td>
3 <td width=70 align=LEFT bgcolor=#FFFFE8>192.168.30.41</td>
4 <td width=230 align=LEFT bgcolor=#FFFFE8>Supervision ACCESLIBRE</td>
5 <td width=80 align=LEFT bgcolor=#FFFFE8>2007-07-07</td>
6 <td width=60 align=LEFT bgcolor=#FFFFE8> </td>
7 <td width=80 align=LEFT bgcolor=#FFFFE8>2007-07-07</td>
8 <td width=60 align=LEFT bgcolor=#FFFFE8> </td>
9 </tr>
[..]

So, I ask you if anybody have some sample to extract _only_ line 6..

Because this :

sub tparse {
@input = @_;
chomp(@input);
if($input[0] =~ /^(\d\d):(\d\d):(\d\d)/){
push (@tableau, $input[0]);
}
}

my $p = HTML::parser->new( api_version => 3,
text_h => [\&tparse, "dtext"]);
$p->parse_file(shift || die "Ne peut ouvrir le fichier ! ($!)\n") ||
die $!;

Extract line 6 and 8 but ONLY if I have hours like 00:00:01 but if I
have nothing, my script extract next and perturb the rest of the
script.

Thank for advance.
 
Ad

Advertisements

T

Tad McClellan

I would extract only first hours, in this example, "23:59:50" and
"00:00:02".
1 <tr>
2 <td width=100 align=CENTER bgcolor=#FFFFE8>login/access</td>
3 <td width=70 align=LEFT bgcolor=#FFFFE8>192.168.30.41</td>
4 <td width=230 align=LEFT bgcolor=#FFFFE8>Supervision ACCESLIBRE</td>
5 <td width=80 align=LEFT bgcolor=#FFFFE8>2007-07-07</td>
6 <td width=60 align=LEFT bgcolor=#FFFFE8> </td>
7 <td width=80 align=LEFT bgcolor=#FFFFE8>2007-07-07</td>
8 <td width=60 align=LEFT bgcolor=#FFFFE8> </td>
9 </tr>
[..]

So, I ask you if anybody have some sample to extract _only_ line 6..


Regexes are not the Right Tool for parsing context free languages
such as HTML.

Use a module that understands HTML for processing HTML data:

----------------------------------
#!/usr/bin/perl
use warnings;
use strict;
use HTML::TableExtract;

my $html = do { local $/; <DATA> };
my $te = new HTML::TableExtract( );
$te->parse($html);

# Examine all matching tables
foreach my $ts ($te->table_states) {
foreach my $row ($ts->rows) {
print "found '$row->[4]'\n";
}
}


__DATA__
<HTML>
<BODY BGCOLOR=#FFFFFF LINK=000066 VLINK=000066 TOPMARGIN=0
LEFTMARGIN=0 MARGINWIDTH=0 MARGINHEIGHT=0><font size=12
color=#000000>Date debut: 07/07/2007<br>Date fin: 08/07/2007<br>Heure
debut : 01:00:00<br>Heure fin: 01:00:00<br>FTI : access<br><br><TABLE
BORDER=>
<tr bgcolor=#FFFFE8>
<th width=100 align=CENTER>login</th>
<th width=70 align=CENTER>ip</th>
<th width=230 align=CENTER>Num. appelant</th>
<th width=80 align=CENTER>J debut</th>
<th width=60 align=CENTER>H debut</th>
<th width=80 align=CENTER>J fin</th>
<th width=60 align=CENTER>H fin</th>
</tr>
<tr>
<td width=100 align=CENTER bgcolor=#FFFFE8>login/access</td>
<td width=70 align=LEFT bgcolor=#FFFFE8>192.168.30.26</td>
<td width=230 align=LEFT bgcolor=#FFFFE8>Supervision ACCESLIBRE</td>
<td width=80 align=LEFT bgcolor=#FFFFE8>2007-07-06</td>
<td width=60 align=LEFT bgcolor=#FFFFE8>23:59:50</td>
<td width=80 align=LEFT bgcolor=#FFFFE8>2007-07-07</td>
<td width=60 align=LEFT bgcolor=#FFFFE8>00:00:00</td>
</tr>
<tr>
<td width=100 align=CENTER bgcolor=#FFFFE8>login/access</td>
<td width=70 align=LEFT bgcolor=#FFFFE8>192.168.30.41</td>
<td width=230 align=LEFT bgcolor=#FFFFE8>Supervision ACCESLIBRE</td>
<td width=80 align=LEFT bgcolor=#FFFFE8>2007-07-07</td>
<td width=60 align=LEFT bgcolor=#FFFFE8>00:00:02</td>
<td width=80 align=LEFT bgcolor=#FFFFE8>2007-07-07</td>
<td width=60 align=LEFT bgcolor=#FFFFE8>00:00:12</td>
</tr>
</HTML>
 
Ad

Advertisements

S

Seminex

Great it works !
2 solutions, 2 manners of proceeding, it's great !

Thanks all !!!

Thank lot off !!

I'm discovering HTML::TableExtract, it's fun :p

Thanks !

:)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top