PERL/HTML: extract repetitive information

seminex · Jul 11, 2007

Hi all,

I've an HTML (~5Mo) page like that :

<HTML>
<BODY BGCOLOR=#FFFFFF LINK=000066 VLINK=000066 TOPMARGIN=0
LEFTMARGIN=0 MARGINWIDTH=0 MARGINHEIGHT=0>Date debut: 07/07/2007 Date fin: 08/07/2007 Heure
debut : 01:00:00 Heure fin: 01:00:00 FTI : access <TABLE
BORDER=>
<tr bgcolor=#FFFFE8>
<th width=100 align=CENTER>login</th>
<th width=70 align=CENTER>ip</th>
<th width=230 align=CENTER>Num. appelant</th>
<th width=80 align=CENTER>J debut</th>
<th width=60 align=CENTER>H debut</th>
<th width=80 align=CENTER>J fin</th>
<th width=60 align=CENTER>H fin</th>
</tr>
<tr>
<td width=100 align=CENTER bgcolor=#FFFFE8>login/access</td>
<td width=70 align=LEFT bgcolor=#FFFFE8>192.168.30.26</td>
<td width=230 align=LEFT bgcolor=#FFFFE8>Supervision ACCESLIBRE</td>
<td width=80 align=LEFT bgcolor=#FFFFE8>2007-07-06</td>
<td width=60 align=LEFT bgcolor=#FFFFE8>23:59:50</td>
<td width=80 align=LEFT bgcolor=#FFFFE8>2007-07-07</td>
<td width=60 align=LEFT bgcolor=#FFFFE8>00:00:00</td>
</tr>
<tr>
<td width=100 align=CENTER bgcolor=#FFFFE8>login/access</td>
<td width=70 align=LEFT bgcolor=#FFFFE8>192.168.30.41</td>
<td width=230 align=LEFT bgcolor=#FFFFE8>Supervision ACCESLIBRE</td>
<td width=80 align=LEFT bgcolor=#FFFFE8>2007-07-07</td>
<td width=60 align=LEFT bgcolor=#FFFFE8>00:00:02</td>
<td width=80 align=LEFT bgcolor=#FFFFE8>2007-07-07</td>
<td width=60 align=LEFT bgcolor=#FFFFE8>00:00:12</td>
</tr>
</HTML>

I would extract only first hours, in this example, "23:59:50" and
"00:00:02".
I've tried more perl program, but I use regular expression ( /^(\d\d):
(\d\d)

\d\d)/) ) to extract my hours but often, they are nothing
(html error), and I've this :

[..]
1 <tr>
2 <td width=100 align=CENTER bgcolor=#FFFFE8>login/access</td>
3 <td width=70 align=LEFT bgcolor=#FFFFE8>192.168.30.41</td>
4 <td width=230 align=LEFT bgcolor=#FFFFE8>Supervision ACCESLIBRE</td>
5 <td width=80 align=LEFT bgcolor=#FFFFE8>2007-07-07</td>
6 <td width=60 align=LEFT bgcolor=#FFFFE8> </td>
7 <td width=80 align=LEFT bgcolor=#FFFFE8>2007-07-07</td>
8 <td width=60 align=LEFT bgcolor=#FFFFE8> </td>
9 </tr>
[..]

So, I ask you if anybody have some sample to extract _only_ line 6..

Because this :

sub tparse {
@input = @_;
chomp(@input);
if($input[0] =~ /^(\d\d)

\d\d)

\d\d)/){
push (@tableau, $input[0]);
}
}

my $p = HTML:

arser->new( api_version => 3,
text_h => [\&tparse, "dtext"]);
$p->parse_file(shift || die "Ne peut ouvrir le fichier ! ($!)\n") ||
die $!;

Extract line 6 and 8 but ONLY if I have hours like 00:00:01 but if I
have nothing, my script extract next and perturb the rest of the
script.

Thank for advance.

Tad McClellan · Jul 11, 2007

I would extract only first hours, in this example, "23:59:50" and
"00:00:02".

1 <tr>
2 <td width=100 align=CENTER bgcolor=#FFFFE8>login/access</td>
3 <td width=70 align=LEFT bgcolor=#FFFFE8>192.168.30.41</td>
4 <td width=230 align=LEFT bgcolor=#FFFFE8>Supervision ACCESLIBRE</td>
5 <td width=80 align=LEFT bgcolor=#FFFFE8>2007-07-07</td>
6 <td width=60 align=LEFT bgcolor=#FFFFE8> </td>
7 <td width=80 align=LEFT bgcolor=#FFFFE8>2007-07-07</td>
8 <td width=60 align=LEFT bgcolor=#FFFFE8> </td>
9 </tr>
[..]

So, I ask you if anybody have some sample to extract _only_ line 6..

Regexes are not the Right Tool for parsing context free languages
such as HTML.

Use a module that understands HTML for processing HTML data:

----------------------------------
#!/usr/bin/perl
use warnings;
use strict;
use HTML::TableExtract;

my $html = do { local $/; <DATA> };
my $te = new HTML::TableExtract( );
$te->parse($html);

# Examine all matching tables
foreach my $ts ($te->table_states) {
foreach my $row ($ts->rows) {
print "found '$row->[4]'\n";
}
}

__DATA__
<HTML>
<BODY BGCOLOR=#FFFFFF LINK=000066 VLINK=000066 TOPMARGIN=0
LEFTMARGIN=0 MARGINWIDTH=0 MARGINHEIGHT=0>Date debut: 07/07/2007 Date fin: 08/07/2007 Heure
debut : 01:00:00 Heure fin: 01:00:00 FTI : access <TABLE
BORDER=>
<tr bgcolor=#FFFFE8>
<th width=100 align=CENTER>login</th>
<th width=70 align=CENTER>ip</th>
<th width=230 align=CENTER>Num. appelant</th>
<th width=80 align=CENTER>J debut</th>
<th width=60 align=CENTER>H debut</th>
<th width=80 align=CENTER>J fin</th>
<th width=60 align=CENTER>H fin</th>
</tr>
<tr>
<td width=100 align=CENTER bgcolor=#FFFFE8>login/access</td>
<td width=70 align=LEFT bgcolor=#FFFFE8>192.168.30.26</td>
<td width=230 align=LEFT bgcolor=#FFFFE8>Supervision ACCESLIBRE</td>
<td width=80 align=LEFT bgcolor=#FFFFE8>2007-07-06</td>
<td width=60 align=LEFT bgcolor=#FFFFE8>23:59:50</td>
<td width=80 align=LEFT bgcolor=#FFFFE8>2007-07-07</td>
<td width=60 align=LEFT bgcolor=#FFFFE8>00:00:00</td>
</tr>
<tr>
<td width=100 align=CENTER bgcolor=#FFFFE8>login/access</td>
<td width=70 align=LEFT bgcolor=#FFFFE8>192.168.30.41</td>
<td width=230 align=LEFT bgcolor=#FFFFE8>Supervision ACCESLIBRE</td>
<td width=80 align=LEFT bgcolor=#FFFFE8>2007-07-07</td>
<td width=60 align=LEFT bgcolor=#FFFFE8>00:00:02</td>
<td width=80 align=LEFT bgcolor=#FFFFE8>2007-07-07</td>
<td width=60 align=LEFT bgcolor=#FFFFE8>00:00:12</td>
</tr>
</HTML>

Seminex · Jul 11, 2007

Great it works !
2 solutions, 2 manners of proceeding, it's great !

Thanks all !!!

Thank lot off !!

I'm discovering HTML::TableExtract, it's fun

Thanks !

Colspan probs	2	May 21, 2026
Need help with <rowspan> in an HTML table	1	Nov 6, 2024
Can anyone please help? HTML - two tables applying different styles	4	Dec 1, 2020
Can someone tell me if this a real tracker? Or is it one designed to show you a different message at certain times, ie. acting like one?	0	Jan 10, 2021
Hello I am learning how to code and I tried making a calculator with HTML and js with some CSS I am stuck at thing, Like the screen value is	0	Mar 13, 2025
Help with my responsive home page	2	Dec 14, 2022
SendGrid email issue in responsive Gmail	1	Nov 4, 2021
Help with code	0	Jun 11, 2022

PERL/HTML: extract repetitive information

seminex

Tad McClellan

Seminex

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads