To extract numbers from files with Perl

L

Luca Villa

I have thousands of files named like these:

c:\input\pumico-home.html
c:\input\ofofo-home.html
c:\input\cimaba-office.html
c:\input\plata-home.html
c:\input\plata-office.html
c:\input\zito-home.html

I need a Perl script that only for the files of those that match "c:
\input\*-home.html" performs some regular expression extractions like
in this two examples:

for a "pumico-home.html" that contains:
ziritabcdef12.80tttcucurullumnopq1zzzspugnizuabcdef1.25tttcantabarramnopq2zzzlocomotoabcdef0.32tttyamazetamnopq1zzz

it generates a "pumico-home-extract.txt" file that contains these
three couples of numbers, delimited by "|":
12.80|1|1.25|2|0.32|1

for a "ofofo-home.html" that contains:
lumabcdef7.44tttcimizetamnopq3zzzpupopoabcdef5.11tttpletoramnopq2zzz

it generates a "ofofo-home-extract.txt" file that contains these two
couples of numbers, delimited by "|":
7.44|3|5.11|2

Note: that the numbers are always in couples as in the examples. The
number of couples in each source file can vary from one to hundreds...


I already found the regular expressions that extract the numbers:
abcdef(\d+\.\d\d)ttt
mnopq(\d+)zzz

I'm stuck on the rest... (including file handling...)


Thanks in advance for any help
 
L

Luca Villa

quasi-solution:

{local @ARGV=<c:/input/*-home.html>; local $^I='.extract.txt'; local $
\=$/;
while( <> ){
print join'|',/([\d.]+)/g if /\d/
}
}

This is still not the solution because it puts the new file in pumico-
home.html and the old file in pumico-home.html.extract.txt
 
M

Michele Dondi

I need a Perl script that only for the files of those that match "c:
\input\*-home.html" performs some regular expression extractions like
in this two examples:

You can directly use glob().
for a "pumico-home.html" that contains:
ziritabcdef12.80tttcucurullumnopq1zzzspugnizuabcdef1.25tttcantabarramnopq2zzzlocomotoabcdef0.32tttyamazetamnopq1zzz

it generates a "pumico-home-extract.txt" file that contains these

perldoc -f open
three couples of numbers, delimited by "|":
12.80|1|1.25|2|0.32|1

local ($,,$\)=("|", "\n");
print /\d+(?:\.\d+)?/g;
I'm stuck on the rest... (including file handling...)

That is in the docs.


Michele
 
T

Tad McClellan

Luca Villa said:
quasi-solution:

{local @ARGV=<c:/input/*-home.html>; local $^I='.extract.txt'; local $
^^^
^^^
That turns on inplace editing.

\=$/;
while( <> ){
print join'|',/([\d.]+)/g if /\d/
}
}

This is still not the solution because it puts the new file in pumico-
home.html and the old file in pumico-home.html.extract.txt


That's what inplace editing is supposed to do.

If that is not what you wanted done, then you should not have
turned on inplace editing, in which case, you would have to
handle the file naming in your own code.


# untested
foreach my $fname ( glob 'c:/input/*-home.html' ) {
(my $outname = $fname) =~ s/\.html$/-extract.txt/;
open my $extract, '>', $outname or die "could not open '$outname' $!";

local @ARGV = $fname;
local $\ = $/;
while( <> ){
next unless /\d/;
print {$extract} join( '|', /([\d.]+)/g );
}

close $extract;
}
 
M

Michele Dondi

That's what inplace editing is supposed to do.

If that is not what you wanted done, then you should not have
turned on inplace editing, in which case, you would have to
handle the file naming in your own code.

Speaking of which, the wild feature request of the day is: ^I could
take a subref which will be passed a string (the original filename)
and should return a modified string.


Michele
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,821
Messages
2,569,725
Members
45,511
Latest member
Osiris-Team

Latest Threads

Top