To extract numbers from files with Perl

Discussion in 'Perl Misc' started by Luca Villa, Nov 11, 2007.

  1. Luca Villa

    Luca Villa Guest

    I have thousands of files named like these:

    c:\input\pumico-home.html
    c:\input\ofofo-home.html
    c:\input\cimaba-office.html
    c:\input\plata-home.html
    c:\input\plata-office.html
    c:\input\zito-home.html

    I need a Perl script that only for the files of those that match "c:
    \input\*-home.html" performs some regular expression extractions like
    in this two examples:

    for a "pumico-home.html" that contains:
    ziritabcdef12.80tttcucurullumnopq1zzzspugnizuabcdef1.25tttcantabarramnopq2zzzlocomotoabcdef0.32tttyamazetamnopq1zzz

    it generates a "pumico-home-extract.txt" file that contains these
    three couples of numbers, delimited by "|":
    12.80|1|1.25|2|0.32|1

    for a "ofofo-home.html" that contains:
    lumabcdef7.44tttcimizetamnopq3zzzpupopoabcdef5.11tttpletoramnopq2zzz

    it generates a "ofofo-home-extract.txt" file that contains these two
    couples of numbers, delimited by "|":
    7.44|3|5.11|2

    Note: that the numbers are always in couples as in the examples. The
    number of couples in each source file can vary from one to hundreds...


    I already found the regular expressions that extract the numbers:
    abcdef(\d+\.\d\d)ttt
    mnopq(\d+)zzz

    I'm stuck on the rest... (including file handling...)


    Thanks in advance for any help
     
    Luca Villa, Nov 11, 2007
    #1
    1. Advertising

  2. Luca Villa

    Luca Villa Guest

    quasi-solution:

    {local @ARGV=<c:/input/*-home.html>; local $^I='.extract.txt'; local $
    \=$/;
    while( <> ){
    print join'|',/([\d.]+)/g if /\d/
    }
    }

    This is still not the solution because it puts the new file in pumico-
    home.html and the old file in pumico-home.html.extract.txt
     
    Luca Villa, Nov 11, 2007
    #2
    1. Advertising

  3. On Sun, 11 Nov 2007 08:58:46 -0800, Luca Villa
    <> wrote:

    >I need a Perl script that only for the files of those that match "c:
    >\input\*-home.html" performs some regular expression extractions like
    >in this two examples:


    You can directly use glob().

    >for a "pumico-home.html" that contains:
    >ziritabcdef12.80tttcucurullumnopq1zzzspugnizuabcdef1.25tttcantabarramnopq2zzzlocomotoabcdef0.32tttyamazetamnopq1zzz
    >
    >it generates a "pumico-home-extract.txt" file that contains these


    perldoc -f open

    >three couples of numbers, delimited by "|":
    >12.80|1|1.25|2|0.32|1


    local ($,,$\)=("|", "\n");
    print /\d+(?:\.\d+)?/g;

    >I'm stuck on the rest... (including file handling...)


    That is in the docs.


    Michele
    --
    {$_=pack'B8'x25,unpack'A8'x32,$a^=sub{pop^pop}->(map substr
    (($a||=join'',map--$|x$_,(unpack'w',unpack'u','G^<R<Y]*YB='
    ..'KYU;*EVH[.FHF2W+#"\Z*5TI/ER<Z`S(G.DZZ9OX0Z')=~/./g)x2,$_,
    256),7,249);s/[^\w,]/ /g;$ \=/^J/?$/:"\r";print,redo}#JAPH,
     
    Michele Dondi, Nov 11, 2007
    #3
  4. Luca Villa <> wrote:
    > quasi-solution:
    >
    > {local @ARGV=<c:/input/*-home.html>; local $^I='.extract.txt'; local $

    ^^^
    ^^^
    That turns on inplace editing.


    > \=$/;
    > while( <> ){
    > print join'|',/([\d.]+)/g if /\d/
    > }
    > }
    >
    > This is still not the solution because it puts the new file in pumico-
    > home.html and the old file in pumico-home.html.extract.txt



    That's what inplace editing is supposed to do.

    If that is not what you wanted done, then you should not have
    turned on inplace editing, in which case, you would have to
    handle the file naming in your own code.


    # untested
    foreach my $fname ( glob 'c:/input/*-home.html' ) {
    (my $outname = $fname) =~ s/\.html$/-extract.txt/;
    open my $extract, '>', $outname or die "could not open '$outname' $!";

    local @ARGV = $fname;
    local $\ = $/;
    while( <> ){
    next unless /\d/;
    print {$extract} join( '|', /([\d.]+)/g );
    }

    close $extract;
    }


    --
    Tad McClellan
    email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"
     
    Tad McClellan, Nov 12, 2007
    #4
  5. On Mon, 12 Nov 2007 01:39:46 GMT, Tad McClellan <>
    wrote:

    >That's what inplace editing is supposed to do.
    >
    >If that is not what you wanted done, then you should not have
    >turned on inplace editing, in which case, you would have to
    >handle the file naming in your own code.


    Speaking of which, the wild feature request of the day is: ^I could
    take a subref which will be passed a string (the original filename)
    and should return a modified string.


    Michele
    --
    {$_=pack'B8'x25,unpack'A8'x32,$a^=sub{pop^pop}->(map substr
    (($a||=join'',map--$|x$_,(unpack'w',unpack'u','G^<R<Y]*YB='
    ..'KYU;*EVH[.FHF2W+#"\Z*5TI/ER<Z`S(G.DZZ9OX0Z')=~/./g)x2,$_,
    256),7,249);s/[^\w,]/ /g;$ \=/^J/?$/:"\r";print,redo}#JAPH,
     
    Michele Dondi, Nov 12, 2007
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jay Douglas
    Replies:
    3
    Views:
    3,540
    Jay Douglas
    Aug 27, 2004
  2. crazyprakash
    Replies:
    4
    Views:
    3,431
    adrian
    Oct 30, 2005
  3. Jimbo
    Replies:
    4
    Views:
    541
    Novocastrian_Nomad
    Mar 21, 2010
  4. eggie5

    Extract numbers from string

    eggie5, Sep 25, 2007, in forum: Ruby
    Replies:
    7
    Views:
    358
    eggie5
    Sep 25, 2007
  5. Thomas Andersson
    Replies:
    20
    Views:
    316
Loading...

Share This Page