Speeding up glob?

Discussion in 'Perl Misc' started by Jim, Apr 25, 2005.

  1. Jim

    Jim Guest

    Hi

    I have a very simple perl program that runs _very_ slowly. Here's my
    code:

    #!/usr/local/bin/perl
    #
    # script to keep only a weeks worth of files
    #
    use File::stat;

    $time = time;

    # get list of all files in the backup directory
    @files = glob ("/backup/output.log*");

    unless (@files[0]) {
    print "No files to process\n";
    exit;
    }

    while (<@files>) {
    $filename = $_;
    $st = stat($_);

    $mod_time = $time - $st->mtime;

    # if file edit time is greater than x days, delete the file
    # 1440 minutes in a day
    # 86400 seconds in a day
    # 604800 seconds in a week
    # 2419200 seconds in a month
    # 7257600 seconds in 90 days

    if ($mod_time > 7257600) {
    print "Deleting file $filename\n";
    unlink ($filename);
    }
    else {
    #do nothing
    }
    }

    There are several thousand files (~21K) in this directory and many
    thousands of those files fit the criteria to delete. It takes a really
    long time to run this program. What's the holdup? Is it glob? My OS
    (Solaris 8)? IO? Any way to speed this up? Thanks.

    Jim
     
    Jim, Apr 25, 2005
    #1
    1. Advertising

  2. Jim wrote:
    > Hi
    >
    > I have a very simple perl program that runs _very_ slowly. Here's my
    > code:
    >
    > #!/usr/local/bin/perl
    > #
    > # script to keep only a weeks worth of files
    > #

    You need to run with strict and warnings turned on. Please read the
    posting guidelines (subject "Posting Guidelines for
    comp.lang.perl.misc"), which are posted regularly.

    > use File::stat;
    >
    > $time = time;
    >
    > # get list of all files in the backup directory
    > @files = glob ("/backup/output.log*");
    >
    > unless (@files[0]) {
    > print "No files to process\n";
    > exit;
    > }
    >
    > while (<@files>) {
    > $filename = $_;
    > $st = stat($_);

    <snip>

    > There are several thousand files (~21K) in this directory and many
    > thousands of those files fit the criteria to delete. It takes a really
    > long time to run this program. What's the holdup? Is it glob? My OS
    > (Solaris 8)? IO? Any way to speed this up? Thanks.

    Solaris doesn't (or didn't - I stand open to correction) perform very
    well with large directories on ufs. How long does eg ls take to complete
    in this directory?

    Secondly, you can benchmark your programs using a number of different
    methods to work out where bottlenecks are. check out

    Benchmark::Timer
    Devel::DProf

    regards,

    Mark
     
    Mark Clements, Apr 25, 2005
    #2
    1. Advertising

  3. Jim

    J. Gleixner Guest

    Jim wrote:

    > There are several thousand files (~21K) in this directory and many
    > thousands of those files fit the criteria to delete. It takes a really
    > long time to run this program. What's the holdup? Is it glob? My OS
    > (Solaris 8)? IO? Any way to speed this up? Thanks.


    Using File::Find or readdir, and process & unlinking each one, if it
    passes your test, would probably be better. Similar to reading &
    processing each line of a file, compared to slurping in the entire file
    and then iterating through each line.

    The "fastest" option would be to just used find (man find), and you'll
    probably need to use xargs also (man xargs).
     
    J. Gleixner, Apr 25, 2005
    #3
  4. Jim

    Guest

    Jim <> wrote:
    > Hi
    >
    > I have a very simple perl program that runs _very_ slowly. Here's my
    > code:
    >

    ....
    > @files = glob ("/backup/output.log*");
    >
    > while (<@files>) {


    You are double globbing. Don't do that.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Apr 25, 2005
    #4
  5. Jim

    Tintin Guest

    "Jim" <> wrote in message
    news:...
    > Hi
    >
    > I have a very simple perl program that runs _very_ slowly. Here's my
    > code:
    >
    > #!/usr/local/bin/perl
    > #
    > # script to keep only a weeks worth of files
    > #
    > use File::stat;
    >
    > $time = time;
    >
    > # get list of all files in the backup directory
    > @files = glob ("/backup/output.log*");
    >
    > unless (@files[0]) {
    > print "No files to process\n";
    > exit;
    > }
    >
    > while (<@files>) {
    > $filename = $_;
    > $st = stat($_);
    >
    > $mod_time = $time - $st->mtime;
    >
    > # if file edit time is greater than x days, delete the file
    > # 1440 minutes in a day
    > # 86400 seconds in a day
    > # 604800 seconds in a week
    > # 2419200 seconds in a month
    > # 7257600 seconds in 90 days
    >
    > if ($mod_time > 7257600) {
    > print "Deleting file $filename\n";
    > unlink ($filename);
    > }
    > else {
    > #do nothing
    > }
    > }
    >
    > There are several thousand files (~21K) in this directory and many
    > thousands of those files fit the criteria to delete. It takes a really
    > long time to run this program. What's the holdup? Is it glob? My OS
    > (Solaris 8)? IO? Any way to speed this up? Thanks.


    The bottleneck is mostly going to be OS & I/O, but you could try

    find /backups -name "output.log*" -mtime +7 | xargs rm -f
     
    Tintin, Apr 25, 2005
    #5
  6. Jim

    peter pilsl Guest

    Jim wrote:
    > Hi
    >
    > I have a very simple perl program that runs _very_ slowly. Here's my
    > code:
    >
    > #!/usr/local/bin/perl
    > #
    > # script to keep only a weeks worth of files
    > #
    > use File::stat;
    >
    > $time = time;
    >
    > # get list of all files in the backup directory
    > @files = glob ("/backup/output.log*");
    >
    > unless (@files[0]) {
    > print "No files to process\n";
    > exit;
    > }
    >
    > while (<@files>) {
    > $filename = $_;
    > $st = stat($_);
    >
    > $mod_time = $time - $st->mtime;
    >
    > # if file edit time is greater than x days, delete the file
    > # 1440 minutes in a day
    > # 86400 seconds in a day
    > # 604800 seconds in a week
    > # 2419200 seconds in a month
    > # 7257600 seconds in 90 days
    >
    > if ($mod_time > 7257600) {
    > print "Deleting file $filename\n";
    > unlink ($filename);
    > }
    > else {
    > #do nothing
    > }
    > }
    >
    > There are several thousand files (~21K) in this directory and many
    > thousands of those files fit the criteria to delete. It takes a really
    > long time to run this program. What's the holdup? Is it glob? My OS
    > (Solaris 8)? IO? Any way to speed this up? Thanks.
    >
    > Jim



    just for comparison:

    I just wrote a small script that creates 20k empty files and get the
    stat of the files and deletes them again.
    Its pretty fast on my machine: a linux 2.4.x on Athlon1800XP with 1GB
    Ram and IDE with softwareraid level1 and loads of daemons running on it.
    So definitely not a machine with fast I/O.


    # time ./p.pl create
    0.18user 0.71system 0:00.88elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+0outputs (357major+76minor)pagefaults 0swaps

    # time ./p.pl delete
    0.12user 1.18system 0:01.29elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+0outputs (364major+820minor)pagefaults 0swaps

    So its not the globbing, but maybe the double-globbing as Xho pointed
    out already !!


    Try the following on your machine:


    #!/usr/bin/perl -w
    use strict;

    if ($ARGV[0]=~/create/) {
    foreach (0..20000) {
    open (FH,">x$_"); close FH;
    }
    }

    if ($ARGV[0]=~/delete/) {
    my @files=glob ("x*");
    foreach(@files) {
    stat($_);
    unlink($_);
    }
    }


    best,
    peter

    --
    http://www.goldfisch.at/know_list
     
    peter pilsl, Apr 25, 2005
    #6
  7. Jim

    Ala Qumsieh Guest

    Jim wrote:

    > I have a very simple perl program that runs _very_ slowly. Here's my
    > code:


    > unless (@files[0]) {


    This works. But you probably meant:

    unless (@files) {

    or
    unless ($files[0]) {

    Type this for more info on the diff between $files[0] and @files[0]:

    perldoc -q 'difference.*\$array'

    > while (<@files>) {


    This is doing much more work than you think it is. Change it to:

    foreach (@files) {

    --Ala
     
    Ala Qumsieh, Apr 25, 2005
    #7
  8. Jim <> wrote:


    > unless (@files[0]) {



    You should always enable warnings when developing Perl code!


    > # if file edit time is greater than x days, delete the file
    > # 1440 minutes in a day
    > # 86400 seconds in a day
    > # 604800 seconds in a week
    > # 2419200 seconds in a month



    You do not need "in a week" nor "in a month".

    You already have how many in a day, multiply by 90 to get how
    many are in 90 days.


    > # 7257600 seconds in 90 days



    Wrong answer...


    > if ($mod_time > 7257600) {



    if ($mod_time > 60 * 60 * 24 * 90) {


    Perl will constant-fold it for you.


    > There are several thousand files (~21K) in this directory



    Then the largest bottleneck is probably the OS and filesystem,
    not the programming language (though your algorithm seems
    sub-optimal too).


    > It takes a really
    > long time to run this program. What's the holdup?



    There are several thousand files (~21K) in that directory.


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Apr 25, 2005
    #8
  9. Jim

    Joe Smith Guest

    Jim wrote:

    > I have a very simple perl program that runs _very_ slowly.


    You posted this question earlier and have already gotten and
    answer. Why are you not accepting the answers already given?

    > while (<@files>) {


    Big error right there. '<' and '>' are *not* appropriate here.

    > $st = stat($_);
    > $mod_time = $time - $st->mtime;
    > # 1440 minutes in a day


    Why are you doing that way? Have you not heard of -M()>

    if (-M $_ > 7 ) { print 'File $_ is older than 7.000 days\n"; }

    -Joe
     
    Joe Smith, Apr 26, 2005
    #9
  10. Jim

    Jim Guest

    In article <ITcbe.1259$>,
    says...

    >
    > Type this for more info on the diff between $files[0] and @files[0]:
    >
    > perldoc -q 'difference.*\$array'
    >
    > > while (<@files>) {

    >
    > This is doing much more work than you think it is. Change it to:
    >
    > foreach (@files) {
    >



    Changing my while to a foreach has sped up the program considerably.
    Thanks to those for the help.

    Jim
     
    Jim, Apr 26, 2005
    #10
  11. Jim

    Jim Guest

    In article <>, says...
    > Jim wrote:
    >
    > > I have a very simple perl program that runs _very_ slowly.

    >
    > You posted this question earlier and have already gotten and
    > answer. Why are you not accepting the answers already given?
    >
    > > while (<@files>) {

    >
    > Big error right there. '<' and '>' are *not* appropriate here.
    >
    > > $st = stat($_);
    > > $mod_time = $time - $st->mtime;
    > > # 1440 minutes in a day

    >
    > Why are you doing that way? Have you not heard of -M()>
    >
    > if (-M $_ > 7 ) { print 'File $_ is older than 7.000 days\n"; }
    >
    > -Joe
    >

    I hadn't posted this question earlier. This is the first time I've
    posted it. I hope. Unless I'm losing my mind. :)

    The foreach speeds things up noticeably. I'm using that now.

    No, I have not heard of -M(). Thanks for pointing it out to me.

    Jim
     
    Jim, Apr 26, 2005
    #11
  12. Jim

    Jim Guest

    In article <>,
    says...
    > You already have how many in a day, multiply by 90 to get how
    > many are in 90 days.
    >
    >
    > > # 7257600 seconds in 90 days

    >
    >
    > Wrong answer...


    I would argue that it's not the "wrong answer", just a different answer
    than you would've used. Six of one, half dozen of the other it seems to
    me.


    >
    > > if ($mod_time > 7257600) {

    >
    >
    > if ($mod_time > 60 * 60 * 24 * 90) {
    >
    >
    > Perl will constant-fold it for you.
    >
    >
    > > There are several thousand files (~21K) in this directory

    >
    >
    > Then the largest bottleneck is probably the OS and filesystem,
    > not the programming language (though your algorithm seems
    > sub-optimal too).
    >
    >
    > > It takes a really
    > > long time to run this program. What's the holdup?

    >
    >
    > There are several thousand files (~21K) in that directory.


    Using a foreach instead of my while (double globbing, I believe it was
    referred to as) sped things up noticeably.

    Thanks for your help.

    Jim
     
    Jim, Apr 26, 2005
    #12
  13. Jim

    Big and Blue Guest

    > Jim wrote:
    >.....
    >> There are several thousand files (~21K) in this directory


    Which is a cause of your problem. And it will cause you problems every
    time you try to access any of the files too.

    If you had 21000 documents you wouldn't throw them all into one drawer
    and expect to find one quickly. You'd arrange them by some category and
    put each category into a separate drawer. File system have directory
    hierarchy for just such an arrangement. If you use the facility you will
    find your code runs much faster.



    --
    Just because I've written it doesn't mean that
    either you or I have to believe it.
     
    Big and Blue, Apr 27, 2005
    #13
  14. Jim

    peter pilsl Guest

    Big and Blue wrote:
    >
    > Which is a cause of your problem. And it will cause you problems
    > every time you try to access any of the files too.
    >
    > If you had 21000 documents you wouldn't throw them all into one
    > drawer and expect to find one quickly. You'd arrange them by some
    > category and put each category into a separate drawer. File system
    > have directory hierarchy for just such an arrangement. If you use the
    > facility you will find your code runs much faster.


    21000 files in one folder ist not a big deal for modern filesystems.
    Like I wrote in my posting:

    creating 21k files is about 1second on my old server. getting the stat
    of each and unlink it, is about 2seconds.

    best,
    peter


    --
    http://www.goldfisch.at/know_list
     
    peter pilsl, Apr 27, 2005
    #14
  15. Jim

    Guest

    Big and Blue <> wrote:
    > > Jim wrote:
    > >.....
    > >> There are several thousand files (~21K) in this directory

    >
    > Which is a cause of your problem.


    Actually, that wasn't the cause of his problems.

    >
    > If you had 21000 documents you wouldn't throw them all into one
    > drawer and expect to find one quickly.


    Lots of things are done on computers differently than they are done
    by hand.

    > You'd arrange them by some
    > category and put each category into a separate drawer.


    Assuming that there are different categories to arrange them into.
    Maybe the name of the file is the optimal level of categorization
    that exists. In which case I would break them into different drawers only
    for the artificial reason that drawers only have a certain physical
    capacity.

    > File system have
    > directory hierarchy for just such an arrangement.


    Good file systems will let you use the natural arrangement rather
    than an artificial one. And if that means 20,000 files in a directory,
    so be it. On a good file system, it makes a neglibible difference. Even
    on a bad file system, I suspect it makes far, far less difference than the
    double globbing issue does.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Apr 27, 2005
    #15
  16. Jim

    Anno Siegel Guest

    Jim <> wrote in comp.lang.perl.misc:
    > In article <ITcbe.1259$>,
    > says...
    >
    > >
    > > Type this for more info on the diff between $files[0] and @files[0]:
    > >
    > > perldoc -q 'difference.*\$array'
    > >
    > > > while (<@files>) {

    > >
    > > This is doing much more work than you think it is. Change it to:
    > >
    > > foreach (@files) {
    > >

    >
    >
    > Changing my while to a foreach has sped up the program considerably.
    > Thanks to those for the help.


    You're missing the point. The speed difference between while and foreach
    is marginal. Globbing all the filenames (again) in

    while ( <@files> ) {

    is what kills it.

    Anno
     
    Anno Siegel, Apr 29, 2005
    #16
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Georgy Pruss
    Replies:
    15
    Views:
    725
    Tim Roberts
    Dec 1, 2003
  2. Tim Peters
    Replies:
    1
    Views:
    360
    Duncan Booth
    Dec 1, 2003
  3. Sean Berry

    Question about glob.glob <--newbie

    Sean Berry, May 4, 2004, in forum: Python
    Replies:
    3
    Views:
    347
    David M. Cooke
    May 4, 2004
  4. Elbert Lev

    glob.glob unicode bug or feature

    Elbert Lev, Jul 31, 2004, in forum: Python
    Replies:
    5
    Views:
    398
    Neil Hodgson
    Aug 2, 2004
  5. Hitesh

    glob.glob output

    Hitesh, Mar 12, 2007, in forum: Python
    Replies:
    6
    Views:
    404
    Hitesh
    Mar 13, 2007
Loading...

Share This Page