the fastest way to create a directory

Discussion in 'Perl Misc' started by George Mpouras, Jul 15, 2013.

  1. Create a directory with all upper directories if missing.
    it uses the minimum possible disk access and checks.




    Mkdir_recursive('/some/dir/d1/d2') or die;

    sub Mkdir_recursive
    {
    return 1 if $_[0] eq '' || -d $_[0];
    Mkdir_recursive( $_[0] =~/^(.*?)[\\\/][^\\\/]+$/ ) || return undef;
    mkdir $_[0] || return undef
    }
     
    George Mpouras, Jul 15, 2013
    #1
    1. Advertising

  2. Ο "Henry Law" έγÏαψε στο μήνυμα
    news:...

    On 15/07/13 12:09, George Mpouras wrote:
    > Create a directory with all upper directories if missing.


    Use File::path.


    no it is slow
     
    George Mpouras, Jul 15, 2013
    #2
    1. Advertising

  3. >>>>> "GM" == George Mpouras
    >>>>> <>
    >>>>> writes:


    GM> Ο "Henry Law" έγÏαψε στο μήνυμα
    GM> news:...

    GM> On 15/07/13 12:09, George Mpouras wrote:
    >> Create a directory with all upper directories if missing.


    GM> Use File::path.

    GM> no it is slow

    If you create so many directories that the execution time of File::path
    is greater than the debugging time of roll-your-own code, Perl is
    probably the wrong langauge for the task.

    Charlton


    --
    Charlton Wilbur
     
    Charlton Wilbur, Jul 15, 2013
    #3
  4. George Mpouras

    Tim McDaniel Guest

    In article <ks0l7j$9j4$>,
    George Mpouras <> wrote:
    >Create a directory with all upper directories if missing.


    On Linux-like systems,
    system("mkdir", "-p", $dir);
    Unless you're creating millions of directories per day, doing it with
    a known way is far faster and far more reliable than trying to code
    your own.

    --
    Tim McDaniel,
     
    Tim McDaniel, Jul 15, 2013
    #4
  5. yes there are about thousands dirs. try to undestand the code. looks simple
    but it is not
     
    George Mpouras, Jul 15, 2013
    #5
  6. I have found that usingn Perl with clever code you can be faster than the
    usual c
    of cource if you write the same code with C it will be faster but then, the
    deadlines will pass !
     
    George Mpouras, Jul 15, 2013
    #6
  7. On 2013-07-15 15:45, George Mpouras <> wrote:
    > Ο "Henry Law" έγÏαψε στο μήνυμα
    > news:...
    >
    > On 15/07/13 12:09, George Mpouras wrote:
    >> Create a directory with all upper directories if missing.

    >
    > Use File::path.
    >
    >
    > no it is slow


    On my system it is exactly as fast as your version, creating 2*70644
    directories.

    Here is the test program:


    #!/usr/bin/perl
    use warnings;
    use strict;
    use File::path qw(make_path);
    use Time::HiRes qw(time);

    $| = 1;

    my $n = $ARGV[0];
    {
    my $t0 = time;
    for my $i (0 .. $n) {
    for my $j (0 .. $n) {
    for my $k (0 .. $n) {
    make_path("t1/$i/$j/$k");
    }
    }
    printf("%3d %9.6f\n", $i, time - $t0);
    }
    }
    {
    my $t0 = time;
    for my $i (0 .. $n) {
    for my $j (0 .. $n) {
    for my $k (0 .. $n) {
    make_path("t2/$k/$j/$i");
    }
    }
    printf("%3d %9.6f\n", $i, time - $t0);
    }
    }
    __END__

    (second test program uses your Mkdir_recursive instead of make_path)

    With $n = 40, The first loop takes about 20 seconds, the second about 24
    seconds for both programs. As expected, they are completely disk-bound.
    strace shows that they are also performing essentially the same system
    calls.

    hp

    --
    _ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
    |_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
    | | | | die Satzbestandteile des Satzes nicht mehr
    __/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
     
    Peter J. Holzer, Jul 15, 2013
    #7
  8. On 2013-07-15 21:04, George Mpouras <> wrote:
    > yes there are about thousands dirs. try to undestand the code. looks
    > simple but it is not


    What? It's trivial. It also has a bug: It doesn't perform silently under
    use warnings. Ok, so maybe it isn't trivial after all ...

    hp

    --
    _ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
    |_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
    | | | | die Satzbestandteile des Satzes nicht mehr
    __/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
     
    Peter J. Holzer, Jul 15, 2013
    #8
  9. George Mpouras

    Tim McDaniel Guest

    In article <ks1o4c$5jj$>,
    George Mpouras <> wrote:
    >yes there are about thousands dirs. try to undestand the code. looks simple
    >but it is not


    It *IS* simple. Microsecond efficiency is highly unlikely to be a
    problem for you, so you should, in almost all cases, so with the
    reliable solution.

    --
    Tim McDaniel,
     
    Tim McDaniel, Jul 15, 2013
    #9
  10. On 2013-07-15 22:40, Ben Morrow <> wrote:
    > Quoth George Mpouras <>:
    >> Create a directory with all upper directories if missing.
    >> it uses the minimum possible disk access and checks.
    >>
    >> Mkdir_recursive('/some/dir/d1/d2') or die;
    >>
    >> sub Mkdir_recursive
    >> {
    >> return 1 if $_[0] eq '' || -d $_[0];
    >> Mkdir_recursive( $_[0] =~/^(.*?)[\\\/][^\\\/]+$/ ) || return undef;
    >> mkdir $_[0] || return undef
    >> }

    >
    > You'd be better off calling mkdir blind and keying off $! if it fails.
    > That way you save a stat in the case where the creation succeeds.


    That shouldn't make a noticeable difference. If the stat does cause any
    disk accesses, those would also have been caused by the mkdir, and if it
    doesn't (i.e. everything is already in the cache) the time for the stat
    calls is completely swamped by the mkdir's.

    To my surprise the second loop of my test program seems actually to be
    a bit faster with a blind mkdir, but the difference is less than the
    variability, so I'd need more runs to see if the difference is
    significant.

    hp


    --
    _ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
    |_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
    | | | | die Satzbestandteile des Satzes nicht mehr
    __/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
     
    Peter J. Holzer, Jul 16, 2013
    #10
  11. Please write the code you test
     
    George Mpouras, Jul 17, 2013
    #11
  12. Please write the code you test
     
    George Mpouras, Jul 17, 2013
    #12
  13. On 2013-07-17 08:09, George Mpouras <> wrote:
    > Please write the code you test


    Please quote relevant parts of the postings you are replying to.

    hp

    --
    _ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
    |_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
    | | | | die Satzbestandteile des Satzes nicht mehr
    __/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
     
    Peter J. Holzer, Jul 17, 2013
    #13
  14. "Peter J. Holzer" <> writes:
    > On 2013-07-15 22:40, Ben Morrow <> wrote:
    >> Quoth George Mpouras <>:
    >>> Create a directory with all upper directories if missing.
    >>> it uses the minimum possible disk access and checks.
    >>>
    >>> Mkdir_recursive('/some/dir/d1/d2') or die;
    >>>
    >>> sub Mkdir_recursive
    >>> {
    >>> return 1 if $_[0] eq '' || -d $_[0];
    >>> Mkdir_recursive( $_[0] =~/^(.*?)[\\\/][^\\\/]+$/ ) || return undef;
    >>> mkdir $_[0] || return undef
    >>> }

    >>
    >> You'd be better off calling mkdir blind and keying off $! if it fails.
    >> That way you save a stat in the case where the creation succeeds.

    >
    > That shouldn't make a noticeable difference. If the stat does cause any
    > disk accesses, those would also have been caused by the mkdir, and if it
    > doesn't (i.e. everything is already in the cache) the time for the stat
    > calls is completely swamped by the mkdir's.


    Both stat and mkdir are system calls and 'one system call' is going to
    be faster than 'two system calls'. 'In certain situations' (eg, for
    the FFS and UFS filesystems or one of the Linux ext? mounted with the
    dirsync option), mkdir will be executed synchronously and then, it is
    going to take a 'long' time (possibly, an infinitely/ arbitrary long
    time for storage devices expected to fail routinely during normal
    operation where the error handling method is 'retry until successful
    or powered down' aka SSDs). But all these ought to be regarded as
    corner case and "everything's in the cache" the general one and in
    this case, Ben's suggestion is sensible if it is expected that
    directories a more often created than not created.

    I don't think I'd use recursion for that because an iterative
    equivalent is still fairly simple, example

    ------------------
    use Errno qw(EEXIST ENOTDIR);

    sub mkdir_p
    {
    my ($cur, $next, $remain);

    $remain = $_[0];
    while ($remain) {
    ($next, $remain) = $remain =~ /^(\/*[^\/]*)(.*)/;
    $cur .= $next;

    mkdir($cur) or do {
    return if $! != EEXIST;
    next if $remain;

    -d($cur) or $! = ENOTDIR, return;
    };
    }

    return 1;
    }

    mkdir_p($ARGV[0]) or die("$!");
    --------------------

    NB: I didn't perform any benchmarks on this. But it works with
    pathnames a la /////b/c///d/e//g while the original doesn't (OTOH, it
    doesn't support MS-DOSE [German for 'tin'] but that's not my problem
    :).
     
    Rainer Weikusat, Jul 17, 2013
    #14
  15. On 2013-07-17 09:49, Rainer Weikusat <> wrote:
    > "Peter J. Holzer" <> writes:
    >> On 2013-07-15 22:40, Ben Morrow <> wrote:
    >>> Quoth George Mpouras <>:
    >>>> sub Mkdir_recursive
    >>>> {
    >>>> return 1 if $_[0] eq '' || -d $_[0];
    >>>> Mkdir_recursive( $_[0] =~/^(.*?)[\\\/][^\\\/]+$/ ) || return undef;
    >>>> mkdir $_[0] || return undef
    >>>> }
    >>>
    >>> You'd be better off calling mkdir blind and keying off $! if it fails.
    >>> That way you save a stat in the case where the creation succeeds.

    >>
    >> That shouldn't make a noticeable difference. If the stat does cause any
    >> disk accesses, those would also have been caused by the mkdir, and if it
    >> doesn't (i.e. everything is already in the cache) the time for the stat
    >> calls is completely swamped by the mkdir's.

    >
    > Both stat and mkdir are system calls and 'one system call' is going to
    > be faster than 'two system calls'.


    As stated, that's trivially untrue ('one call to exec(2)' will *not* be
    faster than 'two calls to time(2)' except under pathological
    circumstances), but even if I translate that into 'an additional system
    call will take additional time', it's not necessarily true.

    In this case I think that the stat system call would normally add a
    little time but it will be completely swamped by the time spent in
    mkdir:

    Each successful mkdir call will cause at least 5 disk accesses on a
    typical Linux file system: 1 for the journal, 2 for the inode and
    content of the parent directory and 2 for the inode and content of the
    new directory (Oh, I forgot the bitmaps, add another 2 or 4 ...). These
    will happen *after* mkdir returns because of the writeback cache, and
    the kernel will almost certainly succeed in coalescing at least some and
    maybe many of those writes, but if you create a lot of directories
    (George wrote about "thousands", in my tests I created about 150000)
    these writes will eventually dominate.

    Now, in addition to writing new blocks, where does the pair
    stat($d); mkdir($d)
    spend time?

    If the ancestor directories of $d are in cache (that would be the normal
    case), both stat and mkdir will walk exactly the same in-memory
    structure until they fail to find $d. So, yes, that part will be
    uselessly duplicated, but it's very fast compared to actually writing a
    new directory to the disk, so the extra time is negligible.

    If the ancestor directories of $d are not in cache, stat will load them
    into the cache, which may take a noticable time. But that time will then
    be saved by mkdir which can now use the cache instead of loading the
    directories itself: So again the difference is one walk through
    in-memory structures, which is insignificant compared to loading the
    structures from disk and then writing a new directory (which will happen
    anyway).

    The ratios will be different depending on the relative speed of RAM and
    storage: Maybe SSDs are fast enough that the additional walk through the
    cache is noticable, but I doubt it. Of course anybody is free to post
    benchmark results to prove me wrong.

    hp

    --
    _ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
    |_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
    | | | | die Satzbestandteile des Satzes nicht mehr
    __/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
     
    Peter J. Holzer, Jul 17, 2013
    #15
  16. your code is not good. It ALWAYS access the hard disk as many times as the
    number of upper dirs .
    This completly uneccassery .
    To check ti with your own eyes run:



    #!/usr/bin/perl
    use Errno qw(EEXIST ENOTDIR);
    my $access=0;
    mkdir_p("/opt/d1/d2/d3/d4/d5") or die("$!");
    print "disk touches : $access\n";


    sub mkdir_p
    {
    my ($cur, $next, $remain);
    $remain = $_[0];

    while ($remain) {
    ($next, $remain) = $remain =~ /^(\/*[^\/]*)(.*)/;
    $cur .= $next;

    $access++;

    mkdir($cur) or do {
    return if $! != EEXIST;
    next if $remain;

    -d($cur) or $! = ENOTDIR, return;
    };
    }

    return 1;
    }
     
    George Mpouras, Jul 17, 2013
    #16
  17. Ο "Peter J. Holzer" έγÏαψε στο μήνυμα
    news:...

    On 2013-07-15 21:04, George Mpouras
    <> wrote:
    > yes there are about thousands dirs. try to undestand the code. looks
    > simple but it is not


    What? It's trivial.


    probably you understand nothing at all
     
    George Mpouras, Jul 17, 2013
    #17
  18. George Mpouras

    Tim McDaniel Guest

    In article <ks71fk$2rgk$>,
    George Mpouras <> wrote:
    >your code is not good. It ALWAYS access the hard disk as many times as the
    >number of upper dirs .


    I think you don't understand disk caching. Any reasonable modern
    system will cache data.

    --
    Tim McDaniel,
     
    Tim McDaniel, Jul 17, 2013
    #18
  19. "Peter J. Holzer" <> writes:
    > On 2013-07-17 09:49, Rainer Weikusat <> wrote:
    >> "Peter J. Holzer" <> writes:
    >>> On 2013-07-15 22:40, Ben Morrow <> wrote:
    >>>> Quoth George Mpouras <>:
    >>>>> sub Mkdir_recursive
    >>>>> {
    >>>>> return 1 if $_[0] eq '' || -d $_[0];
    >>>>> Mkdir_recursive( $_[0] =~/^(.*?)[\\\/][^\\\/]+$/ ) || return undef;
    >>>>> mkdir $_[0] || return undef
    >>>>> }
    >>>>
    >>>> You'd be better off calling mkdir blind and keying off $! if it fails.
    >>>> That way you save a stat in the case where the creation succeeds.
    >>>
    >>> That shouldn't make a noticeable difference. If the stat does cause any
    >>> disk accesses, those would also have been caused by the mkdir, and if it
    >>> doesn't (i.e. everything is already in the cache) the time for the stat
    >>> calls is completely swamped by the mkdir's.

    >>
    >> Both stat and mkdir are system calls and 'one system call' is going to
    >> be faster than 'two system calls'.

    >
    > As stated, that's trivially untrue ('one call to exec(2)' will *not* be
    > faster than 'two calls to time(2)' except under pathological
    > circumstances),


    Yes. And a single call to pause(2) will take forever except 'under
    pathological circumstances'. Another interesting (Linux-)system call would be
    reboot because not only doesn't it ever return, it might even cause
    the system to be powered down!

    Morale: Thanks to inherent vagueness in natural language and the
    sloppy ways people usually use it, anything can be interpreted in a
    way which doesn't make sense.

    > but even if I translate that into 'an additional system
    > call will take additional time', it's not necessarily true.


    What about trying the intended meaning: Calling stat and mkdir is
    going to take more time than calling stat and not mkdir or mkdir and
    not stat (the two system calls in question).

    BTW: Should I immediately supply a fresh round of nonsensical
    interpretations?

    [...]

    > Each successful mkdir call will cause at least 5 disk accesses on a
    > typical Linux file system: 1 for the journal, 2 for the inode and
    > content of the parent directory and 2 for the inode and content of the
    > new directory (Oh, I forgot the bitmaps, add another 2 or 4 ...). These
    > will happen *after* mkdir returns because of the writeback cache, and
    > the kernel will almost certainly succeed in coalescing at least some and
    > maybe many of those writes, but if you create a lot of directories
    > (George wrote about "thousands", in my tests I created about 150000)
    > these writes will eventually dominate.


    [more of this]

    As I already tried to communicate in the earlier posting: 'Disk I/O'
    is in itself a pathological situation or at least one the kernel tries
    very hard to avoid. Since it is going to be slower than pretty much
    everything else, talking about the execution speed of different
    algorithms becomes completely meaningless then. Consequently, it
    should be disregarded when doing this.
     
    Rainer Weikusat, Jul 17, 2013
    #19
  20. "George Mpouras"
    <>
    writes:
    > your code is not good. It ALWAYS access the hard disk as many times as
    > the number of upper dirs .


    Believe it or not but I understand how this algorithm works because I
    wrote it. Anything like this depends heavily on what the expected
    situation happens to be. Eg, if I run your code with an argument of
    /tmp/a/b/c/d/e, assuming nothing except /tmp exists initially, it will
    do the following system calls:

    stat("/tmp/a/b/c/d/e", 0x602130) = -1 ENOENT (No such file or directory)
    stat("/tmp/a/b/c/d", 0x602130) = -1 ENOENT (No such file or directory)
    stat("/tmp/a/b/c", 0x602130) = -1 ENOENT (No such file or directory)
    stat("/tmp/a/b", 0x602130) = -1 ENOENT (No such file or directory)
    stat("/tmp/a", 0x602130) = -1 ENOENT (No such file or directory)
    stat("/tmp", {st_mode=S_IFDIR|S_ISVTX|0777, st_size=94208, ...}) = 0
    mkdir("/tmp/a", 0777) = 0
    mkdir("/tmp/a/b", 0777) = 0
    mkdir("/tmp/a/b/c", 0777) = 0
    mkdir("/tmp/a/b/c/d", 0777) = 0
    mkdir("/tmp/a/b/c/d/e", 0777) = 0

    OTOH, the mkdir_p I posted will just do

    mkdir("/tmp", 0777) = -1 EEXIST (File exists)
    mkdir("/tmp/a", 0777) = 0
    mkdir("/tmp/a/b", 0777) = 0
    mkdir("/tmp/a/b/c", 0777) = 0
    mkdir("/tmp/a/b/c/d", 0777) = 0
    mkdir("/tmp/a/b/c/d/e", 0777) = 0
     
    Rainer Weikusat, Jul 17, 2013
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    1
    Views:
    403
  2. =?Utf-8?B?cmdoYXRvbA==?=

    Re: fastest way to access global vars

    =?Utf-8?B?cmdoYXRvbA==?=, Mar 5, 2004, in forum: ASP .Net
    Replies:
    1
    Views:
    375
  3. Andreas Klemt
    Replies:
    1
    Views:
    465
    Steve C. Orr, MCSD
    Aug 10, 2003
  4. Replies:
    14
    Views:
    1,441
    Roedy Green
    Aug 24, 2007
  5. Seth Brundle
    Replies:
    4
    Views:
    455
    A. Sinan Unur
    Sep 22, 2005
Loading...

Share This Page