# lest talk a litle more about directories

G

#### George Mpouras

Ok letâ€™s talk about a little more about creating directories (or any other
trie structure entries).
There are three basic methods :

1) The simple, just start create them from top to bottom. This is give us
N disk accesses always. Silly but do not underestimate it because it does
not consume cpu.

2) The smarter. Search from bottom to top for the first existing node and
start creating from this point and bellow; this is give us N only at the
worst case.

3) The most wise approach is to define an array from all the upper nodes.
Then perform a binary tree search at this array against a node existence
code. This is give as log(N) accesses at worst case.

So what is the best solution; Unfortunately there is not â€¦.

If your directories â€œusuallyâ€ exist, the 2nd method will be the faster
because it will finish at the very first iterations.
If your directories â€œusuallyâ€ do not exist, the 3rd method will be the
faster because it will dig faster at the bottomless abyss.

Because the 2nd method is almost trivila I will write the code only for the
3rd binary tree method

mkdir_btree( 'd1/d2/d3/d4/d5' ) or die "\$^E\n";

sub mkdir_btree
{
my @array = split /[\/\\]/, \$_[0];
return 1 if -1 == \$#array;
\$array[0] eq '' && splice @array, 0, 2, "/\$array[1]";
my (\$begin, \$end) = (0, scalar @array);

while (\$begin < \$end)
{
my \$cut = int((\$begin + \$end)/2);
-d join('/', @array[0..\$cut]) ? (\$begin = \$cut + 1) : (\$end = \$cut)
}

return 1 unless \$begin < @array;
mkdir(join '/', @array[0 .. \$_]) || return 0 for \$begin .. \$#array;1
}

Also here is small benchmark. If you want to test it , (its better to run it
multiple times)

#!/usr/bin/perl
# Dir creation benchmark
use strict;
use warnings;
use Time::HiRes;
use File:ath;

my \$how_many = 1000;
my \$how_deep = 40;
my \$root_test_dir = '/tmp/test_this';

# create the dirs array;
print "Creating the test dirs array\n";
my @Test_dirs;
my \$start_time=[];
for (0 .. \$how_many) { my @R = \$root_test_dir;
for (0 .. \$how_deep) {
push @R, sprintf "%s%02d", ('a'..'z')[int(rand 24)], int(rand 100) }
push @Test_dirs, join('/',@R) }

foreach my \$code (qw/

Mkdir_recursive
mkdir_btree
File_Path_module

/){
system("/bin/rm -rf \$root_test_dir") if -d \$root_test_dir;
\$start_time = [ Time::HiRes::gettimeofday ];
print "Testing \$code ... ";
foreach (@Test_dirs) { &{\&\$code}(\$_) or die("\$!") }
print "finished at ", Time::HiRes::tv_interval(\$start_time) ," sec\n"
}

sub File_Path_module
{
my \$err;
File:ath::mkpath( \$_[0], {'error' => \\$err} );
@{\$err} ? 0 : 1
}

sub Mkdir_recursive
{
return 1 if \$_[0] eq '' || -d \$_[0];
Mkdir_recursive( \$_[0] =~/^(.*?)\/[^\/]+\$/ ) || return undef;
mkdir \$_[0] || return undef
}

sub mkdir_btree
{
my @array = split /\//, \$_[0];
return 1 if -1 == \$#array;
\$array[0] eq '' && splice @array, 0, 2, "/\$array[1]";
splice @array, 0, 2, "/\$array[1]" if '' eq \$array[0];
my (\$begin, \$end) = (0, scalar @array);

while (\$begin < \$end)
{
my \$cut = int((\$begin + \$end)/2);
-d join('/', @array[0..\$cut]) ? (\$begin = \$cut + 1) : (\$end = \$cut)
}

return 1 unless \$begin < @array;
mkdir(join '/', @array[0 .. \$_]) || return 0 for \$begin .. \$#array;1
}

END {
print "Clearing test dir: \$root_test_dir\n";
system("/bin/rm -rf \$root_test_dir") if -d \$root_test_dir }

R

#### Rainer Weikusat

"George Mpouras"
1) The simple, just start create them from top to bottom. This is
give us N disk accesses always. Silly but do not underestimate it
because it does not consume cpu.

2) The smarter. Search from bottom to top for the first existing
node and start creating from this point and bellow; this is give us N
only at the worst case.

As I already tried to tell you last time: Method 1 tuned for the case
where most directories have to be created and method 2 for the case
where most directories already exist. Both stat (the system call
behind the -X) and mkdir need to determine information about a
particular directory entry, stat because that's what it is supposed to
do and mkdir because it must not do anything if this directory entry
already exists. This means creating all directories in the path
/a/b/c/d from scratch will perform first perform for stat calls,

stat("/a/b/c/d")
stat("/a/b/c")
stat("a/b");
stat("/a");

which will fail with ENOENT, followed by four mkdir calls,

mkdir("/a");
mkdir("/a/b");
mkdir("/a/b/c");
mkdir("/a/b/c/d");

for the 'backward' method while the other will just do the 'mkdirs'.
3) The most wise approach is to define an array from all the upper
nodes. Then perform a binary tree search at this array against a node
existence code. This is give as log(N) accesses at worst case.

So what is the best solution; Unfortunately there is not â€¦.

If your directories â€œusuallyâ€ exist, the 2nd method will be the faster
because it will finish at the very first iterations.
If your directories â€œusuallyâ€ do not exist, the 3rd method will be the
faster because it will dig faster at the bottomless abyss.

Because the 2nd method is almost trivila I will write the code only
for the 3rd binary tree method

mkdir_btree( 'd1/d2/d3/d4/d5' ) or die "\$^E\n";

sub mkdir_btree
{
my @array = split /[\/\\]/, \$_[0];

I don't know this is for Windows, but a valid UNIX(*) pathname may use
any number of slashes to separate two directories,

[rw@sapphire]~ \$perl -e "chdir('//////////////'); system('pwd');"
/

R

#### Rainer Weikusat

"George Mpouras"
<n[email protected]>
writes:

[...]
3) The most wise approach is to define an array from all the upper
nodes. Then perform a binary tree search at this array against a node
existence code. This is give as log(N) accesses at worst case.

So what is the best solution; Unfortunately there is not â€¦.

If your directories â€œusuallyâ€ exist, the 2nd method will be the faster
because it will finish at the very first iterations.
If your directories â€œusuallyâ€ do not exist, the 3rd method will be the
faster because it will dig faster at the bottomless abyss.
[...]

mkdir_btree( 'd1/d2/d3/d4/d5' ) or die "\$^E\n";

sub mkdir_btree
{
my @array = split /[\/\\]/, \$_[0];
return 1 if -1 == \$#array;
\$array[0] eq '' && splice @array, 0, 2, "/\$array[1]";
my (\$begin, \$end) = (0, scalar @array);

while (\$begin < \$end)
{
my \$cut = int((\$begin + \$end)/2);
-d join('/', @array[0..\$cut]) ? (\$begin = \$cut + 1) : (\$end = \$cut)
}

return 1 unless \$begin < @array;
mkdir(join '/', @array[0 .. \$_]) || return 0 for \$begin .. \$#array;1
}

The 'log(N) worst case' is not true: This algortihm has a constant
overhead of log(N) stat calls for every case (this could be improved
somewhat by using mkdir in the loop). This means it is worse than
'create from the top' when all directories have to be created and will
become worse than 'create from the bottom' if the number of
subdirectories which have to be created at depth n is large enough
that (m - 1) * log(n) stat calls amount to more work than what was avoided
when creating the first subdirectory at this depth. The work saved
when creating the first directory is (based on mkdir)

2n - 1 - n - log(n)

This must be larger than (m - 1) * log(n) for the binary search to
beneficial. Putting this together,

2n - 1 - n - log(n) > (m - 1) * log(n)

this ends up as (Rechenfehler vorbehalten)

(n - 1)/log(n) > m

The /usr tree on the machine I used as example has a maximum depth of
12. Using this program,

--------------
sub log2
{
return log(\$_[0])/log(2);
}

sub threshold
{
return (\$_[0] - 1) / log2(\$_[0]);
}

print(\$_, "\t", threshold(\$_), "\n") for 2 .. 12;
--------------

yields the following threshold values:

2 1
3 1.26185950714291
4 1.5
5 1.72270623229357
6 1.93426403617271
7 2.13724312264813
8 2.33333333333333
9 2.52371901428583
10 2.70926996097583
11 2.89064826317888
12 3.06837240216243

(as soon as a directory at depth [first column] has more
subdirectories than [second column], the algorithm sucks).

I've used find /usr -type d to generate a list of directories, edited
the first line to remove a trailing slash and then used the following
program

-------------
use List::Util qw(sum);

@ds = map {chomp; \$_} <STDIN>;

for (@ds) {
/(.*?\/)[^\/]*\$/ and \$path = \$1;
# print(\$_, "\t", \$path, "\n");

++\$dirs{\$path};
}

for (keys(%dirs)) {
\$depth = tr/\///;
push(@{\$counts[\$depth]}, \$dirs{\$_});
}

for (1 .. \$#counts) {
print(\$_, "\t", scalar(@{\$counts[\$_]}), "\t", sum(@{\$counts[\$_]}) / @{\$counts[\$_]}, "\n") if @{\$counts[\$_]};
}
---------------

to calculate the average number of subdirectories at each depth.
The result is (depth, number of times encountered, avg)

1 1 1
2 1 10
3 6 82.3333333333333
4 228 10.0833333333333
5 673 2.93610698365527
6 521 8.13435700575816
7 556 2.42446043165468
8 289 1.96193771626298
9 136 2.63970588235294
10 85 3.23529411764706
11 56 3.17857142857143
12 5 5

In other words (if the results are at least mostly correct , this
idea sucks dead hippopotamusses through nanotubes in the real world.

G

#### George Mpouras

Î£Ï„Î¹Ï‚ 27/7/2013 15:25, Î¿/Î· Rainer Weikusat Î­Î³ÏÎ±ÏˆÎµ:
"George Mpouras"
<n[email protected]>
writes:

[...]
3) The most wise approach is to define an array from all the upper
nodes. Then perform a binary tree search at this array against a node
existence code. This is give as log(N) accesses at worst case.

So what is the best solution; Unfortunately there is not â€¦.

If your directories â€œusuallyâ€ exist, the 2nd method will be the faster
because it will finish at the very first iterations.
If your directories â€œusuallyâ€ do not exist, the 3rd method will be the
faster because it will dig faster at the bottomless abyss.
[...]

mkdir_btree( 'd1/d2/d3/d4/d5' ) or die "\$^E\n";

sub mkdir_btree
{
my @array = split /[\/\\]/, \$_[0];
return 1 if -1 == \$#array;
\$array[0] eq '' && splice @array, 0, 2, "/\$array[1]";
my (\$begin, \$end) = (0, scalar @array);

while (\$begin < \$end)
{
my \$cut = int((\$begin + \$end)/2);
-d join('/', @array[0..\$cut]) ? (\$begin = \$cut + 1) : (\$end = \$cut)
}

return 1 unless \$begin < @array;
mkdir(join '/', @array[0 .. \$_]) || return 0 for \$begin .. \$#array;1
}

The 'log(N) worst case' is not true: This algortihm has a constant
overhead of log(N) stat calls for every case (this could be improved
somewhat by using mkdir in the loop). This means it is worse than
'create from the top' when all directories have to be created and will

Do not forget that computers do not if the directories have to be
created. So it finds faster if they have to be.

become worse than 'create from the bottom' if the number of
subdirectories which have to be created at depth n is large enough
that (m - 1) * log(n) stat calls amount to more work than what was avoided
when creating the first subdirectory at this depth. The work saved
when creating the first directory is (based on mkdir)

2n - 1 - n - log(n)

This must be larger than (m - 1) * log(n) for the binary search to
beneficial. Putting this together,

2n - 1 - n - log(n) > (m - 1) * log(n)

this ends up as (Rechenfehler vorbehalten)

(n - 1)/log(n) > m

The /usr tree on the machine I used as example has a maximum depth of
12. Using this program,

--------------
sub log2
{
return log(\$_[0])/log(2);
}

sub threshold
{
return (\$_[0] - 1) / log2(\$_[0]);
}

print(\$_, "\t", threshold(\$_), "\n") for 2 .. 12;
--------------

yields the following threshold values:

2 1
3 1.26185950714291
4 1.5
5 1.72270623229357
6 1.93426403617271
7 2.13724312264813
8 2.33333333333333
9 2.52371901428583
10 2.70926996097583
11 2.89064826317888
12 3.06837240216243

(as soon as a directory at depth [first column] has more
subdirectories than [second column], the algorithm sucks).

I've used find /usr -type d to generate a list of directories, edited
the first line to remove a trailing slash and then used the following
program

-------------
use List::Util qw(sum);

@ds = map {chomp; \$_} <STDIN>;

for (@ds) {
/(.*?\/)[^\/]*\$/ and \$path = \$1;
# print(\$_, "\t", \$path, "\n");

++\$dirs{\$path};
}

for (keys(%dirs)) {
\$depth = tr/\///;
push(@{\$counts[\$depth]}, \$dirs{\$_});
}

for (1 .. \$#counts) {
print(\$_, "\t", scalar(@{\$counts[\$_]}), "\t", sum(@{\$counts[\$_]}) / @{\$counts[\$_]}, "\n") if @{\$counts[\$_]};
}
---------------

to calculate the average number of subdirectories at each depth.
The result is (depth, number of times encountered, avg)

1 1 1
2 1 10
3 6 82.3333333333333
4 228 10.0833333333333
5 673 2.93610698365527
6 521 8.13435700575816
7 556 2.42446043165468
8 289 1.96193771626298
9 136 2.63970588235294
10 85 3.23529411764706
11 56 3.17857142857143
12 5 5

In other words (if the results are at least mostly correct , this
idea sucks dead hippopotamusses through nanotubes in the real world.

This idea do not sucks. When deep directories usually do not exist it is
the faster one.

G

#### George Mpouras

In other words (if the results are at least mostly correct , this
idea sucks dead hippopotamusses through nanotubes in the real world.

the binary search is the fastest method to access a specific record on a
unknown data structure, that is why is often used from databases, but as
I said there is not one absolute answer. the fallowing factors must
considered before decide what to do on a specific machine:

* what is the timecost of a cpu cycle
* what is the timecost to ask for a node existense
* what is the timecost to create a node
* which is the statistical behaviour of the application
* what is the current status of your data

So the problem is not so determenistic after all

W

#### Willem

George Mpouras wrote:
) Ok let???s talk about a little more about creating directories (or any other
) trie structure entries).
) There are three basic methods :
)
) 1) The simple, just start create them from top to bottom. This is give us
) N disk accesses always. Silly but do not underestimate it because it does
) not consume cpu.
)
) 2) The smarter. Search from bottom to top for the first existing node and
) start creating from this point and bellow; this is give us N only at the
) worst case.
)
) 3) The most wise approach is to define an array from all the upper nodes.
) Then perform a binary tree search at this array against a node existence
) code. This is give as log(N) accesses at worst case.

Have you considered the possibility that it takes O(N) disk accesses
to check if a directory (a1/a2/.../aN) exists ?
The OS will have to open a1, find a2, open a2, find a3, etc.
However, if you already have a handle to directory a2,
then checking or creating a3 is a single step.

If that is the case, method 1 will always be fastest.

SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

G

#### gamo

my @array = split /[\/\\]/, \$_[0];

G

#### George Mpouras

Î£Ï„Î¹Ï‚ 27/7/2013 23:26, Î¿/Î· (e-mail address removed) Î­Î³ÏÎ±ÏˆÎµ:
my @array = split /[\/\\]/, \$_[0];
this one maybe ? split /\/+/, \$_[0];
what is the VMS ?

K

#### Keith Keller

Ok let???s talk about a little more about creating directories (or any other
trie structure entries).
There are three basic methods :

The best method is to use File:ath unless you have thoroughly
documented that you need to optimize directory creation because
File:ath isn't fast enough. Using File:ath optimizes for
programmer time and portability.

If you really believe that your methods are superior, you should write
them up as a patch to File:ath, so that others can thoroughly test
your code and include them in the module if they work properly.

--keith

G

#### George Mpouras

Î£Ï„Î¹Ï‚ 28/7/2013 00:13, Î¿/Î· Keith Keller Î­Î³ÏÎ±ÏˆÎµ:
The best method is to use File:ath unless you have thoroughly
documented that you need to optimize directory creation because
File:ath isn't fast enough. Using File:ath optimizes for
programmer time and portability.

If you really believe that your methods are superior, you should write
them up as a patch to File:ath, so that others can thoroughly test
your code and include them in the module if they work properly.

--keith
Thanks for the comment but for me, it is off topic . Are you interested
in Perl ? Are the modules a religion for you ?

G

#### gamo

this one maybe ? split /\/+/, \$_[0];
what is the VMS ?
-----------------------------

I don't think it works. As I badly remember directories in VMS are not
a special file, e.g. the command to change directory is SET DEF=
and it's not available to cd until the aparition of POSIX compatible
interface.
And why is important to be compatible with VMS/OpenVMS? Because it has
a security features very interesting. Last time I see it working with
a X interface was in a hidroelectrical power station, controlling the
open and close of gates to water.
VMS has other remarcable utilities as the EDT editor, since I use the
JED editor emulating EDT, with Perl syntax highlighting, of course.

C

#### Charlton Wilbur

GM> Î£Ï„Î¹Ï‚ 28/7/2013 00:13, Î¿/Î· Keith Keller Î­Î³ÏÎ±ÏˆÎµ:

GM> Thanks for the comment but for me, it is off topic . Are you
GM> interested in Perl ? Are the modules a religion for you ?

You know, if the sole object is to create directories as quickly as
possible, I bet C's an order of magnitude faster than Perl.

When you choose to use Perl in the first place, you are choosing to
optimize for programmer time over CPU time. To make that choice and
then to squander the benefits you get from it by writing buggy,
unportable code because you think it saves you time instead of
determining what your actual time requirements are and seeing if the
standard, well-tested, portable approach will suffice is fucking
moronic (and if that's what you want, PHP is *over there* and ColdFusion
is *down the hall*), and beyond that, it shows a fundamental lack of
understanding of what software engineering (as opposed to mere coding)

Charlton

K

#### Keith Keller

Thanks for the comment but for me, it is off topic .

What a stupid comment. File:ath may be offtopic for *you* but it is
certainly not for new Perl programmers who might be tempted to take your
Are you interested in Perl ? Are the modules a religion for you ?

No, but I'm not clueless enough to presume that I can do better than
people who have been contributing to Perl's core for many years.

If you're just playing around for kicks, that's your business. But it's
ridiculous to assume that your methods are worth recommending,
especially to people new to Perl, over module code that has been in
production use for years. You may never realize that, but it's still
important for people who care about Perl to warn new programmers against

--keith

G

#### George Mpouras

I like this place for discussing interesting things. if you want to kick
something kick yourself

George

R

#### Rainer Weikusat

Keith Keller said:
The best method is to use File:ath unless you have thoroughly
documented that you need to optimize directory creation because
File:ath isn't fast enough. Using File:ath optimizes for
programmer time and portability.

File:ath happens to be part of the Perl distribution because the guy
who happened to have written it happened to be on a sufficiently
friendly/ familiar footing with the guy (or guys) who made such
descisions at that time, as such things commonly happen in 'open
source land'. It certainly wasn't ever peer-reviewed by someone with
at least half a clue (eg, the outoging check for 'has another process
created the directory meanwhile is totally useless because said other
process could create it one microsecond after the check) and isn't a
particularly good implementation. If you think it is helpful for you,
than, by all means, use it. If someone else either doesn't think so or
wants to spend some time with researching sensible solutions to a
particular problem, even a problem you really don't care about, it
would be appropriate to let him instead of demanding that he has to
make the same choices you happen to have made (even if you
encompasses the population of all of China) because you happen to
If you really believe that your methods are superior, you should write
them up as a patch to File:ath, so that others

.... can laugh about you being so naive to try to 'contribute'
something to 'an open source project' because your solution is better
than the one the Really Important and Well-Connected Guy[tm] (who -
coincidentally - makes a living by offering his 'superior Perl
expertise' to the highest bidder) wrote.

R

#### Rainer Weikusat

my @array = split /[\/\\]/, \$_[0];

Well, so what? Does VMS run on anything but VAX(fossilized) and
Alpha(dead)? Looking into the other direction: A sensible
implementation for VMS also won't be compatible with anything except
'similar systems' (that would be VMS). An idea I recently had
while looking at the File:ath code: Given that there really isn't
such a things as 'distributable pre-compiled Perl code'[*], the
characteristics of the host filesystem are known at compile time and
this means selection an implementation which works for a particular
kind of file system/ operating system should really be done at
compile-time and not at runtime with the help of 'generic' switching
code which is - in effect - the least common denominator

[*] I had a short conversation with the present 'Perl compiler code'
because my boss really didn't like the fact that we were distributing
Perl source code to customers. I asked it "Can you support this?" and

G

#### George Mpouras

Î£Ï„Î¹Ï‚ 28/7/2013 13:00, Î¿/Î· Henry Law Î­Î³ÏÎ±ÏˆÎµ:
So do I; but I'm afraid I have decided that the things you discuss are
interesting only to you. Go right ahead, but I won't see them. Or, to
be precise, I shall see only the parts quoted by the people who really
know what they're talking about (Charlton, Ben and a dozen others) when

it is ok I can live with that.
I had many other things to say at this and other threads.
But as no one cares, and I have so bad treatment here, I better leave
this place.

Good luck to everyone.

R

#### Rainer Weikusat

[...]
An idea I recently had while looking at the File:ath code: Given
that there really isn't such a things as 'distributable pre-compiled
Perl code'[*], the characteristics of the host filesystem are known
at compile time and this means selection an implementation which
works for a particular kind of file system/ operating system should
really be done at compile-time and not at runtime with the help of
'generic' switching code which is - in effect - the least common

perl seems a little schizophrenic in this respect since it both
includes File::Spec which provided 'portable pathname manipulation' in
an IMHO sensible way[*] by delegateing all actual work to a
system-specific implementation module and File::Basename (used by
File:ath) which employs if - elsif - else-cascades to select the
code to execute for a particular system at runtime.

[*] Mostly. It still relies on a hardcoded 'system list' in order to
determine which module to load, uses @ISA for forwarding calls to
'implementation modules' and loads these via require. Ideally, it
should select a module based on 'system name' only (possibly with an
override facility) and the implementation modules should put their
public routines on @EXPORT. The 'entry point module' could then just
use the implementation module in order to put everything in place with

The File::Spec::Functions module provides something similar to this
but it uses

foreach my \$meth (@EXPORT, @EXPORT_OK) {
my \$sub = File::Spec->can(\$meth);
no strict 'refs';
*{\$meth} = sub {&\$sub('File::Spec', @_)};
}

at runtime in order to generate proxy subs. The 'double call' is
necessary to work around the IMHO totally weird descision to use
'method calls' for this in the first place (explanation why this could
make any sense greatly apprecicated) ...

R

#### Rainer Weikusat

Ben Morrow said:
Quoth Rainer Weikusat said:
my @array = split /[\/\\]/, \$_[0];

Well, so what? Does VMS run on anything but VAX(fossilized) and

Modern versions run on IA64,

That would be something like this

http://www8.hp.com/us/en/products/integrity-servers/product-detail.html?oid=4311905#!tab=features

Should I ever get stinkin' filthy rich (hardly a reason to fear that
this might actually happen), I'll buy one of these and see if I can
turn it into a Linux workstation. Until then, I don't expect any code
written by me to run on such hardware ...