Mailbox-style directory hashing

S

s1037989

I whipped up this quick and ugly script and I wanted to post it for
code review and others' benefit.

With an array such as:
qw(aaaa aaab aaac bbbb bccc bcdd bcee bcff cccc dddd)

The program returns:
# perl list2fs.pl 2
/a/aa/aaa/aaaa/aaaa
/a/aa/aaa/aaab/aaab
/a/aa/aaa/aaac/aaac
/b/bb/bbbb
/b/bc/bcc/bccc
/b/bc/bcd/bcdd
/b/bc/bce/bcee
/b/bc/bcf/bcff
/c/cccc
/d/dddd

Now as you can see, what this program does is take a list of filenames
and "hashifies" it like mailbox storing allowing no more than 2 (or
whatever $ARGV[0] is) filenames to be in a single directory. The
point, obviously, is if you have 100000 filenames and ext3 won't store
100000 files in a single directory, you can use this technique to break
them down.

I had to make it recursive because I'm dealing with unknown data. I
don't know that 3 levels or even 6 levels will suffice. Therefore, the
recursion figures it out.

Now, that said, this is NOT intended for "hashifying" mail storage
dirs. It IS intended to "hashify" a HUGE list of filenames.
Unfortunately this code is VERY inefficient. So, I post it here so
people can see my idea if it helps, and so that people can maybe direct
me to an existing CPAN module that would accomplish the same thing?
Or, perhaps someone likes what I've started and wants to help improve
the code?

---------------------
#!/usr/bin/perl -w

use strict;
$ARGV[0] ||= 1;

#my @list = map { chomp; $_ } <STDIN>;
my @list = @_ = qw(aaaa aaab aaac bbbb bccc bcdd bcee bcff cccc dddd);

my %hash = a(1, \@list);
b('', %hash);

sub a {
my %list = map { lc(substr($_, 0, $_[0])) => 1 } @{$_[1]};
my %hash = ();
foreach my $hash ( keys %list ) {
my @hash = grep { lc(substr($hash, 0)) eq lc(substr($_,
0, length($hash))) } @{$_[1]};
if ( $#hash >= $ARGV[0] ) {
%{$hash{$hash}} = a($_[0]+1, \@hash);
} elsif ( $#hash >= 0 ) {
%{$hash{$hash}} = map { lc($_) => 1 } @hash;
} elsif ( $#hash == -1 ) {
# %{$hash{$hash}} = ();
}
}
return %hash;
}

sub b {
my ($root, %hash) = @_;
foreach ( sort keys %hash ) {
if ( ref $hash{$_} eq 'HASH' ) {
b($root.'/'.$_, %{$hash{$_}});
} else {
print "$root/$_\n";
}
}
}
 
T

Tad McClellan

Or, perhaps someone likes what I've started and wants to help improve
the code?


use warnings;

Lexical warnings are a big improvement over global warnings.
 
P

Peter J. Holzer

I whipped up this quick and ugly script and I wanted to post it for
code review and others' benefit.

With an array such as:
qw(aaaa aaab aaac bbbb bccc bcdd bcee bcff cccc dddd)

The program returns:
# perl list2fs.pl 2
/a/aa/aaa/aaaa/aaaa
/a/aa/aaa/aaab/aaab
/a/aa/aaa/aaac/aaac
/b/bb/bbbb
/b/bc/bcc/bccc
/b/bc/bcd/bcdd
/b/bc/bce/bcee
/b/bc/bcf/bcff
/c/cccc
/d/dddd

Now as you can see, what this program does is take a list of filenames
and "hashifies" it like mailbox storing allowing no more than 2 (or
whatever $ARGV[0] is) filenames to be in a single directory. The
point, obviously, is if you have 100000 filenames and ext3 won't store
100000 files in a single directory, you can use this technique to break
them down.

Ext3 will happily store 100000 filenames in a directory - it just won't
be very quick in retrieving them (Even that isn't true for Linux 2.6 any
more - ext3 directories now use a structure called an "htree" to quickly
access files in huge directories). But assuming you need to use ext3
with older kernel versions or other filesystems with linear directories:


Are all the filenames guaranteed to be the same length? Otherwise, what
happens if you have these file names?

aaab aaac aaad aaae aaa

You would need a directory /a/aa/aaa and a file /a/aa/aaa. But you can't
have both.


Do you have all the filenames in advance or is it possible to create new
files after the structure has been created? If it is the latter, you
proabably will need a way to split an existing directory if the number
of files in it becomes too large - what happens when somebody accesses
the directory in the middle of the split operation? How do you determine
when you have to split? Count files time you create a new file?


Finally, I assume you used the value of 2 for demonstration purposes
only: Such a small value is not practical: It makes little sense to
restrict directory sizes to much less than a disk block. A structure as
you showed with lots of directories with a single file in it would be
slower than one which is one level less deep.
Now, that said, this is NOT intended for "hashifying" mail storage
dirs. It IS intended to "hashify" a HUGE list of filenames.
Unfortunately this code is VERY inefficient.

How did you determine that is very inefficient?
So, I post it here so people can see my idea if it helps, and so that
people can maybe direct me to an existing CPAN module that would
accomplish the same thing?

When I needed to do similar things I used a hash function (e.g. MD5 or
SHA-1) on the key (the filename in your case) to get nice, uniformly
distributed constant length filenames and then computed the number of
levels from the maximum number of files I had to store. With 256 files
per directory, 2 levels would be enough for 16 million files and 3
levels would be enough for 4 billion files. I never needed more :).
Or, perhaps someone likes what I've started and wants to help improve
the code?

If you avoid copying around huge lists it might be faster. But I'm not
sure the code even does what you want - see my questions above.

hp
 
S

s1037989

Peter, thanks for your thoughtful reply! :) My response below...
I whipped up this quick and ugly script and I wanted to post it for
code review and others' benefit.

With an array such as:
qw(aaaa aaab aaac bbbb bccc bcdd bcee bcff cccc dddd)

The program returns:
# perl list2fs.pl 2
/a/aa/aaa/aaaa/aaaa
/a/aa/aaa/aaab/aaab
/a/aa/aaa/aaac/aaac
/b/bb/bbbb
/b/bc/bcc/bccc
/b/bc/bcd/bcdd
/b/bc/bce/bcee
/b/bc/bcf/bcff
/c/cccc
/d/dddd

Now as you can see, what this program does is take a list of filenames
and "hashifies" it like mailbox storing allowing no more than 2 (or
whatever $ARGV[0] is) filenames to be in a single directory. The
point, obviously, is if you have 100000 filenames and ext3 won't store
100000 files in a single directory, you can use this technique to break
them down.

Ext3 will happily store 100000 filenames in a directory - it just won't
be very quick in retrieving them (Even that isn't true for Linux 2.6 any
more - ext3 directories now use a structure called an "htree" to quickly
access files in huge directories). But assuming you need to use ext3
with older kernel versions or other filesystems with linear directories:

Ok, I didn't know that. I thought I had read somewhere that the limit
was 32,000 (and thus the reason for using something like reiserfs), but
that was apparently an ancient article... Oops!
Are all the filenames guaranteed to be the same length? Otherwise, what
happens if you have these file names?

No. So otherwise, that's a good question. To be more specific with
the filenames, they are actually URLs.
aaab aaac aaad aaae aaa

You would need a directory /a/aa/aaa and a file /a/aa/aaa. But you can't
have both.

Actually, in this case, aaa the file would go in aaa the directory.
Do you have all the filenames in advance or is it possible to create new
files after the structure has been created? If it is the latter, you
proabably will need a way to split an existing directory if the number
of files in it becomes too large - what happens when somebody accesses
the directory in the middle of the split operation? How do you determine
when you have to split? Count files time you create a new file?

I have the majority of the filenames in advance, but unfortunately the
list will change. To be more specific about what I'm doing: the URLs
are blacklisted domains to be used with squid. I'm writing my own
redirect program in Perl (because I'm not satisified with squidGuard
and do not know C well enough to modify squidGuard). I have been
pondering the idea of giving each URL it's own filename on the FS (disk
space is not an issue) because that would be -- I think -- far more
efficient to lookup a domain name than to parse a file and/or store the
whole list in memory. It would be as simple as: does this file exist?

Oh, the other reason for using a FS versus a MySQL DB is that I hate to
write Web apps and it would be very simple for someone to just create a
new blacklist using Samba from a Windows box.

So as far as handling a changing list, it could just be done nightly.
Blow away the list, and rebuild.

I know this is not the best approach, but I want to make something
simple for a prototype. I hate to invest too much time into it only to
decide it won't be used.

Lastly, just thinking about the problem got me interested in solving
the problem generically. As such, I wrote the quick script and posted
it for review and perhaps for others' benefit.
Finally, I assume you used the value of 2 for demonstration purposes
only: Such a small value is not practical: It makes little sense to
restrict directory sizes to much less than a disk block. A structure as
you showed with lots of directories with a single file in it would be
slower than one which is one level less deep.

Correct, 2 was used merely for demonstration purposes. I intend to set
the number to around 1000. I don't want the directory listing to be
too overwhelming.
How did you determine that is very inefficient?

By running it... :) It was AWFUL! I ran it against a file containing
nearly 600,000 entries. I never waited long enough for it to finish.
When I needed to do similar things I used a hash function (e.g. MD5 or
SHA-1) on the key (the filename in your case) to get nice, uniformly
distributed constant length filenames and then computed the number of
levels from the maximum number of files I had to store. With 256 files
per directory, 2 levels would be enough for 16 million files and 3
levels would be enough for 4 billion files. I never needed more :).

This is good information, thanks. I'll consider it's practical
purposes, but would defeat the goal of simple blacklist updates via
Samba. :(
If you avoid copying around huge lists it might be faster. But I'm not
sure the code even does what you want - see my questions above.

How could I avoid that?
 
P

Peter J. Holzer

Peter, thanks for your thoughtful reply! :) My response below...
I whipped up this quick and ugly script and I wanted to post it for
code review and others' benefit.

With an array such as:
qw(aaaa aaab aaac bbbb bccc bcdd bcee bcff cccc dddd)

The program returns:
# perl list2fs.pl 2
/a/aa/aaa/aaaa/aaaa
/a/aa/aaa/aaab/aaab [...]
/c/cccc
/d/dddd

Now as you can see, what this program does is take a list of filenames
and "hashifies" it like mailbox storing allowing no more than 2 (or
whatever $ARGV[0] is) filenames to be in a single directory. The
point, obviously, is if you have 100000 filenames and ext3 won't store
100000 files in a single directory, you can use this technique to break
them down.

Ext3 will happily store 100000 filenames in a directory - it just won't
be very quick in retrieving them (Even that isn't true for Linux 2.6 any
more - ext3 directories now use a structure called an "htree" to quickly
access files in huge directories). But assuming you need to use ext3
with older kernel versions or other filesystems with linear directories:

Ok, I didn't know that. I thought I had read somewhere that the limit
was 32,000 (and thus the reason for using something like reiserfs), but
that was apparently an ancient article... Oops!

AFAIK, there was never a limit on the number of files in an ext2/ext3
directory. 32 k may be the maximum number of subdirectories (because the
link count is 16 bit, and each ".." entry is a link). But you wouldn't
want more than a few 1000 files in a single directory because it was slow.

Actually, in this case, aaa the file would go in aaa the directory.

Right. But my gut feeling that you don't handle that case correctly was
right. See below.

I have the majority of the filenames in advance, but unfortunately the
list will change. To be more specific about what I'm doing: the URLs
are blacklisted domains to be used with squid.

Are they URLs or domains? With URLs you may need to be careful because
of '/' and possibly other characters (especially if you access the
filesystem via Windows/Samba).

By running it... :) It was AWFUL! I ran it against a file containing
nearly 600,000 entries. I never waited long enough for it to finish.

That's probably because it doesn't finish at all if one of the strings
is a prefix of more than $ARGV[0] other strings. Change the
initialization to

my @list = @_ = qw(aa aaaa aaab aaac bbbb bccc bcdd bcee bcff cccc dddd);

and try it. (It will probably crash pretty soon when it runs out of
memory, but if you had an infinite amount of memory it would run
forever)

This is good information, thanks. I'll consider it's practical
purposes, but would defeat the goal of simple blacklist updates via
Samba. :(

You want the directory structure to be created with the script anyway,
don't you? Or do you expect Windows users to create and delete
individual files to edit the blacklist?

How could I avoid that?

I was wrong. You are passing the lists by reference, and the hashes
which you copy needlessly aren't that big (at most 70 elements, assuming
printable ASCII). The problem is that you are calling grep way too often
which also causes the endless recursion (and would be slow even without
that).

Here is my version of your function "a" (you have some talent for
choosing horrible names, btw. I could only figure out what the variables
were supposed to be by reading the code - the names were at best
useless and at worst misleading):

sub a {
my ($prefix_len, $list) = @_;
#print " " x $prefix_len, (scalar @$list), " elements\n";
my %hash = ();
push @{ $hash{lc(substr($_, 0, $prefix_len))} }, $_ for @$list;
foreach my $sublist ( values %hash ) {
if ( @$sublist > $ARGV[0] ) {
$sublist = a($prefix_len+1, $sublist);
} else {
$sublist = { map { lc($_) => 1 } @$sublist };
}
}
return \%hash;
}

The biggest change is that I didn't try to figure out the prefixes first
and then find all strings with the same prefix (which caused duplication
and the endless recursion) but do that in one loop - that way each
string can only ever be assigned to one prefix and the lists must get
shorter. The switching between array and hash references is mostly so
that a can be called with an array reference as before. I could have
uses hash references throughout. It also now returns a hash reference
instead of a list-which-is-a-hash, which I consider cleaner.

I didn't have a list of 600000 domain names handy, so I used two lists
with 35000 and 96000 names: The run times (with $ARGV[0] == 1000) were
about 3 and 9 seconds, respectively, on a PIII 850.

hp
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top