Mailbox-style directory hashing

Discussion in 'Perl Misc' started by s1037989@gmail.com, Oct 31, 2006.

  1. Guest

    I whipped up this quick and ugly script and I wanted to post it for
    code review and others' benefit.

    With an array such as:
    qw(aaaa aaab aaac bbbb bccc bcdd bcee bcff cccc dddd)

    The program returns:
    # perl list2fs.pl 2
    /a/aa/aaa/aaaa/aaaa
    /a/aa/aaa/aaab/aaab
    /a/aa/aaa/aaac/aaac
    /b/bb/bbbb
    /b/bc/bcc/bccc
    /b/bc/bcd/bcdd
    /b/bc/bce/bcee
    /b/bc/bcf/bcff
    /c/cccc
    /d/dddd

    Now as you can see, what this program does is take a list of filenames
    and "hashifies" it like mailbox storing allowing no more than 2 (or
    whatever $ARGV[0] is) filenames to be in a single directory. The
    point, obviously, is if you have 100000 filenames and ext3 won't store
    100000 files in a single directory, you can use this technique to break
    them down.

    I had to make it recursive because I'm dealing with unknown data. I
    don't know that 3 levels or even 6 levels will suffice. Therefore, the
    recursion figures it out.

    Now, that said, this is NOT intended for "hashifying" mail storage
    dirs. It IS intended to "hashify" a HUGE list of filenames.
    Unfortunately this code is VERY inefficient. So, I post it here so
    people can see my idea if it helps, and so that people can maybe direct
    me to an existing CPAN module that would accomplish the same thing?
    Or, perhaps someone likes what I've started and wants to help improve
    the code?

    ---------------------
    #!/usr/bin/perl -w

    use strict;
    $ARGV[0] ||= 1;

    #my @list = map { chomp; $_ } <STDIN>;
    my @list = @_ = qw(aaaa aaab aaac bbbb bccc bcdd bcee bcff cccc dddd);

    my %hash = a(1, \@list);
    b('', %hash);

    sub a {
    my %list = map { lc(substr($_, 0, $_[0])) => 1 } @{$_[1]};
    my %hash = ();
    foreach my $hash ( keys %list ) {
    my @hash = grep { lc(substr($hash, 0)) eq lc(substr($_,
    0, length($hash))) } @{$_[1]};
    if ( $#hash >= $ARGV[0] ) {
    %{$hash{$hash}} = a($_[0]+1, \@hash);
    } elsif ( $#hash >= 0 ) {
    %{$hash{$hash}} = map { lc($_) => 1 } @hash;
    } elsif ( $#hash == -1 ) {
    # %{$hash{$hash}} = ();
    }
    }
    return %hash;
    }

    sub b {
    my ($root, %hash) = @_;
    foreach ( sort keys %hash ) {
    if ( ref $hash{$_} eq 'HASH' ) {
    b($root.'/'.$_, %{$hash{$_}});
    } else {
    print "$root/$_\n";
    }
    }
    }
     
    , Oct 31, 2006
    #1
    1. Advertising

  2. <> wrote:

    > Or, perhaps someone likes what I've started and wants to help improve
    > the code?
    >
    > ---------------------
    > #!/usr/bin/perl -w



    use warnings;

    Lexical warnings are a big improvement over global warnings.


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Nov 1, 2006
    #2
    1. Advertising

  3. Guest

    Tad McClellan wrote:
    > <> wrote:
    >
    > > Or, perhaps someone likes what I've started and wants to help improve
    > > the code?
    > >
    > > ---------------------
    > > #!/usr/bin/perl -w

    >
    >
    > use warnings;
    >
    > Lexical warnings are a big improvement over global warnings.


    Interesting...! Why is that?

    > --
    > Tad McClellan SGML consulting
    > Perl programming
    > Fort Worth, Texas
     
    , Nov 1, 2006
    #3
  4. Dr.Ruud Guest

    s1037989 schreef:
    > Tad McClellan:
    >> s1037989:


    >>> #!/usr/bin/perl -w

    >>
    >> use warnings;
    >>
    >> Lexical warnings are a big improvement over global warnings.

    >
    > Interesting...! Why is that?


    Read perllexwarn: `perldoc perllexwarn`.

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Nov 1, 2006
    #4
  5. <> wrote:
    >
    > Tad McClellan wrote:
    >> <> wrote:
    >>
    >> > Or, perhaps someone likes what I've started and wants to help improve
    >> > the code?
    >> >
    >> > ---------------------
    >> > #!/usr/bin/perl -w

    >>
    >>
    >> use warnings;
    >>
    >> Lexical warnings are a big improvement over global warnings.

    >
    > Interesting...! Why is that?



    For the same reason that lexical variables are a big improvement
    over global variables.

    You can control the scope.


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Nov 1, 2006
    #5
  6. On 2006-10-31 23:40, <> wrote:
    > I whipped up this quick and ugly script and I wanted to post it for
    > code review and others' benefit.
    >
    > With an array such as:
    > qw(aaaa aaab aaac bbbb bccc bcdd bcee bcff cccc dddd)
    >
    > The program returns:
    > # perl list2fs.pl 2
    > /a/aa/aaa/aaaa/aaaa
    > /a/aa/aaa/aaab/aaab
    > /a/aa/aaa/aaac/aaac
    > /b/bb/bbbb
    > /b/bc/bcc/bccc
    > /b/bc/bcd/bcdd
    > /b/bc/bce/bcee
    > /b/bc/bcf/bcff
    > /c/cccc
    > /d/dddd
    >
    > Now as you can see, what this program does is take a list of filenames
    > and "hashifies" it like mailbox storing allowing no more than 2 (or
    > whatever $ARGV[0] is) filenames to be in a single directory. The
    > point, obviously, is if you have 100000 filenames and ext3 won't store
    > 100000 files in a single directory, you can use this technique to break
    > them down.


    Ext3 will happily store 100000 filenames in a directory - it just won't
    be very quick in retrieving them (Even that isn't true for Linux 2.6 any
    more - ext3 directories now use a structure called an "htree" to quickly
    access files in huge directories). But assuming you need to use ext3
    with older kernel versions or other filesystems with linear directories:


    Are all the filenames guaranteed to be the same length? Otherwise, what
    happens if you have these file names?

    aaab aaac aaad aaae aaa

    You would need a directory /a/aa/aaa and a file /a/aa/aaa. But you can't
    have both.


    Do you have all the filenames in advance or is it possible to create new
    files after the structure has been created? If it is the latter, you
    proabably will need a way to split an existing directory if the number
    of files in it becomes too large - what happens when somebody accesses
    the directory in the middle of the split operation? How do you determine
    when you have to split? Count files time you create a new file?


    Finally, I assume you used the value of 2 for demonstration purposes
    only: Such a small value is not practical: It makes little sense to
    restrict directory sizes to much less than a disk block. A structure as
    you showed with lots of directories with a single file in it would be
    slower than one which is one level less deep.

    > Now, that said, this is NOT intended for "hashifying" mail storage
    > dirs. It IS intended to "hashify" a HUGE list of filenames.
    > Unfortunately this code is VERY inefficient.


    How did you determine that is very inefficient?

    > So, I post it here so people can see my idea if it helps, and so that
    > people can maybe direct me to an existing CPAN module that would
    > accomplish the same thing?


    When I needed to do similar things I used a hash function (e.g. MD5 or
    SHA-1) on the key (the filename in your case) to get nice, uniformly
    distributed constant length filenames and then computed the number of
    levels from the maximum number of files I had to store. With 256 files
    per directory, 2 levels would be enough for 16 million files and 3
    levels would be enough for 4 billion files. I never needed more :).

    > Or, perhaps someone likes what I've started and wants to help improve
    > the code?


    If you avoid copying around huge lists it might be faster. But I'm not
    sure the code even does what you want - see my questions above.

    hp

    --
    _ | Peter J. Holzer | > Wieso sollte man etwas erfinden was nicht
    |_|_) | Sysadmin WSR | > ist?
    | | | | Was sonst wäre der Sinn des Erfindens?
    __/ | http://www.hjp.at/ | -- P. Einstein u. V. Gringmuth in desd
     
    Peter J. Holzer, Nov 1, 2006
    #6
  7. Guest

    Peter, thanks for your thoughtful reply! :) My response below...

    Peter J. Holzer wrote:
    > On 2006-10-31 23:40, <> wrote:
    > > I whipped up this quick and ugly script and I wanted to post it for
    > > code review and others' benefit.
    > >
    > > With an array such as:
    > > qw(aaaa aaab aaac bbbb bccc bcdd bcee bcff cccc dddd)
    > >
    > > The program returns:
    > > # perl list2fs.pl 2
    > > /a/aa/aaa/aaaa/aaaa
    > > /a/aa/aaa/aaab/aaab
    > > /a/aa/aaa/aaac/aaac
    > > /b/bb/bbbb
    > > /b/bc/bcc/bccc
    > > /b/bc/bcd/bcdd
    > > /b/bc/bce/bcee
    > > /b/bc/bcf/bcff
    > > /c/cccc
    > > /d/dddd
    > >
    > > Now as you can see, what this program does is take a list of filenames
    > > and "hashifies" it like mailbox storing allowing no more than 2 (or
    > > whatever $ARGV[0] is) filenames to be in a single directory. The
    > > point, obviously, is if you have 100000 filenames and ext3 won't store
    > > 100000 files in a single directory, you can use this technique to break
    > > them down.

    >
    > Ext3 will happily store 100000 filenames in a directory - it just won't
    > be very quick in retrieving them (Even that isn't true for Linux 2.6 any
    > more - ext3 directories now use a structure called an "htree" to quickly
    > access files in huge directories). But assuming you need to use ext3
    > with older kernel versions or other filesystems with linear directories:


    Ok, I didn't know that. I thought I had read somewhere that the limit
    was 32,000 (and thus the reason for using something like reiserfs), but
    that was apparently an ancient article... Oops!

    > Are all the filenames guaranteed to be the same length? Otherwise, what
    > happens if you have these file names?


    No. So otherwise, that's a good question. To be more specific with
    the filenames, they are actually URLs.

    > aaab aaac aaad aaae aaa
    >
    > You would need a directory /a/aa/aaa and a file /a/aa/aaa. But you can't
    > have both.


    Actually, in this case, aaa the file would go in aaa the directory.

    >
    > Do you have all the filenames in advance or is it possible to create new
    > files after the structure has been created? If it is the latter, you
    > proabably will need a way to split an existing directory if the number
    > of files in it becomes too large - what happens when somebody accesses
    > the directory in the middle of the split operation? How do you determine
    > when you have to split? Count files time you create a new file?


    I have the majority of the filenames in advance, but unfortunately the
    list will change. To be more specific about what I'm doing: the URLs
    are blacklisted domains to be used with squid. I'm writing my own
    redirect program in Perl (because I'm not satisified with squidGuard
    and do not know C well enough to modify squidGuard). I have been
    pondering the idea of giving each URL it's own filename on the FS (disk
    space is not an issue) because that would be -- I think -- far more
    efficient to lookup a domain name than to parse a file and/or store the
    whole list in memory. It would be as simple as: does this file exist?

    Oh, the other reason for using a FS versus a MySQL DB is that I hate to
    write Web apps and it would be very simple for someone to just create a
    new blacklist using Samba from a Windows box.

    So as far as handling a changing list, it could just be done nightly.
    Blow away the list, and rebuild.

    I know this is not the best approach, but I want to make something
    simple for a prototype. I hate to invest too much time into it only to
    decide it won't be used.

    Lastly, just thinking about the problem got me interested in solving
    the problem generically. As such, I wrote the quick script and posted
    it for review and perhaps for others' benefit.

    > Finally, I assume you used the value of 2 for demonstration purposes
    > only: Such a small value is not practical: It makes little sense to
    > restrict directory sizes to much less than a disk block. A structure as
    > you showed with lots of directories with a single file in it would be
    > slower than one which is one level less deep.


    Correct, 2 was used merely for demonstration purposes. I intend to set
    the number to around 1000. I don't want the directory listing to be
    too overwhelming.

    > > Now, that said, this is NOT intended for "hashifying" mail storage
    > > dirs. It IS intended to "hashify" a HUGE list of filenames.
    > > Unfortunately this code is VERY inefficient.

    >
    > How did you determine that is very inefficient?


    By running it... :) It was AWFUL! I ran it against a file containing
    nearly 600,000 entries. I never waited long enough for it to finish.

    > > So, I post it here so people can see my idea if it helps, and so that
    > > people can maybe direct me to an existing CPAN module that would
    > > accomplish the same thing?

    >
    > When I needed to do similar things I used a hash function (e.g. MD5 or
    > SHA-1) on the key (the filename in your case) to get nice, uniformly
    > distributed constant length filenames and then computed the number of
    > levels from the maximum number of files I had to store. With 256 files
    > per directory, 2 levels would be enough for 16 million files and 3
    > levels would be enough for 4 billion files. I never needed more :).


    This is good information, thanks. I'll consider it's practical
    purposes, but would defeat the goal of simple blacklist updates via
    Samba. :(

    > > Or, perhaps someone likes what I've started and wants to help improve
    > > the code?

    >
    > If you avoid copying around huge lists it might be faster. But I'm not
    > sure the code even does what you want - see my questions above.


    How could I avoid that?

    > hp
    >
    > --
    > _ | Peter J. Holzer | > Wieso sollte man etwas erfinden was nicht
    > |_|_) | Sysadmin WSR | > ist?
    > | | | | Was sonst wäre der Sinn des Erfindens?
    > __/ | http://www.hjp.at/ | -- P. Einstein u. V. Gringmuth in desd
     
    , Nov 2, 2006
    #7
  8. On 2006-11-02 18:41, <> wrote:
    > Peter, thanks for your thoughtful reply! :) My response below...
    >
    > Peter J. Holzer wrote:
    >> On 2006-10-31 23:40, <> wrote:
    >> > I whipped up this quick and ugly script and I wanted to post it for
    >> > code review and others' benefit.
    >> >
    >> > With an array such as:
    >> > qw(aaaa aaab aaac bbbb bccc bcdd bcee bcff cccc dddd)
    >> >
    >> > The program returns:
    >> > # perl list2fs.pl 2
    >> > /a/aa/aaa/aaaa/aaaa
    >> > /a/aa/aaa/aaab/aaab

    [...]
    >> > /c/cccc
    >> > /d/dddd
    >> >
    >> > Now as you can see, what this program does is take a list of filenames
    >> > and "hashifies" it like mailbox storing allowing no more than 2 (or
    >> > whatever $ARGV[0] is) filenames to be in a single directory. The
    >> > point, obviously, is if you have 100000 filenames and ext3 won't store
    >> > 100000 files in a single directory, you can use this technique to break
    >> > them down.

    >>
    >> Ext3 will happily store 100000 filenames in a directory - it just won't
    >> be very quick in retrieving them (Even that isn't true for Linux 2.6 any
    >> more - ext3 directories now use a structure called an "htree" to quickly
    >> access files in huge directories). But assuming you need to use ext3
    >> with older kernel versions or other filesystems with linear directories:

    >
    > Ok, I didn't know that. I thought I had read somewhere that the limit
    > was 32,000 (and thus the reason for using something like reiserfs), but
    > that was apparently an ancient article... Oops!


    AFAIK, there was never a limit on the number of files in an ext2/ext3
    directory. 32 k may be the maximum number of subdirectories (because the
    link count is 16 bit, and each ".." entry is a link). But you wouldn't
    want more than a few 1000 files in a single directory because it was slow.


    >> aaab aaac aaad aaae aaa
    >>
    >> You would need a directory /a/aa/aaa and a file /a/aa/aaa. But you can't
    >> have both.

    >
    > Actually, in this case, aaa the file would go in aaa the directory.


    Right. But my gut feeling that you don't handle that case correctly was
    right. See below.


    >> Do you have all the filenames in advance or is it possible to create new
    >> files after the structure has been created? If it is the latter, you
    >> proabably will need a way to split an existing directory if the number
    >> of files in it becomes too large - what happens when somebody accesses
    >> the directory in the middle of the split operation? How do you determine
    >> when you have to split? Count files time you create a new file?

    >
    > I have the majority of the filenames in advance, but unfortunately the
    > list will change. To be more specific about what I'm doing: the URLs
    > are blacklisted domains to be used with squid.


    Are they URLs or domains? With URLs you may need to be careful because
    of '/' and possibly other characters (especially if you access the
    filesystem via Windows/Samba).


    >> > Now, that said, this is NOT intended for "hashifying" mail storage
    >> > dirs. It IS intended to "hashify" a HUGE list of filenames.
    >> > Unfortunately this code is VERY inefficient.

    >>
    >> How did you determine that is very inefficient?

    >
    > By running it... :) It was AWFUL! I ran it against a file containing
    > nearly 600,000 entries. I never waited long enough for it to finish.


    That's probably because it doesn't finish at all if one of the strings
    is a prefix of more than $ARGV[0] other strings. Change the
    initialization to

    my @list = @_ = qw(aa aaaa aaab aaac bbbb bccc bcdd bcee bcff cccc dddd);

    and try it. (It will probably crash pretty soon when it runs out of
    memory, but if you had an infinite amount of memory it would run
    forever)


    >> > So, I post it here so people can see my idea if it helps, and so that
    >> > people can maybe direct me to an existing CPAN module that would
    >> > accomplish the same thing?

    >>
    >> When I needed to do similar things I used a hash function (e.g. MD5 or
    >> SHA-1) on the key (the filename in your case) to get nice, uniformly
    >> distributed constant length filenames and then computed the number of
    >> levels from the maximum number of files I had to store. With 256 files
    >> per directory, 2 levels would be enough for 16 million files and 3
    >> levels would be enough for 4 billion files. I never needed more :).

    >
    > This is good information, thanks. I'll consider it's practical
    > purposes, but would defeat the goal of simple blacklist updates via
    > Samba. :(


    You want the directory structure to be created with the script anyway,
    don't you? Or do you expect Windows users to create and delete
    individual files to edit the blacklist?


    >> > Or, perhaps someone likes what I've started and wants to help improve
    >> > the code?

    >>
    >> If you avoid copying around huge lists it might be faster. But I'm not
    >> sure the code even does what you want - see my questions above.

    >
    > How could I avoid that?


    I was wrong. You are passing the lists by reference, and the hashes
    which you copy needlessly aren't that big (at most 70 elements, assuming
    printable ASCII). The problem is that you are calling grep way too often
    which also causes the endless recursion (and would be slow even without
    that).

    Here is my version of your function "a" (you have some talent for
    choosing horrible names, btw. I could only figure out what the variables
    were supposed to be by reading the code - the names were at best
    useless and at worst misleading):

    sub a {
    my ($prefix_len, $list) = @_;
    #print " " x $prefix_len, (scalar @$list), " elements\n";
    my %hash = ();
    push @{ $hash{lc(substr($_, 0, $prefix_len))} }, $_ for @$list;
    foreach my $sublist ( values %hash ) {
    if ( @$sublist > $ARGV[0] ) {
    $sublist = a($prefix_len+1, $sublist);
    } else {
    $sublist = { map { lc($_) => 1 } @$sublist };
    }
    }
    return \%hash;
    }

    The biggest change is that I didn't try to figure out the prefixes first
    and then find all strings with the same prefix (which caused duplication
    and the endless recursion) but do that in one loop - that way each
    string can only ever be assigned to one prefix and the lists must get
    shorter. The switching between array and hash references is mostly so
    that a can be called with an array reference as before. I could have
    uses hash references throughout. It also now returns a hash reference
    instead of a list-which-is-a-hash, which I consider cleaner.

    I didn't have a list of 600000 domain names handy, so I used two lists
    with 35000 and 96000 names: The run times (with $ARGV[0] == 1000) were
    about 3 and 9 seconds, respectively, on a PIII 850.

    hp

    --
    _ | Peter J. Holzer | > Wieso sollte man etwas erfinden was nicht
    |_|_) | Sysadmin WSR | > ist?
    | | | | Was sonst wäre der Sinn des Erfindens?
    __/ | http://www.hjp.at/ | -- P. Einstein u. V. Gringmuth in desd
     
    Peter J. Holzer, Nov 3, 2006
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. - Steve -

    Creating Exchange Mailbox

    - Steve -, Jun 14, 2004, in forum: ASP .Net
    Replies:
    1
    Views:
    581
    Trevor Benedict R
    Jun 15, 2004
  2. Guadala Harry
    Replies:
    4
    Views:
    386
    Steve C. Orr [MVP, MCSD]
    Sep 12, 2004
  3. =?Utf-8?B?TWFya3VzIEhvcGZlbnNwaXJnZXI=?=

    Creating a WinPOP MailBox from an ASP Net Application

    =?Utf-8?B?TWFya3VzIEhvcGZlbnNwaXJnZXI=?=, Nov 16, 2004, in forum: ASP .Net
    Replies:
    5
    Views:
    2,006
    jmunsch
    Mar 8, 2006
  4. BA
    Replies:
    0
    Views:
    339
  5. Ken Varn
    Replies:
    0
    Views:
    527
    Ken Varn
    Apr 26, 2004
Loading...

Share This Page