looking for some size optimization

Discussion in 'Perl Misc' started by Marc Espie, Apr 15, 2007.

  1. Marc Espie

    Marc Espie Guest

    I'm looking at a script that handles a huge amount of data... basically,
    the filenames from +4000 packages in order to recognize conflicts.

    Currently, it builds a big hash through a loop that constructs
    $all_conflict like this:

    my $file= File::Spec->canonpath($self->fullname());
    push ${$all_conflict->{$file}}, $pkgname;


    I end up with a hash of 250M.

    I can't really use Devel::Size with any benefit, since all the data
    is one single chunk (all the other data in the program amount to <2~3M)

    I expect the $pkgname strings to be shared. In fact, I tried replacing
    $pkgname with \$pkgname, with negative size results (+20M).

    I've tried splitting the path along `frequent' directories, with inconclusive
    gains (most of the files live under /usr/local, /etc, or /var/www):
    -20M at most.

    I'm looking for bright ideas to try and reduce the size used... without
    being too detrimental in terms of speed...
     
    Marc Espie, Apr 15, 2007
    #1
    1. Advertisements

  2. Marc Espie

    Mumia W. Guest

    If swapping is a problem, you could try using a tied database hash such
    as DB_File or NDBM_File or SDBM_File or GDBM_File.

    Perhaps try all of those to see which one performs best. If nothing
    works, buy more ram :)
     
    Mumia W., Apr 15, 2007
    #2
    1. Advertisements

  3. Marc Espie

    anno4000 Guest

    No, you end up with a syntax error, unless the second code line is
    really

    push @{$all_conflict->{$file}}, $pkgname;

    I'll assume that, but please don't re-type code, copy/paste it to
    make sure such errors don't happen.
    What makes you expect that? You're pushing unrelated strings on
    arrays.
    Sure. That way, you are storing a reference in addition to the strings.
    That comes at the cost of additional (anonymous, but real) hashes.
    The best storage strategy often depends on the nature of the data.
    In your case, I'll assume that actual conflicts are rare. That means
    the majority of your arrays of package names contain only one element.
    That is wasteful. You could use two hashes, one to detect if a file
    name has been seen before and another to keep information about actual
    conflicts. Here is some code. I assume that pairs of package names
    and file names can be read from DATA:

    my ( %seen, %conflicts);
    while ( <DATA> ) {
    my ( $package, $file) = split;
    if ( exists $seen{ $file} ) {
    $conflicts{ $file} ||= [ $seen{ $file}]; # transfer first package
    push @{ $conflicts{ $file} }, $package; # add another package
    } else {
    $seen{ $file} = $package; # mark as seen
    }
    }

    print "No conflicts\n" unless %conflicts;
    print "$_ in ", join( ', ' => @{ $conflicts{ $_} }), "\n" for
    keys %conflicts;

    This stores the first package name a file is seen with as a plain
    string in %seen, not a one-element array. Only when a file is seen
    a second time an array is built in %conflicts. If conflicts are
    rare, this should save some memory.

    Anno
     
    anno4000, Apr 15, 2007
    #3
  4. Marc Espie

    xhoster Guest

    How many more than 4000 do you have? To take up the amount of room you
    are talking about, it seems you would need orders of magnitude more
    than 4000. But maybe I don't understand the nature of the data. What does
    a conflict end up being represented as?
    Is $pkgname the name of the package declared in $file, or the name
    of the package used in $file. In any event, are you pushing the same
    $pkgname onto same file's list over and over? Maybe you should use a hash
    of hash instead of hash of array:

    $all_conflict->{$file}->{$pkname}=();
    Hash keys are shared (with substantial overhead) but array values are not.
    If your paths are really so long, you might try compressing them, or
    digesting them.

    ....
    Without knowing more about the exact nature of the data, it is hard to say.
    optimization that involved changing the way a structure is built is often
    drive by the way the structure is used, so showing that part may also be
    useful.

    Xho
     
    xhoster, Apr 15, 2007
    #4
  5. Marc Espie

    Marc Espie Guest

    Oops, sorry about that. I usually copy&paste code indeed, and I could
    have sworn I did.
    They're not necessarily unrelated.

    I guess I'll have to share more code ;-) . I grab the full list
    for each package, and populate the hash with that string like so:

    $plist->visit('register', $filehash, $dephash, $plist->pkgname());

    (not quoting my plist and visitor code, I assume you can figure out that
    it will end up calling the above snippet for each item in the packing-list)

    so I am certain that I call the registration routine with an identical
    string for each item in a given package. I haven't looked too closely at
    perl's internals, but I would have assumed it to share the string in such
    a case ?
    This looks like a good idea indeed.
    I'm wondering if maybe I should build a cache of pkgname lists,
    and try to share as many as I can...

    anyways, I'll try your idea (and the corresponding one) and tell you whether
    I see any improvement.

    While I'm there, responding to the other persons: sorry, my first message
    wasn't clear. I do not want to go off to disk to a dbm or a database.
    As it stands, the program works. It's just that 250M is over the default
    ulimit of the considered system, which means that people have to remember
    to bump the limits before running it. There's also the fact that the current
    set of packages is bound to grow, so if I can find a good idea to reduce
    memory usage, that would be cool. But using external storage is not the
    solution I'm looking for.
     
    Marc Espie, Apr 16, 2007
    #5
  6. Marc Espie

    Marc Espie Guest

    Thanks for the insight. With a few lines of additional code, this does shrink
    the process footprint from >250M to 190M. Very nice indeed.
    Doesn't appear to be slower, the computation is IO-bound anyways (reading
    packing-lists for gzip'ed archives).

    The code now looks like:

    my $file= File::Spec->canonpath($self->fullname());
    if (exists $all_conflict->{$file}) {
    $list->{$all_conflict->{$file}}->{$pkgname} ||=
    [@{$all_conflict->{$file}}, $pkgname ];
    $all_conflict->{$file} = $list->{$all_conflict->{$file}}->{$pkgname};
    } else {
    $list->{$pkgname} ||= [$pkgname];
    $all_conflict->{$file} = $list->{$pkgname};
    }

    I may try to tweak it a bit further, but it's already a very nice improvement.
    (if somebody has other ideas, I'm still game ;-) )

    Anyways, thanks a lot ;-)

    (that's the ports/infrastructure/packages/find-all-conflicts script used in
    OpenBSD, btw)

    I think I'm going to use an extra seen array as well... I need to do some
    measurements, but I suspect most of my files don't appear in more than
    one package, so scanning the conflicts hash later probably can be sped up
    by a large amount.
     
    Marc Espie, Apr 16, 2007
    #6
  7. Marc Espie

    Uri Guttman Guest

    ME> my $file= File::Spec->canonpath($self->fullname());
    ME> if (exists $all_conflict->{$file}) {
    ME> $list->{$all_conflict->{$file}}->{$pkgname} ||=
    ME> [@{$all_conflict->{$file}}, $pkgname ];

    that is slow as you copy the existing array back into another anon
    array. andyou have a lot of code redundancy all over this. you always
    are assigning an anon array but autoviviification will handle that for
    you. put this before the if (and i am not even sure you need a
    conditional there at all but i haven't followed the logic flow)

    push( @{$all_conflict->{$file}}, $pkgname ;

    that will replace the first line in both clauses and be much faster as
    well (no wasteful extra copying). it might even save space as perl won't
    be allocating and freeing as many buffers so less storage could be used.


    ME> $all_conflict->{$file} = $list->{$all_conflict->{$file}}->{$pkgname};
    ME> } else {
    ME> $list->{$pkgname} ||= [$pkgname];
    ME> $all_conflict->{$file} = $list->{$pkgname};
    ME> }

    as for the conflict hash, i am sure it can be reduced but i don't know
    the logic. you change $all_conflict->{$file} after each push (your ||=
    code) which makes no sense to me. maybe you should clearly explain the
    data structure you want to get out of this. i have yet to see such an
    explanation in this thread (or i am not awake yet). i can't see how 4000
    entries of maybe a few hundred bytes each will use up 250MB (or even 190).

    ME> (that's the ports/infrastructure/packages/find-all-conflicts
    ME> script used in OpenBSD, btw)

    any url to get that directly? if that is published code then my autoviv
    fix will save tons of time for many users. that copy anon arrays to
    themselves thing is massively bad code. for more on autovivification see
    my article at http://sysarch.com/Perl/autoviv.txt.

    knowing that the build code is poorly designed, now i am confident that
    the data structure is also poorly design and can be majorly optimized.

    uri
     
    Uri Guttman, Apr 16, 2007
    #7
  8. Marc Espie

    Marc Espie Guest

    But still, the new structure takes less space than the old one.
    The trick is that there is one single ref for each pkgname set now.

    The simple line overhead will build one separate entry for each file
    and this looks like it is the stuff gobbling memory.
    We are talking 4000 packages. Which contain about 700000 files, total.
    Go ahead, knock yourself out...

    You can get this through OpenBSD's cvsweb, from openbsd.org.
    http://www.openbsd.org/cgi-bin/cvsweb/ports/infrastructure/package/find-all-conflicts

    the extra pm files live under
    http://www.openbsd.org/cgi-bin/cvsweb/src/usr.sbin/pkg_add
     
    Marc Espie, Apr 16, 2007
    #8
  9. [A complimentary Cc of this posting was sent to
    Marc Espie
    So you want a hash with 700000 entries? With 100b/entry overhead, it
    will take about least 70K + size of entries.

    You did not explain what is the distribution of lengths of arrays
    which are values of the hash. E.g, if you have a lot of singletons,
    it would save space to store them directly in the hash (the price
    being an extra check when adding a new entry).

    You may also win by enumerating the packages, and storing the index:

    my $idx = $seen{$pkgname};
    $idx = $seen{$pkgname} = ++$max_pk_idx, $pk[$idx] = $pkgname
    unless defined $idx;
    push @{$all_conflict->{$file}}, 0 + $idx; # Cvt to "pure-number"

    Also, take care not to stringify $all_conflict->{$file}[$n], copy it
    to a scratch variable if you ever need the string value of it...

    Hope this helps,
    Ilya
     
    Ilya Zakharevich, Apr 16, 2007
    #9
  10. Marc Espie

    Uri Guttman Guest

    ME> But still, the new structure takes less space than the old one.
    ME> The trick is that there is one single ref for each pkgname set now.

    that may be the case but the intermediate storage will be dramatically
    reduced too with my code.

    ME> The simple line overhead will build one separate entry for each file
    ME> and this looks like it is the stuff gobbling memory.

    i will take a look at that when i can.

    ME> We are talking 4000 packages. Which contain about 700000 files, total.

    i didn't see the 700k files part anywhere before. the dataset wasn't
    defined for me.

    ME> Go ahead, knock yourself out...

    ME> You can get this through OpenBSD's cvsweb, from openbsd.org.
    ME> http://www.openbsd.org/cgi-bin/cvsweb/ports/infrastructure/package/find-all-conflicts

    is this your program for the bsd distro? or are you trying to improve it
    for them?

    uri
     
    Uri Guttman, Apr 16, 2007
    #10
  11. Marc Espie

    Uri Guttman Guest

    ME> Go ahead, knock yourself out...

    ME> You can get this through OpenBSD's cvsweb, from openbsd.org.
    ME> http://www.openbsd.org/cgi-bin/cvsweb/ports/infrastructure/package/find-all-conflicts

    a quick scan shows many places to improve that script, noticeably in
    speed. one of your change logs comments on 5% speedup and i can easily
    get more as you are still doing that anon array copy stuff.

    it would help to have access to the package data and some functional
    specs. i will still do a basic code review here in a day or so if the
    time works out. i don't understand the logic or goals enough to do a
    deeper review as in shrinking the data structures which is the big
    issue.

    do you have any descriptions of the package info and what defines a
    package collision? this is something which should be in the
    docs/pod/comments of the script.

    uri
     
    Uri Guttman, Apr 17, 2007
    #11
  12. Marc Espie

    Marc Espie Guest

    Please explain to me how this is massively bad code. I know about
    auto-vivification. I don't see any unintended auto-vivification in my
    data structure.

    Also, when you talk about copying anon arrays to themselves, I assume you're
    referring to code like:
    $pkg_list->{$all_conflict->{$file}}->{$pkgname} ||=
    [@{$all_conflict->{$file}}, $pkgname ];
    $all_conflict->{$file} =
    $pkg_list->{$all_conflict->{$file}}->{$pkgname};

    but it doesn't intend to copy anon arrays to themselves. The idea
    is that for each set of pkgname1, pkgname2, pkgname3, I want to
    have one single ref to an array with [pkgname1, pkgname2, pkgname3].

    Assuming I have initially $all_conflict->{$file} corresponding to
    $r = [$pkgname1, $pkgname2], then I will try to build
    $pkg_list->{$r}->{$pkgname3} = [$pkgname1, $pkgname2, $pkgname3] as a
    unique new reference.

    I can use references as keys in hashes, right ?
     
    Marc Espie, Apr 17, 2007
    #12
  13. Marc Espie

    Marc Espie Guest

    It's probably impractical to give you access to the package data. A full
    package build is slightly over 5G in size these days....
    The documentation is scattered in other parts of the tools. Basic conflict
    information is documented in OpenBSD::pkgCfl(3p)
    The packing-list format is partly documented in pkg_create(1), partly
    in OpenBSD::packingElement(3p). The format of package specifications is
    explained in package_specs(7) (you can get all this through the cvsman
    interface in OpenBSD).

    These are somewhat big tools, and still evolving. The part that's more or
    less stable is documented...

    As far as find-all-conflicts goes, there are two passes: one that
    builds a correspondance between each file-like object in all packages,
    and the corresponding pkgnames that contain this object, and a second
    pass that checks for each resulting list of pkgnames whether there is
    an actual collision. Some packages can not be installed at the same time,
    as they conflict (they usually are variant versions of the same software),
    so it's okay for the same file to be present in both packages. The situation
    is slightly more complex, because each package actually pulls a dependency
    tree, so we have to look at the closure of packages through dependencies:
    even if a file belongs to 2 packages that do not conflict directly, there
    might be a conflict deeper in the dependency chain, which results in both
    packages not being installable simultaneously anyways.

    In the end, the script reports the (very small) number of files which happen
    to collide, but that no-one is yet aware of. As a result, we decide whether
    we want to change the file names, or whether the conflict should be
    explicitly registered...
     
    Marc Espie, Apr 17, 2007
    #13
  14. Marc Espie

    Uri Guttman Guest

    ME> Please explain to me how this is massively bad code. I know about
    ME> auto-vivification. I don't see any unintended auto-vivification in my
    ME> data structure.

    it doesn't autovivify at all is my point.

    ME> Also, when you talk about copying anon arrays to themselves, I assume you're
    ME> referring to code like:
    ME> $pkg_list->{$all_conflict->{$file}}->{$pkgname} ||=
    ME> [@{$all_conflict->{$file}}, $pkgname ];
    ME> $all_conflict->{$file} =
    ME> $pkg_list->{$all_conflict->{$file}}->{$pkgname};

    ME> but it doesn't intend to copy anon arrays to themselves. The idea
    ME> is that for each set of pkgname1, pkgname2, pkgname3, I want to
    ME> have one single ref to an array with [pkgname1, pkgname2, pkgname3].

    ok, then you can make a better data structure for that. i still don't
    get the conflict logic as it isn't spelled out.

    ME> Assuming I have initially $all_conflict->{$file} corresponding to
    ME> $r = [$pkgname1, $pkgname2], then I will try to build
    ME> $pkg_list->{$r}->{$pkgname3} = [$pkgname1, $pkgname2, $pkgname3] as a
    ME> unique new reference.

    ME> I can use references as keys in hashes, right ?

    yes, you can do that.

    i need to grok the conflict logic. my gut says there is a much better
    structure out there and i trust it well.

    uri
     
    Uri Guttman, Apr 17, 2007
    #14
  15. [A complimentary Cc of this posting was sent to
    Uri Guttman
    Likewise here. To the OP: we MUST know what kind of information one
    must have AT THE END, and what is the "elementary event" which
    triggers update of the info. [Also, some stats of the state at the
    end are also useful.]

    Hope this helps,
    Ilya
     
    Ilya Zakharevich, Apr 17, 2007
    #15
  16. Marc Espie

    anno4000 Guest

    With lots of filenames and their common prefixes, a trie comes to mind.

    Anno
     
    anno4000, Apr 17, 2007
    #16
  17. Marc Espie

    Marc Espie Guest

    Yeah, well... sorry about that. The point is that I've spent a lot of time
    looking at the problem, and trying to optimize things. I was more looking
    into some insight as to perl-specific memory issues I might have missed.

    It's not a vital problem, the current program works, but it would be nice
    to shave even more memory.

    As far as stats go, I have a locate(1) database built from the same data.
    The whole information is roughly 60M, uncompressed. Which translates to
    a tight locate database of 5.6M (unfortunately, it's useless for the
    problem at hand, since lookups are expensive). This means that each
    entry is about 85 bytes long. So I have a factor *3 for each entry when
    I store it as a perl hash entry...

    I've tried my best to extract the relevant info from a large program, with
    a huge dataset (+4G packages).

    If you do not see how to optimize things further, no hardship...
     
    Marc Espie, Apr 18, 2007
    #17
  18. [A complimentary Cc of this posting was sent to
    Marc Espie
    You still do not follow... We see tens of ways; but without knowing
    (exactly) what you do, and (approximately) how to data looks like, we
    cannot choose between those which have a chance to work, and those
    which won't.

    Hope this helps,
    Ilya
     
    Ilya Zakharevich, Apr 19, 2007
    #18
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.