looking for some size optimization

Discussion in 'Perl Misc' started by Marc Espie, Apr 15, 2007.

  1. Marc Espie

    Marc Espie Guest

    I'm looking at a script that handles a huge amount of data... basically,
    the filenames from +4000 packages in order to recognize conflicts.

    Currently, it builds a big hash through a loop that constructs
    $all_conflict like this:

    my $file= File::Spec->canonpath($self->fullname());
    push ${$all_conflict->{$file}}, $pkgname;


    I end up with a hash of 250M.

    I can't really use Devel::Size with any benefit, since all the data
    is one single chunk (all the other data in the program amount to <2~3M)

    I expect the $pkgname strings to be shared. In fact, I tried replacing
    $pkgname with \$pkgname, with negative size results (+20M).

    I've tried splitting the path along `frequent' directories, with inconclusive
    gains (most of the files live under /usr/local, /etc, or /var/www):
    -20M at most.

    I'm looking for bright ideas to try and reduce the size used... without
    being too detrimental in terms of speed...
     
    Marc Espie, Apr 15, 2007
    #1
    1. Advertising

  2. Marc Espie

    Mumia W. Guest

    On 04/15/2007 07:22 AM, Marc Espie wrote:
    > I'm looking at a script that handles a huge amount of data... basically,
    > the filenames from +4000 packages in order to recognize conflicts.
    >
    > Currently, it builds a big hash through a loop that constructs
    > $all_conflict like this:
    >
    > my $file= File::Spec->canonpath($self->fullname());
    > push ${$all_conflict->{$file}}, $pkgname;
    >
    >
    > I end up with a hash of 250M.
    > [...]


    If swapping is a problem, you could try using a tied database hash such
    as DB_File or NDBM_File or SDBM_File or GDBM_File.

    Perhaps try all of those to see which one performs best. If nothing
    works, buy more ram :)


    --
    Count the YOYOs:
    http://home.earthlink.net/~mumia.w.18.spam/games_fever/
     
    Mumia W., Apr 15, 2007
    #2
    1. Advertising

  3. Marc Espie

    -berlin.de Guest

    Marc Espie <> wrote in comp.lang.perl.misc:
    > I'm looking at a script that handles a huge amount of data... basically,
    > the filenames from +4000 packages in order to recognize conflicts.
    >
    > Currently, it builds a big hash through a loop that constructs
    > $all_conflict like this:
    >
    > my $file= File::Spec->canonpath($self->fullname());
    > push ${$all_conflict->{$file}}, $pkgname;
    >
    >
    > I end up with a hash of 250M.


    No, you end up with a syntax error, unless the second code line is
    really

    push @{$all_conflict->{$file}}, $pkgname;

    I'll assume that, but please don't re-type code, copy/paste it to
    make sure such errors don't happen.

    > I can't really use Devel::Size with any benefit, since all the data
    > is one single chunk (all the other data in the program amount to <2~3M)
    >
    > I expect the $pkgname strings to be shared. In fact, I tried replacing


    What makes you expect that? You're pushing unrelated strings on
    arrays.

    > $pkgname with \$pkgname, with negative size results (+20M).


    Sure. That way, you are storing a reference in addition to the strings.

    > I've tried splitting the path along `frequent' directories, with inconclusive
    > gains (most of the files live under /usr/local, /etc, or /var/www):
    > -20M at most.


    That comes at the cost of additional (anonymous, but real) hashes.

    > I'm looking for bright ideas to try and reduce the size used... without
    > being too detrimental in terms of speed...


    The best storage strategy often depends on the nature of the data.
    In your case, I'll assume that actual conflicts are rare. That means
    the majority of your arrays of package names contain only one element.
    That is wasteful. You could use two hashes, one to detect if a file
    name has been seen before and another to keep information about actual
    conflicts. Here is some code. I assume that pairs of package names
    and file names can be read from DATA:

    my ( %seen, %conflicts);
    while ( <DATA> ) {
    my ( $package, $file) = split;
    if ( exists $seen{ $file} ) {
    $conflicts{ $file} ||= [ $seen{ $file}]; # transfer first package
    push @{ $conflicts{ $file} }, $package; # add another package
    } else {
    $seen{ $file} = $package; # mark as seen
    }
    }

    print "No conflicts\n" unless %conflicts;
    print "$_ in ", join( ', ' => @{ $conflicts{ $_} }), "\n" for
    keys %conflicts;

    This stores the first package name a file is seen with as a plain
    string in %seen, not a one-element array. Only when a file is seen
    a second time an array is built in %conflicts. If conflicts are
    rare, this should save some memory.

    Anno
     
    -berlin.de, Apr 15, 2007
    #3
  4. Marc Espie

    Guest

    wrote:
    > I'm looking at a script that handles a huge amount of data... basically,
    > the filenames from +4000 packages in order to recognize conflicts.


    How many more than 4000 do you have? To take up the amount of room you
    are talking about, it seems you would need orders of magnitude more
    than 4000. But maybe I don't understand the nature of the data. What does
    a conflict end up being represented as?

    >
    > Currently, it builds a big hash through a loop that constructs
    > $all_conflict like this:
    >
    > my $file= File::Spec->canonpath($self->fullname());
    > push ${$all_conflict->{$file}}, $pkgname;


    Is $pkgname the name of the package declared in $file, or the name
    of the package used in $file. In any event, are you pushing the same
    $pkgname onto same file's list over and over? Maybe you should use a hash
    of hash instead of hash of array:

    $all_conflict->{$file}->{$pkname}=();

    > I end up with a hash of 250M.
    >
    > I can't really use Devel::Size with any benefit, since all the data
    > is one single chunk (all the other data in the program amount to <2~3M)
    >
    > I expect the $pkgname strings to be shared.


    Hash keys are shared (with substantial overhead) but array values are not.
    If your paths are really so long, you might try compressing them, or
    digesting them.

    ....
    > I'm looking for bright ideas to try and reduce the size used... without
    > being too detrimental in terms of speed...


    Without knowing more about the exact nature of the data, it is hard to say.
    optimization that involved changing the way a structure is built is often
    drive by the way the structure is used, so showing that part may also be
    useful.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Apr 15, 2007
    #4
  5. Marc Espie

    Marc Espie Guest

    In article <>,
    <-berlin.de> wrote:
    >Marc Espie <> wrote in comp.lang.perl.misc:
    >> I'm looking at a script that handles a huge amount of data... basically,
    >> the filenames from +4000 packages in order to recognize conflicts.
    >>
    >> Currently, it builds a big hash through a loop that constructs
    >> $all_conflict like this:
    >>
    >> my $file= File::Spec->canonpath($self->fullname());
    >> push ${$all_conflict->{$file}}, $pkgname;
    >>
    >>
    >> I end up with a hash of 250M.

    >
    >No, you end up with a syntax error, unless the second code line is
    >really
    >
    > push @{$all_conflict->{$file}}, $pkgname;
    >
    >I'll assume that, but please don't re-type code, copy/paste it to
    >make sure such errors don't happen.


    Oops, sorry about that. I usually copy&paste code indeed, and I could
    have sworn I did.

    >> I expect the $pkgname strings to be shared. In fact, I tried replacing

    >
    >What makes you expect that? You're pushing unrelated strings on
    >arrays.


    They're not necessarily unrelated.

    I guess I'll have to share more code ;-) . I grab the full list
    for each package, and populate the hash with that string like so:

    $plist->visit('register', $filehash, $dephash, $plist->pkgname());

    (not quoting my plist and visitor code, I assume you can figure out that
    it will end up calling the above snippet for each item in the packing-list)

    so I am certain that I call the registration routine with an identical
    string for each item in a given package. I haven't looked too closely at
    perl's internals, but I would have assumed it to share the string in such
    a case ?

    >The best storage strategy often depends on the nature of the data.
    >In your case, I'll assume that actual conflicts are rare. That means
    >the majority of your arrays of package names contain only one element.
    >That is wasteful. You could use two hashes, one to detect if a file
    >name has been seen before and another to keep information about actual
    >conflicts. Here is some code. I assume that pairs of package names
    >and file names can be read from DATA:
    >
    > my ( %seen, %conflicts);
    > while ( <DATA> ) {
    > my ( $package, $file) = split;
    > if ( exists $seen{ $file} ) {
    > $conflicts{ $file} ||= [ $seen{ $file}]; # transfer first package
    > push @{ $conflicts{ $file} }, $package; # add another package
    > } else {
    > $seen{ $file} = $package; # mark as seen
    > }
    > }


    This looks like a good idea indeed.
    I'm wondering if maybe I should build a cache of pkgname lists,
    and try to share as many as I can...

    anyways, I'll try your idea (and the corresponding one) and tell you whether
    I see any improvement.

    While I'm there, responding to the other persons: sorry, my first message
    wasn't clear. I do not want to go off to disk to a dbm or a database.
    As it stands, the program works. It's just that 250M is over the default
    ulimit of the considered system, which means that people have to remember
    to bump the limits before running it. There's also the fact that the current
    set of packages is bound to grow, so if I can find a good idea to reduce
    memory usage, that would be cool. But using external storage is not the
    solution I'm looking for.
     
    Marc Espie, Apr 16, 2007
    #5
  6. Marc Espie

    Marc Espie Guest

    In article <evvq4o$5ua$>,
    Marc Espie <> wrote:
    >In article <>,
    > <-berlin.de> wrote:
    >> my ( %seen, %conflicts);
    >> while ( <DATA> ) {
    >> my ( $package, $file) = split;
    >> if ( exists $seen{ $file} ) {
    >> $conflicts{ $file} ||= [ $seen{ $file}]; # transfer first package
    >> push @{ $conflicts{ $file} }, $package; # add another package
    >> } else {
    >> $seen{ $file} = $package; # mark as seen
    >> }
    >> }

    >
    >This looks like a good idea indeed.
    >I'm wondering if maybe I should build a cache of pkgname lists,
    >and try to share as many as I can...


    Thanks for the insight. With a few lines of additional code, this does shrink
    the process footprint from >250M to 190M. Very nice indeed.
    Doesn't appear to be slower, the computation is IO-bound anyways (reading
    packing-lists for gzip'ed archives).

    The code now looks like:

    my $file= File::Spec->canonpath($self->fullname());
    if (exists $all_conflict->{$file}) {
    $list->{$all_conflict->{$file}}->{$pkgname} ||=
    [@{$all_conflict->{$file}}, $pkgname ];
    $all_conflict->{$file} = $list->{$all_conflict->{$file}}->{$pkgname};
    } else {
    $list->{$pkgname} ||= [$pkgname];
    $all_conflict->{$file} = $list->{$pkgname};
    }

    I may try to tweak it a bit further, but it's already a very nice improvement.
    (if somebody has other ideas, I'm still game ;-) )

    Anyways, thanks a lot ;-)

    (that's the ports/infrastructure/packages/find-all-conflicts script used in
    OpenBSD, btw)

    I think I'm going to use an extra seen array as well... I need to do some
    measurements, but I suspect most of my files don't appear in more than
    one package, so scanning the conflicts hash later probably can be sped up
    by a large amount.
     
    Marc Espie, Apr 16, 2007
    #6
  7. Marc Espie

    Uri Guttman Guest

    >>>>> "ME" == Marc Espie <> writes:


    ME> my $file= File::Spec->canonpath($self->fullname());
    ME> if (exists $all_conflict->{$file}) {
    ME> $list->{$all_conflict->{$file}}->{$pkgname} ||=
    ME> [@{$all_conflict->{$file}}, $pkgname ];

    that is slow as you copy the existing array back into another anon
    array. andyou have a lot of code redundancy all over this. you always
    are assigning an anon array but autoviviification will handle that for
    you. put this before the if (and i am not even sure you need a
    conditional there at all but i haven't followed the logic flow)

    push( @{$all_conflict->{$file}}, $pkgname ;

    that will replace the first line in both clauses and be much faster as
    well (no wasteful extra copying). it might even save space as perl won't
    be allocating and freeing as many buffers so less storage could be used.


    ME> $all_conflict->{$file} = $list->{$all_conflict->{$file}}->{$pkgname};
    ME> } else {
    ME> $list->{$pkgname} ||= [$pkgname];
    ME> $all_conflict->{$file} = $list->{$pkgname};
    ME> }

    as for the conflict hash, i am sure it can be reduced but i don't know
    the logic. you change $all_conflict->{$file} after each push (your ||=
    code) which makes no sense to me. maybe you should clearly explain the
    data structure you want to get out of this. i have yet to see such an
    explanation in this thread (or i am not awake yet). i can't see how 4000
    entries of maybe a few hundred bytes each will use up 250MB (or even 190).

    ME> (that's the ports/infrastructure/packages/find-all-conflicts
    ME> script used in OpenBSD, btw)

    any url to get that directly? if that is published code then my autoviv
    fix will save tons of time for many users. that copy anon arrays to
    themselves thing is massively bad code. for more on autovivification see
    my article at http://sysarch.com/Perl/autoviv.txt.

    knowing that the build code is poorly designed, now i am confident that
    the data structure is also poorly design and can be majorly optimized.

    uri

    --
    Uri Guttman ------ -------- http://www.stemsystems.com
    --Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
    Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org
     
    Uri Guttman, Apr 16, 2007
    #7
  8. Marc Espie

    Marc Espie Guest

    In article <>,
    Uri Guttman <> wrote:
    >>>>>> "ME" == Marc Espie <> writes:

    >
    >
    > ME> my $file= File::Spec->canonpath($self->fullname());
    > ME> if (exists $all_conflict->{$file}) {
    > ME> $list->{$all_conflict->{$file}}->{$pkgname} ||=
    > ME> [@{$all_conflict->{$file}}, $pkgname ];
    >
    >that is slow as you copy the existing array back into another anon
    >array. andyou have a lot of code redundancy all over this. you always
    >are assigning an anon array but autoviviification will handle that for
    >you. put this before the if (and i am not even sure you need a
    >conditional there at all but i haven't followed the logic flow)
    >
    > push( @{$all_conflict->{$file}}, $pkgname ;


    But still, the new structure takes less space than the old one.
    The trick is that there is one single ref for each pkgname set now.

    The simple line overhead will build one separate entry for each file
    and this looks like it is the stuff gobbling memory.

    >as for the conflict hash, i am sure it can be reduced but i don't know
    >the logic. you change $all_conflict->{$file} after each push (your ||=
    >code) which makes no sense to me. maybe you should clearly explain the
    >data structure you want to get out of this. i have yet to see such an
    >explanation in this thread (or i am not awake yet). i can't see how 4000
    >entries of maybe a few hundred bytes each will use up 250MB (or even 190).


    We are talking 4000 packages. Which contain about 700000 files, total.

    > ME> (that's the ports/infrastructure/packages/find-all-conflicts
    > ME> script used in OpenBSD, btw)
    >
    >any url to get that directly? if that is published code then my autoviv
    >fix will save tons of time for many users. that copy anon arrays to
    >themselves thing is massively bad code. for more on autovivification see
    >my article at http://sysarch.com/Perl/autoviv.txt.


    >knowing that the build code is poorly designed, now i am confident that
    >the data structure is also poorly design and can be majorly optimized.


    Go ahead, knock yourself out...

    You can get this through OpenBSD's cvsweb, from openbsd.org.
    http://www.openbsd.org/cgi-bin/cvsweb/ports/infrastructure/package/find-all-conflicts

    the extra pm files live under
    http://www.openbsd.org/cgi-bin/cvsweb/src/usr.sbin/pkg_add
     
    Marc Espie, Apr 16, 2007
    #8
  9. [A complimentary Cc of this posting was sent to
    Marc Espie
    <>], who wrote in article <f00b1i$dmi$>:
    > > ME> my $file= File::Spec->canonpath($self->fullname());
    > > ME> if (exists $all_conflict->{$file}) {
    > > ME> $list->{$all_conflict->{$file}}->{$pkgname} ||=
    > > ME> [@{$all_conflict->{$file}}, $pkgname ];


    > We are talking 4000 packages. Which contain about 700000 files, total.


    So you want a hash with 700000 entries? With 100b/entry overhead, it
    will take about least 70K + size of entries.

    You did not explain what is the distribution of lengths of arrays
    which are values of the hash. E.g, if you have a lot of singletons,
    it would save space to store them directly in the hash (the price
    being an extra check when adding a new entry).

    You may also win by enumerating the packages, and storing the index:

    my $idx = $seen{$pkgname};
    $idx = $seen{$pkgname} = ++$max_pk_idx, $pk[$idx] = $pkgname
    unless defined $idx;
    push @{$all_conflict->{$file}}, 0 + $idx; # Cvt to "pure-number"

    Also, take care not to stringify $all_conflict->{$file}[$n], copy it
    to a scratch variable if you ever need the string value of it...

    Hope this helps,
    Ilya
     
    Ilya Zakharevich, Apr 16, 2007
    #9
  10. Marc Espie

    Uri Guttman Guest

    >>>>> "ME" == Marc Espie <> writes:

    >> that is slow as you copy the existing array back into another anon
    >> array. andyou have a lot of code redundancy all over this. you always
    >> are assigning an anon array but autoviviification will handle that for
    >> you. put this before the if (and i am not even sure you need a
    >> conditional there at all but i haven't followed the logic flow)
    >>
    >> push( @{$all_conflict->{$file}}, $pkgname ;


    ME> But still, the new structure takes less space than the old one.
    ME> The trick is that there is one single ref for each pkgname set now.

    that may be the case but the intermediate storage will be dramatically
    reduced too with my code.

    ME> The simple line overhead will build one separate entry for each file
    ME> and this looks like it is the stuff gobbling memory.

    i will take a look at that when i can.

    >> as for the conflict hash, i am sure it can be reduced but i don't know
    >> the logic. you change $all_conflict->{$file} after each push (your ||=
    >> code) which makes no sense to me. maybe you should clearly explain the
    >> data structure you want to get out of this. i have yet to see such an
    >> explanation in this thread (or i am not awake yet). i can't see how 4000
    >> entries of maybe a few hundred bytes each will use up 250MB (or even 190).


    ME> We are talking 4000 packages. Which contain about 700000 files, total.

    i didn't see the 700k files part anywhere before. the dataset wasn't
    defined for me.

    >> knowing that the build code is poorly designed, now i am confident that
    >> the data structure is also poorly design and can be majorly optimized.


    ME> Go ahead, knock yourself out...

    ME> You can get this through OpenBSD's cvsweb, from openbsd.org.
    ME> http://www.openbsd.org/cgi-bin/cvsweb/ports/infrastructure/package/find-all-conflicts

    is this your program for the bsd distro? or are you trying to improve it
    for them?

    uri

    --
    Uri Guttman ------ -------- http://www.stemsystems.com
    --Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
    Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org
     
    Uri Guttman, Apr 16, 2007
    #10
  11. Marc Espie

    Uri Guttman Guest

    >>>>> "ME" == Marc Espie <> writes:

    ME> In article <>,
    ME> Uri Guttman <> wrote:
    >>
    >> any url to get that directly? if that is published code then my autoviv
    >> fix will save tons of time for many users. that copy anon arrays to
    >> themselves thing is massively bad code. for more on autovivification see
    >> my article at http://sysarch.com/Perl/autoviv.txt.


    >> knowing that the build code is poorly designed, now i am confident that
    >> the data structure is also poorly design and can be majorly optimized.


    ME> Go ahead, knock yourself out...

    ME> You can get this through OpenBSD's cvsweb, from openbsd.org.
    ME> http://www.openbsd.org/cgi-bin/cvsweb/ports/infrastructure/package/find-all-conflicts

    a quick scan shows many places to improve that script, noticeably in
    speed. one of your change logs comments on 5% speedup and i can easily
    get more as you are still doing that anon array copy stuff.

    it would help to have access to the package data and some functional
    specs. i will still do a basic code review here in a day or so if the
    time works out. i don't understand the logic or goals enough to do a
    deeper review as in shrinking the data structures which is the big
    issue.

    do you have any descriptions of the package info and what defines a
    package collision? this is something which should be in the
    docs/pod/comments of the script.

    uri

    --
    Uri Guttman ------ -------- http://www.stemsystems.com
    --Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
    Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org
     
    Uri Guttman, Apr 17, 2007
    #11
  12. Marc Espie

    Marc Espie Guest

    In article <>,
    Uri Guttman <> wrote:
    >any url to get that directly? if that is published code then my autoviv
    >fix will save tons of time for many users. that copy anon arrays to
    >themselves thing is massively bad code. for more on autovivification see
    >my article at http://sysarch.com/Perl/autoviv.txt.
    >
    >knowing that the build code is poorly designed, now i am confident that
    >the data structure is also poorly design and can be majorly optimized.


    Please explain to me how this is massively bad code. I know about
    auto-vivification. I don't see any unintended auto-vivification in my
    data structure.

    Also, when you talk about copying anon arrays to themselves, I assume you're
    referring to code like:
    $pkg_list->{$all_conflict->{$file}}->{$pkgname} ||=
    [@{$all_conflict->{$file}}, $pkgname ];
    $all_conflict->{$file} =
    $pkg_list->{$all_conflict->{$file}}->{$pkgname};

    but it doesn't intend to copy anon arrays to themselves. The idea
    is that for each set of pkgname1, pkgname2, pkgname3, I want to
    have one single ref to an array with [pkgname1, pkgname2, pkgname3].

    Assuming I have initially $all_conflict->{$file} corresponding to
    $r = [$pkgname1, $pkgname2], then I will try to build
    $pkg_list->{$r}->{$pkgname3} = [$pkgname1, $pkgname2, $pkgname3] as a
    unique new reference.

    I can use references as keys in hashes, right ?
     
    Marc Espie, Apr 17, 2007
    #12
  13. Marc Espie

    Marc Espie Guest

    In article <>,
    Uri Guttman <> wrote:
    >it would help to have access to the package data and some functional
    >specs. i will still do a basic code review here in a day or so if the
    >time works out. i don't understand the logic or goals enough to do a
    >deeper review as in shrinking the data structures which is the big
    >issue.


    It's probably impractical to give you access to the package data. A full
    package build is slightly over 5G in size these days....

    >do you have any descriptions of the package info and what defines a
    >package collision? this is something which should be in the
    >docs/pod/comments of the script.


    The documentation is scattered in other parts of the tools. Basic conflict
    information is documented in OpenBSD::pkgCfl(3p)
    The packing-list format is partly documented in pkg_create(1), partly
    in OpenBSD::packingElement(3p). The format of package specifications is
    explained in package_specs(7) (you can get all this through the cvsman
    interface in OpenBSD).

    These are somewhat big tools, and still evolving. The part that's more or
    less stable is documented...

    As far as find-all-conflicts goes, there are two passes: one that
    builds a correspondance between each file-like object in all packages,
    and the corresponding pkgnames that contain this object, and a second
    pass that checks for each resulting list of pkgnames whether there is
    an actual collision. Some packages can not be installed at the same time,
    as they conflict (they usually are variant versions of the same software),
    so it's okay for the same file to be present in both packages. The situation
    is slightly more complex, because each package actually pulls a dependency
    tree, so we have to look at the closure of packages through dependencies:
    even if a file belongs to 2 packages that do not conflict directly, there
    might be a conflict deeper in the dependency chain, which results in both
    packages not being installable simultaneously anyways.

    In the end, the script reports the (very small) number of files which happen
    to collide, but that no-one is yet aware of. As a result, we decide whether
    we want to change the file names, or whether the conflict should be
    explicitly registered...
     
    Marc Espie, Apr 17, 2007
    #13
  14. Marc Espie

    Uri Guttman Guest

    >>>>> "ME" == Marc Espie <> writes:

    ME> In article <>,
    ME> Uri Guttman <> wrote:
    >> any url to get that directly? if that is published code then my autoviv
    >> fix will save tons of time for many users. that copy anon arrays to
    >> themselves thing is massively bad code. for more on autovivification see
    >> my article at http://sysarch.com/Perl/autoviv.txt.
    >>
    >> knowing that the build code is poorly designed, now i am confident that
    >> the data structure is also poorly design and can be majorly optimized.


    ME> Please explain to me how this is massively bad code. I know about
    ME> auto-vivification. I don't see any unintended auto-vivification in my
    ME> data structure.

    it doesn't autovivify at all is my point.

    ME> Also, when you talk about copying anon arrays to themselves, I assume you're
    ME> referring to code like:
    ME> $pkg_list->{$all_conflict->{$file}}->{$pkgname} ||=
    ME> [@{$all_conflict->{$file}}, $pkgname ];
    ME> $all_conflict->{$file} =
    ME> $pkg_list->{$all_conflict->{$file}}->{$pkgname};

    ME> but it doesn't intend to copy anon arrays to themselves. The idea
    ME> is that for each set of pkgname1, pkgname2, pkgname3, I want to
    ME> have one single ref to an array with [pkgname1, pkgname2, pkgname3].

    ok, then you can make a better data structure for that. i still don't
    get the conflict logic as it isn't spelled out.

    ME> Assuming I have initially $all_conflict->{$file} corresponding to
    ME> $r = [$pkgname1, $pkgname2], then I will try to build
    ME> $pkg_list->{$r}->{$pkgname3} = [$pkgname1, $pkgname2, $pkgname3] as a
    ME> unique new reference.

    ME> I can use references as keys in hashes, right ?

    yes, you can do that.

    i need to grok the conflict logic. my gut says there is a much better
    structure out there and i trust it well.

    uri

    --
    Uri Guttman ------ -------- http://www.stemsystems.com
    --Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
    Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org
     
    Uri Guttman, Apr 17, 2007
    #14
  15. [A complimentary Cc of this posting was sent to
    Uri Guttman
    <>], who wrote in article <>:
    > ME> Assuming I have initially $all_conflict->{$file} corresponding to
    > ME> $r = [$pkgname1, $pkgname2], then I will try to build
    > ME> $pkg_list->{$r}->{$pkgname3} = [$pkgname1, $pkgname2, $pkgname3] as a
    > ME> unique new reference.


    > yes, you can do that.
    >
    > i need to grok the conflict logic. my gut says there is a much better
    > structure out there and i trust it well.


    Likewise here. To the OP: we MUST know what kind of information one
    must have AT THE END, and what is the "elementary event" which
    triggers update of the info. [Also, some stats of the state at the
    end are also useful.]

    Hope this helps,
    Ilya
     
    Ilya Zakharevich, Apr 17, 2007
    #15
  16. Marc Espie

    -berlin.de Guest

    Ilya Zakharevich <> wrote in comp.lang.perl.misc:
    > [A complimentary Cc of this posting was sent to
    > Uri Guttman
    > <>], who wrote in article <>:
    > > ME> Assuming I have initially $all_conflict->{$file} corresponding to
    > > ME> $r = [$pkgname1, $pkgname2], then I will try to build
    > > ME> $pkg_list->{$r}->{$pkgname3} = [$pkgname1, $pkgname2, $pkgname3] as a
    > > ME> unique new reference.

    >
    > > yes, you can do that.
    > >
    > > i need to grok the conflict logic. my gut says there is a much better
    > > structure out there and i trust it well.

    >
    > Likewise here.


    With lots of filenames and their common prefixes, a trie comes to mind.

    Anno
     
    -berlin.de, Apr 17, 2007
    #16
  17. Marc Espie

    Marc Espie Guest

    In article <f028f6$2hll$>,
    Ilya Zakharevich <> wrote:
    >[A complimentary Cc of this posting was sent to
    >Uri Guttman
    ><>], who wrote in article <>:
    >> ME> Assuming I have initially $all_conflict->{$file} corresponding to
    >> ME> $r = [$pkgname1, $pkgname2], then I will try to build
    >> ME> $pkg_list->{$r}->{$pkgname3} = [$pkgname1, $pkgname2, $pkgname3] as a
    >> ME> unique new reference.

    >
    >> yes, you can do that.
    >>
    >> i need to grok the conflict logic. my gut says there is a much better
    >> structure out there and i trust it well.

    >
    >Likewise here. To the OP: we MUST know what kind of information one
    >must have AT THE END, and what is the "elementary event" which
    >triggers update of the info. [Also, some stats of the state at the
    >end are also useful.]


    Yeah, well... sorry about that. The point is that I've spent a lot of time
    looking at the problem, and trying to optimize things. I was more looking
    into some insight as to perl-specific memory issues I might have missed.

    It's not a vital problem, the current program works, but it would be nice
    to shave even more memory.

    As far as stats go, I have a locate(1) database built from the same data.
    The whole information is roughly 60M, uncompressed. Which translates to
    a tight locate database of 5.6M (unfortunately, it's useless for the
    problem at hand, since lookups are expensive). This means that each
    entry is about 85 bytes long. So I have a factor *3 for each entry when
    I store it as a perl hash entry...

    I've tried my best to extract the relevant info from a large program, with
    a huge dataset (+4G packages).

    If you do not see how to optimize things further, no hardship...
     
    Marc Espie, Apr 18, 2007
    #17
  18. [A complimentary Cc of this posting was sent to
    Marc Espie
    <>], who wrote in article <f04q9v$2a3p$>:
    > If you do not see how to optimize things further, no hardship...


    You still do not follow... We see tens of ways; but without knowing
    (exactly) what you do, and (approximately) how to data looks like, we
    cannot choose between those which have a chance to work, and those
    which won't.

    Hope this helps,
    Ilya
     
    Ilya Zakharevich, Apr 19, 2007
    #18
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Stephan Diehl

    Some optimization tale

    Stephan Diehl, Dec 27, 2003, in forum: Python
    Replies:
    7
    Views:
    372
    Stephan Diehl
    Dec 30, 2003
  2. Jason Cavett

    Preferred Size, Minimum Size, Size

    Jason Cavett, May 23, 2008, in forum: Java
    Replies:
    5
    Views:
    12,742
    Michael Jung
    May 25, 2008
  3. Ravikiran

    Zero Optimization and Sign Optimization???

    Ravikiran, Nov 17, 2008, in forum: C Programming
    Replies:
    22
    Views:
    904
    Thad Smith
    Nov 24, 2008
  4. Replies:
    5
    Views:
    703
    James Kanze
    Jan 28, 2009
Loading...

Share This Page