looking for some size optimization

M

Marc Espie

I'm looking at a script that handles a huge amount of data... basically,
the filenames from +4000 packages in order to recognize conflicts.

Currently, it builds a big hash through a loop that constructs
$all_conflict like this:

my $file= File::Spec->canonpath($self->fullname());
push ${$all_conflict->{$file}}, $pkgname;


I end up with a hash of 250M.

I can't really use Devel::Size with any benefit, since all the data
is one single chunk (all the other data in the program amount to <2~3M)

I expect the $pkgname strings to be shared. In fact, I tried replacing
$pkgname with \$pkgname, with negative size results (+20M).

I've tried splitting the path along `frequent' directories, with inconclusive
gains (most of the files live under /usr/local, /etc, or /var/www):
-20M at most.

I'm looking for bright ideas to try and reduce the size used... without
being too detrimental in terms of speed...
 
M

Mumia W.

I'm looking at a script that handles a huge amount of data... basically,
the filenames from +4000 packages in order to recognize conflicts.

Currently, it builds a big hash through a loop that constructs
$all_conflict like this:

my $file= File::Spec->canonpath($self->fullname());
push ${$all_conflict->{$file}}, $pkgname;


I end up with a hash of 250M.
[...]

If swapping is a problem, you could try using a tied database hash such
as DB_File or NDBM_File or SDBM_File or GDBM_File.

Perhaps try all of those to see which one performs best. If nothing
works, buy more ram :)
 
A

anno4000

Marc Espie said:
I'm looking at a script that handles a huge amount of data... basically,
the filenames from +4000 packages in order to recognize conflicts.

Currently, it builds a big hash through a loop that constructs
$all_conflict like this:

my $file= File::Spec->canonpath($self->fullname());
push ${$all_conflict->{$file}}, $pkgname;


I end up with a hash of 250M.

No, you end up with a syntax error, unless the second code line is
really

push @{$all_conflict->{$file}}, $pkgname;

I'll assume that, but please don't re-type code, copy/paste it to
make sure such errors don't happen.
I can't really use Devel::Size with any benefit, since all the data
is one single chunk (all the other data in the program amount to <2~3M)

I expect the $pkgname strings to be shared. In fact, I tried replacing

What makes you expect that? You're pushing unrelated strings on
arrays.
$pkgname with \$pkgname, with negative size results (+20M).

Sure. That way, you are storing a reference in addition to the strings.
I've tried splitting the path along `frequent' directories, with inconclusive
gains (most of the files live under /usr/local, /etc, or /var/www):
-20M at most.

That comes at the cost of additional (anonymous, but real) hashes.
I'm looking for bright ideas to try and reduce the size used... without
being too detrimental in terms of speed...

The best storage strategy often depends on the nature of the data.
In your case, I'll assume that actual conflicts are rare. That means
the majority of your arrays of package names contain only one element.
That is wasteful. You could use two hashes, one to detect if a file
name has been seen before and another to keep information about actual
conflicts. Here is some code. I assume that pairs of package names
and file names can be read from DATA:

my ( %seen, %conflicts);
while ( <DATA> ) {
my ( $package, $file) = split;
if ( exists $seen{ $file} ) {
$conflicts{ $file} ||= [ $seen{ $file}]; # transfer first package
push @{ $conflicts{ $file} }, $package; # add another package
} else {
$seen{ $file} = $package; # mark as seen
}
}

print "No conflicts\n" unless %conflicts;
print "$_ in ", join( ', ' => @{ $conflicts{ $_} }), "\n" for
keys %conflicts;

This stores the first package name a file is seen with as a plain
string in %seen, not a one-element array. Only when a file is seen
a second time an array is built in %conflicts. If conflicts are
rare, this should save some memory.

Anno
 
X

xhoster

I'm looking at a script that handles a huge amount of data... basically,
the filenames from +4000 packages in order to recognize conflicts.

How many more than 4000 do you have? To take up the amount of room you
are talking about, it seems you would need orders of magnitude more
than 4000. But maybe I don't understand the nature of the data. What does
a conflict end up being represented as?
Currently, it builds a big hash through a loop that constructs
$all_conflict like this:

my $file= File::Spec->canonpath($self->fullname());
push ${$all_conflict->{$file}}, $pkgname;

Is $pkgname the name of the package declared in $file, or the name
of the package used in $file. In any event, are you pushing the same
$pkgname onto same file's list over and over? Maybe you should use a hash
of hash instead of hash of array:

$all_conflict->{$file}->{$pkname}=();
I end up with a hash of 250M.

I can't really use Devel::Size with any benefit, since all the data
is one single chunk (all the other data in the program amount to <2~3M)

I expect the $pkgname strings to be shared.

Hash keys are shared (with substantial overhead) but array values are not.
If your paths are really so long, you might try compressing them, or
digesting them.

....
I'm looking for bright ideas to try and reduce the size used... without
being too detrimental in terms of speed...

Without knowing more about the exact nature of the data, it is hard to say.
optimization that involved changing the way a structure is built is often
drive by the way the structure is used, so showing that part may also be
useful.

Xho
 
M

Marc Espie

No, you end up with a syntax error, unless the second code line is
really

push @{$all_conflict->{$file}}, $pkgname;

I'll assume that, but please don't re-type code, copy/paste it to
make sure such errors don't happen.

Oops, sorry about that. I usually copy&paste code indeed, and I could
have sworn I did.
What makes you expect that? You're pushing unrelated strings on
arrays.

They're not necessarily unrelated.

I guess I'll have to share more code ;-) . I grab the full list
for each package, and populate the hash with that string like so:

$plist->visit('register', $filehash, $dephash, $plist->pkgname());

(not quoting my plist and visitor code, I assume you can figure out that
it will end up calling the above snippet for each item in the packing-list)

so I am certain that I call the registration routine with an identical
string for each item in a given package. I haven't looked too closely at
perl's internals, but I would have assumed it to share the string in such
a case ?
The best storage strategy often depends on the nature of the data.
In your case, I'll assume that actual conflicts are rare. That means
the majority of your arrays of package names contain only one element.
That is wasteful. You could use two hashes, one to detect if a file
name has been seen before and another to keep information about actual
conflicts. Here is some code. I assume that pairs of package names
and file names can be read from DATA:

my ( %seen, %conflicts);
while ( <DATA> ) {
my ( $package, $file) = split;
if ( exists $seen{ $file} ) {
$conflicts{ $file} ||= [ $seen{ $file}]; # transfer first package
push @{ $conflicts{ $file} }, $package; # add another package
} else {
$seen{ $file} = $package; # mark as seen
}
}

This looks like a good idea indeed.
I'm wondering if maybe I should build a cache of pkgname lists,
and try to share as many as I can...

anyways, I'll try your idea (and the corresponding one) and tell you whether
I see any improvement.

While I'm there, responding to the other persons: sorry, my first message
wasn't clear. I do not want to go off to disk to a dbm or a database.
As it stands, the program works. It's just that 250M is over the default
ulimit of the considered system, which means that people have to remember
to bump the limits before running it. There's also the fact that the current
set of packages is bound to grow, so if I can find a good idea to reduce
memory usage, that would be cool. But using external storage is not the
solution I'm looking for.
 
M

Marc Espie

my ( %seen, %conflicts);
while ( <DATA> ) {
my ( $package, $file) = split;
if ( exists $seen{ $file} ) {
$conflicts{ $file} ||= [ $seen{ $file}]; # transfer first package
push @{ $conflicts{ $file} }, $package; # add another package
} else {
$seen{ $file} = $package; # mark as seen
}
}

This looks like a good idea indeed.
I'm wondering if maybe I should build a cache of pkgname lists,
and try to share as many as I can...

Thanks for the insight. With a few lines of additional code, this does shrink
the process footprint from >250M to 190M. Very nice indeed.
Doesn't appear to be slower, the computation is IO-bound anyways (reading
packing-lists for gzip'ed archives).

The code now looks like:

my $file= File::Spec->canonpath($self->fullname());
if (exists $all_conflict->{$file}) {
$list->{$all_conflict->{$file}}->{$pkgname} ||=
[@{$all_conflict->{$file}}, $pkgname ];
$all_conflict->{$file} = $list->{$all_conflict->{$file}}->{$pkgname};
} else {
$list->{$pkgname} ||= [$pkgname];
$all_conflict->{$file} = $list->{$pkgname};
}

I may try to tweak it a bit further, but it's already a very nice improvement.
(if somebody has other ideas, I'm still game ;-) )

Anyways, thanks a lot ;-)

(that's the ports/infrastructure/packages/find-all-conflicts script used in
OpenBSD, btw)

I think I'm going to use an extra seen array as well... I need to do some
measurements, but I suspect most of my files don't appear in more than
one package, so scanning the conflicts hash later probably can be sped up
by a large amount.
 
U

Uri Guttman

ME> my $file= File::Spec->canonpath($self->fullname());
ME> if (exists $all_conflict->{$file}) {
ME> $list->{$all_conflict->{$file}}->{$pkgname} ||=
ME> [@{$all_conflict->{$file}}, $pkgname ];

that is slow as you copy the existing array back into another anon
array. andyou have a lot of code redundancy all over this. you always
are assigning an anon array but autoviviification will handle that for
you. put this before the if (and i am not even sure you need a
conditional there at all but i haven't followed the logic flow)

push( @{$all_conflict->{$file}}, $pkgname ;

that will replace the first line in both clauses and be much faster as
well (no wasteful extra copying). it might even save space as perl won't
be allocating and freeing as many buffers so less storage could be used.


ME> $all_conflict->{$file} = $list->{$all_conflict->{$file}}->{$pkgname};
ME> } else {
ME> $list->{$pkgname} ||= [$pkgname];
ME> $all_conflict->{$file} = $list->{$pkgname};
ME> }

as for the conflict hash, i am sure it can be reduced but i don't know
the logic. you change $all_conflict->{$file} after each push (your ||=
code) which makes no sense to me. maybe you should clearly explain the
data structure you want to get out of this. i have yet to see such an
explanation in this thread (or i am not awake yet). i can't see how 4000
entries of maybe a few hundred bytes each will use up 250MB (or even 190).

ME> (that's the ports/infrastructure/packages/find-all-conflicts
ME> script used in OpenBSD, btw)

any url to get that directly? if that is published code then my autoviv
fix will save tons of time for many users. that copy anon arrays to
themselves thing is massively bad code. for more on autovivification see
my article at http://sysarch.com/Perl/autoviv.txt.

knowing that the build code is poorly designed, now i am confident that
the data structure is also poorly design and can be majorly optimized.

uri
 
M

Marc Espie

ME> my $file= File::Spec->canonpath($self->fullname());
ME> if (exists $all_conflict->{$file}) {
ME> $list->{$all_conflict->{$file}}->{$pkgname} ||=
ME> [@{$all_conflict->{$file}}, $pkgname ];

that is slow as you copy the existing array back into another anon
array. andyou have a lot of code redundancy all over this. you always
are assigning an anon array but autoviviification will handle that for
you. put this before the if (and i am not even sure you need a
conditional there at all but i haven't followed the logic flow)

push( @{$all_conflict->{$file}}, $pkgname ;

But still, the new structure takes less space than the old one.
The trick is that there is one single ref for each pkgname set now.

The simple line overhead will build one separate entry for each file
and this looks like it is the stuff gobbling memory.
as for the conflict hash, i am sure it can be reduced but i don't know
the logic. you change $all_conflict->{$file} after each push (your ||=
code) which makes no sense to me. maybe you should clearly explain the
data structure you want to get out of this. i have yet to see such an
explanation in this thread (or i am not awake yet). i can't see how 4000
entries of maybe a few hundred bytes each will use up 250MB (or even 190).

We are talking 4000 packages. Which contain about 700000 files, total.
ME> (that's the ports/infrastructure/packages/find-all-conflicts
ME> script used in OpenBSD, btw)

any url to get that directly? if that is published code then my autoviv
fix will save tons of time for many users. that copy anon arrays to
themselves thing is massively bad code. for more on autovivification see
my article at http://sysarch.com/Perl/autoviv.txt.
knowing that the build code is poorly designed, now i am confident that
the data structure is also poorly design and can be majorly optimized.

Go ahead, knock yourself out...

You can get this through OpenBSD's cvsweb, from openbsd.org.
http://www.openbsd.org/cgi-bin/cvsweb/ports/infrastructure/package/find-all-conflicts

the extra pm files live under
http://www.openbsd.org/cgi-bin/cvsweb/src/usr.sbin/pkg_add
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Marc Espie
ME> my $file= File::Spec->canonpath($self->fullname());
ME> if (exists $all_conflict->{$file}) {
ME> $list->{$all_conflict->{$file}}->{$pkgname} ||=
ME> [@{$all_conflict->{$file}}, $pkgname ];
We are talking 4000 packages. Which contain about 700000 files, total.

So you want a hash with 700000 entries? With 100b/entry overhead, it
will take about least 70K + size of entries.

You did not explain what is the distribution of lengths of arrays
which are values of the hash. E.g, if you have a lot of singletons,
it would save space to store them directly in the hash (the price
being an extra check when adding a new entry).

You may also win by enumerating the packages, and storing the index:

my $idx = $seen{$pkgname};
$idx = $seen{$pkgname} = ++$max_pk_idx, $pk[$idx] = $pkgname
unless defined $idx;
push @{$all_conflict->{$file}}, 0 + $idx; # Cvt to "pure-number"

Also, take care not to stringify $all_conflict->{$file}[$n], copy it
to a scratch variable if you ever need the string value of it...

Hope this helps,
Ilya
 
U

Uri Guttman

ME> But still, the new structure takes less space than the old one.
ME> The trick is that there is one single ref for each pkgname set now.

that may be the case but the intermediate storage will be dramatically
reduced too with my code.

ME> The simple line overhead will build one separate entry for each file
ME> and this looks like it is the stuff gobbling memory.

i will take a look at that when i can.

ME> We are talking 4000 packages. Which contain about 700000 files, total.

i didn't see the 700k files part anywhere before. the dataset wasn't
defined for me.

ME> Go ahead, knock yourself out...

ME> You can get this through OpenBSD's cvsweb, from openbsd.org.
ME> http://www.openbsd.org/cgi-bin/cvsweb/ports/infrastructure/package/find-all-conflicts

is this your program for the bsd distro? or are you trying to improve it
for them?

uri
 
U

Uri Guttman

ME> Go ahead, knock yourself out...

ME> You can get this through OpenBSD's cvsweb, from openbsd.org.
ME> http://www.openbsd.org/cgi-bin/cvsweb/ports/infrastructure/package/find-all-conflicts

a quick scan shows many places to improve that script, noticeably in
speed. one of your change logs comments on 5% speedup and i can easily
get more as you are still doing that anon array copy stuff.

it would help to have access to the package data and some functional
specs. i will still do a basic code review here in a day or so if the
time works out. i don't understand the logic or goals enough to do a
deeper review as in shrinking the data structures which is the big
issue.

do you have any descriptions of the package info and what defines a
package collision? this is something which should be in the
docs/pod/comments of the script.

uri
 
M

Marc Espie

any url to get that directly? if that is published code then my autoviv
fix will save tons of time for many users. that copy anon arrays to
themselves thing is massively bad code. for more on autovivification see
my article at http://sysarch.com/Perl/autoviv.txt.

knowing that the build code is poorly designed, now i am confident that
the data structure is also poorly design and can be majorly optimized.

Please explain to me how this is massively bad code. I know about
auto-vivification. I don't see any unintended auto-vivification in my
data structure.

Also, when you talk about copying anon arrays to themselves, I assume you're
referring to code like:
$pkg_list->{$all_conflict->{$file}}->{$pkgname} ||=
[@{$all_conflict->{$file}}, $pkgname ];
$all_conflict->{$file} =
$pkg_list->{$all_conflict->{$file}}->{$pkgname};

but it doesn't intend to copy anon arrays to themselves. The idea
is that for each set of pkgname1, pkgname2, pkgname3, I want to
have one single ref to an array with [pkgname1, pkgname2, pkgname3].

Assuming I have initially $all_conflict->{$file} corresponding to
$r = [$pkgname1, $pkgname2], then I will try to build
$pkg_list->{$r}->{$pkgname3} = [$pkgname1, $pkgname2, $pkgname3] as a
unique new reference.

I can use references as keys in hashes, right ?
 
M

Marc Espie

it would help to have access to the package data and some functional
specs. i will still do a basic code review here in a day or so if the
time works out. i don't understand the logic or goals enough to do a
deeper review as in shrinking the data structures which is the big
issue.

It's probably impractical to give you access to the package data. A full
package build is slightly over 5G in size these days....
do you have any descriptions of the package info and what defines a
package collision? this is something which should be in the
docs/pod/comments of the script.

The documentation is scattered in other parts of the tools. Basic conflict
information is documented in OpenBSD::pkgCfl(3p)
The packing-list format is partly documented in pkg_create(1), partly
in OpenBSD::packingElement(3p). The format of package specifications is
explained in package_specs(7) (you can get all this through the cvsman
interface in OpenBSD).

These are somewhat big tools, and still evolving. The part that's more or
less stable is documented...

As far as find-all-conflicts goes, there are two passes: one that
builds a correspondance between each file-like object in all packages,
and the corresponding pkgnames that contain this object, and a second
pass that checks for each resulting list of pkgnames whether there is
an actual collision. Some packages can not be installed at the same time,
as they conflict (they usually are variant versions of the same software),
so it's okay for the same file to be present in both packages. The situation
is slightly more complex, because each package actually pulls a dependency
tree, so we have to look at the closure of packages through dependencies:
even if a file belongs to 2 packages that do not conflict directly, there
might be a conflict deeper in the dependency chain, which results in both
packages not being installable simultaneously anyways.

In the end, the script reports the (very small) number of files which happen
to collide, but that no-one is yet aware of. As a result, we decide whether
we want to change the file names, or whether the conflict should be
explicitly registered...
 
U

Uri Guttman

ME> Please explain to me how this is massively bad code. I know about
ME> auto-vivification. I don't see any unintended auto-vivification in my
ME> data structure.

it doesn't autovivify at all is my point.

ME> Also, when you talk about copying anon arrays to themselves, I assume you're
ME> referring to code like:
ME> $pkg_list->{$all_conflict->{$file}}->{$pkgname} ||=
ME> [@{$all_conflict->{$file}}, $pkgname ];
ME> $all_conflict->{$file} =
ME> $pkg_list->{$all_conflict->{$file}}->{$pkgname};

ME> but it doesn't intend to copy anon arrays to themselves. The idea
ME> is that for each set of pkgname1, pkgname2, pkgname3, I want to
ME> have one single ref to an array with [pkgname1, pkgname2, pkgname3].

ok, then you can make a better data structure for that. i still don't
get the conflict logic as it isn't spelled out.

ME> Assuming I have initially $all_conflict->{$file} corresponding to
ME> $r = [$pkgname1, $pkgname2], then I will try to build
ME> $pkg_list->{$r}->{$pkgname3} = [$pkgname1, $pkgname2, $pkgname3] as a
ME> unique new reference.

ME> I can use references as keys in hashes, right ?

yes, you can do that.

i need to grok the conflict logic. my gut says there is a much better
structure out there and i trust it well.

uri
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Uri Guttman
ME> Assuming I have initially $all_conflict->{$file} corresponding to
ME> $r = [$pkgname1, $pkgname2], then I will try to build
ME> $pkg_list->{$r}->{$pkgname3} = [$pkgname1, $pkgname2, $pkgname3] as a
ME> unique new reference.
yes, you can do that.

i need to grok the conflict logic. my gut says there is a much better
structure out there and i trust it well.

Likewise here. To the OP: we MUST know what kind of information one
must have AT THE END, and what is the "elementary event" which
triggers update of the info. [Also, some stats of the state at the
end are also useful.]

Hope this helps,
Ilya
 
A

anno4000

Ilya Zakharevich said:
[A complimentary Cc of this posting was sent to
Uri Guttman
ME> Assuming I have initially $all_conflict->{$file} corresponding to
ME> $r = [$pkgname1, $pkgname2], then I will try to build
ME> $pkg_list->{$r}->{$pkgname3} = [$pkgname1, $pkgname2, $pkgname3] as a
ME> unique new reference.
yes, you can do that.

i need to grok the conflict logic. my gut says there is a much better
structure out there and i trust it well.

Likewise here.

With lots of filenames and their common prefixes, a trie comes to mind.

Anno
 
M

Marc Espie

[A complimentary Cc of this posting was sent to
Uri Guttman
ME> Assuming I have initially $all_conflict->{$file} corresponding to
ME> $r = [$pkgname1, $pkgname2], then I will try to build
ME> $pkg_list->{$r}->{$pkgname3} = [$pkgname1, $pkgname2, $pkgname3] as a
ME> unique new reference.
yes, you can do that.

i need to grok the conflict logic. my gut says there is a much better
structure out there and i trust it well.

Likewise here. To the OP: we MUST know what kind of information one
must have AT THE END, and what is the "elementary event" which
triggers update of the info. [Also, some stats of the state at the
end are also useful.]

Yeah, well... sorry about that. The point is that I've spent a lot of time
looking at the problem, and trying to optimize things. I was more looking
into some insight as to perl-specific memory issues I might have missed.

It's not a vital problem, the current program works, but it would be nice
to shave even more memory.

As far as stats go, I have a locate(1) database built from the same data.
The whole information is roughly 60M, uncompressed. Which translates to
a tight locate database of 5.6M (unfortunately, it's useless for the
problem at hand, since lookups are expensive). This means that each
entry is about 85 bytes long. So I have a factor *3 for each entry when
I store it as a perl hash entry...

I've tried my best to extract the relevant info from a large program, with
a huge dataset (+4G packages).

If you do not see how to optimize things further, no hardship...
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Marc Espie
If you do not see how to optimize things further, no hardship...

You still do not follow... We see tens of ways; but without knowing
(exactly) what you do, and (approximately) how to data looks like, we
cannot choose between those which have a chance to work, and those
which won't.

Hope this helps,
Ilya
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,534
Members
45,008
Latest member
Rahul737

Latest Threads

Top