Is there any performance benefit to...

D

Derek Fountain

I've inherited a piece of code that does something like this:

sub func {
my %hash = ();

...lots of code that populates and uses the hash

%hash = ();
}

There are hundreds of these functions and they are called millions of
times in a procedure which takes hours, sometimes days, to complete. The
guy who wrote it was obviously concerned about performance.

The question pertains to that resetting of the hash at the end of
function. Does it do anything that exiting the function doesn't do? In
Java-land I've seen things like that to force the garbage collector to
jump in, but in Perl-land won't it just get optimised away? The hash
doesn't get that big, as far as I can see, and there's nothing unusual
in any of the hash processing code.

Normally I'd just remove the line (or ignore it) but since it appears in
every one of these functions, and since removing a useful optimisation
might add an hour or two to my runtime, I thought I'd ask.
 
D

Derek Fountain

Bernard said:
Derek Fountain said:
I've inherited a piece of code that does something like this:

sub func {
my %hash = ();

...lots of code that populates and uses the hash

%hash = ();
}
[...]

Normally I'd just remove the line (or ignore it) but since it
appears in every one of these functions, and since removing a
useful optimisation might add an hour or two to my runtime, I
thought I'd ask.


Why don't you try running it without the line and seeing if there's a
benefit? Sheesh.

Because, as I said in the part you conveniently chopped out of the quote
above, this code takes many hours - maybe days - to run.

Sheesh.
 
S

Sisyphus

Derek Fountain said:
Bernard said:
Derek Fountain said:
I've inherited a piece of code that does something like this:

sub func {
my %hash = ();

...lots of code that populates and uses the hash

%hash = ();
}
[...]

Normally I'd just remove the line (or ignore it) but since it
appears in every one of these functions, and since removing a
useful optimisation might add an hour or two to my runtime, I
thought I'd ask.


Why don't you try running it without the line and seeing if there's a
benefit? Sheesh.

Because, as I said in the part you conveniently chopped out of the quote
above, this code takes many hours - maybe days - to run.

Sheesh.

Weeeeeeeelllllllllllll .... let's put it this way .... take a chance .... be
bold .... and do as Bernard suggested :)

You'll have to make allowances for me as I've had a little more bourbon than
is generally agreed to be "healthy" .... but I would have re-phrased
Bernard's (rhetorical) question as "Why don't you try running it without
the line and seeing if there's *not* a
benefit? Sheesh."

Weeeeeeeelllllllllllll .... in truth, I'm far too much of a snag to ever be
guilty of using the word "Sheesh". Instead, let me just say that what you
considered might be a "useful optimisation" is neither useful, nor an
optimisation :)

The only question that now remains to be answered is "Just how much Bourbon
have I, in fact, consumed ?" I don't know the answer .... though I know very
well that the last few sentences have taken an inordinately long time to
type.

Cheers,
Rob
 
X

xhoster

Derek Fountain said:
Because, as I said in the part you conveniently chopped out of the quote
above, this code takes many hours - maybe days - to run.

Hours? who cares. Days? Well, then run it on a smaller sub-set of the
problem. Sheesh.

Xho
 
X

xhoster

Derek Fountain said:
I've inherited a piece of code that does something like this:

sub func {
my %hash = ();

...lots of code that populates and uses the hash

Does any of the code pass a reference to that hash to other lexical scopes,
who might then hang onto the reference? Does any of the code make closures
upon the hash which are passed out of the lexical scope?
%hash = ();
}

There are hundreds of these functions and they are called millions of
times in a procedure which takes hours, sometimes days, to complete. The
guy who wrote it was obviously concerned about performance.

The question pertains to that resetting of the hash at the end of
function. Does it do anything that exiting the function doesn't do?

It is possible it does, depending on your answers the questions above.
In
Java-land I've seen things like that to force the garbage collector to
jump in, but in Perl-land won't it just get optimised away?

As far as I can tell, perl does very little optimization of this
nature. (In fact in this case it can't safely optimize it aways, because
AFAIK it does no flow analysis, so it doesn't know if references to the
hash have been passed out of the sub or not.)

The hash
doesn't get that big, as far as I can see, and there's nothing unusual
in any of the hash processing code.

Normally I'd just remove the line (or ignore it) but since it appears in
every one of these functions, and since removing a useful optimisation
might add an hour or two to my runtime, I thought I'd ask.

Benchmark it and see. If you don't trust your own benchmark to adequately
reflect the reality of your program, then you surely shouldn't trust the
say-so of people who have never actually seen the code.

Xho
 
D

Derek Fountain

Hours? who cares. Days? Well, then run it on a smaller sub-set of the
problem. Sheesh.

Maybe I've been in PHP-land too long, but when I used to frequent this
newsgroup a couple of years ago I found it invariably helpful. People
who knew the answers to on-topic questions would offer advice, and those
who didn't know the answers would keep quiet. People who wanted to learn
things about Perl would ask questions, and no one got shot down for
asking. Asking and learning always used to be the point of the newgroup.
Seems things have changed.

For the record, I'm working on a cluster of 256 dual and quad processor
Intel machines, with several thousand hard drives - well over a million
quids worth of hardware. It takes 4 of us working fulltime to drive this
thing, and running a testcase on it involves hours or days of
preparation. Changing the data to a smaller subset, then reconfiguring
the cluster to expect a new data set would take weeks of work. I guess
the smart-arses who like to say "just try it" don't have a whole lot of
experience of computing problems on this scale.

Trust me, sometimes it is simpler to just ask.
 
X

xhoster

Derek Fountain said:
Maybe I've been in PHP-land too long, but when I used to frequent this
newsgroup a couple of years ago I found it invariably helpful. People
who knew the answers to on-topic questions would offer advice, and those
who didn't know the answers would keep quiet. People who wanted to learn
things about Perl would ask questions, and no one got shot down for
asking. Asking and learning always used to be the point of the newgroup.
Seems things have changed.

For the record, I'm working on a cluster of 256 dual and quad processor
Intel machines, with several thousand hard drives - well over a million
quids worth of hardware.

Cool. If it is worth doing all that, then it is worth doing right, is it
not?
It takes 4 of us working fulltime to drive this
thing, and running a testcase on it involves hours or days of
preparation.

Frankly, I don't find this very impressive. I have a set-up only slightly
smaller than that, and it doesn't take 4 people working full time to drive
it and it doesn't take me hours or days of preparation to run a test case.

Changing the data to a smaller subset, then reconfiguring
the cluster to expect a new data set would take weeks of work.

Then you are probably doing it wrong. Maybe we could help you. If you let
us.
I guess
the smart-arses who like to say "just try it" don't have a whole lot of
experience of computing problems on this scale.

You guess wrong.

BTW, I notice you haven't answered the questions which I asked in
an effort to help you help me to help you. I guess pissing and moaning is
more important to you than actually getting the answer you claim you want.
Trust me, sometimes it is simpler to just ask.

Since you don't seem to want to listen to the answers you get, it would be
even simpler to just do nothing.

In the mean time, I've run a benchmark showing that including the %hash=()
at the end of the subroutine is slightly but reliably slower. On my
computer. With my verion of Perl. With my "filler" code in the subroutine.
In the milieu of competing CPU and memory demands that exists on my system.
It didn't take me a week of preparation time to do this. If you think this
result transfers to your situation, then bon apetite.

Xho
 
J

jl_post

Derek said:
I've inherited a piece of code that does something like this:

sub func {
my %hash = ();

...lots of code that populates and uses the hash

%hash = ();
}

There are hundreds of these functions and they are called
millions of times in a procedure which takes hours,
sometimes days, to complete. The guy who wrote it was
obviously concerned about performance.

The question pertains to that resetting of the hash at
the end of function. Does it do anything that exiting
the function doesn't do?


Dear Derek,

Out of curiosity, does the code you inherited use "strict" and
"warnings"? The reason I ask this is because I've inherited lots of
code written by coders who didn't feel the need to add those pragmas,
and their code is often filled with portions of code that don't make
sense (for example, escaping (or back-slashing) the '.' and '_'
characters inside a string (not regular expression) and "resetting" an
array with something like "@array = 0;" (which does not empty the
array, by the way)).

When asked about the code, the original coder will usually say
something along the lines of doing it to be safe and/or efficient.
When I mention that one of the best ways to make code more safe (and
less error-prone) is to use the "strict" and "warnings" pragmas, the
response is something like, "I would have done that if I had the time.
But don't touch it because I know it works."

So setting %hash to an empty list at the end of its scope shouldn't
affect anything (and I'm guessing it won't make anything run any
faster, either, but you'd really have to benchmark your code (or a
reasonable mock-up) to be sure).

Since it is being set to "()" and the end of the function, it looks
like it will go out of scope right away, but it's possible it won't if
a reference to %hash has been passed to another area of the program.
If a reference has been passed out to another section of code, the
%hash will live on without being destroyed at the end of the function.
As a result, "%hash = ();" causes ALL the hash references to point to
an empty hash.

It's hard to say without seeing your code, but I'm guessing that no
references to %hash were passed out, otherwise, it would be rather
strange to hand out a reference that refers to an empty hash. But
maybe that function returns early and doesn't always reach that line.
It's especially hard to say what the original programmer meant as he
didn't appear to leave any comments explaining his intent.

If you can still reach the original programmer, I'd recommend asking
him what his intent was; if you can't, review your code for any \%hash
references. And to be honest, if that line truly is unnecessary, I
don't think removing it will make any noticeable difference in
execution speed, as input and output operations usually take up
magnitudes more CPU time than just clearing out a %hash (which would
probably be done along the way sometime anyway, with or without that
line of code).

I hope this helps, Derek.

-- Jean-Luc
 
D

Derek Fountain

Dear Derek,
<snip useful thoughts!>

Thanks for the input. The code I'm looking at uses warnings and strict,
and doesn't do anything clever with references. The fact no one has
jumped up and said "oh yes, that's useful because..." suggests to me
that clearing a hash before exiting a subroutine is not an optimisation
and the original coder was, as you say, guessing from a position of
ignorance. However...
So setting %hash to an empty list at the end of its scope shouldn't
affect anything (and I'm guessing it won't make anything run any
faster, either, but you'd really have to benchmark your code (or a
reasonable mock-up) to be sure).

I tried the following benchmark, which is not really analogous to my
code other than the fact is uses a large hash in a subroutine:

#!/usr/bin/perl -w
use strict;

sub f {

my %h = ();

foreach my $i (0..1000000) { $h{$i}=$i }

foreach my $k (keys %h ) { $h{0} = $k }

# %h = ();
}

foreach (0..10) { f() }


I ran this 5 times on an otherwise idle dual CPU Linux box and it
averaged about 54 seconds per run. I then uncommented the line at the
end of the subroutine and tried another 5 runs. Now it averaged 46
seconds per run. I tried several other benchmarks and couldn't find
another one that shows such an optimisation. I just stumbled across this
one and don't know why it runs faster when that hash is cleared before
the subroutine exits.

So clearly, under some circumstances, it is a valid optimisation. I have
no idea whether it applies to my real data set on my cluster, and as
explained it's not easy to try it out. So for now the line stays in and
I'd better get on with what I'm supposed to be doing!

I'm deeply curious though...
 
J

Jürgen Exner

Derek said:
I ran this 5 times on an otherwise idle dual CPU Linux box and it
averaged about 54 seconds per run. I then uncommented the line at the
end of the subroutine and tried another 5 runs. Now it averaged 46
seconds per run.

You are aware of the Benchmark module, aren't you?

jue
 
X

xhoster

Jürgen Exner said:
You are aware of the Benchmark module, aren't you?

I am. But I am also aware of the unix/linux "time" command, and I
usually find it more convenient, and more reliable.

Xho
 
P

Peter Scott

I've inherited a piece of code that does something like this:

sub func {
my %hash = ();

...lots of code that populates and uses the hash

%hash = ();
}

There are hundreds of these functions and they are called millions of
times in a procedure which takes hours, sometimes days, to complete. The
guy who wrote it was obviously concerned about performance.

Have you asked him?

I would have said it was pointless myself, and if anything likely to
increase running time, but having written before that even the best
programmers are lousy at guessing about performance optimizations, I
didn't guess:

[peter@tweety ~]$ cat /tmp/foo
#!/usr/bin/perl
use strict;
use warnings;

use Benchmark qw(cmpthese);

cmpthese(20,
{ clear => sub { my %h = map {$_,$_} 1..1E5; %h = () },
no_clear => sub { my %h = map {$_,$_} 1..1E5 }
}
);
[peter@tweety ~]$ /tmp/foo
s/iter no_clear clear
no_clear 2.29 -- -3%
clear 2.22 3% --

Go figure.

[peter@tweety ~]$ perl -v
This is perl, v5.8.5 built for i386-linux-thread-multi
 
E

Eric J. Roode

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I've inherited a piece of code that does something like this:

sub func {
my %hash = ();

...lots of code that populates and uses the hash

%hash = ();
}

There are hundreds of these functions and they are called millions of
times in a procedure which takes hours, sometimes days, to complete. The
guy who wrote it was obviously concerned about performance.

The question pertains to that resetting of the hash at the end of
function. Does it do anything that exiting the function doesn't do? In
Java-land I've seen things like that to force the garbage collector to
jump in, but in Perl-land won't it just get optimised away? The hash
doesn't get that big, as far as I can see, and there's nothing unusual
in any of the hash processing code.

Normally I'd just remove the line (or ignore it) but since it appears in
every one of these functions, and since removing a useful optimisation
might add an hour or two to my runtime, I thought I'd ask.

You are speculating, and asking others to speculate, about
performance
and optimization. This is almost always a Bad Idea.

Here's my answer, for what it's worth (not much): %hash = (), either
at the start (initialization) or end (cleanup) of a function, will
add no performance and subtract a tiny fraction of performance.

I'm a pretty smart guy, and I've been programming in Perl for ten
years, but that previous paragraph is almost certainly useless. Why?
Because it's a gut-feel answer, without knowledge of the rest of
your system, the rest of your code, and what interactions there may
be.

What you appear to be doing is making some educated guesses about
what is slow in your program, then changing them, then seeing if
things run faster. This is a very poor optimization technique. What
you must, MUST do is to *measure* the performance of your code. Then
see what parts of the code are taking up the lion's share of the
time, and work on optimizing THOSE parts. For all you or I know,
%hash = () might be very slow, but the portion of the program in
which it appears has far slower bits. So if you remove the %hash =
() lines, you'll have improved the speed, but not nearly as much as
if you had spent your time working on the real problems.

Profile your code. Do a production run with Devel::prof or
Devel::SmallProf. Find out what portions of the code are taking up
all the time, and prioritize them. THEN come and ask us why this
function or that statement or whatever is taking so long.

Don't guess, and don't ask us to guess. Software engineering is not
about guesses.

- --
Eric
`$=`;$_=\%!;($_)=/(.)/;$==++$|;($.,$/,$,,$\,$",$;,$^,$#,$~,$*,$:,@%)=(
$!=~/(.)(.).(.)(.)(.)(.)..(.)(.)(.)..(.)......(.)/,$"),$=++;$.++;$.++;
$_++;$_++;($_,$\,$,)=($~.$"."$;$/$%[$?]$_$\$,$:$%[$?]",$"&$~,$#,);$,++
;$,++;$^|=$";`$_$\$,$/$:$;$~$*$%[$?]$.$~$*${#}$%[$?]$;$\$"$^$~$*.>&$=`
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (MingW32) - WinPT 0.7.96rc1

iD8DBQFDEMm+Y96i4h5M0egRAsjBAKCA7zbQOYOR0ISYc5z3mT8QuNxOrgCgygXg
Q67VprcSY8cctzQVrzUcpLY=
=bs6r
-----END PGP SIGNATURE-----
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,901
Latest member
Noble71S45

Latest Threads

Top