Increase WinXP/jre CPU usage?

P

Patricia Shanahan

Chris said:
Well, the generally reported figure is in that ball-park.

As for explaining it, I should first warn you that I'm not especially
knowledgeable about hardware/chip design, and I'm also relying on a (possibly
faulty) memory, so take all of the following with the usual pinch of salt, and
verify (or refute) it for yourself before depending on it.

That said, my understanding is that, although the Intel HT stuff duplicates
enough registers to allow two independent execution streams, it does /not/
duplicate the ALUs, or the instruction decode pipeline. So the actual
processing power is shared between the two threads, or the two "CPU"s running
them. That means that the HT architecture only provides a benefit when one
thread is stalled on a cache read, or otherwise has nothing in its instruction
pipeline, /and/ the other thread /does/ have all the data and decoded
instructions necessary to proceed. Since the two threads are competing for the
cache space (and in any case most programs spend a lot of time stalled one way
or another) that doesn't happen too very often.

There /are/ programs which benefit usefully from HT, but the general experience
seems to be that they are not common. The ideal case (I think) would be when
the two threads were executing the same (fairly small) section of code and (not
too big) section of data (so the instruction pipeline and cache would serve
both as well as the same circuitry could serve either one); and the mix of data
accesses is such that the interval between stalls for a cache refill is
approximately equal to the time taken for a cache refill. The less the actual
mix of instructions seen by each CPU resembles that case, the more the whole
system will degrade towards acting like one CPU time-sliced between the two
threads.

Note that, in the worst case, the cache behaviour of the two threads executing
at the same time may be worse than it would be if the same two threads were
time-sliced at coarse intervals by the OS but had the whole of the cache
available to each thread at a time.

I think it comes down to two questions:

1. When run as a single thread, does the job leave wasted cycles on the
table? That is, does it do significant amounts of waiting for cache
misses, mispredicted branches etc.? If the single thread really uses
almost all the instruction issue opportunities there is no gain.

2. Can the two thread share all the caches, branch predictors etc.
without getting in each other's way? That can happen either if they are
happy with the same cache contents or if they don't need much cache.

From this point of view, the two job test may have been unfair, because
two independent jobs are less likely to do a good job of cache sharing
than two threads in the same job.

Patricia
 
L

Luc The Perverse

Chris Uppal said:
Arne Vajhøj wrote:

[me:]
Intel claims HT = 1.3 CPU.

"Yeah, right"

;-)

I have seen code that do show the +30%.

Undoubtedly such code does exist. I'm only claiming that it's not the
norm.

It was no doubt a hand made assembly algorithm specifically designed to take
advantage of the hyperthreading limited abilities.
 
?

=?ISO-8859-1?Q?Arne_Vajh=F8j?=

Chris said:
Undoubtedly such code does exist. I'm only claiming that it's not the norm.

Single threaded code is probably the norm.

I would be surprised if a multi threaded app (multiple CPU intensive
threads ofcourse) did below +10%.

Especially if run with -server and enough execution of the code
to cause it to optimize hard.

The 30% is not unrealistic for that kind of app.

Arne
 
?

=?ISO-8859-1?Q?Arne_Vajh=F8j?=

Luc said:
Chris Uppal said:
Arne Vajhøj wrote:
[me:]
Or -- to put it another way -- the CPU usage reported by TaskManager is
misleading. It suggests that 50% of your available horse-power is
unused. My bet would be that it's more like 5% -- if not actually
zero.
I have seen code that do show the +30%.
Undoubtedly such code does exist. I'm only claiming that it's not the
norm.

It was no doubt a hand made assembly algorithm specifically designed to take
advantage of the hyperthreading limited abilities.

Nope.

Simple Java code.

Arne
 
S

Steve Brecher

Patricia Shanahan said:

OK, so I won't hope for improvement by multi-threading on my P4 with
"Hyper-Threading Technology."

But looking ahead to other hardware...

In a routine called from inner loops -- this routine is called 800 million
times in the timing test case I've been using -- I have something like this
(schematically):

for (int i = 0; i < n; i++) { //n is typically a single-digit value (min 2)
result = AStaticMethod(arg);
...
}

What would be the lowest-overhead way to multi-thread the executions of
AStaticMethod?
 
S

Steve Brecher

I said:
In a routine called from inner loops -- this routine is called 800
million times in the timing test case I've been using -- I have
something like this (schematically):

for (int i = 0; i < n; i++) { //n is typically a single-digit value
(min 2) result = AStaticMethod(arg);
...
}

What would be the lowest-overhead way to multi-thread the executions
of AStaticMethod?


P.S.: n becomes known at program startup.
 
P

Patricia Shanahan

Steve said:
Patricia Shanahan said:

OK, so I won't hope for improvement by multi-threading on my P4 with
"Hyper-Threading Technology."

But looking ahead to other hardware...

In a routine called from inner loops -- this routine is called 800 million
times in the timing test case I've been using -- I have something like this
(schematically):

for (int i = 0; i < n; i++) { //n is typically a single-digit value (min 2)
result = AStaticMethod(arg);
...
}

What would be the lowest-overhead way to multi-thread the executions of
AStaticMethod?


Rule #1 for optimizing loop nests, commonly followed by optimizing
compilers:

*** Examine the whole loop nest as a unit. ***

An innermost loop with small iteration count is not usually the best
place to begin optimization.

Are you using a "client" or "server" version of Java? My understanding
is that the "server" JVMs do more routine optimizations than the
"client" versions.

If you are not using a "server" JVM I would try that first. Never do for
yourself work the compiler can do for you.

Even without going multi-threaded, many loop nests can be made more
efficient by changing the order of the loops, loop unrolling etc.

Patricia
 
C

Chris Uppal

Steve said:
for (int i = 0; i < n; i++) { //n is typically a single-digit value (min
2) result = AStaticMethod(arg);
...
}


Adding to Patricia's observations.

Seeing the innermost loop of the code is not especially helpful for seeing
opportunities for parallelisation. If AStaticMethod() runs reasonably quickly
(as I'm sure it must if you execute it 8.0e8 times), then you certainly don't
want to be starting a new thread for each execution of that -- not even if you
mess around with thread pooling.

Also without knowing what sort of data-dependencies there are between
iterations, it is (as it were) if even more impossible to see what
parallelisation can be introduced.

Still, the structure of the example suggests that /maybe/ the app is something
like:

... outer loop definitionss...
{
for (int i = 0; i < n; i++)
{
result = AStaticMethod(arg);
}
}

with no dependencies between /any/ executions of AStaticMethod(). If so then
maybe you could turn your code inside-out so that there were n threads:


for (int i = 0; i < n; i++)
{
start thread which executes:
{
... outer loop definitions...
{
result = AStaticMethod(arg);
}
}
}

Although you would probably not want to share the result[] array between the
threads while they were actually executing.

-- chris
 
C

Chris Uppal

Patricia said:
2. Can the two thread share all the caches, branch predictors etc.
without getting in each other's way? That can happen either if they are
happy with the same cache contents or if they don't need much cache.

From this point of view, the two job test may have been unfair, because
two independent jobs are less likely to do a good job of cache sharing
than two threads in the same job.

Seems likely -- especially as the bulk of the code will have been generated
on-the-fly and so cannot be shared between the two processes (as far as the OS
knows) even though it should be functionally identical.

-- chris
 
S

Steve Brecher

Patricia Shanahan said:
Steve Brecher wrote: ....
Also, n becomes known at program startup, and its maximum value is a
compile-time constant (22), so any arrays of size<=n can be allocated at
startup.
result = AStaticMethod(arg); AStaticMethod returns a primitive type.
...
}

What would be the lowest-overhead way to multi-thread the executions
of AStaticMethod?


Rule #1 for optimizing loop nests, commonly followed by optimizing
compilers:

*** Examine the whole loop nest as a unit. ***

An innermost loop with small iteration count is not usually the best
place to begin optimization.

Are you using a "client" or "server" version of Java? My understanding
is that the "server" JVMs do more routine optimizations than the
"client" versions.

I'm using -server.
...
Even without going multi-threaded, many loop nests can be made more
efficient by changing the order of the loops, loop unrolling etc.

(For readers just joining us: I am a Java newbie.)

With respect and thanks, optimization is the not the issue. The code is a
port of long-standing C code that is highly optimized -- I'm very familiar
with optimization techniques; actually, that is my specialty. I'd like to
try multi-threading it.

The loops enumerate cases. For each case, there are "n" significant
computations, each accomplished in AStaticMethod (excuse the initial
upper-case :). Multi-threading the executions of that method would be a
very easy way to begin. Partitioning the enumeration of cases on the other
hand, would be difficult -- at this writing, I don't have a scheme to do
that.

So far my knowledge of Java multi-threading is based on rapid pass through
the relevant material in Flanagan's "Java in a Nutshell" and Sun's "The Java
Tutorials."

If possible, I would like a way to do the multi-threading that creates no
objects for each execution of AStaticMethod. Currently the code, after
startup, creates no objects and incurs no GC.
 
P

Patricia Shanahan

Steve Brecher wrote:
....
(For readers just joining us: I am a Java newbie.)

With respect and thanks, optimization is the not the issue. The code is a
port of long-standing C code that is highly optimized -- I'm very familiar
with optimization techniques; actually, that is my specialty. I'd like to
try multi-threading it.

The loops enumerate cases. For each case, there are "n" significant
computations, each accomplished in AStaticMethod (excuse the initial
upper-case :). Multi-threading the executions of that method would be a
very easy way to begin. Partitioning the enumeration of cases on the other
hand, would be difficult -- at this writing, I don't have a scheme to do
that.

So far my knowledge of Java multi-threading is based on rapid pass through
the relevant material in Flanagan's "Java in a Nutshell" and Sun's "The Java
Tutorials."

If possible, I would like a way to do the multi-threading that creates no
objects for each execution of AStaticMethod. Currently the code, after
startup, creates no objects and incurs no GC.

I don't know whether the books you have cover the java.util.concurrent
features added in JDK 1.5. You may find something useful there.

Patricia
 
P

Patrick May

I don't know the particulars of your problem, but you might want
to consider using a JavaSpace (introductory information is available
at http://www.jini.org/wiki/JavaSpaces_Specification). You can think
of a JavaSpace as a distributed shared memory. In the space-based
model, you would use the master-worker pattern to write tasks to the
space where they can be picked up by worker processes running in the
same JVM, different JVMs on the same machine, or JVMs running on
multiple machines. After processing, the workers write the result
back to the space where it is consolidated by the master.

GigaSpaces (http://www.gigaspaces.com) has a JavaSpace
implementation that allows the space to be embedded in the same JVM as
the master and workers, each of which is executing in a separate
thread. This gives you the simplicity and scalability of the space
approach with the speed of running in a single process.

Regards,

Patrick
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,009
Latest member
GidgetGamb

Latest Threads

Top