Re: Matrix performance question

Discussion in 'Java' started by Roedy Green, Nov 21, 2008.

  1. Roedy Green

    Roedy Green Guest

    On Thu, 20 Nov 2008 08:05:45 -0800, Patricia Shanahan <>
    wrote, quoted or indirectly quoted someone who said :

    >Any other ideas I should be considering? Any packages that would do this
    >operation fast?


    Other approaches:

    Transpose rows and columns and see if Java can optimise better.

    Use Jet native compilation.
    see http://mindprod.com/jgloss/jet.html

    Write that code in C and hook to it with JNI. C has more efficient
    ways of storing matrices.

    write that code in assembler and hook to it with JNI

    mathapi http://mindprod.com/jgloss/mathapi.html
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com
    Your old road is
    Rapidly agin'.
    Please get out of the new one
    If you can't lend your hand
    For the times they are a-changin'.
     
    Roedy Green, Nov 21, 2008
    #1
    1. Advertising

  2. Roedy Green

    Tom Anderson Guest

    On Fri, 21 Nov 2008, Eric Sosman wrote:

    > Roedy Green wrote:
    >> On Thu, 20 Nov 2008 08:05:45 -0800, Patricia Shanahan <>
    >> wrote, quoted or indirectly quoted someone who said :
    >>
    >>> Any other ideas I should be considering? Any packages that would do this
    >>> operation fast?

    >>
    >> Other approaches:
    >>
    >> Transpose rows and columns and see if Java can optimise better.

    >
    > Since she's using both the matrix and its transpose, she's got
    > to traverse it in both orientations anyhow. If Java does better
    > with one of them, it must do worse with the other.


    Maybe yes, maybe no. You have to think about the effect of the layout on
    memory and cache access patterns - does one more easily admit efficient
    use of the cache? I *think* that with Pat's array-of-arrays
    implementation, having the larger axis be contiguous is the right thing to
    do; with a packed array - one where you put all the rows (or columns) in
    one big array, and then do the index arithmetic yourself, which isn't hard
    - it doesn't matter either way, not with one of the dimensions that small.

    >> Write that code in C and hook to it with JNI. C has more efficient ways
    >> of storing matrices.

    >
    > Not *that* much more efficient.


    I don't believe C's multimensional arrays are any more efficient than
    using a packed matrix in java. It boils down to the same data layout and
    sequence of calculations, neither of which are complicated, so i doubt C
    does it faster than java.

    > Besides, why re-invent wheels? If resorting to JNI, use it to get at
    > pre-written high-performance BLAS routines or the like. Vendors really
    > *do* optimize those things heavily, with knowledge of model-specific
    > caches, pipeline depths, pre-fetch strategies, and so on.


    s/vendors/Kazushige Goto/!

    I'd be interested to see a benchmark of the various leading BLAS
    implementations - current versions, with current compilers, on current
    hardware. I can't find anything more recent than 1995!

    tom

    --
    A paranoid-schizophrenic is a guy who just found out what's going on. --
    William S. Burroughs
     
    Tom Anderson, Nov 21, 2008
    #2
    1. Advertising

  3. Roedy Green

    Tom Anderson Guest

    On Fri, 21 Nov 2008, Eric Sosman wrote:

    > Graham Jones wrote:
    >>
    >> "Patricia Shanahan" <> wrote in message
    >> news:...
    >>
    >> [...]
    >>> Thanks. I'm actually doing eight 10,000 element dot products followed by
    >>> eight 10,000 element daxpy-like operations. That does not affect the
    >>> number of floating point operations, but does change the memory access
    >>> pattern.
    >>>
    >>> [For anyone following this who is not familiar with BLAS, daxpy is
    >>> scalar constant times a vector plus a vector]

    >>
    >> Maybe worth storing U twice, as 8*10000 and 10000*8, and using only dot
    >> products, at least when you have several vectors for the same U.

    >
    > Watch out for the memory penalty, though.


    Moreover, the cache penalty. How big is Patricia's L2/3 cache? Probably
    0.5 - 4 MB? That's the size where the difference between working on a 640
    kB data structure and two which total to 1480 kB (per your calculations)
    is going to be very significant.

    tom

    --
    Sorry. Went a bit Atari Teenage Riot there. -- Andrew
     
    Tom Anderson, Nov 21, 2008
    #3
  4. Roedy Green

    Tom Anderson Guest

    On Fri, 21 Nov 2008, Patricia Shanahan wrote:

    > Patricia Shanahan wrote:
    >> Eric Sosman wrote:
    >>> Roedy Green wrote:

    > ...
    >>>> Write that code in C and hook to it with JNI. C has more efficient
    >>>> ways of storing matrices.
    >>>
    >>> Not *that* much more efficient. Besides, why re-invent wheels?
    >>> If resorting to JNI, use it to get at pre-written high-performance
    >>> BLAS routines or the like. Vendors really *do* optimize those things
    >>> heavily, with knowledge of model-specific caches, pipeline depths,
    >>> pre-fetch strategies, and so on.

    > ...
    >> My main concern with the mixed languages approaches is that this work is
    >> a lot of calls to a relatively short job. I will get some measurements
    >> on that. At least last time I tried it, JNI had a significant overhead
    >> in the transition between languages.

    >
    > In one of my performance test cases, the method is called 122,520 times,
    > taking a total of 434 seconds, mean time 3.5 milliseconds per call. What
    > do people think about the economics of an inter-language call for a 3.5
    > ms function?


    I believe the overhead for a JNI call - the call itself - is on the order
    of hundreds of instructions, which is hundreds of nanoseconds on a modern
    processor. That vanishes in comparison to 3.5 ms of computation.

    JNI's reputation for slowness comes, i believe, not from the overhead of
    calls, but from the overhead of manipulating java objects from the C side
    via the JNI interface. Those operations are also hundreds of instructions,
    but you do them hojillions of times per call, hence they really start to
    drag.

    > Each call allocates one double[10000].


    About that. The natural way to represent your arrays is as java arrays.
    You can get at arrays from JNI in bulk using GetDoubleArray, but there's a
    chance that this might give you a copy rather than a direct pointer. That
    would work, but be very slow. I don't know what any production JVM's
    actually do. However, there is an alternative - NIO's FloatBuffer. If you
    allocate them as direct buffers, then you can use JNI's
    GetDirectBufferAddress to get direct access to them. That is pretty much
    guaranteed to be the fastest way to get at an array from the C side. I
    believe access from the java side is still pretty fast, but i can't put a
    number on it.

    A related point is the use of arrays of arrays for matrices. This is
    convenient, but leads to a layout in memory that isn't native-friendly.
    Packing your matrix into a single array, and wrapping it in an object
    which does the index calculations for you, will give you better layout
    (including better cache locality for operations in java) for minimal
    ergonomic loss. You will have to write change things like:

    m[j] = m[0] + m[0][j] ;

    to:

    m.set(i, j, m.get(i, 0) + m.get(0, j)) ;

    Which is a pain, but is hopefully bearable.

    On the bright side, once you've got your matrices wrapped, you can change
    the implementation between arrays of arrays, packed arrays, buffers, or
    whatever else you can think of, with only localised changes to the code,
    which makes for convenient benchmarking of the alternatives.

    tom

    --
    Sorry. Went a bit Atari Teenage Riot there. -- Andrew
     
    Tom Anderson, Nov 21, 2008
    #4
  5. Roedy Green

    Tom Anderson Guest

    On Sat, 22 Nov 2008, Graham Jones wrote:

    > "Tom Anderson" <> wrote in message
    > news:p...
    >> On Fri, 21 Nov 2008, Eric Sosman wrote:
    >>
    >>> Graham Jones wrote:
    >>>>
    >>>> "Patricia Shanahan" <> wrote in message
    >>>> news:...
    >>>>
    >>>> [...]
    >>>>> Thanks. I'm actually doing eight 10,000 element dot products followed by
    >>>>> eight 10,000 element daxpy-like operations. That does not affect the
    >>>>> number of floating point operations, but does change the memory access
    >>>>> pattern.
    >>>>>
    >>>>> [For anyone following this who is not familiar with BLAS, daxpy is
    >>>>> scalar constant times a vector plus a vector]
    >>>>
    >>>> Maybe worth storing U twice, as 8*10000 and 10000*8, and using only dot
    >>>> products, at least when you have several vectors for the same U.
    >>>
    >>> Watch out for the memory penalty, though.

    >>
    >> Moreover, the cache penalty. How big is Patricia's L2/3 cache? Probably 0.5
    >> - 4 MB? That's the size where the difference between working on a 640 kB
    >> data structure and two which total to 1480 kB (per your calculations) is
    >> going to be very significant.

    >
    > I think the dot product calculation is cache-friendly even when the vectors
    > are bigger than the cache. Anyway the code below seems to work about 6x
    > faster than Patricia reported, and making N ten times bigger and T ten times
    > smaller makes little odds.
    >
    > I'm really a C/C++ programmer who may doing some numerical stuff in Java
    > soon. Have I done something silly or clever? ;-)


    You've done something very silly - you've contradicted me! For this, you
    will pay!!

    I've written an implementation of Patricia's function (as i understand
    it), with a timing driver and a bunch of matrix implementations:

    http://urchin.earth.li/~twic/Matrix_Layout/

    For a 10 000 x 8 matrix, doing 100 runs of 100 iterations each, i get the
    following results (times are in nanoseconds per iteration):

    Matrix Type min median 95% max
    RowMajorMatrix 1087790.0 1096540.0 1146610.0 1165810.0
    ColumnMajorMatrix 593350.0 598450.0 641440.0 709700.0
    ParallelRowMajorMatrix 876650.0 922850.0 959030.0 1142850.0
    ParallelColumnMajorMatrix 366350.0 414790.0 428000.0 469190.0
    AntiParallelRowMajorMatrix 352050.0 409060.0 615760.0 670430.0
    AntiParallelColumnMajorMatrix 830310.0 833460.0 869010.0 1044430.0
    DualMatrix 649780.0 726620.0 752110.0 807310.0
    PessimalDualMatrix 861620.0 866200.0 914710.0 1771520.0

    I'm running this code on a remote machine; i don't know how big its caches
    are, and i can't guarantee that it was unloaded, or equally loaded between
    tests.

    As to what these implementations are, well. The plain row- and
    column-major matrices should be obvious. The dual matrix is your idea, the
    pessimal dual is one which uses the two copies of the matrix the wrong way
    round. The parallel matrices, rather than computing each element of the
    output vector one after the other, compute them in parallel in one
    direction, to make the memory access pattern more linear. The antiparallel
    versions do the same, but to make them *less* linear. Or something. I'm
    pretty confused by the whole thing at this point.

    So:

    Column-major beats row-major, as Patricia anticipated, and one would
    expect. Your dual matrix approach is faster than a row-major matrix, but a
    bit slower than a column-major matrix. Doing the right kind of
    parallelisation speeds up both the plain matrices to roughly the same
    speed, which is much faster than the dual matrix. I didn't try
    parallelising the dual matrix; i wouldn't have thought it was necessary,
    but i could be wrong.

    Please do report any errors, omissions, further results, etc. I've had
    enough of this for today!

    tom

    --
    Weird is Cool -- Naomi Clarke
     
    Tom Anderson, Nov 22, 2008
    #5
  6. Roedy Green

    Tom Anderson Guest

    On Sun, 23 Nov 2008, Graham Jones wrote:

    > "Tom Anderson" <> wrote in message
    > news:p...
    >
    >> You've done something very silly - you've contradicted me! For this, you
    >> will pay!!
    >>
    >> I've written an implementation of Patricia's function (as i understand it),
    >> with a timing driver and a bunch of matrix implementations:
    >>
    >> http://urchin.earth.li/~twic/Matrix_Layout/
    >>
    >> For a 10 000 x 8 matrix, doing 100 runs of 100 iterations each, i get the
    >> following results (times are in nanoseconds per iteration):
    >>
    >> Matrix Type min median 95% max
    >> RowMajorMatrix 1087790.0 1096540.0 1146610.0 1165810.0
    >> ColumnMajorMatrix 593350.0 598450.0 641440.0 709700.0
    >> ParallelRowMajorMatrix 876650.0 922850.0 959030.0 1142850.0
    >> ParallelColumnMajorMatrix 366350.0 414790.0 428000.0 469190.0
    >> AntiParallelRowMajorMatrix 352050.0 409060.0 615760.0 670430.0
    >> AntiParallelColumnMajorMatrix 830310.0 833460.0 869010.0 1044430.0
    >> DualMatrix 649780.0 726620.0 752110.0 807310.0
    >> PessimalDualMatrix 861620.0 866200.0 914710.0 1771520.0

    >
    > I shall make sure I annoy you whenever I have an speed-critical routine
    > to write.


    *shakes fist*

    > But the big puzzle is why even the slowest versions take 1ms for
    > something Patricia said took 3.5ms.


    It's possible my computer's just a lot faster than hers. Or i
    misunderstood the calculation she's doing.

    Patricia was using arrays-of-arrays, so let's try that:

    UnpackedRowMajorMatrix 1408840.0 1456200.0 1763020.0 2343020.0
    UnpackedColumnMajorMatrix 519090.0 523810.0 542980.0 624270.0

    Interestingly, neither are these as slow as 3.5 ms, the column-major one
    is faster than the packed version! Let's try parallelising the cross-grain
    multiplication:

    ParallelUnpackedColumnMajorMatrix 370980.0 376420.0 437200.0 500890.0

    Okay, so that's faster than the packed version too. Well, looks like my
    advice to use a packed matrix was pretty bad, then.

    Hmm. It seems hard to believe that my machine is an order of magnitude
    (and that's denary magnitude!) faster than hers. It's also hard to believe
    my code is an order of magnitude faster.

    Hang on, was her matrix 10 000 or 100 000 elements long?

    tom

    --
    The best way I know of to win an argument is to start by being in the
    right. -- Lord Hailsham
     
    Tom Anderson, Nov 23, 2008
    #6
  7. Roedy Green

    Tom Anderson Guest

    On Sun, 23 Nov 2008, Patricia Shanahan wrote:

    > Tom Anderson wrote:
    > ...
    >> Hmm. It seems hard to believe that my machine is an order of magnitude (and
    >> that's denary magnitude!) faster than hers. It's also hard to believe my
    >> code is an order of magnitude faster.
    >>
    >> Hang on, was her matrix 10 000 or 100 000 elements long?

    > ...
    >
    > I've done a few checks, and found these issues:
    >
    > 1. My production arrays are up to 20,101 elements long.


    Going up to 20101 elements makes ParallelColumnMajorMatrix take a median
    time of 951450 ns.

    > 2. It is slightly faster, about 2ms, when run in the production
    > environment, -server and without profiling.
    >
    > 3. I am measuring it in the context of my program, not an isolated
    > benchmark. Since these calls are interleaved with other memory intensive
    > code, and the input vector is new each time, I may be getting more cache
    > misses.


    Good point - my code reuses the same input vector. If i modify it to use a
    fresh set of arrays each time (from a pre-prepared list, so they should be
    cache-cold), it comes out as 965280, about the same. It's possible i'm not
    making things cache-cold enough.

    > 4. My code was very simple unoptimized code, pending a decision about
    > the most promising approaches to improve it.


    Fair enough. Please don't read any of what i've written as impugning your
    code or coding!

    > 5. There is a single post-processing pass over the input and output
    > vectors that is done after the main calculation.


    Okay, i don't have that. And i'm not even certain that the calculation i'm
    doing is the same as yours apart from that.

    > I'll do some more measurements to see which of these issues are important.


    If you get a chance, could you give my benchmark a run? There are some
    shell scripts which automate it, and it's not hard to do by hand.

    tom

    --
    Only the bagel has the correct aspect ratio.
     
    Tom Anderson, Nov 23, 2008
    #7
  8. Roedy Green

    Roedy Green Guest

    On Fri, 21 Nov 2008 07:19:49 -0800, Roedy Green
    <> wrote, quoted or indirectly quoted
    someone who said :

    >Other approaches:


    People love to argue about which approach will work better, however,
    the inner working of CPUS, compiler, and operating systems are so
    complicated, it is at best an educated guess informed by years of
    experience. Gone are the days you could sit there with an adding
    machine toting up cycles to compare algorithms.

    If you have the time, and it really matters, you can perform all
    manner of experiments to see what actually works better.

    I once had a plum task, designing high voltage transmission lines.
    Back then they cost a million dollars a mile to build, and back then
    CPU time was extremely expensive relative to my time. Almost anything
    I could think of to improve the code had an enormous payback. So
    there was no question about trying out any idea.

    Today it is rare to get a task that justifies much in the way of
    tuning.

    The big problem with the JNI approach is the overhead of the JNI call.
    If you hop over to the C++ side, you want to do a substantial amount
    of work before you return to justify the cost of the call.

    The other thing is Jet and Sun -server hotspot are so good now, there
    is a fair chance you won't get any benefit to speak of from coding in
    C++ or even assembler. You need to use some highly tuned polished
    library code to make it worthwhile.

    This is not meant to discount any other advice you have received. Some
    of the best people are offering their thoughts.

    --
    Roedy Green Canadian Mind Products
    http://mindprod.com
    "Humanity is conducting an unintended, uncontrolled, globally pervasive experiment
    whose ultimate consequences could be second only to global nuclear war."
    ~ Environment Canada (The Canadian equivalent of the EPA on global warming)
     
    Roedy Green, Nov 24, 2008
    #8
  9. Roedy Green

    Tom Anderson Guest

    On Sun, 23 Nov 2008, Graham Jones wrote:

    > "Tom Anderson" <> wrote in message
    > news:p...
    >
    >> If you get a chance, could you give my benchmark a run? There are some
    >> shell scripts which automate it, and it's not hard to do by hand.

    >
    > Shell scripts don't work too good on Vista...


    http://win-bash.sourceforge.net/
    http://gnuwin32.sourceforge.net/

    !

    > but I did as you asked of Patricia...
    >
    > I just changed the reporting to just give median time in ms. Ordinary
    > PC, Vista, duo CPU, 2.33 GHz, 4Mb L2 cache. jdk1.6.0_03 (1). I use
    > Eclipse in case that matters. Typical command line:
    >
    > java PatriciaTest PessimalDualMatrix 10000 8 1795725554
    >
    > (I confess my ignorance of Java is such that I don't know what -server
    > might do for me.)


    The Sun JVM has, broadly speaking, two modes: client and server. Client
    optimises for startup time and minimal memory consumption, and server
    optimises for speed. The mode affects the compiler (the server compiler is
    a *lot* better), and i think memory management. For tests like this, you
    really want server mode. I'm not sure which is the default; it used to be
    that client was, but i think it's now the case that it will be server if
    your machine is big enough, which i think means >2 cores and >2 GB RAM.
    Not sure.

    My results again:

    ParallelColumnMajorMatrix 414790.0
    DualMatrix 726620.0

    > ParallelColumnMajorMatrix 0.6838042
    > DualMatrix 0.5443394


    Interesting!

    Patricia's numbers:

    > ParallelColumnMajorMatrix 345285.12
    > DualMatrix 502855.2


    Vaguely similar to mine, although the difference is smaller than on my
    machine.

    > And replacing 10000 with 100000, which surely overflows my cache:
    >
    > ColumnMajorMatrix 6.429315
    > RowMajorMatrix 12.383814
    > ColumnMajorMatrix 6.390228
    > ParallelRowMajorMatrix 7.260556
    > ParallelColumnMajorMatrix 7.1654706
    > AntiParallelRowMajorMatrix 7.239504
    > AntiParallelColumnMajorMatrix 8.493671
    > DualMatrix 5.873374
    > PessimalDualMatrix 13.273498
    > UnpackedColumnMajorMatrix 7.186361
    > UnpackedRowMajorMatrix 16.7376
    > ParallelUnpackedColumnMajorMatrix 8.268656


    Also interesting! I note that the parallelisation, which speeds things up
    for the column-major matrix at N=10k, slows things down at N=100k.
    Possibly because it uses an accumulator array which is bigger than the
    cache. Not sure.

    > Differences in platform seem more important than differences in code.


    Really? There was about a 50% difference in times between your machine and
    mine. But there was a twofold or more difference in times between the best
    and worst strategies.

    What's is particularly interesting, though, is that different strategies
    are best on different machines. On mine, the parallel column-major
    organisation was a fairly solid winner, especially against the dual
    matrix. On yours, the dual matrix convincingly edged it out, at both
    sizes.

    I guess the take-home message is that (a) you need to do benchmarks and
    (b) you need to do them on the target machine. Not surprising, i guess,
    but strikingly demonstrated here.

    It follows that there's scope for performance-criticial libraries to
    include multiple strategies, and then to select an optimal one at runtime
    based on the characteristics of the host machine. There are some libraries
    which already do this, and are some of the fastest libraries of their
    kind:

    http://math-atlas.sourceforge.net/
    http://www.fftw.org/

    tom

    --
    Tomorrow has made a phone call to today.
     
    Tom Anderson, Nov 24, 2008
    #9
  10. Roedy Green

    Tom Anderson Guest

    On Tue, 25 Nov 2008, Graham Jones wrote:

    > "Tom Anderson" <> wrote in message
    > news:p...
    >
    >> The Sun JVM has, broadly speaking, two modes: client and server. Client
    >> optimises for startup time and minimal memory consumption, and server
    >> optimises for speed. The mode affects the compiler (the server compiler
    >> is a *lot* better), and i think memory management. For tests like this,
    >> you really want server mode.

    >
    > Yes indeed!
    >
    > ColumnMajorMatrix 0.3995627
    > RowMajorMatrix 0.44905844
    > ColumnMajorMatrix 0.39949283
    > ParallelRowMajorMatrix 0.43311155
    > ParallelColumnMajorMatrix 0.2588967
    > AntiParallelRowMajorMatrix 0.2619739
    > AntiParallelColumnMajorMatrix 0.6244125
    > DualMatrix 0.31512663
    > PessimalDualMatrix 0.53376275
    > UnpackedColumnMajorMatrix 0.4364346
    > UnpackedRowMajorMatrix 0.5116404
    > ParallelUnpackedColumnMajorMatrix 0.25846648


    Hurrah! And now my code's fastest again! :D

    tom

    --
    The literature is filled with bizarre occurrances for which we have
    no explanation
     
    Tom Anderson, Nov 25, 2008
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. lvcargnini

    Matrix composed by two matrix

    lvcargnini, Jul 4, 2006, in forum: VHDL
    Replies:
    3
    Views:
    2,676
    Jonathan Bromley
    Jul 5, 2006
  2. Holgerson

    Matrix*Vector and Vector*Matrix

    Holgerson, Oct 25, 2007, in forum: C++
    Replies:
    3
    Views:
    410
    Holgerson
    Oct 26, 2007
  3. John B. Matthews

    Re: Matrix performance question

    John B. Matthews, Nov 20, 2008, in forum: Java
    Replies:
    14
    Views:
    653
    Mark Thornton
    Nov 22, 2008
  4. Terry Reedy
    Replies:
    0
    Views:
    557
    Terry Reedy
    Apr 2, 2009
  5. Robert Kern
    Replies:
    0
    Views:
    598
    Robert Kern
    Apr 2, 2009
Loading...

Share This Page