Re: Matrix performance question

Discussion in 'Java' started by John B. Matthews, Nov 20, 2008.

  1. In article <>,
    Patricia Shanahan <> wrote:

    > As indicated in the "Request for profiler recommendation" thread, I have
    > a program that is running inconveniently slowly. I applied one of the
    > recommendations, and determined that the time is being spent doing a
    > particular matrix operation, and in computing a Poisson distribution
    > probability. This article is about the matrix part of the problem.
    >
    > This is a high level description of the matrix operation, not
    > necessarily the current code.
    >
    > I have a matrix U, stored as a double[8][10,000], that represents part
    > of the result of a truncated singular value decomposition. U has 8
    > columns and about 10,000 rows, but I store it transposed to reduce the
    > total number of arrays.
    >
    > At several points in my code, I take a vector X, one row of 10,000
    > elements, and multiply it by U, producing a double[8] intermediate
    > result. I then multiply the intermediate result by the transpose of U,
    > producing another 10,000 element vector.
    >
    > I would like to do this faster, and have some ideas. To save time
    > trying things out, I would welcome opinions on which of these are most
    > likely to have as good performance as possible, as well as additional
    > suggestions. I will then implement and measure only the most promising
    > one or two ideas:
    >
    > 1. Throw it to Matlab. U resulted from a matlab script, so I could keep
    > the matlab session open and continue operating using it. However, I
    > would suffer the overhead of communication with matlab for each vector.
    >
    > 2. Use Colt. Matlab outperformed Colt on the truncated singular value
    > decomposition, but that may be because Colt did not have a truncated
    > operation - I had to do the whole decomposition and truncate it myself.
    >
    > 3. Refactor to group multiple operations together. In some situations, I
    > need to do the operation on several vectors. I could group those vectors
    > into a matrix, and perhaps open up more optimization options. Does
    > anyone know whether the -server JVM restructures nested loops for cache
    > use optimization?
    >
    > Note that pre-computing U * U_transpose does not seem to be practical -
    > the result would take 800 MB of memory.
    >
    > Any other ideas I should be considering? Any packages that would do this
    > operation fast?


    I was pleasantly surprised by the speed of the JScience library,
    possibly due to stack-based allocation afforded by Javolution, on which
    Jscience is based. If staying in Java, it may be worth comparing
    JScience to Colt:

    <http://jscience.org/>
    <http://javolution.org/>

    To compare Java to another high-level language, I implemented a
    root-finding algorithm in Java & Ada:

    <http://sites.google.com/site/drjohnbmatthews/polyroots>
    <http://home.roadrunner.com/~jbmatthews/misc/groots.html>

    From the command line, the Ada appeared many times faster; but after
    factoring out JVM startup and Ada RT initialization, it was less than a
    factor of three.

    It's possible to mix Java & Ada, but you'd still need Matlab for
    truncation:

    <http://code.google.com/p/javada/>
    <http://www.adacore.com/2008/06/17/ada-java_interfacing_suite/>

    --
    John B. Matthews
    trashgod at gmail dot com
    http://home.roadrunner.com/~jbmatthews/
     
    John B. Matthews, Nov 20, 2008
    #1
    1. Advertising

  2. John B. Matthews

    Tom Anderson Guest

    On Thu, 20 Nov 2008, Patricia Shanahan wrote:

    > John B. Matthews wrote:
    >> In article <>,
    >> Patricia Shanahan <> wrote:
    >>
    >>> I have a matrix U, stored as a double[8][10,000], that represents part
    >>> of the result of a truncated singular value decomposition. U has 8
    >>> columns and about 10,000 rows, but I store it transposed to reduce the
    >>> total number of arrays.
    >>>
    >>> At several points in my code, I take a vector X, one row of 10,000
    >>> elements, and multiply it by U, producing a double[8] intermediate
    >>> result. I then multiply the intermediate result by the transpose of U,
    >>> producing another 10,000 element vector.
    >>>
    >>> Any other ideas I should be considering? Any packages that would do this
    >>> operation fast?

    >>
    >> To compare Java to another high-level language, I implemented a
    >> root-finding algorithm in Java & Ada:
    >>
    >> <http://sites.google.com/site/drjohnbmatthews/polyroots>
    >> <http://home.roadrunner.com/~jbmatthews/misc/groots.html>
    >>
    >> From the command line, the Ada appeared many times faster; but after
    >> factoring out JVM startup and Ada RT initialization, it was less than a
    >> factor of three.

    >
    > If I go with Ada for the critical code, it would have to be as a
    > combined Java/Ada program. The problem code is part of 25,000 line
    > program, most of which is most naturally expressed in terms of objects.


    If you're willing to do a big chunk of work in native code, doesn't
    fortran become an appealing option? Or rather, doesn't using BLAS or
    LAPACK through a minimal wrapper become attractive? BLAS will give you a
    constant-time speedup, i would guess; if LAPACK has routines that are more
    specific to your needs, it might do even better.

    As for algorithmic improvements to your app - pass! I'd try a numerics
    newsgroup rather than cljp for that.

    tom

    --
    a moratorium on the future
     
    Tom Anderson, Nov 21, 2008
    #2
    1. Advertising

  3. John B. Matthews

    Tom Anderson Guest

    On Thu, 20 Nov 2008, John B. Matthews wrote:

    > I was pleasantly surprised by the speed of the JScience library,
    > possibly due to stack-based allocation afforded by Javolution, on which
    > Jscience is based. If staying in Java, it may be worth comparing
    > JScience to Colt:
    >
    > <http://jscience.org/>
    > <http://javolution.org/>


    I'd never come across javolution before - it's a pretty impressive
    library. If it's as good as it claims, i'm surprised it's not more
    well-known.

    tom

    --
    a moratorium on the future
     
    Tom Anderson, Nov 21, 2008
    #3
  4. In article <>,
    Tom Anderson <> wrote:

    > On Thu, 20 Nov 2008, John B. Matthews wrote:
    >
    > > I was pleasantly surprised by the speed of the JScience library,
    > > possibly due to stack-based allocation afforded by Javolution, on which
    > > Jscience is based. If staying in Java, it may be worth comparing
    > > JScience to Colt:
    > >
    > > <http://jscience.org/>
    > > <http://javolution.org/>

    >
    > I'd never come across javolution before - it's a pretty impressive
    > library. If it's as good as it claims, i'm surprised it's not more
    > well-known.


    Difficult question. I suppose there is the hope that one or more
    Javolution features will be obviated in a future release of the
    standard platform, e.g.:

    "It should be noted that future versions of the JVM may provide some
    limited support for stack allocation through escape analysis. Users can
    always turn-off stack allocation to revert to standard heap allocation."

    <http://javolution.org/api/javolution/context/StackContext.html>

    --
    John B. Matthews
    trashgod at gmail dot com
    http://home.roadrunner.com/~jbmatthews/
     
    John B. Matthews, Nov 21, 2008
    #4
  5. In article <>,
    Tom Anderson <> wrote:

    > On Thu, 20 Nov 2008, Patricia Shanahan wrote:
    >
    > > John B. Matthews wrote:
    > >> In article <>,
    > >> Patricia Shanahan <> wrote:
    > >>
    > >>> I have a matrix U, stored as a double[8][10,000], that represents part
    > >>> of the result of a truncated singular value decomposition. U has 8
    > >>> columns and about 10,000 rows, but I store it transposed to reduce the
    > >>> total number of arrays.
    > >>>
    > >>> At several points in my code, I take a vector X, one row of 10,000
    > >>> elements, and multiply it by U, producing a double[8] intermediate
    > >>> result. I then multiply the intermediate result by the transpose of U,
    > >>> producing another 10,000 element vector.
    > >>>
    > >>> Any other ideas I should be considering? Any packages that would do this
    > >>> operation fast?
    > >>
    > >> To compare Java to another high-level language, I implemented a
    > >> root-finding algorithm in Java & Ada:
    > >>
    > >> <http://sites.google.com/site/drjohnbmatthews/polyroots>
    > >> <http://home.roadrunner.com/~jbmatthews/misc/groots.html>
    > >>
    > >> From the command line, the Ada appeared many times faster; but after
    > >> factoring out JVM startup and Ada RT initialization, it was less than a
    > >> factor of three.

    > >
    > > If I go with Ada for the critical code, it would have to be as a
    > > combined Java/Ada program. The problem code is part of 25,000 line
    > > program, most of which is most naturally expressed in terms of objects.

    >
    > If you're willing to do a big chunk of work in native code, doesn't
    > fortran become an appealing option? Or rather, doesn't using BLAS or
    > LAPACK through a minimal wrapper become attractive? BLAS will give you a
    > constant-time speedup, i would guess; if LAPACK has routines that are more
    > specific to your needs, it might do even better.


    This makes sense. Indeed, one widely available Ada complier (gnat)
    typically implements the numeric parts of the language's standard
    library using BLAS. I guess in this use, Ada would be a kind of
    wrapper, albeit one with nice operator overloading. :)

    --
    John B. Matthews
    trashgod at gmail dot com
    http://home.roadrunner.com/~jbmatthews/
     
    John B. Matthews, Nov 21, 2008
    #5
  6. John B. Matthews

    Stefan Ram Guest

    "John B. Matthews" <> writes:
    >I was pleasantly surprised by the speed of the JScience library,
    >possibly due to stack-based allocation afforded by Javolution, on which
    >Jscience is based. If staying in Java, it may be worth comparing
    >JScience to Colt:
    ><http://jscience.org/>
    ><http://javolution.org/>


    »Javolution is a pure Java Solution«

    http://javolution.org/

    Does this mean, they implement stack-based allocation
    with Java source code only?

    If so, how do they do this?
     
    Stefan Ram, Nov 21, 2008
    #6
  7. John B. Matthews

    Tom Anderson Guest

    On Fri, 21 Nov 2008, Stefan Ram wrote:

    > "John B. Matthews" <> writes:
    >> I was pleasantly surprised by the speed of the JScience library,
    >> possibly due to stack-based allocation afforded by Javolution, on which
    >> Jscience is based. If staying in Java, it may be worth comparing
    >> JScience to Colt:
    >> <http://jscience.org/>
    >> <http://javolution.org/>

    >
    > »Javolution is a pure Java Solution«
    >
    > http://javolution.org/
    >
    > Does this mean, they implement stack-based allocation
    > with Java source code only?
    >
    > If so, how do they do this?


    "using thread-local pools or RTSJ ScopedMemory" [1]

    Although:

    "the default implementation [...] uses thread-local pools. RTSJ
    alternative implementations could use ScopedMemory for their stack
    allocations. "

    So it sounds like it's done with 'thread-local pools'. I have a hard time
    believing this would actually be faster than normal allocation in a modern
    VM. I'd be interested to see a performance comparison, and to read the
    source for this class (something i haven't got time to do right now).

    tom

    [1] RTSJ = Real-Time Specification for Java, a fairly scary add-on
    package, with classes for doing all sorts of hairy and unusual things.

    --
    A paranoid-schizophrenic is a guy who just found out what's going on. --
    William S. Burroughs
     
    Tom Anderson, Nov 21, 2008
    #7
  8. John B. Matthews

    Stefan Ram Guest

    Tom Anderson <> writes:
    >[1] RTSJ = Real-Time Specification for Java, a fairly scary add-on
    >package, with classes for doing all sorts of hairy and unusual things.


    If RTSJ is written in pure Java source code,
    my question how it is done then pertains to this RTSJ.

    If RTSJ is /not/ written in pure Java source code,
    then I can understand how it is done (by some magic
    in native methods or JVM-extensions). But if Jalopy
    depends on such magic, which is not part of J2SE,
    I would not call it »pure Java«.
     
    Stefan Ram, Nov 21, 2008
    #8
  9. John B. Matthews

    Tom Anderson Guest

    On Fri, 21 Nov 2008, John B. Matthews wrote:

    > In article <>,
    > Tom Anderson <> wrote:
    >
    >> On Thu, 20 Nov 2008, John B. Matthews wrote:
    >>
    >>> I was pleasantly surprised by the speed of the JScience library,
    >>> possibly due to stack-based allocation afforded by Javolution, on which
    >>> Jscience is based. If staying in Java, it may be worth comparing
    >>> JScience to Colt:
    >>>
    >>> <http://jscience.org/>
    >>> <http://javolution.org/>

    >>
    >> I'd never come across javolution before - it's a pretty impressive
    >> library. If it's as good as it claims, i'm surprised it's not more
    >> well-known.

    >
    > Difficult question. I suppose there is the hope that one or more
    > Javolution features will be obviated in a future release of the
    > standard platform, e.g.:
    >
    > "It should be noted that future versions of the JVM may provide some
    > limited support for stack allocation through escape analysis. Users can
    > always turn-off stack allocation to revert to standard heap allocation."
    >
    > <http://javolution.org/api/javolution/context/StackContext.html>


    Sun's JVM, as of 5 or 6, certainly does some stack allocation via escape
    analysis.

    I was more thinking of Javolution's various realtime-friendly classes, and
    fast XML serialisation, which seem pretty cool.

    tom

    --
    A paranoid-schizophrenic is a guy who just found out what's going on. --
    William S. Burroughs
     
    Tom Anderson, Nov 21, 2008
    #9
  10. In article
    <-berlin.de>,
    -berlin.de (Stefan Ram) wrote:

    > "John B. Matthews" <> writes:
    > >I was pleasantly surprised by the speed of the JScience library,
    > >possibly due to stack-based allocation afforded by Javolution, on which
    > >Jscience is based. If staying in Java, it may be worth comparing
    > >JScience to Colt:
    > ><http://jscience.org/>
    > ><http://javolution.org/>

    >
    > »Javolution is a pure Java Solution«
    >
    > http://javolution.org/
    >
    > Does this mean, they implement stack-based allocation
    > with Java source code only?
    >
    > If so, how do they do this?


    Penetrating question. IIUC, they don't. Apparently, they use a
    configurable stack context, implemented using thread-local pools:

    <http://javolution.org/api/javolution/context/StackContext.html>

    As Patricia's noted, her problem has low allocation overhead. In my
    tinkering with Polynomial<Complex> and Rational<LargeInteger>, there may
    be more benefit, although I have no definitive results.

    --
    John B. Matthews
    trashgod at gmail dot com
    http://home.roadrunner.com/~jbmatthews/
     
    John B. Matthews, Nov 21, 2008
    #10
  11. In article <-berlin.de>,
    -berlin.de (Stefan Ram) wrote:

    > Tom Anderson <> writes:
    > >[1] RTSJ = Real-Time Specification for Java, a fairly scary add-on
    > >package, with classes for doing all sorts of hairy and unusual things.

    >
    > If RTSJ is written in pure Java source code,
    > my question how it is done then pertains to this RTSJ.
    >
    > If RTSJ is /not/ written in pure Java source code,
    > then I can understand how it is done (by some magic
    > in native methods or JVM-extensions). But if Jalopy
    > depends on such magic, which is not part of J2SE,
    > I would not call it »pure Java«.


    I think they default to thread-local pools but leave the door open for
    RTSJ. [_terra_incognita_ to me]

    "This implementation uses thread-local pools. RTSJ alternative
    implementations could use ScopedMemory for their stack allocations."

    <http://javolution.org/api/javolution/context/StackContext.html#DEFAULT>

    --
    John B. Matthews
    trashgod at gmail dot com
    http://home.roadrunner.com/~jbmatthews/
     
    John B. Matthews, Nov 21, 2008
    #11
  12. In article <>,
    Tom Anderson <> wrote:

    > On Fri, 21 Nov 2008, Stefan Ram wrote:
    >
    > > "John B. Matthews" <> writes:
    > >> I was pleasantly surprised by the speed of the JScience library,
    > >> possibly due to stack-based allocation afforded by Javolution, on which
    > >> Jscience is based. If staying in Java, it may be worth comparing
    > >> JScience to Colt:
    > >> <http://jscience.org/>
    > >> <http://javolution.org/>

    > >
    > > »Javolution is a pure Java Solution«
    > >
    > > http://javolution.org/
    > >
    > > Does this mean, they implement stack-based allocation
    > > with Java source code only?
    > >
    > > If so, how do they do this?

    >
    > "using thread-local pools or RTSJ ScopedMemory" [1]
    >
    > Although:
    >
    > "the default implementation [...] uses thread-local pools. RTSJ
    > alternative implementations could use ScopedMemory for their stack
    > allocations. "
    >
    > So it sounds like it's done with 'thread-local pools'. I have a hard time
    > believing this would actually be faster than normal allocation in a modern
    > VM. I'd be interested to see a performance comparison, and to read the
    > source for this class (something i haven't got time to do right now).


    There's a quick, built-in performance benchmark that's apropos:

    $ java -jar jscience.jar perf
    ....
    LargeInteger (StackContext) versus BigInteger
    LargeInteger (1024 bits) addition: 1.032 us
    LargeInteger (1024 bits) multiplication: 41.40 us
    BigInteger (1024 bits) addition: 4.487 us
    BigInteger (1024 bits) multiplication: 52.16 us
    ....

    I see the feature is enabled for LargeInteger, DenseVector and Matrix,
    where it appears to have a benefit. My JVM does slightly better on its
    own in my Complex test.

    --
    John B. Matthews
    trashgod at gmail dot com
    http://home.roadrunner.com/~jbmatthews/
     
    John B. Matthews, Nov 21, 2008
    #12
  13. John B. Matthews

    Tom Anderson Guest

    On Fri, 21 Nov 2008, Stefan Ram wrote:

    > Tom Anderson <> writes:
    >> [1] RTSJ = Real-Time Specification for Java, a fairly scary add-on
    >> package, with classes for doing all sorts of hairy and unusual things.

    >
    > If RTSJ is written in pure Java source code,
    > my question how it is done then pertains to this RTSJ.
    >
    > If RTSJ is /not/ written in pure Java source code,
    > then I can understand how it is done (by some magic
    > in native methods or JVM-extensions).


    RTSJ is definitely not pure java. It requires a specialised JVM, amongst
    other things.

    > But if Jalopy
    > depends on such magic, which is not part of J2SE,
    > I would not call it »pure Java«.


    Presumably, that means you also wouldn't call any J2EE app pure java,
    either?

    tom

    --
    Sorry. Went a bit Atari Teenage Riot there. -- Andrew
     
    Tom Anderson, Nov 21, 2008
    #13
  14. John B. Matthews

    Tom Anderson Guest

    On Fri, 21 Nov 2008, John B. Matthews wrote:

    > In article <>,
    > Tom Anderson <> wrote:
    >
    >> On Fri, 21 Nov 2008, Stefan Ram wrote:
    >>
    >>> "John B. Matthews" <> writes:
    >>>> I was pleasantly surprised by the speed of the JScience library,
    >>>> possibly due to stack-based allocation afforded by Javolution, on which
    >>>> Jscience is based. If staying in Java, it may be worth comparing
    >>>> JScience to Colt:
    >>>> <http://jscience.org/>
    >>>> <http://javolution.org/>
    >>>
    >>> »Javolution is a pure Java Solution«
    >>>
    >>> http://javolution.org/
    >>>
    >>> Does this mean, they implement stack-based allocation
    >>> with Java source code only?
    >>>
    >>> If so, how do they do this?

    >>
    >> "using thread-local pools or RTSJ ScopedMemory" [1]
    >>
    >> Although:
    >>
    >> "the default implementation [...] uses thread-local pools. RTSJ
    >> alternative implementations could use ScopedMemory for their stack
    >> allocations. "
    >>
    >> So it sounds like it's done with 'thread-local pools'. I have a hard time
    >> believing this would actually be faster than normal allocation in a modern
    >> VM. I'd be interested to see a performance comparison, and to read the
    >> source for this class (something i haven't got time to do right now).

    >
    > There's a quick, built-in performance benchmark that's apropos:
    >
    > $ java -jar jscience.jar perf
    > ...
    > LargeInteger (StackContext) versus BigInteger
    > LargeInteger (1024 bits) addition: 1.032 us
    > LargeInteger (1024 bits) multiplication: 41.40 us
    > BigInteger (1024 bits) addition: 4.487 us
    > BigInteger (1024 bits) multiplication: 52.16 us


    Interesting. The Large and Big integers are different classes, so this
    comparison is, if not apples and oranges, at least apples and pears. The
    JScience docs do claim that LargeInteger is faster than BigInteger, i
    assume regardless of allocation techniques, so it's quite possible that
    the speedup is from better algorithms there, rather than the allocation
    techniques.

    Having looked at the implementation of the "thread-local pools" approach,
    i really can't see how it could be faster than heap allocation. It's a
    straightforward object pooling system at heart, with pools that are
    thread-local rather than global. There's no magic. We know that object
    pooling is generally a bad idea from a throughput point of view in modern
    VMs, so this doesn't look great to me.

    tom

    --
    Sorry. Went a bit Atari Teenage Riot there. -- Andrew
     
    Tom Anderson, Nov 21, 2008
    #14
  15. Tom Anderson wrote:
    >
    > Sun's JVM, as of 5 or 6, certainly does some stack allocation via escape
    > analysis.
    >
    > I was more thinking of Javolution's various realtime-friendly classes,
    > and fast XML serialisation, which seem pretty cool.
    >


    I think escape analysis is so far only used to eliminate unnecessary
    locking. The work on eliminating allocations hasn't yet made it into a
    release version.

    Mark Thornton
     
    Mark Thornton, Nov 22, 2008
    #15
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. lvcargnini

    Matrix composed by two matrix

    lvcargnini, Jul 4, 2006, in forum: VHDL
    Replies:
    3
    Views:
    2,675
    Jonathan Bromley
    Jul 5, 2006
  2. Holgerson

    Matrix*Vector and Vector*Matrix

    Holgerson, Oct 25, 2007, in forum: C++
    Replies:
    3
    Views:
    409
    Holgerson
    Oct 26, 2007
  3. Roedy Green

    Re: Matrix performance question

    Roedy Green, Nov 21, 2008, in forum: Java
    Replies:
    9
    Views:
    431
    Tom Anderson
    Nov 25, 2008
  4. Terry Reedy
    Replies:
    0
    Views:
    557
    Terry Reedy
    Apr 2, 2009
  5. Robert Kern
    Replies:
    0
    Views:
    597
    Robert Kern
    Apr 2, 2009
Loading...

Share This Page