Architecture ideas for batch processing?

Discussion in 'Java' started by Alan Meyer, Jun 12, 2005.

  1. Alan Meyer

    Alan Meyer Guest

    I'm working on an architecture for a fairly large scale system
    in Java. User interactions will be through JSP and servlets,
    but there is also a big batch processing component with multiple
    jobs that can run for minutes to hours, and some of them running
    in parallel.

    I'm looking for architecture suggestions for the batch job
    portion.

    Specifically, does anyone have opinions on running jobs as
    separate threads in a single virtual machine, vs running
    multiple independent JVMs?

    A single JVM has the advantage of less system overhead and
    easier inter-thread communication - which may or may not turn
    out to be useful. But it means that if one job crashes it can
    bring the others down with it. It also means that if one job
    goes into a loop and has to be terminated, the operator will
    have no choice but to kill everything.

    Multiple independent jobs looks more robust, but Java's penchant
    for swallowing memory may make this significantly less
    efficient.

    Anyone have experience or opinions on this?

    Any other good ideas for someone doing batch processing in Java?

    Anyone know of open source or commercial toolkits of special
    interest for batch processing?

    Thanks.

    Alan
     
    Alan Meyer, Jun 12, 2005
    #1
    1. Advertising

  2. Alan Meyer

    Ray in HK Guest

    What are the market price of ram.

    "Alan Meyer" <> ¦b¶l¥ó
    news: ¤¤¼¶¼g...
    > I'm working on an architecture for a fairly large scale system
    > in Java. User interactions will be through JSP and servlets,
    > but there is also a big batch processing component with multiple
    > jobs that can run for minutes to hours, and some of them running
    > in parallel.
    >
    > I'm looking for architecture suggestions for the batch job
    > portion.
    >
    > Specifically, does anyone have opinions on running jobs as
    > separate threads in a single virtual machine, vs running
    > multiple independent JVMs?
    >
    > A single JVM has the advantage of less system overhead and
    > easier inter-thread communication - which may or may not turn
    > out to be useful. But it means that if one job crashes it can
    > bring the others down with it. It also means that if one job
    > goes into a loop and has to be terminated, the operator will
    > have no choice but to kill everything.
    >
    > Multiple independent jobs looks more robust, but Java's penchant
    > for swallowing memory may make this significantly less
    > efficient.
    >
    > Anyone have experience or opinions on this?
    >
    > Any other good ideas for someone doing batch processing in Java?
    >
    > Anyone know of open source or commercial toolkits of special
    > interest for batch processing?
    >
    > Thanks.
    >
    > Alan
    >
    >
     
    Ray in HK, Jun 12, 2005
    #2
    1. Advertising

  3. Alan Meyer

    Alan Meyer Guest

    "Ray in HK" <> wrote in message news:d8ge0v$...
    > What are the market price of ram.


    Thanks for the reply.

    RAM can be surprisingly expensive when you buy it in multi-bank,
    multi-gigabyte modules for Sun machines. But your point is well
    taken. The extra cost may be justified.

    There may be some processing efficiencies besides using
    less memory management in a multi-threaded vs. multi-tasking
    design.

    Maybe someone has experience in this area?
     
    Alan Meyer, Jun 12, 2005
    #3
  4. Alan Meyer

    Aquila Deus Guest

    Alan Meyer wrote:
    > I'm working on an architecture for a fairly large scale system
    > in Java. User interactions will be through JSP and servlets,
    > but there is also a big batch processing component with multiple
    > jobs that can run for minutes to hours, and some of them running
    > in parallel.
    >
    > I'm looking for architecture suggestions for the batch job
    > portion.
    >
    > Specifically, does anyone have opinions on running jobs as
    > separate threads in a single virtual machine, vs running
    > multiple independent JVMs?
    >
    > A single JVM has the advantage of less system overhead and
    > easier inter-thread communication - which may or may not turn
    > out to be useful. But it means that if one job crashes it can
    > bring the others down with it. It also means that if one job
    > goes into a loop and has to be terminated, the operator will
    > have no choice but to kill everything.
    >
    > Multiple independent jobs looks more robust, but Java's penchant
    > for swallowing memory may make this significantly less
    > efficient.
    >
    > Anyone have experience or opinions on this?
    >
    > Any other good ideas for someone doing batch processing in Java?
    >
    > Anyone know of open source or commercial toolkits of special
    > interest for batch processing?


    Multi-JVM. If you worry about crash, multi-process is the only solution
    - even .NET's Application Domain cannot ensure 100% isolation inside a
    process.

    memory problem could be solved later, but a single-process solution has
    failed since the beginning.

    However, most developers of java application servers wouldn't agree
    with me :)
     
    Aquila Deus, Jun 12, 2005
    #4
  5. Alan Meyer

    Aquila Deus Guest

    Alan Meyer wrote:
    > "Ray in HK" <> wrote in message news:d8ge0v$...
    > > What are the market price of ram.

    >
    > Thanks for the reply.
    >
    > RAM can be surprisingly expensive when you buy it in multi-bank,
    > multi-gigabyte modules for Sun machines. But your point is well
    > taken. The extra cost may be justified.
    >
    > There may be some processing efficiencies besides using
    > less memory management in a multi-threaded vs. multi-tasking
    > design.
    >
    > Maybe someone has experience in this area?


    Ask unix or windows experts. There are at least three disadvantages:

    1.Allocating a process uses more resource than a new thread. But given
    Java's own resource need, .... :)

    2.Context-switching processes takes longer time than switching threads.

    3.On non-SMP, sync between processes are super heavy compared to
    thread's. In x86 all you need to do mutex between threads is an
    exchange command, but between processes....


    PS: none of above is really important.
     
    Aquila Deus, Jun 12, 2005
    #5
  6. Alan Meyer

    Harald Guest

    "Alan Meyer" <> writes:

    > Specifically, does anyone have opinions on running jobs as
    > separate threads in a single virtual machine, vs running
    > multiple independent JVMs?



    > A single JVM has the advantage of less system overhead and

    [...]

    > Multiple independent jobs looks more robust, but Java's penchant

    [...]

    By default, the VM tries to operate with 70% more allocated memory
    than is currently needed for all objects. You can reduce this with
    -XX:MaxHeapFreeRatio (see [1]). Call the GC explicitly after
    dismissing any huge object to convince it to really obey the
    MaxHeapFreeRatio (see [2]). This, and the added stability may be in
    favor of running independent VMs.

    On the other hand, Java's Process class is pretty poor compared to
    proper process management. If you need more than trivial communication
    between processes, go for threads. As for killing individual threads
    that have gone crazy, jdb's remote interface may be an option, though
    I never tried this myself.

    Harald.

    [1] http://java.sun.com/docs/hotspot/VMOptions.html
    [2] http://www.ebi.ac.uk/Rebholz-srv/whatizit/monq-doc/monq/stuff/ConvinceGC.html


    --
    ---------------------+---------------------------------------------
    Harald Kirsch (@home)|
    Java Text Crunching: http://www.ebi.ac.uk/Rebholz-srv/whatizit/software
     
    Harald, Jun 12, 2005
    #6
  7. Alan Meyer

    Patrick May Guest

    Harald <> writes:
    > On the other hand, Java's Process class is pretty poor compared to
    > proper process management. If you need more than trivial
    > communication between processes, go for threads.


    Another alternative is to use a JavaSpace. For grid and
    autonomic computing systems it provides an elegant solution to the
    problem of interprocess communication.

    Regards,

    Patrick

    ------------------------------------------------------------------------
    S P Engineering, Inc. | The experts in large scale distributed OO
    | systems design and implementation.
    | (C++, Java, Common Lisp, Jini, CORBA, UML)
     
    Patrick May, Jun 12, 2005
    #7
  8. Alan Meyer

    Bjorn Borud Guest

    ["Alan Meyer" <>]
    |
    | RAM can be surprisingly expensive when you buy it in multi-bank,
    | multi-gigabyte modules for Sun machines. But your point is well
    | taken. The extra cost may be justified.

    if we are talking about a large system and the batch processing is
    easily separable from your other infrastructure, why not buy cheap
    Intel or AMD based machines, fill them up with RAM and run your batch
    jobs there? it might be cheaper over all?

    | There may be some processing efficiencies besides using
    | less memory management in a multi-threaded vs. multi-tasking
    | design.
    |
    | Maybe someone has experience in this area?

    it is hard to give any sort of meaningful answer as long as I have no
    idea of the what the nature of your batch processing tasks is :).

    -Bjørn
     
    Bjorn Borud, Jun 13, 2005
    #8
  9. Alan Meyer

    Lucy Guest

    "Aquila Deus" <> wrote in message
    news:...
    > Alan Meyer wrote:
    > > "Ray in HK" <> wrote in message

    news:d8ge0v$...
    > > > What are the market price of ram.

    > >
    > > Thanks for the reply.
    > >
    > > RAM can be surprisingly expensive when you buy it in multi-bank,
    > > multi-gigabyte modules for Sun machines. But your point is well
    > > taken. The extra cost may be justified.
    > >
    > > There may be some processing efficiencies besides using
    > > less memory management in a multi-threaded vs. multi-tasking
    > > design.
    > >
    > > Maybe someone has experience in this area?

    >
    > Ask unix or windows experts. There are at least three disadvantages:
    >
    > 1.Allocating a process uses more resource than a new thread. But given
    > Java's own resource need, .... :)
    >
    > 2.Context-switching processes takes longer time than switching threads.


    So on a per batch basis, the percentage of time wasted is this:

    (A tiny part of a fraction of a second) / (A few minutes or hours)
    is approx == 0, so just forget about it.

    > 3.On non-SMP, sync between processes are super heavy compared to
    > thread's. In x86 all you need to do mutex between threads is an
    > exchange command, but between processes....
    >
    >
    > PS: none of above is really important.
    >
     
    Lucy, Jun 13, 2005
    #9
  10. Alan Meyer

    Bjorn Borud Guest

    ["Alan Meyer" <>]
    |
    | But it means that if one job crashes it can bring the others down
    | with it. It also means that if one job goes into a loop and has to
    | be terminated, the operator will have no choice but to kill
    | everything.

    what do you mean by "crash" here? threads dying because of unhandled
    exceptions or hard errors that make the JVM die?

    I'd prefer to model the batch processing APIs so that you don't really
    have to make a decision before you know you have to. you just
    abstract away if the job runs in the same JVM or not. provide
    implementations for running jobs in the same JVM first. if it becomes
    a problem doing so or it would make sense for other reasons to move
    the processing elsewhere, implement whatever is needed for sending the
    batch job to a different JVM (possibly on a different machine).

    the important part is to

    - have proper abstractions so that later you have the freedom
    to choose.

    - implement the remote processing when needed, and not start by
    prematurely assuming that it is required.

    good luck!

    -Bjørn
     
    Bjorn Borud, Jun 13, 2005
    #10
  11. Alan Meyer

    Bjorn Borud Guest

    [Harald <>]
    |
    | Call the GC explicitly after dismissing any huge object to convince
    | it to really obey the MaxHeapFreeRatio (see [2]).

    triggering major collections blindly every second is not exactly an
    optimal solution to this problem. this is indeed *very* bad advice.

    I would recommend reading a bit more about how the JVM you want to use
    manages its memory. if you use Sun's JVM for instance you can read:

    http://java.sun.com/docs/hotspot/

    I would also recommend using jvmstat and visualgc to inspect what your
    JVM is doing. after using it for a while you will become more
    familiar with the hotspot GC system and you will most likely be able
    to spot various problems with heap sizing etc quite fast.

    -Bjørn
     
    Bjorn Borud, Jun 13, 2005
    #11
  12. Alan Meyer

    Chris Uppal Guest

    Bjorn Borud wrote:

    > the important part is to
    >
    > - have proper abstractions so that later you have the freedom
    > to choose.
    >
    > - implement the remote processing when needed, and not start by
    > prematurely assuming that it is required.


    I agree with your philosophy here, but I think I'd come to the opposite
    conclusion.

    Presumably there is no compelling /need/ for the batch code to run in the same
    JVM as the online code (the two don't need to interact directly). In that case
    I'd want to start with the simple and inherently robust architecture of using
    separate processes, and take the options of moving them onto separate machines
    or of moving them into a shared JVM as and when it became necessary.

    -- chris
     
    Chris Uppal, Jun 13, 2005
    #12
  13. Alan Meyer

    Bjorn Borud Guest

    ["Chris Uppal" <-THIS.org>]
    |
    | I agree with your philosophy here, but I think I'd come to the opposite
    | conclusion.
    |
    | Presumably there is no compelling /need/ for the batch code to run
    | in the same JVM as the online code (the two don't need to interact
    | directly).

    the problem is that we can't really know that based on the information
    the OP has posted. if we knew exactly what problem he tried to solve
    it would be easier to give recommendations.

    -Bjørn
     
    Bjorn Borud, Jun 13, 2005
    #13
  14. Alan Meyer

    Alan Meyer Guest

    Alan Meyer wrote:
    > ...
    > I'm looking for architecture suggestions for the batch job
    > portion.
    > ...


    Thank you all for your ideas and suggestions. I will follow
    up on them.

    If anyone else has more ideas, please chime in. I will continue
    to follow the thread.

    Alan
     
    Alan Meyer, Jun 13, 2005
    #14
  15. Alan Meyer

    Chris Uppal Guest

    Bjorn Borud wrote:

    > the problem is that we can't really know that based on the information
    > the OP has posted. if we knew exactly what problem he tried to solve
    > it would be easier to give recommendations.


    Agreed.

    -- chris
     
    Chris Uppal, Jun 14, 2005
    #15
  16. Alan Meyer

    Bjorn Borud Guest

    ["HK" <>]
    |
    | May I kindly ask you to carefully read my posting and the
    | documentation of ConvinceGC before you suggest I am
    | a complete idiot?

    indeed, I was mistaken. I misread the API and thought the class was
    used to call System.gc() at a given interval indefinitely (which would
    be a bad idea, and indeed, bad advice).

    | The only thing that I can see that may have mislead you
    | is a different understanding of what a "huge object" is.

    that, and the fact that I misread the API. my apologies.

    -Bjørn
     
    Bjorn Borud, Jun 14, 2005
    #16
  17. Alan Meyer

    HK Guest

    Bjorn Borud wrote:
    > [Harald <>]
    > |
    > | Call the GC explicitly after dismissing any huge object to convince
    > | it to really obey the MaxHeapFreeRatio (see [2]).
    >
    > triggering major collections blindly every second is not exactly an
    > optimal solution to this problem. this is indeed *very* bad advice.


    May I kindly ask you to carefully read my posting and the
    documentation of ConvinceGC before you suggest I am
    a complete idiot?

    The only thing that I can see that may have mislead you
    is a different understanding of what a "huge object" is.
    For me this is one which needs more than 50% of the
    allocated memory. If such an object is not needed
    anymore, e.g. after a startup phase of a server,
    the only way I found to really get rid of it *and* free the
    memory for other processes was ConvinceGC. If you have
    a better solution, I would be eager to learn about it.



    >
    > I would recommend reading a bit more about how the JVM you want to use
    > manages its memory. if you use Sun's JVM for instance you can read:
    >
    > http://java.sun.com/docs/hotspot/


    Well, guess were I learned about -XX:MaxHeapFreeRatio.

    Harald.
     
    HK, Jun 14, 2005
    #17
  18. Alan Meyer

    Alan Meyer Guest

    Chris Uppal wrote:
    > Bjorn Borud wrote:
    >
    > > the problem is that we can't really know that based on the information
    > > the OP has posted. if we knew exactly what problem he tried to solve
    > > it would be easier to give recommendations.

    >
    > Agreed.
    >
    > -- chris


    The points about abstraction are well taken. I will indeed
    follow the advice given and design the interfaces to the programs
    so they can run as threads of one JVM, in separate JVMs, or on
    separate machines with no changes to the internals of the
    processing.

    I can't say anything about the specifics of the problem I'm
    working on because it's a commercial project and the customer
    requires confidentiality from the developers.

    Speaking generally, I can say that the batch processes prepare
    collections of documents for publication. There are user
    interactive components to the system, but the basic publication
    process is non-interactive. Publishing is done on a scheduled
    basis. Documents pass through a series of steps to transform
    them to make them usable by end users.

    Thanks.

    Alan
     
    Alan Meyer, Jun 14, 2005
    #18
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Muhammad Khan
    Replies:
    4
    Views:
    1,255
    Mike Treseler
    Jul 10, 2003
  2. Replies:
    3
    Views:
    530
    Malcolm
    Sep 29, 2005
  3. rashmi
    Replies:
    2
    Views:
    485
    Grumble
    Jul 5, 2005
  4. Replies:
    3
    Views:
    441
    Malcolm
    Sep 29, 2005
  5. Replies:
    4
    Views:
    687
    Malcolm
    Sep 29, 2005
Loading...

Share This Page