Dump complete java VM state as core dump (not via OS) possible?

H

halfdog

Hi everyone,

I've a problem debugging an application.

Background: Sometimes my application comes to a very unlikely state,
which at the moment results in an error message. The stack trace alone
has no great value, since this state is cause by the interaction of
more than one thread. The state is resolved throwing an exception, the
program continues normally.

Goal: If I reach this state, I want to suspend the application, dump
the complete state of all java threads, objects, ... (complete java
memory core dump) to analyse it later.

Question: Is there a possibility to generate such dumps?

Just thinking:

One posibility would be, to send a signal from the vm to some external
program (e.g. udp packet), this program attaches a standard OS-level
debugger (e.g. gdb under linux), dumps the core, resumes app, detach.
But I guess reconstruction of internal VM state from complete dump is
rather hard, is it?

I've looked at jdb, to see if it has some java-core-dump functions, but
it seems not to be so. Are there alternative implementations, or helper
scripts that make jdb loop over all java object addresses and dump
them?

Does someone known about the jdb remote debugging interface? Is the
protocol public, can I implement such a feature on myself?
 
S

stixwix

halfdog said:
Goal: If I reach this state, I want to suspend the application, dump
the complete state of all java threads, objects, ... (complete java
memory core dump) to analyse it later.

Question: Is there a possibility to generate such dumps?
If it stays in this state long enough for human intervention then you
can do kill -3 [process number] under Linux I think.
Andy
 
R

Razvan

Goal: If I reach this state, I want to suspend the application, dump
the complete state of all java threads, objects, ... (complete java
memory core dump) to analyse it later.

Just curios ? What tools would you use to analyze that ?

I never used very advanced debugging techniques. For me,
"System.out.println()" and Java exceptions are more than enough.
Whenever I had a complex issue I just analyzed the whole algorithm with
extra attention. I may loose several hours just thinking but it the end
it always paid out for me. The truth is that I am too lazy to use more
advanced debugging techniques.



Regards,
Razvan

http://www.mihaiu.name/2004/sun_java_scjp_310_035/
 
H

halfdog

stixwix said:
If it stays in this state long enough for human intervention then you
can do kill -3 [process number] under Linux I think.
Andy

There are two problems: The state is not reproducible, it occurs at any
day or nighttime. Secondly, the server is contacted by various xml-rpc
clients. If they do not get a response within [http-client-timeout]
(120sec) they fall out of sync and the connected clients will report
errors. So halting the system for longer than 1min is out of question.

(Apart from that: with gdb --pid [processID] you can attach to any
running process, then call "generate-core-file x.core", "quit" and you
have a core dump to analyse without killing the process)

Just curios ? What tools would you use to analyze that ?

I heard that there is a tool to analyse OS process core dumps from java
VMs and reconstruct some of the java object state information. I have
no information how to use these and how good the data reconstruction
would be.
I never used very advanced debugging techniques. For me,
"System.out.println()" and Java exceptions are more than enough.
Whenever I had a complex issue I just analyzed the whole algorithm with
extra attention. I may loose several hours just thinking but it the end
it always paid out for me. The truth is that I am too lazy to use more
advanced debugging techniques.

I also used these very frequently until I had some strange debugging
problems to solve:

1: Very rare time race condition: There was no possibility to reproduce
the bug, it occured at a frequency of about 1:1000 000, a test program
calling the methods could make them fail when running long enough:

Problem: Attaching any loggers or system.outs modified the time course
of the program (possibly through thread scheduling when doing IO ops),
so that it was never possible to get the error. Removing the logging
output made it reapper.

Solution: Added many unneeded synchronized{} blocks, so that the error
resulted in a deadlock, which was debuggable

Now it was a deadlock problem, which is also impossible to debug with
System.out, but with debugger attached it is possible to see which
threads wait for monitors. Afterwards you can fix the buggy code
 
R

Razvan Mihaiu

Just some suggestions:

1. Take the system out of the production environment. I mean replicate
the system somewhere else. I cannot see you working calmly and
efficiently on a system that cannot be down more than 1 minute.

2. Insert as many "System.out" statements as possible. Sometime they
can help even in case of deadlocks - for example you expected a certain
statement to be printed but it wasn't. This could be a good indication
that a deadlock occurred.

I agree - in case of a complex system, this is a nightmare.

2. Create a test application that will also parse the log file of your
application. When a certain error occurs instruct the test application
to stop. Also make a log file for the test application itself. You need
to know the exact steps that were done right before the error.

Now, it all depends on how much time you can spend on this. On most
bugs you cannot spend too much time - so what I told you might still be
impracticable.


Just out of curiosity: how many threads are you speaking about ?


http://www.mihaiu.name/2004/sun_java_scjp_310_035/
 
H

halfdog

Razvan said:
1. Take the system out of the production environment. I mean replicate
the system somewhere else. I cannot see you working calmly and
efficiently on a system that cannot be down more than 1 minute.

The main difficulty is: the errors are not easily replicable, but may
cause severe troubles in future. Since i'm an perfectionist, I want to
fix them after a single occurence. All errors that I could replicate
are already fixed.

I have a test system and test programs which I run in indefinite loops,
but they only produce some errors (or do you have a test program for
the case: client DNS mapping changes during transaction, or https
certificate expires while reading data?)

The remaining errors have no clear cause, e.g:

"Resource already in use": The application detects that a resource is
still in use so it cannot open it, client call fails (no exception, ..)
As a I know, there should be no lock on the resource. Which other
thread still holds the lock? Or is there even no lock owner, perhaps it
is just an error in the locking system itself?
2. Insert as many "System.out" statements as possible. Sometime they
can help even in case of deadlocks - for example you expected a certain
statement to be printed but it wasn't. This could be a good indication
that a deadlock occurred.

The difficulty is: one thread can detect when the problem has occured,
but has nothing to do with it (see resource in use example), so writing
log output in this thread does not make much sense. So all other
threads have to produce the logging output (they already do, with
level=ALL i get about 100kb/s log info on server when running high
throughput tests, client produces about 800kb in 30sec transaction).
When I have a clue about an error, i work through it, but it is out of
question to enable logging on the production site.
Just out of curiosity: how many threads are you speaking about ?

With no load 20 (timeout checkers, cache optimizers), then 1 per client
and service (if a client needs 3 services in parallel - which is
normal, it will start 3 threads).
 
R

Razvan Mihaiu

The main difficulty is: the errors are not easily replicable, but may
cause severe troubles in future. Since i'm an perfectionist, I want to
fix them after a single occurence. All errors that I could replicate
are already fixed.

If you can do that please post your technique here. I will sure follow
this thread in the future.

Maybe with the tools that you mention you can do something like this.
Keep the group informed.
With no load 20 (timeout checkers, cache optimizers), then 1 per client
and service (if a client needs 3 services in parallel - which is
normal, it will start 3 threads).

I have to admit. I have not developed such a complex network
application in Java.



Regards,
Razvan
 
H

halfdog

I did some more searching and found some methods in java.lang.Class
Runtime:

void traceInstructions(boolean on) Enables/Disables tracing
of instructions.
void traceMethodCalls(boolean on) Enables/Disables tracing of
method calls.

After my "bad event", i'll enable these for some seconds to capture at
least some of the other (live) threads, but their data still stays
invisible.

It seems that there is no automated way to capture vm state data for
debugging, if someone has one, pls let me know!
 
C

Chris Uppal

halfdog said:
void traceInstructions(boolean on) Enables/Disables tracing
of instructions.
void traceMethodCalls(boolean on) Enables/Disables tracing of
method calls.

I don't think either of them do anything.

It seems that there is no automated way to capture vm state data for
debugging, if someone has one, pls let me know!

You /might/ be able to find something here

http://java.sun.com/j2se/1.5.0/docs/tooldocs/index.html#manage

In particular there's a link to a trouble shooting guide:

http://java.sun.com/j2se/1.5/pdf/jdk50_ts_guide.pdf

Lastly, the -Xrunhprof option to the java command has some options which
/might/
be relevant. Try java -Xrunhprof:help for a start, and then try Google for
more explanations

-- chris
 
H

halfdog

Wow, thanks, you brought me to the right track. Your link pointed to
j2se tools, which I never heard of until now, and there it was:
''jsadebugd'' The tool allows to attach to a running java-vm, or a
core-dump of a vm and present the result via RMI
ps aux | grep java
xxx 21849 0.0 13.1 278344 51012 pts/5 S May10 0:07
home/xxx/external_data/java/jdk1.5.0_06/bin/java
-Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager
-Djava.util.logging.config.file=/home/xxx/var/tomcat/devel/conf/logging.properties
-Xdebug
-Xrunjdwp:transport=dt_socket,address=localhost:33333,server=y,suspend=n
-Djava.endorsed.dirs=/home/xxx/external_data/java/apache-tomcat-5.5.12/common/endorsed
-classpath
:/home/fiedler/external_data/java/apache-tomcat-5.5.12/bin/bootstrap.jar:/home/xxx/external_data/java/apache-tomcat-5.5.12/bin/commons-logging-api.jar
-Dcatalina.base=/home/xxx/var/tomcat/devel
-Dcatalina.home=/home/xxx/external_data/java/apache-tomcat-5.5.12
-Djava.io.tmpdir=/home/xxx/var/tomcat/devel/temp
org.apache.catalina.startup.Bootstrap start
gdb --pid 21849
Attaching to process 21849
(gdb) generate-core-file java.core
Saved corefile java.core
(gdb) quit
jsadebugd /home/xxx/external_data/java/jdk1.5.0_06/bin/java java.core DebugServer
Attaching to core java.core from executable
/home/xxx/external_data/java/jdk1.5.0_06/bin/java and starting RMI
services, please wait...
Debugger attached and RMI services started.

## Now open another console, print a stack trace
jstack DebugServer@localhost
Attaching to remote server DebugServer@localhost, please wait...
Debugger attached successfully.
Client compiler detected.
JVM version is 1.5.0_06-b05
Thread t@ 22418: (state = BLOCKED)
- java.lang.Object.wait(long) @bci=0 (Interpreted frame)
- java.lang.Object.wait() @bci=2, line=474 (Interpreted frame)
- xxxxxxxxxxxx.GenericCallDispatcher$CallDispatcherWorkerThread.run()
@bci=12, line=388 (Interpreted frame)
- java.lang.Thread.run() @bci=11, line=595 (Interpreted frame)

...... and so on


Currently I'm looking for the RMI interface specification for the
jsadebugd, if there are more methods for inspection available out
there. The jdb and this server seem to be incompatible, so it is just a
small step.

PS: Bad? luck Windows users, seems that this ultra-chaotic, scratch
your nose with your feets-geek tool is only available for linux
 
C

Chris Uppal

halfdog said:
PS: Bad? luck Windows users, seems that this ultra-chaotic, scratch
your nose with your feets-geek tool is only available for linux

Thanks for the follow up. It does sound rather a rather, um, baroque way of
getting at the data you need. But if it works, it works....

-- chris
 
H

halfdog

With the help of a guy from javasoft I managed to do it all: Debug a
dump of a java vm:

$ cat said:
gcore tomcat.core
detach
quit
END
$ gdb --pid 26003 -x gdb.commands
$ jsadebugd ..../jdk1.5.0_06/bin/java tomcat.core DebugServer &
$ jdb -connect
sun.jvm.hotspot.jdi.SADebugServerAttachingConnector:debugServerName=DebugServer@localhost

You can inspect all the data (threads, stack frames, local variables,
objects) just as if you would have attached to a live VM, only
modifications do not work (step, continue, set...) because a VM core
dump is dead.

This is a really strange way to debug a java application, I never
thought that it could work that way.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,731
Messages
2,569,432
Members
44,832
Latest member
GlennSmall

Latest Threads

Top