horribly impossible debugging task

A

Ara.T.Howard

i've got 30 process running on 30 machines running jobs taken from an nfs mounted
queue. recently i started seeing random core dumps from them. i've isolated
the bit of code that causes the core dumps to occur - it's this

class JobRunner
#{{{
attr :job
attr :jid
attr :cid
attr :shell
attr :command
def initialize job
#{{{
@job = job
@jid = job['jid']
@command = job['command']
@shell = job['shell'] || 'bash'
@r,@w = IO.pipe
@cid =
Util::fork do
@w.close
STDIN.reopen @r

if $want_to_core_dump

keep = [STDIN, STDOUT, STDERR, @r].map{|io| io.fileno}
256.times do |fd|
next if keep.include? fd
begin
IO::new(fd).close
rescue Errno::EINVAL, Errno::EBADF
end
end

end

if File::basename(@shell) == 'bash' || File::basename(@shell) == 'sh'
exec [@shell, "__rq_job__#{ @jid }__#{ File.basename(@shell) }__"], '--login'
else
exec [@shell, "__rq_job__#{ @jid }__#{ File.basename(@shell) }__"], '-l'
end
end
@r.close
#}}}
end
def run
#{{{
@w.puts @command
@w.close
#}}}
end
#}}}
end


now heres the tricky bit. the core dump doesn't happen here - it happens at
some random time later, and then again sometimes it doesn't. the context this
code executes in is complex, but here's the just of it


sqlite database transaction started - this opens some files like db-journal,
etc.

a job is selected from database

fork job runner - this closes open files except stdin, stdout, stderr, and
com pipe

the job pid and other accounting is committed to database


the reason i'm trying to close all the files in the first place is because the
parent eventually unlinks some of them while the child still has them open -
this causes nfs sillynames to appear when running on nfs (.nfsxxxxxxxxx).
this causes no harm as the child never uses these fds - but with 30 machines i
i end up with 90 or more .nfsxxxxxxx files lying around looking ugly. these
eventually go away when the child exits but some of these children run for 4
or 5 or 10 days so the ugliness is constantly in my face - sometimes growing
to be quite large.

back to the core dump...

basically if i DO close all the filehandles i'll, maybe, core dump sometime
later IN THE PARENT. if i do NOT close them the parent never core dumps. the
core dumps are totally random and show nothing in common execpt one thing -
they all show a signal received in the stack trace - i'm guessing this is
SIGCHLD. i have some signal handlers setup for stopping/restarting that look
exactly like this:


trap('SIGHUP') do
$signaled = $sighup = true
warn{ "signal <SIGHUP>" }
end
trap('SIGTERM') do
$signaled = $sigterm = true
warn{ "signal <SIGTERM>" }
end
trap('SIGINT') do
$signaled = $sigint = true
warn{ "signal <SIGINT>" }
end

in my event loop i obviously take appropriate steps for the $sigXXX.

as i said, however, i don't think these are responsible since they don't
actually get run as these signals are not being sent. i DO fork for every job
though so that's why i'm guessing the signal is SIGCHLD.

so - here's the question: what kind of badness could closing fd's be causing
in the PARENT? i'm utterly confused at this point and don't really know
where to look next... could this be a ruby bug or am i just breaking some
unix law and getting bitten.

thanks for any advice.

kind regards.

-a
--
===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| A flower falls, even though we love it;
| and a weed grows, even though we do not love it.
| --Dogen
===============================================================================
 
M

Markus

Ara --

Random thoughts:

* It could be a race condition of some sort
* It could be that closing the file in the child closes it for the
parent even though closing it for the parent does not close it
for the child
* It could be that you omitted a file from your keep list that the
child actually needs. It tries to access it, goes boom,...
* can you make it happen in a simplified situation (e.g. one
child, etc.)
* is it possible to make nfs put the ugly files somewhere you
can't see them? I know much of the software I run has lots of
ugly files (e.g. the web browser cache), but they don't bother
me because I don't look at them.
* Instead of specifying the files you want to keep (STDIN, etc)
could you list the ones you want to close, and narrow the
problem down that way?

I don't know if any of these will help, but I can't see that they
could hurt (I used to say that "ideas can't hurt you" but I'm older
now).

-- MarkusQ



i've got 30 process running on 30 machines running jobs taken from an nfs mounted
queue. recently i started seeing random core dumps from them. i've isolated
the bit of code that causes the core dumps to occur - it's this

class JobRunner
#{{{
attr :job
attr :jid
attr :cid
attr :shell
attr :command
def initialize job
#{{{
@job = job
@jid = job['jid']
@command = job['command']
@shell = job['shell'] || 'bash'
@r,@w = IO.pipe
@cid =
Util::fork do
@w.close
STDIN.reopen @r

if $want_to_core_dump

keep = [STDIN, STDOUT, STDERR, @r].map{|io| io.fileno}
256.times do |fd|
next if keep.include? fd
begin
IO::new(fd).close
rescue Errno::EINVAL, Errno::EBADF
end
end

end

if File::basename(@shell) == 'bash' || File::basename(@shell) == 'sh'
exec [@shell, "__rq_job__#{ @jid }__#{ File.basename(@shell) }__"], '--login'
else
exec [@shell, "__rq_job__#{ @jid }__#{ File.basename(@shell) }__"], '-l'
end
end
@r.close
#}}}
end
def run
#{{{
@w.puts @command
@w.close
#}}}
end
#}}}
end


now heres the tricky bit. the core dump doesn't happen here - it happens at
some random time later, and then again sometimes it doesn't. the context this
code executes in is complex, but here's the just of it


sqlite database transaction started - this opens some files like db-journal,
etc.

a job is selected from database

fork job runner - this closes open files except stdin, stdout, stderr, and
com pipe

the job pid and other accounting is committed to database


the reason i'm trying to close all the files in the first place is because the
parent eventually unlinks some of them while the child still has them open -
this causes nfs sillynames to appear when running on nfs (.nfsxxxxxxxxx).
this causes no harm as the child never uses these fds - but with 30 machines i
i end up with 90 or more .nfsxxxxxxx files lying around looking ugly. these
eventually go away when the child exits but some of these children run for 4
or 5 or 10 days so the ugliness is constantly in my face - sometimes growing
to be quite large.

back to the core dump...

basically if i DO close all the filehandles i'll, maybe, core dump sometime
later IN THE PARENT. if i do NOT close them the parent never core dumps. the
core dumps are totally random and show nothing in common execpt one thing -
they all show a signal received in the stack trace - i'm guessing this is
SIGCHLD. i have some signal handlers setup for stopping/restarting that look
exactly like this:


trap('SIGHUP') do
$signaled = $sighup = true
warn{ "signal <SIGHUP>" }
end
trap('SIGTERM') do
$signaled = $sigterm = true
warn{ "signal <SIGTERM>" }
end
trap('SIGINT') do
$signaled = $sigint = true
warn{ "signal <SIGINT>" }
end

in my event loop i obviously take appropriate steps for the $sigXXX.

as i said, however, i don't think these are responsible since they don't
actually get run as these signals are not being sent. i DO fork for every job
though so that's why i'm guessing the signal is SIGCHLD.

so - here's the question: what kind of badness could closing fd's be causing
in the PARENT? i'm utterly confused at this point and don't really know
where to look next... could this be a ruby bug or am i just breaking some
unix law and getting bitten.

thanks for any advice.

kind regards.

-a
--
===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| A flower falls, even though we love it;
| and a weed grows, even though we do not love it.
| --Dogen
===============================================================================
 
A

Ara.T.Howard

Ara --

Random thoughts:
* It could be a race condition of some sort

yes - perhaps even in some library code i'm exercising - this my current best
guess.
* It could be that closing the file in the child closes it for the
parent even though closing it for the parent does not close it
for the child

hmmm - not that one:

harp:~ > ruby -e'f = open "f","w";fork{ f.close };Process.wait;f.puts 42'
harp:~ > cat f
42

* It could be that you omitted a file from your keep list that the
child actually needs. It tries to access it, goes boom,...

i do an exec of bash immediately after so i think that's out since bash cannot
possibly require anything ruby or sqlite has open other that stdin, stdout,
and stderr.
* can you make it happen in a simplified situation (e.g. one
child, etc.)

yes. but not predictably either. it can run for days, or minutes.
unfortunately (for debugging) it usually about 3 days before a core dump -
diffucult to work with...
* is it possible to make nfs put the ugly files somewhere you
can't see them? I know much of the software I run has lots of
ugly files (e.g. the web browser cache), but they don't bother
me because I don't look at them.

i handle that this way now:

def sillyclean dir = @dirname
#{{{
glob = File.join dir,'.nfs*'
orgsilly = Dir[glob]
yield
newsilly = Dir[glob]
silly = newsilly - orgsilly
silly.each{|path| FileUtils::rm_rf path}
#}}}
end

this code wraps ONLY the transaction/fork code. it is safe because i know any
silly file left over from a transaction was created due to the sqlite not
setting close-on-exec on it's tmp files. plus removing a silly file cannot
hurt because they spring back into existence (by definition) if someone
actually still needs them. so, if the remove succeeds it no-one was actually
using them. this is indeed what happens - they are removed never to return.
i just hate this sort of thing.

* Instead of specifying the files you want to keep (STDIN, etc)
could you list the ones you want to close, and narrow the
problem down that way?

yes - i'm working on that. the problem is that i actually KNOW the filename
that gets unlinked and causes the sillyname - it's the 'db-journal' file (i
can see a .nfsXXXX file come into existence with it's exact contents). the
problem is that the sqlite api opens this file and i have no file handle on
it. problem two is that ruby does not provide a way to get at this info that
i know of. you could

256.times do |fd|
begin
file = IO::new fd
File::unlink file.path if file.path =~ %r/db-journal/o
rescue Errno::EBADF, Errno::EINVAL
end
end

__except__ that File objects created this way do not have a path! (nor
respond_to?('path') for that matter) - at least on my ruby. i'm not sure if
this is a bug or not...
I don't know if any of these will help, but I can't see that they
could hurt (I used to say that "ideas can't hurt you" but I'm older
now).

funny. yeah - anything helps - i'm grasping at straws!

cheers.

-a
--
===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| A flower falls, even though we love it;
| and a weed grows, even though we do not love it.
| --Dogen
===============================================================================
 
A

Ara.T.Howard

At Fri, 17 Sep 2004 14:46:07 +0900,


Sounds like the same thing valgrind does (for free). It might be
interesting to try valgrind on this, if it's a memory related bug. The
downside is that running the code through valgrind will give you a
slowdown with a factor 30 to 60 (from personal experience). So, not
really an option if the bug only shows up after a couple of days...

Ruben

actually both are options since the code in question simply manages a queue of
jobs and the cost is about 1000th the actual work. i'm used valgrind and
purify before with some success. i had a really hard to track down bug about
a year ago and ended up needing valgrind, purify, and dmalloc to track it
down. these are good suggestions as i'd forgotten about them. it'll be
pretty tough to set up but possible.

this is getting a bit OT now so any responders should probably ping me offline
unless anyone has anything specific to ruby regarding closing all file
descriptors after a fork and related bugs.

kind regards.

-a
--
===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| A flower falls, even though we love it;
| and a weed grows, even though we do not love it.
| --Dogen
===============================================================================
 
N

Nathan Weston

Clifford Heath said:
The world would be a better place if every developer used Purify on
every release.

Also worth checking out (for linux/x86 only, I believe) is valgrind.
It also does very good memory/bounds checking, and is free.

Nathan
 
L

Lothar Scholz

Hello Ruben,


R> Sounds like the same thing valgrind does (for free). It might be
R> interesting to try valgrind on this, if it's a memory related bug. The
R> downside is that running the code through valgrind will give you a
R> slowdown with a factor 30 to 60 (from personal experience). So, not

And now you see the difference between good working (and expensive)
commercial tools and freeware tools like valgrid.

But to be honest your make valgrind worse then it is. The slowdown
should be a factor 10 to 20.
 
R

Ruben

Lothar,
And now you see the difference between good working (and expensive)
commercial tools and freeware tools like valgrid.

I've heard before that Purify is good, but I don't have any experience
with it myself, and it might not be an option for everyone because of
the cost.

(besides, I don't think that commercial tools are necessarily bad and
free tools are necessarily good, or the other way around...)
But to be honest your make valgrind worse then it is. The slowdown
should be a factor 10 to 20.

Ah.. that's probably because I used 'callgrind' recently which is also
a skin for valgrind and probably more expensive than the memcheck
skin. I guess it also depends on the kind of code that's run.

Ruben
 
N

nobu.nokada

Hi,

At Fri, 17 Sep 2004 03:54:52 +0900,
Ara.T.Howard wrote in [ruby-talk:112814]:
@cid =
Util::fork do
trap('SIGHUP') do
$signaled = $sighup = true
warn{ "signal <SIGHUP>" }

What are these, "Util::fork" and "warn" with block?
 
A

Ara.T.Howard

Hi,

At Fri, 17 Sep 2004 03:54:52 +0900,
Ara.T.Howard wrote in [ruby-talk:112814]:
@cid =
Util::fork do
trap('SIGHUP') do
$signaled = $sighup = true
warn{ "signal <SIGHUP>" }

What are these, "Util::fork" and "warn" with block?

Util::fork is simply a 'quiet' fork:

module Util
#{{{
class << self
def export sym
#{{{
sym = "#{ sym }".intern
module_function sym
public sym
#}}}
end
def append_features c
#{{{
super
c.extend Util
#}}}
end
end

...

def fork(*a, &b)
#{{{
begin
verbose = $VERBOSE
$VERBOSE = nil
Process::fork(*a, &b)
ensure
$VERBOSE = verbose
end
#}}}
end
export 'fork'

...

#}}}
end


warn with block delegates to Logger object:

class Main
#{{{

...

%w( debug info warn error fatal ).each do |m|
eval "def #{ m }(*a,&b);@logger.#{ m }(*a,&b);end"
end

...

#}}}
end


regards.

-a
--
===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| A flower falls, even though we love it;
| and a weed grows, even though we do not love it.
| --Dogen
===============================================================================
 
A

Ara.T.Howard

I could be way off here, but are you opening your SQLite database over NFS?

oh yeah - definitely, from many machines at once! ;-)
I think this can often lead to problems due to the locking not working, so
maybe something is going wrong inside the sqlite library code?

the locking is fcntl based - so it's nfs safe on any decent (not sun) nfs
implimentation. ours in pure linux on both server and client nodes.
You might want to look at the section 7 on http://www.sqlite.org/faq.html.

i have. ;-)

essentially i am not relying on sqlite's locking exclusively : my code has an
additional 'lock file' (empty file which to apply nfs safe locks - see my
posixlock module on the raa) which i use the ensure single writer multiple
reader semantics on a __file__ level (sqlite guarantees this on a
__byte_range__ level). in addition i am using a nfs safe lockfile class (my
lockfile package in the raa) to assist for certain touchy operations. in
summary i am manually coordinating access to the database in a way that is
safe and transactionally protected. the access is logically this:

aquire separate lock of read or write type

open database

begin a transaction

execute sql

end transaction

close database

release separate lock of read or write type

this is wrapped with code that autodetects and recoveres from several
potential errors such as a failed lockd server or failed io operations.
although i can force these to happen and my code handles it i have never
actually seen it happen in practice.

the code in question is a system that allows scientists to configure a linux
cluster to work on a huge stack of work in under a minute with zero sysad
intervention. at this point we've run about 3 million jobs through the system
without incident in the face of two power outages, dozens of reboots, and
steady extreme (load > 30) nfs load.

here's a shot of one of our clusters now:

yacht:~/shared > rq queue status
---
pending : 5875
running : 36
finished : 1108
dead : 0

yacht:~/shared > rq queue list running | head -20
---
-
jid: 1324
priority: 0
state: running
submitted: 2004-09-20 09:16:39.449169
started: 2004-09-22 03:55:24.914682
finished:
elapsed:
submitter: jib.ngdc.noaa.gov
runner: redfish.ngdc.noaa.gov
pid: 11519
exit_status:
command: /dmsp/moby-1-1/cfadmin/shared/jobs/wavgjob /dmsp/moby-1-1/conf/avg_dn/filelists/F142000.included F142000.cloud2.light1.tile8 /dmsp/moby-1-1/conf/avg_dn/cloud2.light1.tile8.conf cfd2://cfd2-3/F142000/
-
jid: 1325
priority: 0
state: running
submitted: 2004-09-20 09:16:39.449169
started: 2004-09-22 04:12:32.758249


this stack of work will take about a week to complete using 18 nodes.



from the man page of the main commandline program 'rq':


NAME
rq v0.1.2

SYNOPSIS
rq [queue] mode [mode_args]* [options]*

DESCRIPTION
rq is an __experimental__ tool used to manage nfs mounted work
queues. multiple instances of rq on multiples hosts can work from
these queues to distribute processing load to 'n' nodes - bringing many dozens
of otherwise powerful cpus to their knees with a single blow. clearly this
software should be kept out of the hands of radicals, SETI enthusiasts, and
one mr. jeff safran.

rq operates in one of the modes create, submit, feed, list, delete,
query, or help. depending on the mode of operation and the options used the
meaning of mode_args may change, sometime wildly and unpredictably (i jest, of
course).


MODES

modes may be abbreviated to uniqueness, therefore the following shortcuts
apply :

c => create
s => submit
f => feed
l => list
d => delete
q => query
h => help

create, c :

creates a queue. the queue MUST be located on an nfs mounted file system
visible from all nodes intended to run jobs from it.

examples :

0) to create a queue
~ > rq q create
or simply
~ > rq q c

list, l :

show combinations of pending, running, dead, or finished jobs. for this
command mode_args must be one of pending, running, dead, finished, or all.
the default is all.

mode_args may be abbreviated to uniqueness, therefore the following
shortcuts apply :

p => pending
r => running
f => finished
d => dead
a => all

examples :

0) show everything in q
~ > rq q list all
or
~ > rq q l all
or
~ > export RQ_Q=q
~ > rq l

0) show q's pending jobs
~ > rq q list pending

1) show q's running jobs
~ > rq q list running

2) show q's finished jobs
~ > rq q list finshed


submit, s :

submit jobs to a queue to be proccesed by any feeding node. any mode_args
are taken as the command to run. note that mode_args are subject to shell
expansion - if you don't understand what this means do not use this feature.

when running in submit mode a file may by specified as a list of commands to
run using the '--infile, -i' option. this file is taken to be a newline
separated list of commands to submit, blank lines and comments (#) are
allowed. if submitting a large number of jobs the input file method is MUCH
more efficient. if no commands are specified on the command line rq
automaticallys reads them from STDIN. yaml formatted files are also allowed
as input (http://www.yaml.org/) - note that output of nearly all rq
commands is valid yaml and may, therefore, be piped as input into the submit
command.

the '--priority, -p' option can be used here to determine the priority of
jobs. priorities may be any number (0, 10]; therefore 9 is the maximum
priority. submitting a high priority job will NOT supplant currently
running low priority jobs, but higher priority jobs will always migrate
above lower priority jobs in the queue in order that they be run sooner.
note that constant submission of high priority jobs may create a starvation
situation whereby low priority jobs are never allowed to run. avoiding this
situation is the responsibility of the user.

examples :

0) submit the job ls to run on some feeding host

~ > rq q s ls

1) submit the job ls to run on some feeding host, at priority 9

~ > rq -p9 q s ls

2) submit 42000 jobs (quietly) to run from a command file.

~ > wc -l cmdfile
42000
~ > rq q s -q < cmdfile

3) submit 42 jobs to run at priority 9 from a command file.

~ > wc -l cmdfile
42
~ > rq -p9 q s < cmdfile

4) re-submit all finished jobs

~ > rq q l f | rq q s


feed, f :

take jobs from the queue and run them on behalf of the submitter. jobs are
taken from the queue in an 'oldest highest priority' order.

feeders can be run from any number of nodes allowing you to harness the CPU
power of many nodes simoultaneously in order to more effectively clobber
your network.

the most useful method of feeding from a queue is to do so in daemon mode so
that if the process loses it's controling terminal and will not exit when
you exit your terminal session. use the '--daemon, -d' option to accomplish
this. by default only one feeding process per host per queue is allowed to
run at any given moment. because of this it is acceptable to start a feeder
at some regular interval from a cron entry since, if a feeder is alreay
running, the process will simply exit and otherwise a new feeder will be
started. in this way you may keep feeder processing running even acroess
machine reboots.


examples :

0) feed from a queue verbosely for debugging purposes, using a minimum and
maximum polling time of 2 and 4 respectively

~ > rq q feed -v4 -m2 -M4

1) feed from a queue in daemon mode logging into /home/ahoward/rq.log

~ > rq q feed -d -l/home/ahoward/rq.log

2) use something like this sample crontab entry to keep a feeder running
forever (it attempts to (re)start every fifteen minutes)

#
# your crontab file
#

*/15 * * * * /full/path/to/bin/rq /full/path/to/nfs/mounted/q f -d -l/home/user/rq.log

log rolling while running in daemon mode is automatic.


delete, d :

delete combinations of pending, running, finished, dead, or specific jobs.
the delete mode is capable of parsing the output of list mode, making it
possible to create filters to delete jobs meeting very specific conditions.

mode_args are the same as for 'list', including 'running'. note that it is
possible to 'delete' a running job, but there is no way to actually STOP it
mid execution since the node doing the deleteing has no way to communicate
this information to the (possibly) remote execution host. therefore you
should use the 'delete running' feature with care and only for housekeeping
purposes or to prevent future jobs from being scheduled.

examples :

0) delete all pending, running, and finished jobs from a queue

~ > rq q d all

1) delete all pending jobs from a queue

~ > rq q d p

2) delete all finished jobs from a queue

~ > rq q d f

3) delete jobs via hand crafted filter program

~ > rq q list | filter_prog | rq q d

query, q :

query exposes the database more directly the user, evaluating the where
clause specified on the command line (or from STDIN). this feature can be
used to make a fine grained slection of jobs for reporting or as input into
the delete command. you must have a basic understanding of SQL syntax to
use this feature, but it is fairly intuitive in this capacity.

examples:

0) show all jobs submitted within a specific 10 minute range

~ > rq q query "started >= '2004-06-29 22:51:00' and started < '2004-06-29 22:51:10'"

1) shell quoting can be tricky here so input on STDIN is also allowed

~ > cat contraints
started >= '2004-06-29 22:51:00' and
started < '2004-06-29 22:51:10'

~ > rq q query < contraints
or (same thing)

~ > cat contraints | rq q query

2) this query output may then be used to delete specific jobs

~ > cat contraints | rq q query | rq q d

3) show all jobs which are either finished or dead

~ > rq q q state=finished or state=dead


NOTES
- realize that your job is going to be running on a remote host and this has
implication. paths, for example, should be absolute, not relative.
specifically the submitted job must be visible from all hosts currently
feeding from a q.

- you need to consider __CAREFULLY__ what the ramifications of having multiple
instances of your program all running at the same time will be. it is
beyond the scope of rq to ensure multiple instances of a program
will not overwrite each others output files, for instance. coordination of
programs is left entirely to the user.

- the list of finished jobs will grow without bound unless you sometimes
delete some (all) of them. the reason for this is that rq cannot
know when the user has collected the exit_status, etc. from a job and so
keeps this information in the queue until instructed to delete it.

- if you are using the crontab feature to maintain an immortal feeder on a
host then that feeder will be running in the environment provided by cron.
this is NOT the same environment found in a login shell and you may be
suprised at the range of commands which do not function. if you want
submitted jobs to behave as closely as possibly to their behaviour when
typed interactively you'll need to wrap each job in a shell script that
looks like the following:

#/bin/bash --login
commmands_for_your_job

and submit that script


ENVIRONMENT
RQ_Q: full path to queue

the queue argument to all commands may be omitted if, and only if, the
environment variable 'RQ_Q' contains the full path to the q. eg.

~ > export RQ_Q=/full/path/to/my/q

this feature can save a considerable amount of typing for those weak of wrist


DIAGNOSTICS
success => $? == 0
failure => $? != 0


AUTHOR
(e-mail address removed)


BUGS
1 < bugno && bugno <= 42


OPTIONS


-f, --feed=appetite
-p, --priority=priority
--name
-d, --daemon
-q, --quiet
-e, --select
-i, --infile=infile
-M, --max_sleep=seconds
-m, --min_sleep=seconds
-l, --log=path
-v=0-4|debug|info|warn|error|fatal
--verbosity
--log_age=log_age
--log_size=log_size
-c, --config=path
--template=template
-h, --help


so far it looks like the solution of my problem was to close the database after
forking (if it was open) but i'm still testing this approach.

kind regards.

-a
--
===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| A flower falls, even though we love it;
| and a weed grows, even though we do not love it.
| --Dogen
===============================================================================
 
A

Ara.T.Howard

Ah...yeah, I suspected I was just stating the obvious :)

better to assume nothing when debugging though - i AM grasping at straws so
i'm overlooking nothing. i went back and re-read the docs at your suggestion
- now i'm re-reading the sqlite_close code.
Good luck with the solution, though.

luck would be nice.

regards.

-a
--
===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| A flower falls, even though we love it;
| and a weed grows, even though we do not love it.
| --Dogen
===============================================================================
 
A

Ara.T.Howard

Just had one other suggestion (hopefully more useful than the last :)

Could you seperate out the db-related code into a little 'proxy' app, to run
on the same machine as where the db files are, and have your clients connect
to it (to read the job, submit the pid etc) ? It might help solve any
potential locking hassles (if that's even the problem), since the only thing
touching the database would be local. And hey, if nothing else, it could be
interesting to find out which side of the code coredumps :)

i'm now looking at using detach.rb, which creates a drb object out of any
existing object. basically it would be a little servlet for the daemon's use
only. i think this may be the way to go. thanks for the idea.

regards.

-a
--
===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| A flower falls, even though we love it;
| and a weed grows, even though we do not love it.
| --Dogen
===============================================================================
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,576
Members
45,054
Latest member
LucyCarper

Latest Threads

Top