horribly impossible debugging task

Discussion in 'Ruby' started by Ara.T.Howard, Sep 16, 2004.

  1. Ara.T.Howard

    Ara.T.Howard Guest

    i've got 30 process running on 30 machines running jobs taken from an nfs mounted
    queue. recently i started seeing random core dumps from them. i've isolated
    the bit of code that causes the core dumps to occur - it's this

    class JobRunner
    #{{{
    attr :job
    attr :jid
    attr :cid
    attr :shell
    attr :command
    def initialize job
    #{{{
    @job = job
    @jid = job['jid']
    @command = job['command']
    @shell = job['shell'] || 'bash'
    @r,@w = IO.pipe
    @cid =
    Util::fork do
    @w.close
    STDIN.reopen @r

    if $want_to_core_dump

    keep = [STDIN, STDOUT, STDERR, @r].map{|io| io.fileno}
    256.times do |fd|
    next if keep.include? fd
    begin
    IO::new(fd).close
    rescue Errno::EINVAL, Errno::EBADF
    end
    end

    end

    if File::basename(@shell) == 'bash' || File::basename(@shell) == 'sh'
    exec [@shell, "__rq_job__#{ @jid }__#{ File.basename(@shell) }__"], '--login'
    else
    exec [@shell, "__rq_job__#{ @jid }__#{ File.basename(@shell) }__"], '-l'
    end
    end
    @r.close
    #}}}
    end
    def run
    #{{{
    @w.puts @command
    @w.close
    #}}}
    end
    #}}}
    end


    now heres the tricky bit. the core dump doesn't happen here - it happens at
    some random time later, and then again sometimes it doesn't. the context this
    code executes in is complex, but here's the just of it


    sqlite database transaction started - this opens some files like db-journal,
    etc.

    a job is selected from database

    fork job runner - this closes open files except stdin, stdout, stderr, and
    com pipe

    the job pid and other accounting is committed to database


    the reason i'm trying to close all the files in the first place is because the
    parent eventually unlinks some of them while the child still has them open -
    this causes nfs sillynames to appear when running on nfs (.nfsxxxxxxxxx).
    this causes no harm as the child never uses these fds - but with 30 machines i
    i end up with 90 or more .nfsxxxxxxx files lying around looking ugly. these
    eventually go away when the child exits but some of these children run for 4
    or 5 or 10 days so the ugliness is constantly in my face - sometimes growing
    to be quite large.

    back to the core dump...

    basically if i DO close all the filehandles i'll, maybe, core dump sometime
    later IN THE PARENT. if i do NOT close them the parent never core dumps. the
    core dumps are totally random and show nothing in common execpt one thing -
    they all show a signal received in the stack trace - i'm guessing this is
    SIGCHLD. i have some signal handlers setup for stopping/restarting that look
    exactly like this:


    trap('SIGHUP') do
    $signaled = $sighup = true
    warn{ "signal <SIGHUP>" }
    end
    trap('SIGTERM') do
    $signaled = $sigterm = true
    warn{ "signal <SIGTERM>" }
    end
    trap('SIGINT') do
    $signaled = $sigint = true
    warn{ "signal <SIGINT>" }
    end

    in my event loop i obviously take appropriate steps for the $sigXXX.

    as i said, however, i don't think these are responsible since they don't
    actually get run as these signals are not being sent. i DO fork for every job
    though so that's why i'm guessing the signal is SIGCHLD.

    so - here's the question: what kind of badness could closing fd's be causing
    in the PARENT? i'm utterly confused at this point and don't really know
    where to look next... could this be a ruby bug or am i just breaking some
    unix law and getting bitten.

    thanks for any advice.

    kind regards.

    -a
    --
    ===============================================================================
    | EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
    | PHONE :: 303.497.6469
    | A flower falls, even though we love it;
    | and a weed grows, even though we do not love it.
    | --Dogen
    ===============================================================================
     
    Ara.T.Howard, Sep 16, 2004
    #1
    1. Advertising

  2. Ara.T.Howard

    Markus Guest

    Ara --

    Random thoughts:

    * It could be a race condition of some sort
    * It could be that closing the file in the child closes it for the
    parent even though closing it for the parent does not close it
    for the child
    * It could be that you omitted a file from your keep list that the
    child actually needs. It tries to access it, goes boom,...
    * can you make it happen in a simplified situation (e.g. one
    child, etc.)
    * is it possible to make nfs put the ugly files somewhere you
    can't see them? I know much of the software I run has lots of
    ugly files (e.g. the web browser cache), but they don't bother
    me because I don't look at them.
    * Instead of specifying the files you want to keep (STDIN, etc)
    could you list the ones you want to close, and narrow the
    problem down that way?

    I don't know if any of these will help, but I can't see that they
    could hurt (I used to say that "ideas can't hurt you" but I'm older
    now).

    -- MarkusQ



    On Thu, 2004-09-16 at 11:54, Ara.T.Howard wrote:
    > i've got 30 process running on 30 machines running jobs taken from an nfs mounted
    > queue. recently i started seeing random core dumps from them. i've isolated
    > the bit of code that causes the core dumps to occur - it's this
    >
    > class JobRunner
    > #{{{
    > attr :job
    > attr :jid
    > attr :cid
    > attr :shell
    > attr :command
    > def initialize job
    > #{{{
    > @job = job
    > @jid = job['jid']
    > @command = job['command']
    > @shell = job['shell'] || 'bash'
    > @r,@w = IO.pipe
    > @cid =
    > Util::fork do
    > @w.close
    > STDIN.reopen @r
    >
    > if $want_to_core_dump
    >
    > keep = [STDIN, STDOUT, STDERR, @r].map{|io| io.fileno}
    > 256.times do |fd|
    > next if keep.include? fd
    > begin
    > IO::new(fd).close
    > rescue Errno::EINVAL, Errno::EBADF
    > end
    > end
    >
    > end
    >
    > if File::basename(@shell) == 'bash' || File::basename(@shell) == 'sh'
    > exec [@shell, "__rq_job__#{ @jid }__#{ File.basename(@shell) }__"], '--login'
    > else
    > exec [@shell, "__rq_job__#{ @jid }__#{ File.basename(@shell) }__"], '-l'
    > end
    > end
    > @r.close
    > #}}}
    > end
    > def run
    > #{{{
    > @w.puts @command
    > @w.close
    > #}}}
    > end
    > #}}}
    > end
    >
    >
    > now heres the tricky bit. the core dump doesn't happen here - it happens at
    > some random time later, and then again sometimes it doesn't. the context this
    > code executes in is complex, but here's the just of it
    >
    >
    > sqlite database transaction started - this opens some files like db-journal,
    > etc.
    >
    > a job is selected from database
    >
    > fork job runner - this closes open files except stdin, stdout, stderr, and
    > com pipe
    >
    > the job pid and other accounting is committed to database
    >
    >
    > the reason i'm trying to close all the files in the first place is because the
    > parent eventually unlinks some of them while the child still has them open -
    > this causes nfs sillynames to appear when running on nfs (.nfsxxxxxxxxx).
    > this causes no harm as the child never uses these fds - but with 30 machines i
    > i end up with 90 or more .nfsxxxxxxx files lying around looking ugly. these
    > eventually go away when the child exits but some of these children run for 4
    > or 5 or 10 days so the ugliness is constantly in my face - sometimes growing
    > to be quite large.
    >
    > back to the core dump...
    >
    > basically if i DO close all the filehandles i'll, maybe, core dump sometime
    > later IN THE PARENT. if i do NOT close them the parent never core dumps. the
    > core dumps are totally random and show nothing in common execpt one thing -
    > they all show a signal received in the stack trace - i'm guessing this is
    > SIGCHLD. i have some signal handlers setup for stopping/restarting that look
    > exactly like this:
    >
    >
    > trap('SIGHUP') do
    > $signaled = $sighup = true
    > warn{ "signal <SIGHUP>" }
    > end
    > trap('SIGTERM') do
    > $signaled = $sigterm = true
    > warn{ "signal <SIGTERM>" }
    > end
    > trap('SIGINT') do
    > $signaled = $sigint = true
    > warn{ "signal <SIGINT>" }
    > end
    >
    > in my event loop i obviously take appropriate steps for the $sigXXX.
    >
    > as i said, however, i don't think these are responsible since they don't
    > actually get run as these signals are not being sent. i DO fork for every job
    > though so that's why i'm guessing the signal is SIGCHLD.
    >
    > so - here's the question: what kind of badness could closing fd's be causing
    > in the PARENT? i'm utterly confused at this point and don't really know
    > where to look next... could this be a ruby bug or am i just breaking some
    > unix law and getting bitten.
    >
    > thanks for any advice.
    >
    > kind regards.
    >
    > -a
    > --
    > ===============================================================================
    > | EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
    > | PHONE :: 303.497.6469
    > | A flower falls, even though we love it;
    > | and a weed grows, even though we do not love it.
    > | --Dogen
    > ===============================================================================
     
    Markus, Sep 16, 2004
    #2
    1. Advertising

  3. Ara.T.Howard

    Guest

    On Fri, 17 Sep 2004, Markus wrote:

    > Ara --
    >
    > Random thoughts:
    >


    > * It could be a race condition of some sort


    yes - perhaps even in some library code i'm exercising - this my current best
    guess.

    > * It could be that closing the file in the child closes it for the
    > parent even though closing it for the parent does not close it
    > for the child


    hmmm - not that one:

    harp:~ > ruby -e'f = open "f","w";fork{ f.close };Process.wait;f.puts 42'
    harp:~ > cat f
    42


    > * It could be that you omitted a file from your keep list that the
    > child actually needs. It tries to access it, goes boom,...


    i do an exec of bash immediately after so i think that's out since bash cannot
    possibly require anything ruby or sqlite has open other that stdin, stdout,
    and stderr.

    > * can you make it happen in a simplified situation (e.g. one
    > child, etc.)


    yes. but not predictably either. it can run for days, or minutes.
    unfortunately (for debugging) it usually about 3 days before a core dump -
    diffucult to work with...

    > * is it possible to make nfs put the ugly files somewhere you
    > can't see them? I know much of the software I run has lots of
    > ugly files (e.g. the web browser cache), but they don't bother
    > me because I don't look at them.


    i handle that this way now:

    def sillyclean dir = @dirname
    #{{{
    glob = File.join dir,'.nfs*'
    orgsilly = Dir[glob]
    yield
    newsilly = Dir[glob]
    silly = newsilly - orgsilly
    silly.each{|path| FileUtils::rm_rf path}
    #}}}
    end

    this code wraps ONLY the transaction/fork code. it is safe because i know any
    silly file left over from a transaction was created due to the sqlite not
    setting close-on-exec on it's tmp files. plus removing a silly file cannot
    hurt because they spring back into existence (by definition) if someone
    actually still needs them. so, if the remove succeeds it no-one was actually
    using them. this is indeed what happens - they are removed never to return.
    i just hate this sort of thing.


    > * Instead of specifying the files you want to keep (STDIN, etc)
    > could you list the ones you want to close, and narrow the
    > problem down that way?


    yes - i'm working on that. the problem is that i actually KNOW the filename
    that gets unlinked and causes the sillyname - it's the 'db-journal' file (i
    can see a .nfsXXXX file come into existence with it's exact contents). the
    problem is that the sqlite api opens this file and i have no file handle on
    it. problem two is that ruby does not provide a way to get at this info that
    i know of. you could

    256.times do |fd|
    begin
    file = IO::new fd
    File::unlink file.path if file.path =~ %r/db-journal/o
    rescue Errno::EBADF, Errno::EINVAL
    end
    end

    __except__ that File objects created this way do not have a path! (nor
    respond_to?('path') for that matter) - at least on my ruby. i'm not sure if
    this is a bug or not...

    > I don't know if any of these will help, but I can't see that they
    > could hurt (I used to say that "ideas can't hurt you" but I'm older
    > now).


    funny. yeah - anything helps - i'm grasping at straws!

    cheers.

    -a
    --
    ===============================================================================
    | EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
    | PHONE :: 303.497.6469
    | A flower falls, even though we love it;
    | and a weed grows, even though we do not love it.
    | --Dogen
    ===============================================================================
     
    , Sep 16, 2004
    #3
  4. Ara.T.Howard

    Ara.T.Howard Guest

    On Fri, 17 Sep 2004, Ruben wrote:

    > At Fri, 17 Sep 2004 14:46:07 +0900,
    > Clifford Heath wrote:
    >>
    >> Ara.T.Howard wrote:
    >>> i've isolated the bit of code that causes the core dumps to occur
    >>> ...
    >>> now heres the tricky bit. the core dump doesn't happen here - it
    >>> happens at some random time later, and then again sometimes it doesn't.

    >>
    >> This sort of scenario is almost always caused by a memory corruption,
    >> either an array-out-of-bounds write causing corruption of the memory
    >> allocation arena, or a reference to an object that's been deleted.
    >> I've chased dozens of these up until five or so years ago, when I got
    >> my hands on a copy of Purify. It catches the corruption *at source*,
    >> and has become quite simply indispensable for this (and many other)
    >> tasks.
    >>
    >> The world would be a better place if every developer used Purify on
    >> every release. Note that it's *not* the same as most "bounds checker"
    >> type tools; it actually maintains a parallel table of markers for
    >> every memory location, and *rewrites the machine instructions* for every
    >> memory reference so it can also check validity against the marker table.
    >>
    >> It's quite simply the bee's knees if you must write in C or a similarly
    >> primitive language :).
    >>
    >> Clifford Heath.

    >
    > Sounds like the same thing valgrind does (for free). It might be
    > interesting to try valgrind on this, if it's a memory related bug. The
    > downside is that running the code through valgrind will give you a
    > slowdown with a factor 30 to 60 (from personal experience). So, not
    > really an option if the bug only shows up after a couple of days...
    >
    > Ruben


    actually both are options since the code in question simply manages a queue of
    jobs and the cost is about 1000th the actual work. i'm used valgrind and
    purify before with some success. i had a really hard to track down bug about
    a year ago and ended up needing valgrind, purify, and dmalloc to track it
    down. these are good suggestions as i'd forgotten about them. it'll be
    pretty tough to set up but possible.

    this is getting a bit OT now so any responders should probably ping me offline
    unless anyone has anything specific to ruby regarding closing all file
    descriptors after a fork and related bugs.

    kind regards.

    -a
    --
    ===============================================================================
    | EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
    | PHONE :: 303.497.6469
    | A flower falls, even though we love it;
    | and a weed grows, even though we do not love it.
    | --Dogen
    ===============================================================================
     
    Ara.T.Howard, Sep 17, 2004
    #4
  5. Clifford Heath <> wrote in message news:<>...

    > The world would be a better place if every developer used Purify on
    > every release.


    Also worth checking out (for linux/x86 only, I believe) is valgrind.
    It also does very good memory/bounds checking, and is free.

    Nathan
     
    Nathan Weston, Sep 17, 2004
    #5
  6. Hello Ruben,


    R> Sounds like the same thing valgrind does (for free). It might be
    R> interesting to try valgrind on this, if it's a memory related bug. The
    R> downside is that running the code through valgrind will give you a
    R> slowdown with a factor 30 to 60 (from personal experience). So, not

    And now you see the difference between good working (and expensive)
    commercial tools and freeware tools like valgrid.

    But to be honest your make valgrind worse then it is. The slowdown
    should be a factor 10 to 20.


    --
    Best regards, emailto: scholz at scriptolutions dot com
    Lothar Scholz http://www.ruby-ide.com
    CTO Scriptolutions Ruby, PHP, Python IDE 's
     
    Lothar Scholz, Sep 17, 2004
    #6
  7. Ara.T.Howard

    Ruben Guest

    Lothar,

    > And now you see the difference between good working (and expensive)
    > commercial tools and freeware tools like valgrid.


    I've heard before that Purify is good, but I don't have any experience
    with it myself, and it might not be an option for everyone because of
    the cost.

    (besides, I don't think that commercial tools are necessarily bad and
    free tools are necessarily good, or the other way around...)

    > But to be honest your make valgrind worse then it is. The slowdown
    > should be a factor 10 to 20.


    Ah.. that's probably because I used 'callgrind' recently which is also
    a skin for valgrind and probably more expensive than the memcheck
    skin. I guess it also depends on the kind of code that's run.

    Ruben
     
    Ruben, Sep 17, 2004
    #7
  8. Ara.T.Howard

    Guest

    Hi,

    At Fri, 17 Sep 2004 03:54:52 +0900,
    Ara.T.Howard wrote in [ruby-talk:112814]:
    > @cid =
    > Util::fork do


    > trap('SIGHUP') do
    > $signaled = $sighup = true
    > warn{ "signal <SIGHUP>" }


    What are these, "Util::fork" and "warn" with block?

    --
    Nobu Nakada
     
    , Sep 22, 2004
    #8
  9. Ara.T.Howard

    Guest

    On Wed, 22 Sep 2004 wrote:

    > Hi,
    >
    > At Fri, 17 Sep 2004 03:54:52 +0900,
    > Ara.T.Howard wrote in [ruby-talk:112814]:
    >> @cid =
    >> Util::fork do

    >
    >> trap('SIGHUP') do
    >> $signaled = $sighup = true
    >> warn{ "signal <SIGHUP>" }

    >
    > What are these, "Util::fork" and "warn" with block?


    Util::fork is simply a 'quiet' fork:

    module Util
    #{{{
    class << self
    def export sym
    #{{{
    sym = "#{ sym }".intern
    module_function sym
    public sym
    #}}}
    end
    def append_features c
    #{{{
    super
    c.extend Util
    #}}}
    end
    end

    ...

    def fork(*a, &b)
    #{{{
    begin
    verbose = $VERBOSE
    $VERBOSE = nil
    Process::fork(*a, &b)
    ensure
    $VERBOSE = verbose
    end
    #}}}
    end
    export 'fork'

    ...

    #}}}
    end


    warn with block delegates to Logger object:

    class Main
    #{{{

    ...

    %w( debug info warn error fatal ).each do |m|
    eval "def #{ m }(*a,&b);@logger.#{ m }(*a,&b);end"
    end

    ...

    #}}}
    end


    regards.

    -a
    --
    ===============================================================================
    | EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
    | PHONE :: 303.497.6469
    | A flower falls, even though we love it;
    | and a weed grows, even though we do not love it.
    | --Dogen
    ===============================================================================
     
    , Sep 22, 2004
    #9
  10. Ara.T.Howard

    Guest

    On Wed, 22 Sep 2004, Kevin McConnell wrote:

    > I could be way off here, but are you opening your SQLite database over NFS?


    oh yeah - definitely, from many machines at once! ;-)

    > I think this can often lead to problems due to the locking not working, so
    > maybe something is going wrong inside the sqlite library code?


    the locking is fcntl based - so it's nfs safe on any decent (not sun) nfs
    implimentation. ours in pure linux on both server and client nodes.

    > You might want to look at the section 7 on http://www.sqlite.org/faq.html.


    i have. ;-)

    essentially i am not relying on sqlite's locking exclusively : my code has an
    additional 'lock file' (empty file which to apply nfs safe locks - see my
    posixlock module on the raa) which i use the ensure single writer multiple
    reader semantics on a __file__ level (sqlite guarantees this on a
    __byte_range__ level). in addition i am using a nfs safe lockfile class (my
    lockfile package in the raa) to assist for certain touchy operations. in
    summary i am manually coordinating access to the database in a way that is
    safe and transactionally protected. the access is logically this:

    aquire separate lock of read or write type

    open database

    begin a transaction

    execute sql

    end transaction

    close database

    release separate lock of read or write type

    this is wrapped with code that autodetects and recoveres from several
    potential errors such as a failed lockd server or failed io operations.
    although i can force these to happen and my code handles it i have never
    actually seen it happen in practice.

    the code in question is a system that allows scientists to configure a linux
    cluster to work on a huge stack of work in under a minute with zero sysad
    intervention. at this point we've run about 3 million jobs through the system
    without incident in the face of two power outages, dozens of reboots, and
    steady extreme (load > 30) nfs load.

    here's a shot of one of our clusters now:

    yacht:~/shared > rq queue status
    ---
    pending : 5875
    running : 36
    finished : 1108
    dead : 0

    yacht:~/shared > rq queue list running | head -20
    ---
    -
    jid: 1324
    priority: 0
    state: running
    submitted: 2004-09-20 09:16:39.449169
    started: 2004-09-22 03:55:24.914682
    finished:
    elapsed:
    submitter: jib.ngdc.noaa.gov
    runner: redfish.ngdc.noaa.gov
    pid: 11519
    exit_status:
    command: /dmsp/moby-1-1/cfadmin/shared/jobs/wavgjob /dmsp/moby-1-1/conf/avg_dn/filelists/F142000.included F142000.cloud2.light1.tile8 /dmsp/moby-1-1/conf/avg_dn/cloud2.light1.tile8.conf cfd2://cfd2-3/F142000/
    -
    jid: 1325
    priority: 0
    state: running
    submitted: 2004-09-20 09:16:39.449169
    started: 2004-09-22 04:12:32.758249


    this stack of work will take about a week to complete using 18 nodes.



    from the man page of the main commandline program 'rq':


    NAME
    rq v0.1.2

    SYNOPSIS
    rq [queue] mode [mode_args]* [options]*

    DESCRIPTION
    rq is an __experimental__ tool used to manage nfs mounted work
    queues. multiple instances of rq on multiples hosts can work from
    these queues to distribute processing load to 'n' nodes - bringing many dozens
    of otherwise powerful cpus to their knees with a single blow. clearly this
    software should be kept out of the hands of radicals, SETI enthusiasts, and
    one mr. jeff safran.

    rq operates in one of the modes create, submit, feed, list, delete,
    query, or help. depending on the mode of operation and the options used the
    meaning of mode_args may change, sometime wildly and unpredictably (i jest, of
    course).


    MODES

    modes may be abbreviated to uniqueness, therefore the following shortcuts
    apply :

    c => create
    s => submit
    f => feed
    l => list
    d => delete
    q => query
    h => help

    create, c :

    creates a queue. the queue MUST be located on an nfs mounted file system
    visible from all nodes intended to run jobs from it.

    examples :

    0) to create a queue
    ~ > rq q create
    or simply
    ~ > rq q c

    list, l :

    show combinations of pending, running, dead, or finished jobs. for this
    command mode_args must be one of pending, running, dead, finished, or all.
    the default is all.

    mode_args may be abbreviated to uniqueness, therefore the following
    shortcuts apply :

    p => pending
    r => running
    f => finished
    d => dead
    a => all

    examples :

    0) show everything in q
    ~ > rq q list all
    or
    ~ > rq q l all
    or
    ~ > export RQ_Q=q
    ~ > rq l

    0) show q's pending jobs
    ~ > rq q list pending

    1) show q's running jobs
    ~ > rq q list running

    2) show q's finished jobs
    ~ > rq q list finshed


    submit, s :

    submit jobs to a queue to be proccesed by any feeding node. any mode_args
    are taken as the command to run. note that mode_args are subject to shell
    expansion - if you don't understand what this means do not use this feature.

    when running in submit mode a file may by specified as a list of commands to
    run using the '--infile, -i' option. this file is taken to be a newline
    separated list of commands to submit, blank lines and comments (#) are
    allowed. if submitting a large number of jobs the input file method is MUCH
    more efficient. if no commands are specified on the command line rq
    automaticallys reads them from STDIN. yaml formatted files are also allowed
    as input (http://www.yaml.org/) - note that output of nearly all rq
    commands is valid yaml and may, therefore, be piped as input into the submit
    command.

    the '--priority, -p' option can be used here to determine the priority of
    jobs. priorities may be any number (0, 10]; therefore 9 is the maximum
    priority. submitting a high priority job will NOT supplant currently
    running low priority jobs, but higher priority jobs will always migrate
    above lower priority jobs in the queue in order that they be run sooner.
    note that constant submission of high priority jobs may create a starvation
    situation whereby low priority jobs are never allowed to run. avoiding this
    situation is the responsibility of the user.

    examples :

    0) submit the job ls to run on some feeding host

    ~ > rq q s ls

    1) submit the job ls to run on some feeding host, at priority 9

    ~ > rq -p9 q s ls

    2) submit 42000 jobs (quietly) to run from a command file.

    ~ > wc -l cmdfile
    42000
    ~ > rq q s -q < cmdfile

    3) submit 42 jobs to run at priority 9 from a command file.

    ~ > wc -l cmdfile
    42
    ~ > rq -p9 q s < cmdfile

    4) re-submit all finished jobs

    ~ > rq q l f | rq q s


    feed, f :

    take jobs from the queue and run them on behalf of the submitter. jobs are
    taken from the queue in an 'oldest highest priority' order.

    feeders can be run from any number of nodes allowing you to harness the CPU
    power of many nodes simoultaneously in order to more effectively clobber
    your network.

    the most useful method of feeding from a queue is to do so in daemon mode so
    that if the process loses it's controling terminal and will not exit when
    you exit your terminal session. use the '--daemon, -d' option to accomplish
    this. by default only one feeding process per host per queue is allowed to
    run at any given moment. because of this it is acceptable to start a feeder
    at some regular interval from a cron entry since, if a feeder is alreay
    running, the process will simply exit and otherwise a new feeder will be
    started. in this way you may keep feeder processing running even acroess
    machine reboots.


    examples :

    0) feed from a queue verbosely for debugging purposes, using a minimum and
    maximum polling time of 2 and 4 respectively

    ~ > rq q feed -v4 -m2 -M4

    1) feed from a queue in daemon mode logging into /home/ahoward/rq.log

    ~ > rq q feed -d -l/home/ahoward/rq.log

    2) use something like this sample crontab entry to keep a feeder running
    forever (it attempts to (re)start every fifteen minutes)

    #
    # your crontab file
    #

    */15 * * * * /full/path/to/bin/rq /full/path/to/nfs/mounted/q f -d -l/home/user/rq.log

    log rolling while running in daemon mode is automatic.


    delete, d :

    delete combinations of pending, running, finished, dead, or specific jobs.
    the delete mode is capable of parsing the output of list mode, making it
    possible to create filters to delete jobs meeting very specific conditions.

    mode_args are the same as for 'list', including 'running'. note that it is
    possible to 'delete' a running job, but there is no way to actually STOP it
    mid execution since the node doing the deleteing has no way to communicate
    this information to the (possibly) remote execution host. therefore you
    should use the 'delete running' feature with care and only for housekeeping
    purposes or to prevent future jobs from being scheduled.

    examples :

    0) delete all pending, running, and finished jobs from a queue

    ~ > rq q d all

    1) delete all pending jobs from a queue

    ~ > rq q d p

    2) delete all finished jobs from a queue

    ~ > rq q d f

    3) delete jobs via hand crafted filter program

    ~ > rq q list | filter_prog | rq q d

    query, q :

    query exposes the database more directly the user, evaluating the where
    clause specified on the command line (or from STDIN). this feature can be
    used to make a fine grained slection of jobs for reporting or as input into
    the delete command. you must have a basic understanding of SQL syntax to
    use this feature, but it is fairly intuitive in this capacity.

    examples:

    0) show all jobs submitted within a specific 10 minute range

    ~ > rq q query "started >= '2004-06-29 22:51:00' and started < '2004-06-29 22:51:10'"

    1) shell quoting can be tricky here so input on STDIN is also allowed

    ~ > cat contraints
    started >= '2004-06-29 22:51:00' and
    started < '2004-06-29 22:51:10'

    ~ > rq q query < contraints
    or (same thing)

    ~ > cat contraints | rq q query

    2) this query output may then be used to delete specific jobs

    ~ > cat contraints | rq q query | rq q d

    3) show all jobs which are either finished or dead

    ~ > rq q q state=finished or state=dead


    NOTES
    - realize that your job is going to be running on a remote host and this has
    implication. paths, for example, should be absolute, not relative.
    specifically the submitted job must be visible from all hosts currently
    feeding from a q.

    - you need to consider __CAREFULLY__ what the ramifications of having multiple
    instances of your program all running at the same time will be. it is
    beyond the scope of rq to ensure multiple instances of a program
    will not overwrite each others output files, for instance. coordination of
    programs is left entirely to the user.

    - the list of finished jobs will grow without bound unless you sometimes
    delete some (all) of them. the reason for this is that rq cannot
    know when the user has collected the exit_status, etc. from a job and so
    keeps this information in the queue until instructed to delete it.

    - if you are using the crontab feature to maintain an immortal feeder on a
    host then that feeder will be running in the environment provided by cron.
    this is NOT the same environment found in a login shell and you may be
    suprised at the range of commands which do not function. if you want
    submitted jobs to behave as closely as possibly to their behaviour when
    typed interactively you'll need to wrap each job in a shell script that
    looks like the following:

    #/bin/bash --login
    commmands_for_your_job

    and submit that script


    ENVIRONMENT
    RQ_Q: full path to queue

    the queue argument to all commands may be omitted if, and only if, the
    environment variable 'RQ_Q' contains the full path to the q. eg.

    ~ > export RQ_Q=/full/path/to/my/q

    this feature can save a considerable amount of typing for those weak of wrist


    DIAGNOSTICS
    success => $? == 0
    failure => $? != 0


    AUTHOR



    BUGS
    1 < bugno && bugno <= 42


    OPTIONS


    -f, --feed=appetite
    -p, --priority=priority
    --name
    -d, --daemon
    -q, --quiet
    -e, --select
    -i, --infile=infile
    -M, --max_sleep=seconds
    -m, --min_sleep=seconds
    -l, --log=path
    -v=0-4|debug|info|warn|error|fatal
    --verbosity
    --log_age=log_age
    --log_size=log_size
    -c, --config=path
    --template=template
    -h, --help


    so far it looks like the solution of my problem was to close the database after
    forking (if it was open) but i'm still testing this approach.

    kind regards.

    -a
    --
    ===============================================================================
    | EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
    | PHONE :: 303.497.6469
    | A flower falls, even though we love it;
    | and a weed grows, even though we do not love it.
    | --Dogen
    ===============================================================================
     
    , Sep 22, 2004
    #10
  11. Ara.T.Howard

    Guest

    On Wed, 22 Sep 2004, Kevin McConnell wrote:

    > wrote:
    >
    >>> You might want to look at the section 7 on
    >>> http://www.sqlite.org/faq.html.

    >>
    >>
    >> i have. ;-)

    >
    > Ah...yeah, I suspected I was just stating the obvious :)


    better to assume nothing when debugging though - i AM grasping at straws so
    i'm overlooking nothing. i went back and re-read the docs at your suggestion
    - now i'm re-reading the sqlite_close code.

    > Good luck with the solution, though.


    luck would be nice.

    regards.

    -a
    --
    ===============================================================================
    | EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
    | PHONE :: 303.497.6469
    | A flower falls, even though we love it;
    | and a weed grows, even though we do not love it.
    | --Dogen
    ===============================================================================
     
    , Sep 22, 2004
    #11
  12. Ara.T.Howard

    Guest

    On Wed, 22 Sep 2004, Kevin McConnell wrote:

    > Just had one other suggestion (hopefully more useful than the last :)
    >
    > Could you seperate out the db-related code into a little 'proxy' app, to run
    > on the same machine as where the db files are, and have your clients connect
    > to it (to read the job, submit the pid etc) ? It might help solve any
    > potential locking hassles (if that's even the problem), since the only thing
    > touching the database would be local. And hey, if nothing else, it could be
    > interesting to find out which side of the code coredumps :)


    i'm now looking at using detach.rb, which creates a drb object out of any
    existing object. basically it would be a little servlet for the daemon's use
    only. i think this may be the way to go. thanks for the idea.

    regards.

    -a
    --
    ===============================================================================
    | EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
    | PHONE :: 303.497.6469
    | A flower falls, even though we love it;
    | and a weed grows, even though we do not love it.
    | --Dogen
    ===============================================================================
     
    , Sep 22, 2004
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Victor
    Replies:
    0
    Views:
    8,847
    Victor
    Sep 1, 2004
  2. SeNTry
    Replies:
    4
    Views:
    305
    SeNTry
    Dec 15, 2005
  3. Doug Livesey
    Replies:
    2
    Views:
    150
    Santhoshkumar Muniraj
    Nov 16, 2009
  4. Replies:
    5
    Views:
    262
    Michele Dondi
    Jun 30, 2006
  5. Chris Angelico

    Python is horribly slow compared to bash!!

    Chris Angelico, May 22, 2014, in forum: Python
    Replies:
    6
    Views:
    69
    Dennis Lee Bieber
    May 28, 2014
Loading...

Share This Page