extremely strange segfault

Discussion in 'Ruby' started by Luke A. Kanies, Dec 15, 2003.

  1. Hi all,

    I'm trying to use ruby on AIX 5.1 to generate Nagios configuration files.
    It's all been working fine until today, when I started having very, very
    odd segfaults.

    The thing that makes them odd is that they appear and disappear depending
    on lines I add or remove in my scripts, and (here's the weird part) those
    lines can be comments or "puts" statements.

    I tried adding output statements to track down the segfaults, and they
    went away. So, I tried commenting the statements out, and the faults were
    still gone. Okay, so I delete them, and now the faults are back.
    Depending on where my print statements are and where the comments are and
    other completely random things, the segfaults also appear at different
    lines in the script.

    This is in a 1200 line script, using an ldap.so compbiled against OpenLDAP
    2.something and digest/md5. Ooh, except I just discovered that I'm not
    actually using digest/md5, but removing it causes another segfault.

    Yes, I'm relatively convinced that this is something I'm doing, but, well,
    you've seen where my debugging has gotten me: very confused.

    The ruby is one I compiled myself using gcc 3.3.1 on AIX 5.1. The only
    configure flag I used was -qmaxmem=32768, but I did run the following
    after configuring:

    find . -name Makefile -exec perl -pi -e 's/ -brtl$//' {} \;

    perl -pi -e 's/^.+RSTRING.+$//' ext/syck/emitter.c

    These were necessary to compile on AIx. The -brtl fix is because
    apparently -brtl must appear at the beginning of a DLDFLAGS declaration on
    AIX. I know I should have submitted this bug, but I (obviously) have not.
    How should I?

    The emitter change was because emitter refused to compile based on a
    problem with that line. The strange thing? It was commented with '//'.
    It mentioned (in the comment) a variable not used anywhere else, so gcc
    (stupidly?) complained about the missing declaration. I removed the
    comment, and all went ok.

    Let me guess: My build environment is very broken, so there's not much
    help for me, right?

    Well, if anyone has any pointers on how I might track this problem down,
    I'd appreciate it. I'm willing to just upgrade, if necessary, but I would
    prefer to actually understand what is happening.

    I am also willing to share the code, but it's long enough that I didn't
    want to just spam the list with it.

    Thanks,
    Luke

    --
    I don't want the world, I just want your half.
    Luke A. Kanies, Dec 15, 2003
    #1
    1. Advertising

  2. Sorry, forgot to mention: I'm using ruby 1.8.0.

    --
    Meeting, n.:
    An assembly of people coming together to decide what person or
    department not represented in the room must solve a problem.
    Luke A. Kanies, Dec 15, 2003
    #2
    1. Advertising

  3. Hi,

    In message "extremely strange segfault"
    on 03/12/16, "Luke A. Kanies" <> writes:

    |These were necessary to compile on AIx. The -brtl fix is because
    |apparently -brtl must appear at the beginning of a DLDFLAGS declaration on
    |AIX. I know I should have submitted this bug, but I (obviously) have not.
    |How should I?

    Posting the patch to the ruby-core mailing list is most convenient for me.

    |I am also willing to share the code, but it's long enough that I didn't
    |want to just spam the list with it.

    If you can put your script (and data) to reproduce error on the web,
    it's the best way. Otherwise, send me directly.

    matz.
    Yukihiro Matsumoto, Dec 15, 2003
    #3
  4. On Tue, 16 Dec 2003, Yukihiro Matsumoto wrote:

    > In message "extremely strange segfault"
    > on 03/12/16, "Luke A. Kanies" <> writes:
    >
    > |These were necessary to compile on AIx. The -brtl fix is because
    > |apparently -brtl must appear at the beginning of a DLDFLAGS declaration on
    > |AIX. I know I should have submitted this bug, but I (obviously) have not.
    > |How should I?
    >
    > Posting the patch to the ruby-core mailing list is most convenient for me.


    Okay, I'll do that.

    > |I am also willing to share the code, but it's long enough that I didn't
    > |want to just spam the list with it.
    >
    > If you can put your script (and data) to reproduce error on the web,
    > it's the best way. Otherwise, send me directly.


    Hmmm. I can probably do that (I'll have to check my employers) but I have
    to warn you: It's about 370 hosts in LDAP using a custom schema.

    Before I get to that, let me try to spend some time further isolating the
    problem. It's not exactly straightforward testing, but let me at least
    see if I can reproduce the problem without the LDAP data.

    Luke

    --
    Should I say "I believe in physics", or "I know that physics is true"?
    -- Ludwig Wittgenstein, On Certainty, 602.
    Luke A. Kanies, Dec 15, 2003
    #4
  5. Is is feasible to GC.disable in your app? That would at least tell you
    if it is a mark/free related bug.
    Joel VanderWerf, Dec 15, 2003
    #5
  6. On Tue, 16 Dec 2003, Joel VanderWerf wrote:

    >
    > Is is feasible to GC.disable in your app? That would at least tell you
    > if it is a mark/free related bug.


    I can't test it until I get back to work, but how would I do that? There
    certainly is a decent amount of memory shunting involved, since I'm doing
    an ldap query, creating a bunch of objects based on the results, and then
    storing the objects in various and sundry groups, along with some
    self-rolled auto-vivication.

    Basically, I'm pulling a host list from ldap, converting each host into an
    object, using some basic logic to store that host in groups for which I've
    set up pseudo-autovivication, and then looping across each host to do some
    other stuff.

    I've had some strange issues with ldap, and I miss the all-perl Net::LDAP
    with SSL, but it basically worked until I started with the autovivication
    and adding the objects to lots of groups. That's why I figure I can
    isolate the problem some, but I got sidetracked into trying to make my
    lexer perform better (it's taking about 12 seconds just to tokenize a 99k
    file, which seemed high, so...).

    Hopefully tomorrow I'll be able to isolate this problem at least into a
    specific component, but so far it's been, um, strange.

    Luke

    --
    You can't have everything. Where would you put it?
    -- Stephen Wright
    Luke A. Kanies, Dec 16, 2003
    #6
  7. Luke A. Kanies

    ts Guest

    >>>>> "L" == Luke A Kanies <> writes:

    L> I tried adding output statements to track down the segfaults, and they
    L> went away. So, I tried commenting the statements out, and the faults were
    L> still gone. Okay, so I delete them, and now the faults are back.
    L> Depending on where my print statements are and where the comments are and
    L> other completely random things, the segfaults also appear at different
    L> lines in the script.

    try to give a backtrace when it segfault.


    Guy Decoux
    ts, Dec 16, 2003
    #7
  8. On Tue, 16 Dec 2003, Joel VanderWerf wrote:

    >
    > Is is feasible to GC.disable in your app? That would at least tell you
    > if it is a mark/free related bug.


    Well, I can't precisely say that it was a problem with GC, but I can't
    reproduce the problem with GC disabled.

    This is just about the strangest problem I've ever had, because it will
    appear if I comment a print statement out, but then disappear if I just
    delete the print statement. That's why I can't clearly say it was a
    problem with GC, even though GC seems to fix it: It could be the extra
    line in the file that fixes it or something silly like that.

    The segfault consistently comes around an .each iteration I have
    associated with some LDAP entries and some class definitions. I get
    different line numbers for the segfault every time, but it is consistently
    somewhere in my processing of the LDAP information. This makes me think
    it is a problem with the ldap.so somehow, although I don't know if
    loaded libraries can kill Ruby -- I assume so.

    If there are any other tests you would like me to try, please let me know.

    Luke

    --
    Due to circumstances beyond your control, you are master of your fate
    and captain of your soul.
    Luke A. Kanies, Dec 16, 2003
    #8
  9. On Tue, 16 Dec 2003, ts wrote:

    > try to give a backtrace when it segfault.


    (Here I delve into unknown territory...)

    wzd4845@naadmd02(134) $ gdb /usr/local/bin/ruby
    GNU gdb 6.0
    Copyright 2003 Free Software Foundation, Inc.
    GDB is free software, covered by the GNU General Public License, and you
    are
    welcome to change it and/or distribute copies of it under certain
    conditions.
    Type "show copying" to see the conditions.
    There is absolutely no warranty for GDB. Type "show warranty" for
    details.
    This GDB was configured as "powerpc-ibm-aix5.1.0.0"...
    (gdb) run -S /home/wzd4845/bin/naghosts
    Starting program: /usr/local/bin/ruby -S /home/wzd4845/bin/naghosts

    Program received signal SIGSEGV, Segmentation fault.
    0x1003a09c in rb_gc_mark ()
    (gdb) bt
    #0 0x1003a09c in rb_gc_mark ()
    #1 0x1003a580 in rb_gc_mark_children ()
    #2 0x1003a168 in rb_gc_mark ()
    #3 0x1003a778 in rb_gc_mark_children ()
    #4 0x1003a168 in rb_gc_mark ()
    #5 0x10039d20 in mark_locations_array ()
    #6 0x10039db0 in rb_gc_mark_locations ()
    #7 0x1003baa0 in rb_gc ()
    #8 0x1003960c in rb_newobj ()
    #9 0x100147d4 in new_blktag ()
    #10 0x1001b098 in rb_eval ()
    #11 0x10023ae0 in rb_call0 ()
    #12 0x10024100 in rb_call ()
    #13 0x100244ec in rb_funcall2 ()
    #14 0x1002838c in rb_obj_call_init ()
    #15 0x10048f00 in rb_class_new_instance ()
    #16 0x10036a48 in call_cfunc ()
    #17 0x1002347c in rb_call0 ()
    #18 0x10024100 in rb_call ()
    #19 0x1001c530 in rb_eval ()
    #20 0x10023ae0 in rb_call0 ()
    #21 0x10024100 in rb_call ()
    #22 0x1001c530 in rb_eval ()
    #23 0x1001d4c8 in rb_eval ()
    #24 0x10020834 in rb_yield_0 ()
    #25 0x10020ca0 in rb_yield ()
    #26 0x10022580 in rb_ensure ()
    #27 0xd1ae7f90 in rb_ldap_conn_search_b ()
    from /usr/local/lib/ruby/site_ruby/1.8/powerpc-aix5.1.0.0/ldap.so
    #28 0x10022580 in rb_ensure ()
    ---Type <return> to continue, or q <return> to quit---
    #29 0xd1ae80e8 in rb_ldap_conn_search_s ()
    from /usr/local/lib/ruby/site_ruby/1.8/powerpc-aix5.1.0.0/ldap.so
    #30 0x10036a48 in call_cfunc ()
    #31 0x1002347c in rb_call0 ()
    #32 0x10024100 in rb_call ()
    #33 0x1001c530 in rb_eval ()
    #34 0x1001b1d8 in rb_eval ()
    #35 0x1001b808 in rb_eval ()
    #36 0x10015c34 in eval_node ()
    #37 0x10016708 in ruby_exec ()
    #38 0x10016850 in ruby_run ()
    #39 0x10000570 in main ()

    Hopefully that tells you something...

    Do you need anything else?

    Luke

    --
    "The leader of Jamestown was "John Smith" (not his real name), under
    whose direction the colony engaged in a number of activities,
    primarily related to starving. -- Dave Barry, "Dave Barry Slept Here"
    Luke A. Kanies, Dec 16, 2003
    #9
  10. Luke A. Kanies

    ts Guest

    >>>>> "L" == Luke A Kanies <> writes:

    L> #4 0x1003a168 in rb_gc_mark ()
    L> #5 0x10039d20 in mark_locations_array ()
    L> #6 0x10039db0 in rb_gc_mark_locations ()
    L> #7 0x1003baa0 in rb_gc ()

    You have a problem with the GC, it probably find an invalid object on the
    stack.

    The best is probably to first verify the extensions that you use, one of
    these extensions can have a bug.


    Guy Decoux
    ts, Dec 16, 2003
    #10
  11. Luke A. Kanies

    ts Guest

    >>>>> "L" == Luke A Kanies <> writes:

    L> #29 0xd1ae80e8 in rb_ldap_conn_search_s ()
    L> from /usr/local/lib/ruby/site_ruby/1.8/powerpc-aix5.1.0.0/ldap.so

    What is your version of ruby_ldap ?

    If it's 0.8.1 try to use 0.7.2 : there is a bug in rb_ldap_conn_search_s()
    for 0.8.1 (see [ruby-talk:85228])


    Guy Decoux
    ts, Dec 16, 2003
    #11
  12. On Wed, 17 Dec 2003, ts wrote:

    > >>>>> "L" == Luke A Kanies <> writes:

    >
    > L> #29 0xd1ae80e8 in rb_ldap_conn_search_s ()
    > L> from /usr/local/lib/ruby/site_ruby/1.8/powerpc-aix5.1.0.0/ldap.so
    >
    > What is your version of ruby_ldap ?
    >
    > If it's 0.8.1 try to use 0.7.2 : there is a bug in rb_ldap_conn_search_s()
    > for 0.8.1 (see [ruby-talk:85228])


    Yes, it's 0.8.1. I'll try 0.7.2 when I get a chance.

    Luke

    --
    My favorite was a professor at a University I Used To Be Associated With
    who claimed that our requirement of a non-alphabetic character in our
    passwords was an abridgement of his freedom of speech.
    -- Jacob Haller
    Luke A. Kanies, Dec 16, 2003
    #12
  13. "Luke A. Kanies" <> writes:

    > On Tue, 16 Dec 2003, Joel VanderWerf wrote:
    >
    > >
    > > Is is feasible to GC.disable in your app? That would at least tell you
    > > if it is a mark/free related bug.

    >
    > Well, I can't precisely say that it was a problem with GC, but I can't
    > reproduce the problem with GC disabled.
    >
    > This is just about the strangest problem I've ever had, because it will
    > appear if I comment a print statement out, but then disappear if I just
    > delete the print statement.


    If this is the strangest bug you ever had, you are doing pretty good.

    What you have come across is a "Heisenbug":

    http://wombat.doc.ic.ac.uk/foldoc/foldoc.cgi?heisenbug

    Happy hunting!

    d.k.


    --
    Daniel Kelley - San Jose, CA
    For email, replace the first dot in the domain with an at.
    Daniel Kelley, Dec 16, 2003
    #13
  14. On Wed, 17 Dec 2003, Daniel Kelley wrote:

    > "Luke A. Kanies" <> writes:
    >
    > > This is just about the strangest problem I've ever had, because it will
    > > appear if I comment a print statement out, but then disappear if I just
    > > delete the print statement.

    >
    > If this is the strangest bug you ever had, you are doing pretty good.
    >
    > What you have come across is a "Heisenbug":
    >
    > http://wombat.doc.ic.ac.uk/foldoc/foldoc.cgi?heisenbug


    That's what this was, though. I would encounter the bug, so then I'd add
    some print statements to try to bracket the bug and it would go away.
    Okay, then I'd just comment the statements out, maybe their execution
    fixed it; still no bug. Okay, delete the statements entirely; now the bug
    is back.

    The Heisenberg nature of the bug did not stand up to deeper scrutiny, but
    it certainly fooled my initial (usually sufficient) debugging.

    Luke

    --
    "I think that's how Chicago got started. A bunch of people in New York
    said, 'Gee, I'm enjoying the crime and the poverty, but it just isn't
    cold enough. Let's go west.' "
    --Richard Jeni
    Luke A. Kanies, Dec 16, 2003
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    3
    Views:
    679
  2. Mitko Haralanov

    Need help with strange segfault

    Mitko Haralanov, Sep 5, 2007, in forum: Python
    Replies:
    0
    Views:
    231
    Mitko Haralanov
    Sep 5, 2007
  3. Replies:
    1
    Views:
    334
    Tor Rustad
    Nov 30, 2007
  4. Andrey Vul
    Replies:
    8
    Views:
    669
    Richard Bos
    Jul 30, 2010
  5. Zhang Weiwu
    Replies:
    4
    Views:
    101
Loading...

Share This Page