Most Interesting Bug Track Down

Discussion in 'C Programming' started by Frederick Gotham, Nov 24, 2006.

  1. I thought it might be interesting to share experiences of tracking down a
    subtle or mysterious bug. I myself haven't much experience with tracking
    down bugs, but there's one in particular which comes to mind.

    I was writing usable which dealt with strings. As per usual with my code, I
    made it efficient to the extreme. One thing I did was replace, where
    possible, any usages of "strlen" with something like:

    struct PtrAndLen {
    char *p;
    size_t len;
    };

    This could be initialised with a string as follows:

    struct PtrAndLen const pal = { "Hello", sizeof "Hello" };

    From that point forward in the code, "pal.len" would be used in place of
    strlen.

    The code grew though, and at one stage I needed to store info about two
    strings in one of these structures. To do this, I used a null separator,
    e.g.:

    PtrAndLen pal = {"Hello\0Bonjour", sizeof "Hello\0Bonjour"};

    All of this, however, was expanded by macros, so I actually had something
    like:

    MAKE_STR_INFO("Hello\0Bonjour")

    The problem with this, however, was that "strlen" and "pal.len" had
    different values, because strlen only read as far as the first null
    terminator. Anyway, I had to read through the code in detail before the bug
    jumped out at me.

    I'm sure several regulars here have more interesting stories... :)

    --

    Frederick Gotham
    Frederick Gotham, Nov 24, 2006
    #1
    1. Advertising

  2. Frederick Gotham

    Eric Sosman Guest

    Frederick Gotham wrote:

    > I thought it might be interesting to share experiences of tracking down a
    > subtle or mysterious bug. I myself haven't much experience with tracking
    > down bugs, but there's one in particular which comes to mind.
    >
    > I was writing usable which dealt with strings. As per usual with my code, I
    > made it efficient to the extreme. One thing I did was replace, where
    > possible, any usages of "strlen" with something like:
    >
    > struct PtrAndLen {
    > char *p;
    > size_t len;
    > };
    >
    > This could be initialised with a string as follows:
    >
    > struct PtrAndLen const pal = { "Hello", sizeof "Hello" };
    >
    > From that point forward in the code, "pal.len" would be used in place of
    > strlen.
    >
    > The code grew though, and at one stage I needed to store info about two
    > strings in one of these structures. To do this, I used a null separator,
    > e.g.:
    >
    > PtrAndLen pal = {"Hello\0Bonjour", sizeof "Hello\0Bonjour"};
    >
    > All of this, however, was expanded by macros, so I actually had something
    > like:
    >
    > MAKE_STR_INFO("Hello\0Bonjour")
    >
    > The problem with this, however, was that "strlen" and "pal.len" had
    > different values, because strlen only read as far as the first null
    > terminator. Anyway, I had to read through the code in detail before the bug
    > jumped out at me.


    Looks like it may not have jumped high enough: The bug is there
    even without embedded '\0' characters.

    --
    Eric Sosman
    lid
    Eric Sosman, Nov 24, 2006
    #2
    1. Advertising

  3. Frederick Gotham

    Jack Klein Guest

    On Fri, 24 Nov 2006 22:12:15 GMT, Frederick Gotham
    <> wrote in comp.lang.c:

    >
    > I thought it might be interesting to share experiences of tracking down a
    > subtle or mysterious bug. I myself haven't much experience with tracking
    > down bugs, but there's one in particular which comes to mind.
    >
    > I was writing usable which dealt with strings. As per usual with my code, I
    > made it efficient to the extreme. One thing I did was replace, where
    > possible, any usages of "strlen" with something like:


    Why "as usual"? Once you had the application working correctly, was
    it too slow? Did you profile or otherwise test and to prove that this
    was a bottleneck?

    > struct PtrAndLen {
    > char *p;
    > size_t len;
    > };
    >
    > This could be initialised with a string as follows:
    >
    > struct PtrAndLen const pal = { "Hello", sizeof "Hello" };
    >
    > From that point forward in the code, "pal.len" would be used in place of
    > strlen.
    >
    > The code grew though, and at one stage I needed to store info about two
    > strings in one of these structures. To do this, I used a null separator,
    > e.g.:
    >
    > PtrAndLen pal = {"Hello\0Bonjour", sizeof "Hello\0Bonjour"};


    So now, thanks to premature optimization, you have violated your
    design. You are initializing with something that is not a string, or
    more precisely a string plus something else.

    > All of this, however, was expanded by macros, so I actually had something
    > like:
    >
    > MAKE_STR_INFO("Hello\0Bonjour")
    >
    > The problem with this, however, was that "strlen" and "pal.len" had
    > different values, because strlen only read as far as the first null
    > terminator. Anyway, I had to read through the code in detail before the bug
    > jumped out at me.


    How could that be a problem? You just said you eliminated all use of
    strlen().

    > I'm sure several regulars here have more interesting stories... :)


    Sounds like a poor design, aggravated by violating its constraints in
    use.

    --
    Jack Klein
    Home: http://JK-Technology.Com
    FAQs for
    comp.lang.c http://c-faq.com/
    comp.lang.c++ http://www.parashift.com/c -faq-lite/
    alt.comp.lang.learn.c-c++
    http://www.contrib.andrew.cmu.edu/~ajo/docs/FAQ-acllc.html
    Jack Klein, Nov 24, 2006
    #3
  4. Eric Sosman:

    > Looks like it may not have jumped high enough: The bug is there
    > even without embedded '\0' characters.


    Ah yes, I should have mentioned that I took the "sizeof "Hello" - 1" into
    account.

    --

    Frederick Gotham
    Frederick Gotham, Nov 24, 2006
    #4
  5. Jack Klein:

    >> As per usual with my
    >> code, I made it efficient to the extreme. One thing I did was replace,
    >> where possible, any usages of "strlen" with something like:

    >
    > Why "as usual"? Once you had the application working correctly, was
    > it too slow? Did you profile or otherwise test and to prove that this
    > was a bottleneck?



    I have never written a program for monetary gain. I program purely for the
    enjoyment of programming. If I achieve a certain object, I am not satisfied
    -- I want to achieve the objective as efficiently as is possible.


    >> struct PtrAndLen {
    >> char *p;
    >> size_t len;
    >> };
    >>
    >> This could be initialised with a string as follows:
    >>
    >> struct PtrAndLen const pal = { "Hello", sizeof "Hello" };
    >>
    >> From that point forward in the code, "pal.len" would be used in place
    >> of strlen.
    >>
    >> The code grew though, and at one stage I needed to store info about two
    >> strings in one of these structures. To do this, I used a null
    >> separator, e.g.:
    >>
    >> PtrAndLen pal = {"Hello\0Bonjour", sizeof "Hello\0Bonjour"};

    >
    > So now, thanks to premature optimization, you have violated your
    > design. You are initializing with something that is not a string, or
    > more precisely a string plus something else.



    I can live with the minor complication though, given that my code runs
    several orders of magnitude faster than the "play it safe" equivalent.

    --

    Frederick Gotham
    Frederick Gotham, Nov 24, 2006
    #5
  6. Frederick Gotham

    CBFalconer Guest

    Frederick Gotham wrote:
    >

    .... snip ...
    >
    > I was writing usable which dealt with strings. As per usual with my
    > code, I made it efficient to the extreme. One thing I did was
    > replace, where possible, any usages of "strlen" with something like:
    >
    > struct PtrAndLen {
    > char *p;
    > size_t len;
    > };
    >
    > This could be initialised with a string as follows:
    >
    > struct PtrAndLen const pal = { "Hello", sizeof "Hello" };
    >
    > From that point forward in the code, "pal.len" would be used in
    > place of strlen.
    >
    > The code grew though, and at one stage I needed to store info about
    > two strings in one of these structures. To do this, I used a null
    > separator, e.g.:
    >
    > PtrAndLen pal = {"Hello\0Bonjour", sizeof "Hello\0Bonjour"};


    Ugh. You deserved anything that happened to you.

    --
    Chuck F (cbfalconer at maineline dot net)
    Available for consulting/temporary embedded and systems.
    <http://cbfalconer.home.att.net>
    CBFalconer, Nov 25, 2006
    #6
  7. >I have never written a program for monetary gain. I program purely for the
    >enjoyment of programming. If I achieve a certain object, I am not satisfied
    >-- I want to achieve the objective as efficiently as is possible.


    If it doesn't have to work correctly, any program can run in 0 time
    and 0 bytes.

    Even programs like the OS/360's IEFBR14, or the C equivalent:

    int main(void)
    {
    return 0;
    }

    can be made to work faster if they don't have to run correctly.
    Gordon Burditt, Nov 25, 2006
    #7
  8. Frederick Gotham wrote:
    > I thought it might be interesting to share experiences of tracking down a
    > subtle or mysterious bug. I myself haven't much experience with tracking
    > down bugs, but there's one in particular which comes to mind.
    >

    <snip>
    >
    > I'm sure several regulars here have more interesting stories... :)


    Ok, I'll bite.

    One of the most unusual things I've tracked down was a production-level
    C program that suddenly failed with the addition of a single comment in
    a single .c file. Mind you, this was an application deployed on custom
    hardware and in active use at thousands of sites worldwide. It wasn't
    even a new comment-- it was the addition of about 30 characters to an
    existing comment.

    The usual suspects were considered. Perhaps it was a build-order
    issue, I thought. I've run into those before. If there are data
    corruption problems in the code, changes in the order that files are
    compiled and linked can cause the linker to allocate different storage
    locations for different variables. This can cause a "more critical"
    (ie pointer) variable to be clobbered rather than another (ie
    statistics counter) when corruption occurs.

    No dice. My original build, a clean rebuild, and a rebuild of a clean
    build with several hundred random files touched all failed. Hmmm. At
    best this was inconclusive.

    Now this was an embedded product running the vxWorks RTOS. In the bad
    old days, on the bad old MMU-less platforms, it could take a real
    effort to get debugging information out of a production board if it
    crashed hard. Unfortunately, I couldn't track down any boards with
    debug facilities, so I had to disassemble the chassis and have a rework
    tech solder a DB-9 on it so I could get some serial output from it (my
    soldering isn't the greatest).

    Much to my chagrin, I quickly realized that any substantial code
    modifications other than that comment line caused the code to spring
    back to life. "If you can't beat 'em, diff 'em", I thought. I
    reverted the original change and rebuilt the file, except this time
    saving the assembly and preprocessed output. I then re-added the
    problematic comment and rebuilt, again saving the assembly and
    preprocessed output. On each build I set the randomization seeds to
    the same values, so any pseudo-random numbers used by the compiler
    would have the same values on both builds.

    I fired up the graphical diff program. No changes to the preprocessed
    files except the line numbers, which differed by one. No changes to
    the assembly files, except...

    .... A single additional line of assembly code. Aparrently a
    preprocessor macro in a header file far, far away had decided to use
    the value of __LINE__ for field debugging of production code. Since
    the CPU was an ARM7, and the ARM instruction set can only load a 12 bit
    immediate IIRC, and (shockingly) the __LINE__ value was exactly on the
    boundary, the compiler had generated a "load immediate; increment by 1"
    instruction sequence instead of just a "load immediate".

    But who cares!?! The instruction sequence was correct, anyways.
    Perhaps an assembler bug? This did not seem to be likely, given "The
    Two General Rules of Compiler and Assembler Bugs":

    1. It's not a compiler or assembler bug.
    2. See rule #1*.

    * Unless you can prove it.

    I verified with objdump and the ARM Architecture Reference Manual that
    the opcodes being generated by the assembler were indeed correct.
    Hmmm.

    Every code modification I could think of caused things to start working
    again. Things were seeming to me to point toward a hardware bug of
    some kind. But given "The Two General Rules of Hardware Bugs": ...

    1. It's not a hardware bug.
    2. See rule #1*.

    * Unless you can prove it.

    .... I now had to prove it, especially since the units were already in
    the field. On a hunch I looked at the assembly code directly after
    this new additional instruction. I saw a load of a value from memory
    that was added to another value and used as a pointer.

    I inserted a jump to a routine I wrote in assembly that moved the
    freshly-loaded value to a known memory location. It then jumped to
    some C code in another translation unit that dumped the value, and then
    reloaded and re-dumped it.

    The value of the original read was 0, and the reload of the same
    location generated a valid pointer value. Obviously they should have
    been equal (I had already ruled out concurrent access problems
    previously).

    Turns out the memory subsystem was laid out by a junior designer at
    another company who had incorrectly chained some of the RAM clock
    traces (IIRC), where they should have all had equal lengths. This
    particular instruction sequence, combined with the peculiar cache
    behavior of the application had caused it to issue memory reads that
    exposed the RAM subsystem timing problems. Rumor had it that somebody
    at the other company had done a quickie board spin to increase the
    amount of RAM without properly reviewing the change. Amazingly, the
    rest of the half-million lines of code seemed to run just fine. How, I
    will never know.

    The entire thing was so unlikely that I couldn't help but think,
    "Great. I'll probably be hit with a meteorite on the way to my car
    tonight." Fortunately the bad luck stopped with a new spin of the
    board, and I'm still here to tell the tale.


    Mark F. Haigh
    Mark F. Haigh, Nov 25, 2006
    #8
  9. Frederick Gotham

    Guest

    Frederick Gotham wrote:
    > I thought it might be interesting to share experiences of tracking down a
    > subtle or mysterious bug. I myself haven't much experience with tracking
    > down bugs, but there's one in particular which comes to mind.
    >
    > I was writing usable which dealt with strings. As per usual with my code, I
    > made it efficient to the extreme. One thing I did was replace, where
    > possible, any usages of "strlen" with something like:
    >
    > struct PtrAndLen {
    > char *p;
    > size_t len;
    > };
    >
    > This could be initialised with a string as follows:
    >
    > struct PtrAndLen const pal = { "Hello", sizeof "Hello" };
    >
    > From that point forward in the code, "pal.len" would be used in place of
    > strlen.
    >
    > The code grew though, and at one stage I needed to store info about two
    > strings in one of these structures. To do this, I used a null separator,
    > e.g.:
    >
    > PtrAndLen pal = {"Hello\0Bonjour", sizeof "Hello\0Bonjour"};
    >
    > All of this, however, was expanded by macros, so I actually had something
    > like:
    >
    > MAKE_STR_INFO("Hello\0Bonjour")
    >
    > The problem with this, however, was that "strlen" and "pal.len" had
    > different values, because strlen only read as far as the first null
    > terminator. Anyway, I had to read through the code in detail before the bug
    > jumped out at me.


    First of all (sizeof "inline string") is 1+strlen ("inline string").
    So I assume you compensated for this in your macro.

    Second of all, in "The Better String Library", which does the same
    thing, this is not a bug but it fact, the correct behavior. '\0' is a
    legitimate character, not a string terminator. Where the semantics
    coincide (which is most of the time, when dealing with pure text data)
    you can assume strlen(bstring->data) is the same as b->slen. In
    Bstrlib, you would never try to mash two strings together using some
    kind of hacked representation such as "string1\0string2", that would
    make no sense. Because Bstrlib is more consistent in this respect,
    these sorts of bugs are far less likely.

    > I'm sure several regulars here have more interesting stories... :)


    Oh sure:

    1) if (a < 0) a = -a; b = sqrt (a);

    2) Anything involving a stack overrun with stack checking turned off.
    You just have to be inspired to imagine that this is your problem. The
    standard is worthless for helping you here.

    3) Assuming that vararg parameters were passed by value and could be
    "reset" by retrieving its original value. (No debugger or compiler
    diagnostic can help you figure out what is going wrong here.)

    4) Watching Microsoft Visual C++ barf on struct tagbstring b = {
    sizeof("string")-1, -__LINE__, "string" }; because MS's preprocessor
    emitted something like _line+425 for __LINE__, and it complained that
    it was not a compile time constant.

    5) Adventures with WATCOM C/C++ v11.x's optimizations with "-ol" turned
    on. It just fails to build correct code for about 10% of the source
    I've written. These are real fun to track down. Like the stack
    checking thing, you just have to be inspired to try turning the flag
    off to see if it fixes the problem.

    Then there's the standard "I forgot I made assumption X in function Y
    then passed it parameters which technically violated X even though it
    wasn't obvious that it was". Unfortunately, in the C language, these
    assumptions often take the form of "allocated at least some certain
    amount of space" or "the parameter is a well form non-empty linked
    list" etc, and the error is usually undefined behavior.

    I don't do a lot of heap or stack smashing anymore these days, as I
    generally wrap things in rigourous enough abstractions, and I just
    generally use debug heaps while developing. But there can still be
    problems of convention. A hash table I implemented has an iterator
    mechanism, and I made the termination condition when the index was
    greater than the current hash table size -- the problem is that when I
    came back to reuse this code after more than a year, I forgot my
    convention for termination and thought it was when the index was < 0.
    So I walked off the end of the hash array nicely because I did not
    sufficiently document the convention. The problem is that I was using
    -1 as the start-up index (since 0 may or may not be a valid entry, and
    you *have* to perform an increment on every call to the iterator
    incrementor) and so could not use < 0 as the terminator condition. But
    it meant that my intuition conficted with what was necessary. I fixed
    this by creating an "isDone" macro for the iterator.

    With multithreaded errors, I already know a priori that they are
    difficult. When I can, and I detect such a bug, I will spend a short
    amount of time try to track it down. If I can't get it, I junk the
    contentious code and start over. Its just a matter of productivity --
    these bugs can be so hard, that it will take longer to track them down
    than to rewrite the code. Sometimes I don't learn/figure out what I
    did wrong, but life is too short.

    --
    Paul Hsieh
    http://www.pobox.com/~qed/
    http://bstring.sf.net/
    , Nov 25, 2006
    #9
  10. Gordon Burditt:

    >>I have never written a program for monetary gain. I program purely for
    >>the enjoyment of programming. If I achieve a certain object, I am not
    >>satisfied -- I want to achieve the objective as efficiently as is
    >>possible.

    >
    > If it doesn't have to work correctly, any program can run in 0 time
    > and 0 bytes.



    Which is a tremendous argument, only that my programs do run right in the
    end.

    --

    Frederick Gotham
    Frederick Gotham, Nov 25, 2006
    #10
  11. Frederick Gotham

    Eric Sosman Guest

    [OT] Re: Most Interesting Bug Track Down

    Gordon Burditt wrote:

    >>I have never written a program for monetary gain. I program purely for the
    >>enjoyment of programming. If I achieve a certain object, I am not satisfied
    >>-- I want to achieve the objective as efficiently as is possible.

    >
    >
    > If it doesn't have to work correctly, any program can run in 0 time
    > and 0 bytes.
    >
    > Even programs like the OS/360's IEFBR14, or the C equivalent:
    >
    > int main(void)
    > {
    > return 0;
    > }
    >
    > can be made to work faster if they don't have to run correctly.


    <off-topic>

    Wikipedia's article on IEFBR14 makes amusing reading. The
    original one-instruction program had a bug, which must set some
    kind of unenviable standard for "fault density:" one bug per
    machine instruction! Not only that, but the fix had a bug, and
    the fix to the fix had a bug, and it wasn't until the fourth
    version of IEFBR14 that the program "did nothing" correctly.

    The final version had three times as many lines of source as
    the first, executed three times as many instructions, and occupied
    eight times as much memory. Code bloat wasn't invented in Redmond.

    </off-topic>

    --
    Eric Sosman
    lid
    Eric Sosman, Nov 25, 2006
    #11
  12. Frederick Gotham wrote:
    > I thought it might be interesting to share experiences of tracking down a
    > subtle or mysterious bug. I myself haven't much experience with tracking
    > down bugs, but there's one in particular which comes to mind.
    >
    > I was writing usable which dealt with strings. As per usual with my code, I
    > made it efficient to the extreme. One thing I did was replace, where
    > possible, any usages of "strlen" with something like:
    >
    > struct PtrAndLen {
    > char *p;
    > size_t len;
    > };
    >
    > This could be initialised with a string as follows:
    >
    > struct PtrAndLen const pal = { "Hello", sizeof "Hello" };
    >
    > From that point forward in the code, "pal.len" would be used in place of
    > strlen.
    >
    > The code grew though, and at one stage I needed to store info about two
    > strings in one of these structures. To do this, I used a null separator,
    > e.g.:
    >
    > PtrAndLen pal = {"Hello\0Bonjour", sizeof "Hello\0Bonjour"};
    >
    > All of this, however, was expanded by macros, so I actually had something
    > like:
    >
    > MAKE_STR_INFO("Hello\0Bonjour")
    >
    > The problem with this, however, was that "strlen" and "pal.len" had
    > different values, because strlen only read as far as the first null
    > terminator. Anyway, I had to read through the code in detail before the bug
    > jumped out at me.
    >
    > I'm sure several regulars here have more interesting stories... :)
    >




    The ugliest bug (though by no means the harderst)
    I ever had to fix was

    strcmp(search_string, TARGET_STRING)

    where TARGET_STRING was #defnied
    as a string literal. I had a problem
    that even though search_string was the same as the
    define, it was not matching.

    Turns out the code had been changed to
    from using us_satellite1 and europe_satellite.
    to us_satellite1, us_satellite2 and europe_satellite.
    Because the same thing was done for both
    us_satellite1 and us_satellite2 it was only necessary
    to check the us_satellite part. Enter string_strip, which
    in addition to taking off whitespace, was modified to
    remove trainling numbers. Now all that was needed
    was to strip the string, then compare. To get
    the comparison string you strip TARGET_STRING.
    You only need to do this once as the name does
    not change.
    (The contract on the progammer responsible is still
    open. It requires proof of a slow agonizing death.)
    Of course, it the compiler had refused to modify
    a string literal this would not have worked.

    Three bugs that are also cautionary tails about
    undefined behaviour.

    One large program started failing at odd momements. We traced the
    problem to some changes made by one of the programmers. However,
    these changes were made to a completely different part of the program,
    and they were only intended to add another input format. Furthermore,
    the changes looked fine and worked correctly.

    On looking at the code I noticed that a standard library function
    (I don't recall exact details) was called in a non-standard way.
    Knowing that
    this couldn't possibly be the proplem, I changed the call
    to conform to the standard. The bug disappeared.
    Undefined behaviour includes appearing to do exaclty what you
    want, but causing subtle problems elsewhere.

    Another bug had to do with the equivalent of

    a = i++

    This was in a piece of code we had used for years, on many platforms
    and many compilers. One day the code didn't work correctly on
    the SGI (this is also an argument for test suites with known results,
    the failure mode was a small loss of accuracy, not a complete
    failure.).
    Yep, turned out that a compiler upgrade was the culprit. Undefined
    behaviour
    can lie dormant for years.

    A third, fairly easy to find bug, had to do with the use of
    the nonstandard strdup. The code refused to compile on one
    plaform, even though this platform had a strdup function. Turned
    out that the culprit here was a #define POSIX statement. If
    POSIX was defined, the strdup was not available. So
    the agument "feature X is available on almost all machines"
    has a problem. Yes feature X is available but it may only
    be available in a nonstandard mode.

    - William Hughes
    William Hughes, Nov 25, 2006
    #12
  13. Frederick Gotham

    Ian Collins Guest

    Mark F. Haigh wrote:
    >
    > Turns out the memory subsystem was laid out by a junior designer at
    > another company who had incorrectly chained some of the RAM clock
    > traces (IIRC), where they should have all had equal lengths. This
    > particular instruction sequence, combined with the peculiar cache
    > behavior of the application had caused it to issue memory reads that
    > exposed the RAM subsystem timing problems. Rumor had it that somebody
    > at the other company had done a quickie board spin to increase the
    > amount of RAM without properly reviewing the change. Amazingly, the
    > rest of the half-million lines of code seemed to run just fine. How, I
    > will never know.
    >

    That reminds me of one of mine - "your diver doesn't work in the afternoon".

    About 2 o'clock each day, an HDLC driver I wad written started to
    experience random crashes, no amount of debug code or simulation could
    catch the problem. As a contractor, I was under immense pressure from
    the hardware people to fix "my broken software". I the end I came in
    over the weekend so I could pinch a decent scope and logic analyser,
    hooked up a couple of dozen probes and waited.

    Sure enough, mid afternoon the board fell over. But this time I had the
    evidence, a late data acknowledge strobe. Turns out the hardware
    designer had used a 47K rather than a 4K7 pull-up resistor. Which just
    about pulled the strobe high in time until the lab temperature got above
    20C. It was lucky for us we where testing in the summer.

    Needless to say, I enjoyed the team meeting on the following Monday morning!

    --
    Ian Collins.
    Ian Collins, Nov 25, 2006
    #13
  14. Ian Collins wrote:

    <snip>

    > That reminds me of one of mine - "your diver doesn't work in the afternoon".
    >
    > About 2 o'clock each day, an HDLC driver I wad written started to
    > experience random crashes, no amount of debug code or simulation could
    > catch the problem. As a contractor, I was under immense pressure from
    > the hardware people to fix "my broken software". I the end I came in
    > over the weekend so I could pinch a decent scope and logic analyser,
    > hooked up a couple of dozen probes and waited.


    HDLC, hmm? Did you ever get stuck dealing with the Intel IXPs? They
    might have looked great on paper, but those things were complete turds
    to program. Stitching buggy software together with buggy microcode,
    ugh. At least with a PCI device you get a ~50% correct datasheet and a
    T-shirt from the vendor, but with the IXP that kind of thing seems like
    a utopian dream.

    I suppose a topical conclusion can be drawn-- always write bomb-proof C
    so you can prove it's a hardware problem.

    >
    > Sure enough, mid afternoon the board fell over. But this time I had the
    > evidence, a late data acknowledge strobe. Turns out the hardware
    > designer had used a 47K rather than a 4K7 pull-up resistor. Which just
    > about pulled the strobe high in time until the lab temperature got above
    > 20C. It was lucky for us we where testing in the summer.


    Here's a funny one. A prototype high-end system that I was working on
    was slated to get some new high-CFM (cubic feet / minute) fans because
    of a new processor's increased heat output. These particular fans had
    tachometer / RPM-sensing support. The plan was to use it to detect
    failed fans and alert the user to order a replacement. The system
    could run properly at normal room temperature with 2 failed fans.

    It was a prototype board, so I had no chassis to mount the fans to. I
    just put them on my desk, plugged them in, and powered up. The fans
    went airborne. I quickly powered down and duct-taped the fans to my
    desk.

    Unfortunately the wires to the fans were quite short and the fans were
    uncomfortably close to the power supply switch. I would just have to
    be careful, I thought.

    After a couple of days, I had all of the fan driver functionality
    implemented. It could give you instantaneous RPM, low / high RPM
    watermarks, and it was tied in to the rest of the system.

    Near the end of the second day, I was distracted and reached over
    carelessly to power the system down. Big mistake. The highly curved
    and very sharp blades of one of the fans reached out and cut the very
    tip of my right thumb off.

    As my thumb bled into a paper towel, I thought "OK, time for a beer",
    and headed across the street to grab a couple of pints. I eventually
    got it all clotted up and bandaged. I headed back to my desk and
    powered up the system...

    Sure enough, the system had logged an alert for a possible fan failure
    due to low RPM for the period during which it was de-skinning the tip
    of my thumb.

    >
    > Needless to say, I enjoyed the team meeting on the following Monday morning!
    >


    Not me. Bandage was still on my thumb, and everybody still thought it
    was hilarious (including me).

    Ok, ok, sorry. Yes, it's off topic. Yes, I'll shut up now.


    Mark F. Haigh
    Mark F. Haigh, Nov 26, 2006
    #14
  15. Frederick Gotham

    Ian Collins Guest

    Mark F. Haigh wrote:
    >
    > HDLC, hmm? Did you ever get stuck dealing with the Intel IXPs? They
    > might have looked great on paper, but those things were complete turds
    > to program.
    >

    No, MC68360, nice to program, not too buggy microcode!

    > Not me. Bandage was still on my thumb, and everybody still thought it
    > was hilarious (including me).
    >
    > Ok, ok, sorry. Yes, it's off topic. Yes, I'll shut up now.
    >

    Amusing none the less.

    --
    Ian Collins.
    Ian Collins, Nov 26, 2006
    #15
  16. Frederick Gotham

    goose Guest

    Frederick Gotham wrote:

    <snipped jumping bug>

    >
    > I'm sure several regulars here have more interesting stories... :)
    >


    On the (seriously drain-bamaged) platform I am working on
    right now, there are calls to open a file, read data from a file,
    write data to a file and rename a file. Yup ... no close file function!

    I ran across an interesting bug last week; when a file has been
    opened for writing, but no data gets written to it and then the
    file is renamed ... the file contains junk data of a fairly random
    length.

    Try explaining *that* bug to your boss. Luckily, it seems
    that the newer models of this particular device will (when
    we get them in 2007/2008) be running ARM linux, and not
    this current POS-specially-written-OS...

    goose,
    goose, Nov 26, 2006
    #16
  17. "William Hughes" <> writes:
    [snip]
    > A third, fairly easy to find bug, had to do with the use of
    > the nonstandard strdup. The code refused to compile on one
    > plaform, even though this platform had a strdup function. Turned
    > out that the culprit here was a #define POSIX statement. If
    > POSIX was defined, the strdup was not available. So
    > the agument "feature X is available on almost all machines"
    > has a problem. Yes feature X is available but it may only
    > be available in a nonstandard mode.


    Are you sure about that? I'd expect strdup() to be available only if
    POSIX *is* defined.

    --
    Keith Thompson (The_Other_Keith) <http://www.ghoti.net/~kst>
    San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
    We must do something. This is something. Therefore, we must do this.
    Keith Thompson, Nov 27, 2006
    #17
  18. Frederick Gotham

    Ben Pfaff Guest

    Keith Thompson <> writes:

    > "William Hughes" <> writes:
    > [snip]
    >> A third, fairly easy to find bug, had to do with the use of
    >> the nonstandard strdup. The code refused to compile on one
    >> plaform, even though this platform had a strdup function. Turned
    >> out that the culprit here was a #define POSIX statement. If
    >> POSIX was defined, the strdup was not available. So
    >> the agument "feature X is available on almost all machines"
    >> has a problem. Yes feature X is available but it may only
    >> be available in a nonstandard mode.

    >
    > Are you sure about that? I'd expect strdup() to be available only if
    > POSIX *is* defined.


    According to the SUSv3 "Change History" for strdup, it was
    originally part of the X/OPEN UNIX extension, so it seems
    believable that a strict POSIX mode might cause it to be omitted.

    Perhaps it is worth commenting that #define is not a statement.
    --
    Bite me! said C.
    Ben Pfaff, Nov 27, 2006
    #18
  19. Keith Thompson wrote:
    > "William Hughes" <> writes:
    > [snip]
    > > A third, fairly easy to find bug, had to do with the use of
    > > the nonstandard strdup. The code refused to compile on one
    > > plaform, even though this platform had a strdup function. Turned
    > > out that the culprit here was a #define POSIX statement. If
    > > POSIX was defined, the strdup was not available. So
    > > the agument "feature X is available on almost all machines"
    > > has a problem. Yes feature X is available but it may only
    > > be available in a nonstandard mode.

    >
    > Are you sure about that? I'd expect strdup() to be available only if
    > POSIX *is* defined.
    >


    No, this was from memory. However, a quick check suggests that strdup
    is not a POSIX function. The problem was definitely that strdup was
    available on
    the machine, but not if one of our standardization definies was used.
    I seem to recall this was some version of #define POSIX.

    - William Hughes
    William Hughes, Nov 28, 2006
    #19
  20. Mark F. Haigh wrote:
    > > I'm sure several regulars here have more interesting stories... :)

    >
    > Ok, I'll bite.
    > ...


    Great story! You should post this in alt.folklore.computers.
    Most of my best bug stories also involve hardware, e.g.
    http://james.fabpedigree.com/bug22.htm

    I'll mention one C-language bug that cost me several
    hours of debug: it seemed like a problem with a controller
    chip, but was actually a violation by me, of C rules.

    I had a function like
    send_to_chip(char *dat, int cnt); /* send dat[0] ... dat[cnt-1] to
    chip */

    and another function
    whatever( ..., short cmd);
    which needed, among other things to send the 2-byte cmd
    to the chip. Endianness was not an issue, as my program
    was specific to MC680x0.

    I tried to send the bytes with
    send_to_chip(&cmd, 2);
    but it didn't work. Eventually I found that "&cmd" was pointing
    2 bytes *before* the 2-byte cmd on the stack.
    I don't remember which compiler this was 20 years ago --
    I think it was based on Plauger's and I deduced that the
    peculiar "&cmd" value was a result of porting a little-endian
    compiler to a big-endian machine.

    I consider the compiler behavior clearly flawed, but concede
    now that taking the address of a function argument was
    a violation.

    > The entire thing was so unlikely that I couldn't help but think,
    > "Great. I'll probably be hit with a meteorite on the way to my car
    > tonight." Fortunately the bad luck stopped with a new spin of the
    > board, and I'm still here to tell the tale.


    Glad you made it!

    James Dow Allen
    James Dow Allen, Nov 28, 2006
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. H.MuthuKumaraRajan
    Replies:
    3
    Views:
    426
    H.MuthuKumaraRajan
    Feb 4, 2004
  2. spaghetti
    Replies:
    6
    Views:
    360
    Jukka K. Korpela
    Aug 9, 2003
  3. xkenneth
    Replies:
    8
    Views:
    329
    Bruno Desthuilliers
    Feb 6, 2008
  4. Pinky
    Replies:
    0
    Views:
    277
    Pinky
    Mar 26, 2009
  5. Pinky
    Replies:
    0
    Views:
    317
    Pinky
    Mar 26, 2009
Loading...

Share This Page