Stuck with a strange core (C code) -- Please help.

S

SZ

Hi,

I've hit this core multiple times when running an application (coded
in C)
running on BSD OS. Using gdb getting to the core, it shows:

.....
(gdb) bt
#0 0x8541fa3 in mTimerQUnlink (t=0xd7f5a51, head=0x8fec13c)
at ../../rsvp/eventwheel.c:160
#1 0x854288d in mTimerInsert (t=0xd7f5a51,
callback=0x873ce88 <retran_unacked>, param=0xd7f5a01, time=50,
type=1)
at ../../rsvp/eventwheel.c:536
#2 0x873d266 in retran_unacked (unacked_cb=0xd7f5a00)
at ../../rsvp/rrrmsgi3.c:1425
....


Notice that the address (0xd7f5a00) of "unacketd_cb" (pointer of a
struct)
at level 2 is passed to the function of mTimerInsert as "param". But
"param" becomes 0xd7f5a01 (which added a "1" at the end). All the
subsequent processing is based on this wrong address causing the core.
The application is compiled with gcc with "-O2" along with some other
options.

I know address here can not start with ood number. This is a weired
problem. The stack shown with bt command in gdb seems to be complete.

Anyone knows what direction should I go with for the trouble shooting?

Thanks a lot

-SZ
 
M

Michael Mair

SZ said:
Hi,

I've hit this core multiple times when running an application (coded
in C)
running on BSD OS. Using gdb getting to the core, it shows:

....
(gdb) bt
#0 0x8541fa3 in mTimerQUnlink (t=0xd7f5a51, head=0x8fec13c)
at ../../rsvp/eventwheel.c:160
#1 0x854288d in mTimerInsert (t=0xd7f5a51,
callback=0x873ce88 <retran_unacked>, param=0xd7f5a01, time=50,
type=1)
at ../../rsvp/eventwheel.c:536
#2 0x873d266 in retran_unacked (unacked_cb=0xd7f5a00)
at ../../rsvp/rrrmsgi3.c:1425
...


Notice that the address (0xd7f5a00) of "unacketd_cb" (pointer of a
struct)
at level 2 is passed to the function of mTimerInsert as "param". But
"param" becomes 0xd7f5a01 (which added a "1" at the end). All the
subsequent processing is based on this wrong address causing the core.
The application is compiled with gcc with "-O2" along with some other
options.

I know address here can not start with ood number. This is a weired
problem. The stack shown with bt command in gdb seems to be complete.

Anyone knows what direction should I go with for the trouble shooting?

Towards a system or application specific newsgroup or mailing list,
for example, if this is not your own code.
You did not provide us with C code, let alone a minimal example.
Is it something you cooked up yourself? What does the structure
look like? What the function call? ...

It may be an easy to spot error -- but not without source.
Quality crystal balls don't come cheap.


Cheers
Michael
 
J

jacob navia

If I do:

void foo(struct foo *p)
{
char *a = (char *)p;
a++;
...
}

I increment by one the address.

Two problems spring into view:
1:
You are passing a wrong parameter or the routine is
expecting a char pointer that gets incremented in the
code of the routine
2:
There is a memory overwrite somewhere.

Procedure:

Follow the called routine (in assembler if you do
not have the source) and see where is being changed

jacob
 
S

SZ

Thanks, Jacob, for your response.

The address was not incremented in the C code purposely. It is
not char * either. Incrementing it would result in something much
larger since the structure has at least 40 bytes.

NOTE: the same routine is being run through multiple times and the problem
does not occur every time. I do have source code available. I've checked
the source code many times and did not found any where the address is purposely
changed before it gets passed to the next function.

So, I tend to believe this is memory violation of some sort. I once added
assert in the related routine to catch ood (bad) pointer, but it
then seem to core somewhere else.

I do not know if there is a good way to catch such memory violation. Would
appreciate it if anyone know a way to catch it.

Thanks again

-SZ
 
S

SZ

Thanks for your response, Michael.

Sorry, I did not make it clear. The function actually are run
through multiple times and not every time it will core. The code
never purposely increment the address. The source is just like the
following (the original source it too long to put here).


int retran_unacked (unacked_cb=0xd7f5a00)
{
int sub = unacked_cb->sub; <---- value of sub is correct here.

other_func(unacked_cb);

... <-- some local operations.

mTimerInsert (&unacked_cb->time, callback_func, unacked_cb, time, type);

...
}

The function itself is not so complicated and the function (other_func())
can not change "unacked_cb". So, I have to assume there must be some
kind of memory violation. But I am not sure if there is a general good
way to track down such problem. I am not even sure the violation is somewhere
near or at totally irrelavent places.

I can mail you the source if needed.

Thanks again.

-SZ
 
M

Manuel Petit

SZ said:
int retran_unacked (unacked_cb=0xd7f5a00)

Since you said you were using some flavor of BSD, that address seems to
refer to some stack... so my guess is that the unacked_cb was declared
as an automatic variable, and in this case...
{
int sub = unacked_cb->sub; <---- value of sub is correct here.

other_func(unacked_cb);

... <-- some local operations.

mTimerInsert (&unacked_cb->time, callback_func, unacked_cb, time, type);

.... this line is wrong. You are passing a pointer a field of an
automatic variable (&unacked_cb->time) to a function that probably
remembers it and will they to modify the pointed integer at some other
point in time when the automatic object is no longer live.


manuel,
 
M

Michael Mair

Hello SZ,


one thing first: Please do not top-post.

Sorry, I did not make it clear. The function actually are run
through multiple times and not every time it will core. The code
never purposely increment the address.

Okay, so this really speaks of stack corruption or some other
flavour of pointer trouble.

The source is just like the
following (the original source it too long to put here).

Note: Many people who post code look-alikes leave out the critical
parts so the original code is really necessary if you cannot
break it down to a minimal example.
int retran_unacked (unacked_cb=0xd7f5a00)
{
int sub = unacked_cb->sub; <---- value of sub is correct here.

other_func(unacked_cb);

... <-- some local operations.

mTimerInsert (&unacked_cb->time, callback_func, unacked_cb, time, type);

...
}

The function itself is not so complicated and the function (other_func())
can not change "unacked_cb".

Is the place in the parameter lists where you pass unacked_cb
or the address of unacked_cb->time const qualified? Otherwise
I would say try it. Your compiler may tell you interesting
things about it.

So, I have to assume there must be some
kind of memory violation. But I am not sure if there is a general good
way to track down such problem. I am not even sure the violation is somewhere
near or at totally irrelavent places.

Okay, this is somewhat offtopic:
<OT>
You claim to have tried finding the error with about any other means,
so I have only one suggestion now:
Find out the addresses where things are stored and watch the contents
with hardware watchpoints. Example:
--------------
$ print &object
0xdeadbeef
$ watch *((struct objtype *)0xdeadbeef)
--------------
Important is to use the actual "address", otherwise the watchpoint
will cease existance after leaving the function where it was defined.
You have to delete the hardware watchpoints before a new run.
I can mail you the source if needed.

Don't. If you have some webspace, put it up and post the URL.
There are many people here who know more than I or have at least
sooner time to have a peek at it than I do.


Cheers
Michael
 
S

SZ

Thanks, Michael, please the see the following:


....
Note: Many people who post code look-alikes leave out the critical
parts so the original code is really necessary if you cannot
break it down to a minimal example.
Ok, I've put the source of the problematic function at the end.

Is the place in the parameter lists where you pass unacked_cb
or the address of unacked_cb->time const qualified? Otherwise
I would say try it. Your compiler may tell you interesting
things about it.
I did not put const qualifier. But I just did and compiled without
any warning or error. NOTE: it is not the contents of "unacked_cb"
gets changed, but the pointer itself gets incremented unexpected.
I tried: retran_unacked(RRR_SENT_UNACKED_CB * const unacked_cb CCXT_T CXT)
without any warning with compiler.
Okay, this is somewhat offtopic:
<OT>
You claim to have tried finding the error with about any other means,
so I have only one suggestion now:
Find out the addresses where things are stored and watch the contents
with hardware watchpoints. Example:
--------------
$ print &object
0xdeadbeef
$ watch *((struct objtype *)0xdeadbeef)
--------------
Important is to use the actual "address", otherwise the watchpoint
will cease existance after leaving the function where it was defined.
You have to delete the hardware watchpoints before a new run.
</OT>

The problem here is that unacked_cb is a pointer pointing to a dynamically
allocated memory and there many such pointers. Before the problem occurs,
I don't know which one will have the problem. Therefore, it is not possible
to type the debug command beforehand. I hope the "watch" function can be
coded into the c code to ensure every such pointer will not get changed
during its life cycle.

Notice the remarks I put behand "<--------" signs. The struct of
RRR_SENT_UNACKED_CB really has nothing specially. It has some pointers and
ints defined inside. But in this case, I guess it is irrelavent since we are talking
pointer itself's change, not its content's change.

Thanks lot.

-SZ


Here is the source code:

void
retran_unacked(RRR_SENT_UNACKED_CB *unacked_cb CCXT_T CXT)
{
RRR_NEXT_HOP_CB *next_hop = unacked_cb->next_hop_cb;
int frr_rc;
PSB *psbp;

next_hop->nh_event_usage_count++; <---- next_hop is correct,
so unacked_cb is ok here.
if (RPL_RL && (unacked_cb->sa_resend_attempts >= RPL_RL))
{
if ((unacked_cb->sa_pkt_msg_type == PATH) ||
(unacked_cb->sa_pkt_msg_type == RESV))
{
frr_rc = RRR_FRR_CONTINUE;
if (RRR_EX_FAST_REROUTE())
{
frr_rc = rrr_frr_proc_sent_unacked_msg(
unacked_cb->sa_pkt_msg_type,
unacked_cb->spi_msg_id_info->msg_id_psb_parent CCXT);
}

if (frr_rc == RRR_FRR_CONTINUE)
{
rrr_maybe_send_error_packet(unacked_cb->sa_pkt_msg_type,
unacked_cb->sa_state_handle,
unacked_cb->sa_upstrm_lih_set,
unacked_cb->sa_upstrm_lih
CCXT);
}
}

rrr_delete_sent_unacked_cb(unacked_cb, next_hop CCXT);
}
else
{
rrr_resend_packet(unacked_cb, next_hop CCXT);

if (next_hop->nh_use_msg_ids == ATG_NO)
{
rrr_delete_sent_unacked_cb(unacked_cb, next_hop CCXT);
}
else
{
unacked_cb->sa_resend_attempts++;
if (next_hop->nh_rr_decay != 100) {
unacked_cb->sa_retrans_interval *=
(100 + next_hop->nh_rr_decay);
unacked_cb->sa_retrans_interval /= 100;
} else {
unacked_cb->sa_retrans_interval =
unacked_cb->sa_retrans_interval << 1;
}
if (unacked_cb->sa_pkt_msg_type == PATH_TEAR) {
if (!unacked_cb->spi_msg_id_info) {
rrr_delete_sent_unacked_cb(unacked_cb,
next_hop CCXT);
goto EXIT_LABEL;
}
psbp = sent_unacked_cb->spi_msg_id_info->msg_id_psb_parent;
if (!psbp) {
rrr_delete_sent_unacked_cb(unacked_cb,
next_hop CCXT);
goto EXIT_LABEL;
}
if (psbp->rapid_retran & RPL_PATH_TEAR) {
psbp->rapid_retran &= ~(RPL_PATH_TEAR|RPL_PATH_ERROR);
rrr_delete_retries_for_psb(psbp,
RRR_KILL_RETRY_FOR_ALL_MSGS CCXT);
deferred_kill_PSB(psbp CCXT);
}
goto EXIT_LABEL;
}
if (unacked_cb->sa_retrans_interval >= RPL_RM) {
if (unacked_cb->sa_pkt_msg_type == PATH_TEAR) {
if (!unacked_cb->spi_msg_id_info) {
rrr_delete_sent_unacked_cb(unacked_cb,
next_hop CCXT);
goto EXIT_LABEL;
}
psbp = sent_unacked_cb->spi_msg_id_info->msg_id_psb_parent;
if (!psbp) {
rrr_delete_sent_unacked_cb(unacked_cb,
next_hop CCXT);
goto EXIT_LABEL;
}
if (psbp->rapid_retran & RPL_PATH_TEAR) {
psbp->rapid_retran &=
~(RPL_PATH_TEAR|RPL_PATH_ERROR);
rrr_delete_retries_for_psb(psbp,
RRR_KILL_RETRY_FOR_ALL_MSGS CCXT);
deferred_kill_PSB(psbp CCXT);
}
goto EXIT_LABEL;
} else if (unacked_cb->sa_pkt_msg_type == PATH_ERR){
if (!unacked_cb->spi_msg_id_info) {
rrr_delete_sent_unacked_cb(unacked_cb,
next_hop CCXT);
goto EXIT_LABEL;
}
psbp = sent_unacked_cb->spi_msg_id_info->msg_id_psb_parent;
if (!psbp) {
rrr_delete_sent_unacked_cb(unacked_cb,
next_hop CCXT);
goto EXIT_LABEL;
}
if (psbp->rapid_retran & RPL_PATH_ERROR) {
rrr_delete_sent_unacked_cb(unacked_cb,
next_hop CCXT);
psbp->rapid_retran &= ~RPL_PATH_ERROR;
frr_rc = rrr_frr_proc_path_tear(psbp,
ATG_MPLS_XC_REL_REAS_IF_DOWN,
TRUE
CCXT);
if (frr_rc == RRR_FRR_CONTINUE) {
psbp->ps_rel_reason = ATG_MPLS_XC_REL_REAS_IF_DOWN;
tear_or_kill_PSB(psbp CCXT);
}
}
goto EXIT_LABEL;
} else if (unacked_cb->sa_pkt_msg_type == RESV_TEAR){
rrr_delete_sent_unacked_cb(unacked_cb,
next_hop CCXT);
goto EXIT_LABEL;
} else if (unacked_cb->sa_pkt_msg_type == RESV_ERR) {
rrr_delete_sent_unacked_cb(unacked_cb,
next_hop CCXT);
goto EXIT_LABEL;
}
unacked_cb->sa_retrans_interval = RPL_RF;
}
unacked_cb->sa_resend_time =
unacked_cb->sa_retrans_interval;

mTimerInsert(&unacked_cb->m_timer, (vfcnptr_2)retran_unacked,
unacked_cb,unacked_cb->sa_resend_time, TTYPE_ONESHOT); <------
The stack shows that
both &unacked_cb->m_timer
and unacked_cb are added by 1.


}
}
EXIT_LABEL:
next_hop->nh_event_usage_count--;
rrr_maybe_free_next_hop(next_hop CCXT);

return;

} /* retran_unacked */
 
S

SZ

Thanks, Michael, please the see the following:


....
Note: Many people who post code look-alikes leave out the critical
parts so the original code is really necessary if you cannot
break it down to a minimal example.
Ok, I've put the source of the problematic function at the end.

Is the place in the parameter lists where you pass unacked_cb
or the address of unacked_cb->time const qualified? Otherwise
I would say try it. Your compiler may tell you interesting
things about it.
I did not put const qualifier. But I just did and compiled without
any warning or error. NOTE: it is not the contents of "unacked_cb"
gets changed, but the pointer itself gets incremented unexpected.
I tried: retran_unacked(RRR_SENT_UNACKED_CB * const unacked_cb CCXT_T CXT)
without any warning with compiler.
Okay, this is somewhat offtopic:
<OT>
You claim to have tried finding the error with about any other means,
so I have only one suggestion now:
Find out the addresses where things are stored and watch the contents
with hardware watchpoints. Example:
--------------
$ print &object
0xdeadbeef
$ watch *((struct objtype *)0xdeadbeef)
--------------
Important is to use the actual "address", otherwise the watchpoint
will cease existance after leaving the function where it was defined.
You have to delete the hardware watchpoints before a new run.
</OT>

The problem here is that unacked_cb is a pointer pointing to a dynamically
allocated memory and there many such pointers. Before the problem occurs,
I don't know which one will have the problem. Therefore, it is not possible
to type the debug command beforehand. I hope the "watch" function can be
coded into the c code to ensure every such pointer will not get changed
during its life cycle.

Notice the remarks I put behand "<--------" signs. The struct of
RRR_SENT_UNACKED_CB really has nothing specially. It has some pointers and
ints defined inside. But in this case, I guess it is irrelavent since we are talking
pointer itself's change, not its content's change.

Thanks lot.

-SZ


Here is the source code:

void
retran_unacked(RRR_SENT_UNACKED_CB *unacked_cb CCXT_T CXT)
{
RRR_NEXT_HOP_CB *next_hop = unacked_cb->next_hop_cb;
int frr_rc;
PSB *psbp;

next_hop->nh_event_usage_count++; <---- next_hop is correct,
so unacked_cb is ok here.
if (RPL_RL && (unacked_cb->sa_resend_attempts >= RPL_RL))
{
if ((unacked_cb->sa_pkt_msg_type == PATH) ||
(unacked_cb->sa_pkt_msg_type == RESV))
{
frr_rc = RRR_FRR_CONTINUE;
if (RRR_EX_FAST_REROUTE())
{
frr_rc = rrr_frr_proc_sent_unacked_msg(
unacked_cb->sa_pkt_msg_type,
unacked_cb->spi_msg_id_info->msg_id_psb_parent CCXT);
}

if (frr_rc == RRR_FRR_CONTINUE)
{
rrr_maybe_send_error_packet(unacked_cb->sa_pkt_msg_type,
unacked_cb->sa_state_handle,
unacked_cb->sa_upstrm_lih_set,
unacked_cb->sa_upstrm_lih
CCXT);
}
}

rrr_delete_sent_unacked_cb(unacked_cb, next_hop CCXT);
}
else
{
rrr_resend_packet(unacked_cb, next_hop CCXT);

if (next_hop->nh_use_msg_ids == ATG_NO)
{
rrr_delete_sent_unacked_cb(unacked_cb, next_hop CCXT);
}
else
{
unacked_cb->sa_resend_attempts++;
if (next_hop->nh_rr_decay != 100) {
unacked_cb->sa_retrans_interval *=
(100 + next_hop->nh_rr_decay);
unacked_cb->sa_retrans_interval /= 100;
} else {
unacked_cb->sa_retrans_interval =
unacked_cb->sa_retrans_interval << 1;
}
if (unacked_cb->sa_pkt_msg_type == PATH_TEAR) {
if (!unacked_cb->spi_msg_id_info) {
rrr_delete_sent_unacked_cb(unacked_cb,
next_hop CCXT);
goto EXIT_LABEL;
}
psbp = sent_unacked_cb->spi_msg_id_info->msg_id_psb_parent;
if (!psbp) {
rrr_delete_sent_unacked_cb(unacked_cb,
next_hop CCXT);
goto EXIT_LABEL;
}
if (psbp->rapid_retran & RPL_PATH_TEAR) {
psbp->rapid_retran &= ~(RPL_PATH_TEAR|RPL_PATH_ERROR);
rrr_delete_retries_for_psb(psbp,
RRR_KILL_RETRY_FOR_ALL_MSGS CCXT);
deferred_kill_PSB(psbp CCXT);
}
goto EXIT_LABEL;
}
if (unacked_cb->sa_retrans_interval >= RPL_RM) {
if (unacked_cb->sa_pkt_msg_type == PATH_TEAR) {
if (!unacked_cb->spi_msg_id_info) {
rrr_delete_sent_unacked_cb(unacked_cb,
next_hop CCXT);
goto EXIT_LABEL;
}
psbp = sent_unacked_cb->spi_msg_id_info->msg_id_psb_parent;
if (!psbp) {
rrr_delete_sent_unacked_cb(unacked_cb,
next_hop CCXT);
goto EXIT_LABEL;
}
if (psbp->rapid_retran & RPL_PATH_TEAR) {
psbp->rapid_retran &=
~(RPL_PATH_TEAR|RPL_PATH_ERROR);
rrr_delete_retries_for_psb(psbp,
RRR_KILL_RETRY_FOR_ALL_MSGS CCXT);
deferred_kill_PSB(psbp CCXT);
}
goto EXIT_LABEL;
} else if (unacked_cb->sa_pkt_msg_type == PATH_ERR){
if (!unacked_cb->spi_msg_id_info) {
rrr_delete_sent_unacked_cb(unacked_cb,
next_hop CCXT);
goto EXIT_LABEL;
}
psbp = sent_unacked_cb->spi_msg_id_info->msg_id_psb_parent;
if (!psbp) {
rrr_delete_sent_unacked_cb(unacked_cb,
next_hop CCXT);
goto EXIT_LABEL;
}
if (psbp->rapid_retran & RPL_PATH_ERROR) {
rrr_delete_sent_unacked_cb(unacked_cb,
next_hop CCXT);
psbp->rapid_retran &= ~RPL_PATH_ERROR;
frr_rc = rrr_frr_proc_path_tear(psbp,
ATG_MPLS_XC_REL_REAS_IF_DOWN,
TRUE
CCXT);
if (frr_rc == RRR_FRR_CONTINUE) {
psbp->ps_rel_reason = ATG_MPLS_XC_REL_REAS_IF_DOWN;
tear_or_kill_PSB(psbp CCXT);
}
}
goto EXIT_LABEL;
} else if (unacked_cb->sa_pkt_msg_type == RESV_TEAR){
rrr_delete_sent_unacked_cb(unacked_cb,
next_hop CCXT);
goto EXIT_LABEL;
} else if (unacked_cb->sa_pkt_msg_type == RESV_ERR) {
rrr_delete_sent_unacked_cb(unacked_cb,
next_hop CCXT);
goto EXIT_LABEL;
}
unacked_cb->sa_retrans_interval = RPL_RF;
}
unacked_cb->sa_resend_time =
unacked_cb->sa_retrans_interval;

mTimerInsert(&unacked_cb->m_timer, (vfcnptr_2)retran_unacked,
unacked_cb,unacked_cb->sa_resend_time, TTYPE_ONESHOT); <------
The stack shows that
both &unacked_cb->m_timer
and unacked_cb are added by 1.


}
}
EXIT_LABEL:
next_hop->nh_event_usage_count--;
rrr_maybe_free_next_hop(next_hop CCXT);

return;

} /* retran_unacked */
 
P

Peter Slootweg

see inline


[snip]

/* NOTE: both pointer parameters are void * because one of them may not
point to a valid object - i.e. off by one byte*/
void CheckPointers(const void *p1,const void *p2, const char * context)
{
if (p1!=p2)
{
fprintf(stderr,"%s %p!=%p\n",context,p1,p2); /* set a break point here
*/
}
}
void
retran_unacked(RRR_SENT_UNACKED_CB *unacked_cb CCXT_T CXT)
{
RRR_NEXT_HOP_CB *next_hop = unacked_cb->next_hop_cb;
int frr_rc;
PSB *psbp;
RRR_SENT_UNACKED_CB *unacked_cb_back=unacked_cb;
next_hop->nh_event_usage_count++; <---- next_hop is correct,
so unacked_cb is ok here.
if (RPL_RL && (unacked_cb->sa_resend_attempts >= RPL_RL))
{
if ((unacked_cb->sa_pkt_msg_type == PATH) ||
(unacked_cb->sa_pkt_msg_type == RESV))
{
frr_rc = RRR_FRR_CONTINUE;
if (RRR_EX_FAST_REROUTE())
{
frr_rc = rrr_frr_proc_sent_unacked_msg(
unacked_cb->sa_pkt_msg_type,
unacked_cb->spi_msg_id_info->msg_id_psb_parent CCXT);
}

if (frr_rc == RRR_FRR_CONTINUE)
{
rrr_maybe_send_error_packet(unacked_cb->sa_pkt_msg_type,
unacked_cb->sa_state_handle,
unacked_cb->sa_upstrm_lih_set,
unacked_cb->sa_upstrm_lih
CCXT);
}
}

rrr_delete_sent_unacked_cb(unacked_cb, next_hop CCXT);
}
else
{
rrr_resend_packet(unacked_cb, next_hop CCXT);

if (next_hop->nh_use_msg_ids == ATG_NO)
{
rrr_delete_sent_unacked_cb(unacked_cb, next_hop CCXT);
}
else
{
unacked_cb->sa_resend_attempts++;
if (next_hop->nh_rr_decay != 100) {
unacked_cb->sa_retrans_interval *=
(100 + next_hop->nh_rr_decay);
unacked_cb->sa_retrans_interval /= 100;
} else {
unacked_cb->sa_retrans_interval =
unacked_cb->sa_retrans_interval << 1;
}
if (unacked_cb->sa_pkt_msg_type == PATH_TEAR) {
if (!unacked_cb->spi_msg_id_info) {
rrr_delete_sent_unacked_cb(unacked_cb,
next_hop CCXT);
goto EXIT_LABEL;
}
psbp = sent_unacked_cb->spi_msg_id_info->msg_id_psb_parent;
if (!psbp) {
rrr_delete_sent_unacked_cb(unacked_cb,
next_hop CCXT);
goto EXIT_LABEL;
}
if (psbp->rapid_retran & RPL_PATH_TEAR) {
psbp->rapid_retran &= ~(RPL_PATH_TEAR|RPL_PATH_ERROR);
rrr_delete_retries_for_psb(psbp,
RRR_KILL_RETRY_FOR_ALL_MSGS CCXT);
deferred_kill_PSB(psbp CCXT);
}
goto EXIT_LABEL;
}
if (unacked_cb->sa_retrans_interval >= RPL_RM) {
if (unacked_cb->sa_pkt_msg_type == PATH_TEAR) {
if (!unacked_cb->spi_msg_id_info) {
rrr_delete_sent_unacked_cb(unacked_cb,
next_hop CCXT);
goto EXIT_LABEL;
}
psbp = sent_unacked_cb->spi_msg_id_info->msg_id_psb_parent;
if (!psbp) {
rrr_delete_sent_unacked_cb(unacked_cb,
next_hop CCXT);
goto EXIT_LABEL;
}
if (psbp->rapid_retran & RPL_PATH_TEAR) {
psbp->rapid_retran &=
~(RPL_PATH_TEAR|RPL_PATH_ERROR);
rrr_delete_retries_for_psb(psbp,
RRR_KILL_RETRY_FOR_ALL_MSGS CCXT);
deferred_kill_PSB(psbp CCXT);
}
goto EXIT_LABEL;
} else if (unacked_cb->sa_pkt_msg_type == PATH_ERR){
if (!unacked_cb->spi_msg_id_info) {
rrr_delete_sent_unacked_cb(unacked_cb,
next_hop CCXT);
goto EXIT_LABEL;
}
psbp = sent_unacked_cb->spi_msg_id_info->msg_id_psb_parent;
if (!psbp) {
rrr_delete_sent_unacked_cb(unacked_cb,
next_hop CCXT);
goto EXIT_LABEL;
}
if (psbp->rapid_retran & RPL_PATH_ERROR) {
rrr_delete_sent_unacked_cb(unacked_cb,
next_hop CCXT);
psbp->rapid_retran &= ~RPL_PATH_ERROR;
frr_rc = rrr_frr_proc_path_tear(psbp,
ATG_MPLS_XC_REL_REAS_IF_DOWN,
TRUE
CCXT);
if (frr_rc == RRR_FRR_CONTINUE) {
psbp->ps_rel_reason = ATG_MPLS_XC_REL_REAS_IF_DOWN;
tear_or_kill_PSB(psbp CCXT);
}
}
goto EXIT_LABEL;
} else if (unacked_cb->sa_pkt_msg_type == RESV_TEAR){
rrr_delete_sent_unacked_cb(unacked_cb,
next_hop CCXT);
goto EXIT_LABEL;
} else if (unacked_cb->sa_pkt_msg_type == RESV_ERR) {
rrr_delete_sent_unacked_cb(unacked_cb,
next_hop CCXT);
goto EXIT_LABEL;
}
unacked_cb->sa_retrans_interval = RPL_RF;
}
unacked_cb->sa_resend_time =
unacked_cb->sa_retrans_interval;
if unacked_cb is bad below then it should be bad above - work back from here
adding calls to CheckPointers.
mTimerInsert(&unacked_cb->m_timer, (vfcnptr_2)retran_unacked,
unacked_cb,unacked_cb->sa_resend_time, TTYPE_ONESHOT); <------
The stack shows that
both &unacked_cb->m_timer
and unacked_cb are added by 1.


}
}
EXIT_LABEL:
next_hop->nh_event_usage_count--;
rrr_maybe_free_next_hop(next_hop CCXT);

return;

} /* retran_unacked */
see top and inline
Add copius calls to CheckPointers(unacked_cb,unacked_cb_back,"Context
String"); throughout retran_unacked to figure out where unacked_cb is being
corrupted.
 
S

SZ

I think the problem is found:

there is a function, which is a couple of layers down from
rrr_resend_packet(), got something like a++ with a[] defined
as local and i uninitialized. This cause it radmomly pick up
a place (in stack) and increment it.

Geez, I hope there is a better systematic way to catch memory
violation like this.

Thanks everyone for your help.

-SZ
 
R

Richard Bos

there is a function, which is a couple of layers down from
rrr_resend_packet(), got something like a++ with a[] defined
as local and i uninitialized. This cause it radmomly pick up
a place (in stack) and increment it.

Geez, I hope there is a better systematic way to catch memory
violation like this.


Well, in this case, a good way to catch it would probably have been not
to rely on a memory tracker and your OS alone, but to turn up the
warning level on your compiler until the knob nearly comes off, and then
turning it one stop back. All compilers I've ever worked with were
capable of telling you that you're using an uninitialised variable,
which would've caught this bug before it got out.
(The reason for the one stop back, btw, is that most compilers' "really
anal-retentive" settings are _too_ strict and pick up things that are,
e.g., only a problem if you're compiling for an embedded system, which
you probably aren't; at least one edition of one compiler famously
complains about its own system headers when set to the highest warning
level. Nevertheless, below the extreme, stricter warnings is better.)

Richard
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,042
Latest member
icassiem

Latest Threads

Top