Help! SIGBUS (object specifc hardware error) when call function getline

W

wqyuwss

Hi,

We have several core dumps in our product. These core dump can be
reproduced in the same place. That is system function call
std::basic_istream<char,std::char_traits<char>>::getline. The result of

pstack for the core dump is

pstack core | c++filt


core 'core of 12214: ../bin/QBE_V5 -X 30017
ffffffff7b944318 __type_0 std::__find_if<const
(__type_0,__type_0,__type_1,const std::random_access_iterator_tag&) (ffffffff7a000bc7, ffffffff7a0014fd, a00000000000001, ffffffff7a0014fd, a, 936) + 20


ffffffff7b952c18 long
,std::_Scan_for_char_val<std::char_traits<char> > >(std::basic_istream<__type_0,__type_1>*,std::basic_streambuf<__type_0,__ty­pe_1>*,long,__type_0*,__type_2,__type_3,bool,bool,bool) (ffffffff7a000bc7, 1001e7170, 3ff, 100243bd0, 0, a00000000000005) + 84


ffffffff7b9537c4 std::istream &std::istream::getline(char*,long,char)
(1001e7160, 100243bd0, 400, a, 1001e7160, 0) + 7c
ffffffff7e6b905c int Service::readObj(std::ifstream
&,XEPersistentObj*&) (1001e7160, ffffffff7ffff2a8, 2, 188d04,
ffffffff7bac93b8, 10) + 54
000000010001df08 int initialize(unsigned) (1001ce030, 0, 1001ce1d0,
1001ce1d0, 3400, 1) + 660
000000010002c37c main (3, ffffffff7ffffac8, ffffffffffecb898,
1001bb020, 1001959e8, 134400) + ac
000000010001b6dc _start (0, 0, 0, 0, 0, 0) + 17c
When we debug it with dbx, dbx tells us it's a object specific hardware

error, SIG_BUS error. The result of dbx is
t@null <mailto:t@null> (l@1 <mailto:l@1> ) program terminated by
signal BUS (object specific hardware error)
0xffffffff7b944318: __find_if+0x0020: ldsb [%o4], %o0
(dbx) regs
current thread: t@null <mailto:t@null>
current frame: [1]
g0-g1 0x0000000000000000 0xffffffff7b953748
g2-g3 0x0000000000000000 0x000000010022ae0c
g4-g5 0x0000000000000001 0x0000000000000936
g6-g7 0x0000000000000000 0xffffffff7de02000
o0-o1 0xffffffff7a000bc7 0xffffffff7a0014fd
o2-o3 0x000000000000000a 0xffffffff7fffebbe
o4-o5 0xffffffff7a000bc7 0x000000000000024d
o6-o7 0xffffffff7fffe221 0xffffffff7b9442e8
l0-l1 0xffffffff7de02000 0x0000000000000000
l2-l3 0x000000010023ef40 0xffffffff7b3ebec4
l4-l5 0x0000000000000000 0x0000000000000000
l6-l7 0x0000000000000001 0x0000000000000000
i0-i1 0xffffffff7a000bc7 0xffffffff7a0014fd
i2-i3 0x0a00000000000001 0xffffffff7a0014fd
i4-i5 0x000000000000000a 0x0000000000000936
i6-i7 0xffffffff7fffe3c1 0xffffffff7b952c18
y 0x0000000000000000
ccr 0x0000000000000044
pc 0xffffffff7b944318:__find_if+0x20 ldsb [%o4], %o0
npc 0xffffffff7b94431c:__find_if+0x24 cmp %o0, %o2
(dbx) examine $o4 /s
dbx: warning: unknown language, 'c' assumed
0xffffffff7a000bc7: "EngineCkptInput 99 f " ...

int Service::readObj( ifstream& strm, XEPersistentObj*& retObj )
{
char* tmp=0;
tmp = new char [BUFSIZ];
strm.getline(tmp,BUFSIZ);
Looking at code, we can not find any suspecting place. It's a pure
system call. I searched similar case through google and got two link
http://groups.google.com/group/comp.unix.solaris/browse_thread/thread...

and
http://groups.google.com/group/comp.unix.solaris/browse_thread/thread....

A SUN engineer said, "It's an error returned by software somewhere deep

down the VM system's hat layer; without knowledge of the mapping at the

address, how it was accessed, it's hard to tell what really is the
matter. Basically, the HAT layer is very low-level part of the virtual
memory system. HAT information describes how a memory page is mapped
on the physical side of the VM (i.e. RAM). " He also suggested "Start
by finding out which address is giving the problem, which instruction
is using the address and how. "
In the implementation of function getline, a large buffer will be
allocated and data will be loaded into the buffer. Then data will be
continuously compared with a required char. The ldsb loads bytes from
the big buffer to register. After loading a byte from register o4 to
o0, the data in register o4 and o2 will be compared to check if
condition is meet.
According to sun sparc instruction, ldsb instruction is used to load a
signed byte from memory into register. It can't cause the core dump of
memory address alignment. The address giving the problem also shows
correct content loading from the services.dat with the dbx command
"examine". So we really don't know why the core dump happened.


Our product will be delivered to customer in few days. It's greatly
urgent for us. Your input and help will be highly appreciated by us.


P.S. OS version is Solaris 10 64bit.
% /usr/platform/sun4u/sbin/prtdiag
System Configuration: Sun Microsystems sun4u Netra t 1400/1405 (4 X
UltraSPARC-II 440MHz)
System clock frequency: 110 MHz
Memory size: 4096 Megabytes


Best Regards
Leslie
 
G

Gianni Mariani

This is WAY off topic here.

OK OK - here is the hint.

ffffffff7a000bc7 - is not aligned. I have no idea why "ldsb" would
cause an issue with this location, however it appears to be the address
of the basic_istream structure (which would be very odd - no pun
intended). It may be caused by the previous instruction.


Hi,

We have several core dumps in our product. These core dump can be
reproduced in the same place. That is system function call
std::basic_istream<char,std::char_traits<char>>::getline. The result of

pstack for the core dump is


pstack core | c++filt



core 'core of 12214: ../bin/QBE_V5 -X 30017
ffffffff7b944318 __type_0 std::__find_if<const
(__type_0,__type_0,__type_1,const std::random_access_iterator_tag&) (ffffffff7a000bc7, ffffffff7a0014fd, a00000000000001, ffffffff7a0014fd, a, 936) + 20



ffffffff7b952c18 long
,std::_Scan_for_char_val<std::char_traits<char> > >(std::basic_istream<__type_0,__type_1>*,std::basic_streambuf<__type_0,__ty­pe_1>*,long,__type_0*,__type_2,__type_3,bool,bool,bool) (ffffffff7a000bc7, 1001e7170, 3ff, 100243bd0, 0, a00000000000005) + 84



ffffffff7b9537c4 std::istream &std::istream::getline(char*,long,char)
(1001e7160, 100243bd0, 400, a, 1001e7160, 0) + 7c
ffffffff7e6b905c int Service::readObj(std::ifstream
&,XEPersistentObj*&) (1001e7160, ffffffff7ffff2a8, 2, 188d04,
ffffffff7bac93b8, 10) + 54
000000010001df08 int initialize(unsigned) (1001ce030, 0, 1001ce1d0,
1001ce1d0, 3400, 1) + 660
000000010002c37c main (3, ffffffff7ffffac8, ffffffffffecb898,
1001bb020, 1001959e8, 134400) + ac
000000010001b6dc _start (0, 0, 0, 0, 0, 0) + 17c
When we debug it with dbx, dbx tells us it's a object specific hardware

error, SIG_BUS error. The result of dbx is
t@null <mailto:t@null> (l@1 <mailto:l@1> ) program terminated by
signal BUS (object specific hardware error)
0xffffffff7b944318: __find_if+0x0020: ldsb [%o4], %o0
(dbx) regs
current thread: t@null <mailto:t@null>
current frame: [1]
g0-g1 0x0000000000000000 0xffffffff7b953748
g2-g3 0x0000000000000000 0x000000010022ae0c
g4-g5 0x0000000000000001 0x0000000000000936
g6-g7 0x0000000000000000 0xffffffff7de02000
o0-o1 0xffffffff7a000bc7 0xffffffff7a0014fd
o2-o3 0x000000000000000a 0xffffffff7fffebbe
o4-o5 0xffffffff7a000bc7 0x000000000000024d
o6-o7 0xffffffff7fffe221 0xffffffff7b9442e8
l0-l1 0xffffffff7de02000 0x0000000000000000
l2-l3 0x000000010023ef40 0xffffffff7b3ebec4
l4-l5 0x0000000000000000 0x0000000000000000
l6-l7 0x0000000000000001 0x0000000000000000
i0-i1 0xffffffff7a000bc7 0xffffffff7a0014fd
i2-i3 0x0a00000000000001 0xffffffff7a0014fd
i4-i5 0x000000000000000a 0x0000000000000936
i6-i7 0xffffffff7fffe3c1 0xffffffff7b952c18
y 0x0000000000000000
ccr 0x0000000000000044
pc 0xffffffff7b944318:__find_if+0x20 ldsb [%o4], %o0
npc 0xffffffff7b94431c:__find_if+0x24 cmp %o0, %o2
(dbx) examine $o4 /s
dbx: warning: unknown language, 'c' assumed
0xffffffff7a000bc7: "EngineCkptInput 99 f " ...

int Service::readObj( ifstream& strm, XEPersistentObj*& retObj )
{
char* tmp=0;
tmp = new char [BUFSIZ];
strm.getline(tmp,BUFSIZ);
Looking at code, we can not find any suspecting place. It's a pure
system call. I searched similar case through google and got two link
http://groups.google.com/group/comp.unix.solaris/browse_thread/thread...

and
http://groups.google.com/group/comp.unix.solaris/browse_thread/thread....

A SUN engineer said, "It's an error returned by software somewhere deep

down the VM system's hat layer; without knowledge of the mapping at the

address, how it was accessed, it's hard to tell what really is the
matter. Basically, the HAT layer is very low-level part of the virtual
memory system. HAT information describes how a memory page is mapped
on the physical side of the VM (i.e. RAM). " He also suggested "Start
by finding out which address is giving the problem, which instruction
is using the address and how. "
In the implementation of function getline, a large buffer will be
allocated and data will be loaded into the buffer. Then data will be
continuously compared with a required char. The ldsb loads bytes from
the big buffer to register. After loading a byte from register o4 to
o0, the data in register o4 and o2 will be compared to check if
condition is meet.
According to sun sparc instruction, ldsb instruction is used to load a
signed byte from memory into register. It can't cause the core dump of
memory address alignment. The address giving the problem also shows
correct content loading from the services.dat with the dbx command
"examine". So we really don't know why the core dump happened.


Our product will be delivered to customer in few days. It's greatly
urgent for us. Your input and help will be highly appreciated by us.


P.S. OS version is Solaris 10 64bit.
% /usr/platform/sun4u/sbin/prtdiag
System Configuration: Sun Microsystems sun4u Netra t 1400/1405 (4 X
UltraSPARC-II 440MHz)
System clock frequency: 110 MHz
Memory size: 4096 Megabytes


Best Regards
Leslie
 
W

wqyuwss

can you explain it more? why do you think it's due to another
instruction. I think every time core dump file records the exact core
position.
 
G

Gianni Mariani

can you explain it more? why do you think it's due to another
instruction. I think every time core dump file records the exact core
position.

Perhaps, perhaps not.

Still, you have an unaligned struct. Why ?
 
W

wqyuwss

no. unlign struct is not in my code. It's in Solaris system call
getline. When I call getline, address of the input parameter is
100243bd0, which is aligned. In function getline, character will be
compared one by one. So it will point a odd address, which looks like
unaligned.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top