[PATCH] Subtle bug in bignum.c

M

Markus

All --

I'm posting this here because ruby-core and the bug
form do not seem to be working for me.

-- MQR


I have found a rather subtle bug in the bitwise operators defined in
bignum.c, caused, I believe, by uninitialized bits in the
RBIGNUM()->sign. I have a patch which fixes the problem but, as with my
previous patches, someone cleverer than I am (hi Matz!) may have a much
better way of fixing it.

TO DEMONSTRATE THE PROBLEM:

Run the following script. It should produce no output (and occasionally
this is what happens). But frequently, the values of a and b differ.
This should not happen, but it does. I have reproduced the problem on
several different machines, running several different flavors of linux.

The fact that the problem doesn't occur all the time, or even every time
through the loop, and that the extent to which it occurs it can be
affected by running other programs to "rough up" memory between tests
led me to believe that it was an uninitializateion problem.

10.times {
(0x10000..0x20000).each { |i|
a = (i*-12345) & 0xFFFFFFFF
b = (i*-12345) & 0xFFFFFFFF
print "#{i.to_s(2)} #{a.to_s(2)} #{b.to_s(2)}\n" if a != b
}
}

A similar problem can b shown with bitwise-or


10.times {
(0x10000..0x20000).each { |i|
a = (i*-12345) | 0
b = (i*-12345) | 0
print "#{i.to_s(2)} #{a.to_s(2)} #{{b.to_s(2)}\n" if a != b
}
}


But the problem does NOT occur with bitwise-xor:

10.times {
(0x10000..0x20000).each { |i|
a = (i*-12345) ^ 0
b = (i*-12345) ^ 0
print "#{i.to_s(2)} #{a.to_s(2)} #{{b.to_s(2)}\n" if a != b
}
}


MY PATCH:

Looking at the code for the three bitwise ops, I noticed that xor
protected itself from RBIGNUM()->sign having values other than 0 or 1,
but the other ops did not. So I tried this patch (copying code from the
bitwise-xor routine to bitwise-and and bitwise-or):

--- ruby-1.8.1.org/bignum.c 2003-12-22 02:25:49.000000000 -0700
+++ ruby-1.8.1/bignum.c 2004-08-25 11:33:08.000000000 -0700
@@ -1645,6 +1645,8 @@
ds2 = BDIGITS(y);
sign = RBIGNUM(x)->sign;
}
+ RBIGNUM(x)->sign = RBIGNUM(x)->sign?1:0;
+ RBIGNUM(y)->sign = RBIGNUM(y)->sign?1:0;
z = bignew(l2, RBIGNUM(x)->sign || RBIGNUM(y)->sign);
zds = BDIGITS(z);

@@ -1701,6 +1703,8 @@
ds2 = BDIGITS(y);
sign = RBIGNUM(x)->sign;
}
+ RBIGNUM(x)->sign = RBIGNUM(x)->sign?1:0;
+ RBIGNUM(y)->sign = RBIGNUM(y)->sign?1:0;
z = bignew(l2, RBIGNUM(x)->sign && RBIGNUM(y)->sign);
zds = BDIGITS(z);


This fixed the problem, at least in so far as the symptoms went away.
It may fix it correctly, if everything else only looks at the lowest bit
of RBIGNUM()->sign, or it might be a false step if the real design
intent is for RBIGNUM()->sign to always be either 0 or 1.

In any case, someone more knowledgeable than I am in these matters
should look at it (someone who at least knows C?). I have not, for
example, figured out why the || and && don't "mask away" the problem, as
I thought they were always supposed to return either 0 or 1--meaning
that it might be an optimizer problem.

I used GCC 3.2.2 & 3.3.2 when compiling ruby, from otherwise unmodified
1.8.1 sources. I looked in CVS and saw no post 1.8.1 changes that
appeared relevant.

-- Markus (MQR) Roberts
 
T

ts

M> I have found a rather subtle bug in the bitwise operators defined in

It's really subtle :)

M> + RBIGNUM(x)->sign = RBIGNUM(x)->sign?1:0;
M> + RBIGNUM(y)->sign = RBIGNUM(y)->sign?1:0;
M> z = bignew(l2, RBIGNUM(x)->sign || RBIGNUM(y)->sign);
M> zds = BDIGITS(z);

[...]

M> This fixed the problem, at least in so far as the symptoms went away.
M> It may fix it correctly, if everything else only looks at the lowest bit
M> of RBIGNUM()->sign, or it might be a false step if the real design
M> intent is for RBIGNUM()->sign to always be either 0 or 1.

it's a GC problem : your patch make the variable `x' (or `y') available
for the mark phase, without this line `x' can be gc'ed.

Don't ask me why ruby do this : I've not yet look at the generated
assembler, and why it can't find the right `x' on the stack.


Guy Decoux
 
T

ts

t> It's really subtle :)

really, really subtle :)

Breakpoint 4, rb_big_and (x=1074241584, y=1074241584) at bignum.c:1637
1637 y = rb_to_int(y);
(gdb) info register ebx
ebx 0x4007a030 1074241584
(gdb) printf "0x%x\n", x
0x4007a030
(gdb)

the variable `x' is in ebx

1647 get2comp(x, Qtrue);
(gdb) info register ebx
ebx 0x4007a008 1074241544
(gdb) printf "0x%x\n", x
0x4007a008
(gdb) p *(struct RBignum *)x
$8 = {basic = {flags = 13, klass = 1074419244}, sign = 0 '\0', len = 1,
digits = 0x8117870}
(gdb)

it create a new Bignum in rb_big_clone(), `x' is still in ebx

1663 z = bignew(l2, RBIGNUM(x)->sign || RBIGNUM(y)->sign);
(gdb) info register ebx
ebx 0x8117700 135362304
(gdb)

it will call bignew() which will call rb_gc(), but in ebx it has
RBIGNUM(x)->digits. This mean that when ruby will mark the registers it
will not mark the Bignum created by rb_big_clone() and the GC remove the
Bignum created.



Guy Decoux
 
T

ts

t> 1663 z = bignew(l2, RBIGNUM(x)->sign || RBIGNUM(y)->sign);
t> (gdb) info register ebx
t> ebx 0x8117700 135362304
t> (gdb)

t> it will call bignew() which will call rb_gc(), but in ebx it has
t> RBIGNUM(x)->digits. This mean that when ruby will mark the registers it
^^^^^^^^^^^^^^^^^^
t> will not mark the Bignum created by rb_big_clone() and the GC remove the
t> Bignum created.

In reality this is RBIGNUM(y)->digits which is is in ebx (the result is
the same)

(gdb) p *(struct RBignum *)y
$13 = {basic = {flags = 13, klass = 1074419244}, sign = 1 '\001', len = 1,
digits = 0x8117700}
(gdb)


Guy Decoux
 
M

Markus

t> It's really subtle :)

really, really subtle :)

...

...it will call bignew() which will call rb_gc(), but in ebx it has
RBIGNUM(x)->digits. This mean that when ruby will mark the registers it
will not mark the Bignum created by rb_big_clone() and the GC remove the
Bignum created.


Guy --

That makes sense (and I'd agree with the "really, really subtle
assessment). So although my original patch "fixes" the problem, it does
so in a way that is not obvious. To that end, I'd suggest (remember, I
am NOT a C programmer; this compiles and it works, but that's all I can
claim for it) something along the lines of:

--- ruby-1.8.1.org/bignum.c 2003-12-22 02:25:49.000000000 -0700
+++ ruby-1.8.1/bignum.c 2004-08-30 11:06:33.000000000 -0700
@@ -1615,6 +1615,8 @@
VALUE x, y;
{
VALUE z;
+ VALUE *y_gc = NULL;
+ VALUE *x_gc = NULL;
BDIGIT *ds1, *ds2, *zds;
long i, l1, l2;
char sign;
@@ -1624,11 +1626,13 @@
y = rb_int2big(FIX2LONG(y));
}
if (!RBIGNUM(y)->sign) {
- y = rb_big_clone(y);
+ y = rb_big_clone(y);
+ rb_gc_register_address(x_gc = &y);
get2comp(y, Qtrue);
}
if (!RBIGNUM(x)->sign) {
- x = rb_big_clone(x);
+ x = rb_big_clone(x);
+ rb_gc_register_address(x_gc = &x);
get2comp(x, Qtrue);
}
if (RBIGNUM(x)->len > RBIGNUM(y)->len) {
@@ -1655,6 +1659,8 @@
zds = sign?0:ds2;
}
if (!RBIGNUM(z)->sign) get2comp(z, Qfalse);
+ if (y_gc) rb_gc_unregister_address(y_gc);
+ if (x_gc) rb_gc_unregister_address(x_gc);
return bignorm(z);
}

@@ -1670,6 +1676,8 @@
VALUE x, y;
{
VALUE z;
+ VALUE *y_gc = NULL;
+ VALUE *x_gc = NULL;
BDIGIT *ds1, *ds2, *zds;
long i, l1, l2;
char sign;
@@ -1681,10 +1689,12 @@

if (!RBIGNUM(y)->sign) {
y = rb_big_clone(y);
+ rb_gc_register_address(x_gc = &y);
get2comp(y, Qtrue);
}
if (!RBIGNUM(x)->sign) {
x = rb_big_clone(x);
+ rb_gc_register_address(x_gc = &x);
get2comp(x, Qtrue);
}
if (RBIGNUM(x)->len > RBIGNUM(y)->len) {
@@ -1711,6 +1721,8 @@
zds = sign?ds2:(BIGRAD-1);
}
if (!RBIGNUM(z)->sign) get2comp(z, Qfalse);
+ if (y_gc) rb_gc_unregister_address(y_gc);
+ if (x_gc) rb_gc_unregister_address(x_gc);

return bignorm(z);
}


...which may not be ideal but at least is clearer in intent.

In the process of working up this patch, I began to wonder: how do we
know that other operations are not similarly afflicted? It would seem
(IMHO) that relying on temporaries to be findable in registers (if I
understand you correctly) would be fraught with opportunities for this
sort of errors--especially since the gc could be called by deeply nested
code that has no inkling of the use to which it being put.

This could also (again, if I'm understanding correctly) affect code
other than bignum--in fact, any code that makes multiple temporary
objects without registering them in some way. Is this something we
can/should do something about beyond dealing with Bignum.& and Bignum.|?

-- Markus
 
T

ts

M> That makes sense (and I'd agree with the "really, really subtle
M> assessment). So although my original patch "fixes" the problem, it does
M> so in a way that is not obvious.

Well, it fix the problem for *your* compiler and with *your* level of
optimization, probably a more "smart" compiler will be able to optimize
it and remove the variable from the register.

You must see in rb_big_and()

z = bignew(l2, RBIGNUM(x)->sign || RBIGNUM(y)->sign);
zds = BDIGITS(z);

for (i=0; i<l1; i++) {
zds = ds1 & ds2;
}
for (; i<l2; i++) {
zds = sign?0:ds2;
}
if (!RBIGNUM(z)->sign) get2comp(z, Qfalse);
return bignorm(z);

that a "smart" compiler can see that `x' and `y' are not used (in this
part) and remove these variables from the registers but if it do this,
ruby will free BDIGITS(x) (i.e ds1) and BDIGITS(y) (i.e. ds2) and this
why you see the bug.

A patch is *perhaps* to don't give to the compiler the possibility to
optimize these lines (hint "volatile")


Guy Decoux
 
M

Markus

G" == Guy Decoux ts said:
M> That makes sense (and I'd agree with the "really, really subtle
M> assessment). So although my original patch "fixes" the problem, it does
M> so in a way that is not obvious.

G> Well, it fix the problem for *your* compiler and with *your* level of
G> optimization, probably a more "smart" compiler will be able to optimize
G> it and remove the variable from the register....

Yes, of course...that is why I said "fixes" (in quotes). This usage may
be idiomatic, so perhaps I should have been more explicit:
M (paraphrased)>
So although my original patch makes the symptoms go away in the
case I am testing, it does so in a way that sheds no light on
the real problem, which it *hides*.



A patch is *perhaps* to don't give to the compiler the possibility to
optimize these lines (hint "volatile")

That was the intent of the patch in the rest of the post to which you
responded. Could you perhaps take a look at it and see what you think?

I did not do anything with "volatile," in part because I am not really a
C programmer, but mostly because (as I explained in the note after the
revised patch), I don't trust the idea of leaving things in registers in
the hopes that other, unrelated code will find them there.

-- MarkusQ
 
T

ts

M> I did not do anything with "volatile," in part because I am not really a
M> C programmer, but mostly because (as I explained in the note after the
M> revised patch), I don't trust the idea of leaving things in registers in
M> the hopes that other, unrelated code will find them there.

Well, probably I'm wrong but if you write it

VALUE
rb_big_and(x0, y0)
VALUE x0, y0;
{
volatile VALUE x = x0;
volatile VALUE y = y0;

I think that perhaps it will work.

Probably it best to do the same with rb_big_xor() and see if this case can
exist with other functions in bignum.c


Guy Decoux
 
M

Markus

I am confused again. I thought the problem was not with the
_parameters_ x & y, but rather with the clones of them that wind up
being stored in the same variables (if the value is negative, the
program modifies the parameter, replacing it with a modified clone of
the original).

I've always thought (but remember, I don't really know C) that volatile
was a hint to the compiler that code that _used_ something should _not_
hold it in a register, because something else might modify it at any
time (e.g. the system clock). So I am not sure why it would help.

In any case, is trying to force the addresses into registers better than
just temporarily marking them as not-to-be-reaped as I have done in the
revised patch? It seems to me that the whole trick of leaving things in
registers and hoping the garbage collector sees them is untrustworthy in
the first place. Couldn't the specter of better optimization inducing
bugs you raised in your previous post haunt us any time we depended on
the gc somehow knowing we still wanted something just because its
address was in a register?

Or am I missing something?

-- Markus
 
T

ts

M> I've always thought (but remember, I don't really know C) that volatile
M> was a hint to the compiler that code that _used_ something should _not_
M> hold it in a register, because something else might modify it at any
M> time (e.g. the system clock). So I am not sure why it would help.

Yes, you are right.

Now `x' and `y' will not be in a register, and ruby will find it on the
stack if the GC is called in bignew() *even* if a new Bignum is created by
rb_big_clone() and stored at this address.

The initial problem is that the new Bignum (created by rb_big_clone()) was
stored in a register (because `x' was in a register), then the content of
the register was replaced with another variable (RBIGNUM(y)->digits) when
the gc was called and this is why ruby was unable to mark the Bignum


Guy Decoux
 
M

Markus

M> I've always thought (but remember, I don't really know C) that volatile
M> was a hint to the compiler that code that _used_ something should _not_
M> hold it in a register, because something else might modify it at any
M> time (e.g. the system clock). So I am not sure why it would help.

Yes, you are right.

Now `x' and `y' will not be in a register, and ruby will find it on the
stack if the GC is called in bignew() *even* if a new Bignum is created by
rb_big_clone() and stored at this address.

The initial problem is that the new Bignum (created by rb_big_clone()) was
stored in a register (because `x' was in a register), then the content of
the register was replaced with another variable (RBIGNUM(y)->digits) when
the gc was called and this is why ruby was unable to mark the Bignum


Guy Decoux

Ah yes. Thank you. I had not realized (but should have guessed) that
the gc walked the stack to find things that were reachable. I will work
up & test a third (and I hope final) patch.

The only remaining question is: do you think there may be analogous bugs
lurking elsewhere in the code? I suppose it might be possible to hunt
for them in a semi-automated way, but before I embark on that is there
any reason to suppose that this situation is more or less common?

-- Markus
 
T

ts

M> The only remaining question is: do you think there may be analogous bugs
M> lurking elsewhere in the code? I suppose it might be possible to hunt
M> for them in a semi-automated way, but before I embark on that is there
M> any reason to suppose that this situation is more or less common?

No, this is really a very, very special case. Just look in bignum.c if
this case can exist.



Guy Decoux
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,043
Latest member
CannalabsCBDReview

Latest Threads

Top