simd-style coding

B

ben-goldberg

I've got some code which was written to make use of some integers as vectors of bits, which deals with only a couple of bits at a time.

I'd like to speed it up, and use whole words at a time... similar to how SIMD coding works.

Here's some pseudocode for how the original bit-at-a-time code works:

outer_loop_start:

advance_lfsr();
old_bits[0] = lfsr_lsb();
have_old[0] = true;
advance_lfsr();
new_bit = lfsr_lsb();
i = 0;

inner_loop_start:

next_bit = new_bit != old_bits;
if( next_bit ) output(new_bit);
new_bit = next_bit;
have_old = false;
i = i + 1;
if( i >= M ) goto outer_loop_start;

if( have_old ) goto inner_loop_start;

have_old = true;
old_bits = new_bit;

goto outer_loop_start;

An LFSR is a Linear Feedback Shift Register, a type of pseudo-random bit generator which produces statistically decent output. This pseudocode takes the bits and hopefully makes them into something cryptographically secure.

But if I can't make it go fast, then it's not worthwhile to try to cryptanalyse it, because there are so many fast, well studied (and believed to be secure) algorithms out there.

So, I want to completely eliminate the inner loop, by keeping an extra unsigned int, and replacing the current inner loop with a few operations which perform, in effect, 32 simultaneous passes through the original inner loop using ands, ors, nots, xors, etc.

What I have so far is:


static unsigned lfsr = 1, have_old = 0, old_bits = 0, new_bits = 0;

void output(unsigned which, unsigned thebits);

void make_some_bits() {
unsigned bit = lfsr & 1;
lfsr = (lfsr >> 1) ^ (0xD0000001u & -(uint32_t)bit);
old_bits += bit;
have_old += 1;
bit = lfsr & 1;
lfsr = (lfsr >> 1) ^ (0xD0000001u & -(uint32_t)bit);
new_bits += bit;

unsigned next_bit = new_bits ^ old_bits;
unsigned output_these = have_old & next_bit;
output( output_these, new_bits );
new_bits = next_bit << 1;

/* Help? */
}

void output(unsigned which, unsigned thebits) {
while( which ) {
int next = which & (which-1);
int lsb = which - next;
printf("%d", (thebits & lsb) ? 1 : 0 );
which = next;
}
}

But I'm not sure what to do next.
 
S

Shao Miller

I've got some code which was written to make use of some integers as vectors of bits, which deals with only a couple of bits at a time.

Ok.

I'd like to speed it up, and use whole words at a time... similar to how SIMD coding works.

C doesn't define the speed of operations. If you want extremes of
speed, you probably don't want to code in C. In C, we can talk about
minimizing the number of side effects and sequence points, but these
aren't exactly related to speed.
Here's some pseudocode for how the original bit-at-a-time code works:

outer_loop_start:

advance_lfsr();
old_bits[0] = lfsr_lsb();
have_old[0] = true;
advance_lfsr();
new_bit = lfsr_lsb();
i = 0;

inner_loop_start:

next_bit = new_bit != old_bits;
if( next_bit ) output(new_bit);
new_bit = next_bit;
have_old = false;
i = i + 1;
if( i >= M ) goto outer_loop_start;

if( have_old ) goto inner_loop_start;

have_old = true;
old_bits = new_bit;

goto outer_loop_start;

An LFSR is a Linear Feedback Shift Register, a type of pseudo-random bit generator which produces statistically decent output. This pseudocode takes the bits and hopefully makes them into something cryptographically secure.

But if I can't make it go fast, then it's not worthwhile to try to cryptanalyse it, because there are so many fast, well studied (and believed to be secure) algorithms out there.


Are you thinking about gathering statistics of the code's output and
scrutinizing those statistics? Is a mathematical analysis (without C
code involvement) not possible?
So, I want to completely eliminate the inner loop, by keeping an extra unsigned int, and replacing the current inner loop with a few operations which perform, in effect, 32 simultaneous passes through the original inner loop using ands, ors, nots, xors, etc.

You might be interested in "loop unrolling". Is that what you're
talking about?
What I have so far is:


static unsigned lfsr = 1, have_old = 0, old_bits = 0, new_bits = 0;

void output(unsigned which, unsigned thebits);

void make_some_bits() {
unsigned bit = lfsr & 1;
lfsr = (lfsr >> 1) ^ (0xD0000001u & -(uint32_t)bit);
old_bits += bit;
have_old += 1;
bit = lfsr & 1;
lfsr = (lfsr >> 1) ^ (0xD0000001u & -(uint32_t)bit);
new_bits += bit;

unsigned next_bit = new_bits ^ old_bits;
unsigned output_these = have_old & next_bit;
output( output_these, new_bits );
new_bits = next_bit << 1;

/* Help? */
}

void output(unsigned which, unsigned thebits) {
while( which ) {
int next = which & (which-1);
int lsb = which - next;
printf("%d", (thebits & lsb) ? 1 : 0 );
which = next;
}
}

But I'm not sure what to do next.

Is the goal to rewrite the original pseudo-code in C, or to speed up
some existing C code, or neither of these, or both?
 
G

glen herrmannsfeldt

C doesn't define the speed of operations. If you want extremes of
speed, you probably don't want to code in C. In C, we can talk about
minimizing the number of side effects and sequence points, but these
aren't exactly related to speed.

Well, yes, but if you do bit manipulation sizeof(int)*CHAR_BIT bits
at a time instead of one bit at a time, it is likely faster.

(snip)

The LFSRs used for CRC calculation are commonly done in software
one byte at a time. That is, the whole 32 bit CRC is processed
with a table lookup, one 32 bit XOR, and a shift. (Not necessarily
in that order.)

Here is a CRC calculating routine and table generating routine.
Though it is more usual to put in a static 256 entry table
initialized with constants.

#define POLYNOMIAL 0x04c11db7L
static unsigned long crc_table[256];
void gen_crc_table()
/* generate the table of CRC remainders for all possible bytes */
{ register int i, j; register unsigned long crc_accum;
for ( i = 0; i < 256; i++ )
{ crc_accum = ( (unsigned long) i << 24 );
for ( j = 0; j < 8; j++ )
{ if ( crc_accum & 0x80000000L )
crc_accum =
( crc_accum << 1 ) ^ POLYNOMIAL;
else
crc_accum =
( crc_accum << 1 ); }
crc_table = crc_accum; }
return; }

unsigned long update_crc(unsigned long crc_accum, char *data_blk_ptr,
int data_blk_size)
/* update the CRC on the data block one byte at a time */
{ register int i, j;
for ( j = 0; j < data_blk_size; j++ )
{ i = ( (int) ( crc_accum >> 24) ^ *data_blk_ptr++ ) & 0xff;
crc_accum = ( crc_accum << 8 ) ^ crc_table; }
return crc_accum; }

-- glen
 
B

ben-goldberg

C doesn't define the speed of operations. If you want extremes of
speed, you probably don't want to code in C. In C, we can talk about
minimizing the number of side effects and sequence points, but these
aren't exactly related to speed.

True, C doesn't define how long any particular operation takes.
But generally, fewer operations takes less time.

For example, suppose my original code were this:

for( int i = 0; i < sizeof(unsigned)*CHAR_BIT; ++i ) {
int bit_from_a = (a >> i) & 1;
int bit_from_b = (b >> i) & 1;
unsigned bit_for_c = bit_from_a ^ bit_from_b;
c = (c & ~(1<<i)) + (bit_for_c << i);
}

and I replaced it with this:
c = a ^ b;

The latter is unquestionably faster, because the compiler isn't smart
enough to convert the first into the second.
Here's some pseudocode for how the original bit-at-a-time code works: [snip]
An LFSR is a Linear Feedback Shift Register, a type of pseudo-random bit
generator which produces statistically decent output. This pseudocode
takes the bits and hopefully makes them into something cryptographically
secure.
But if I can't make it go fast, then it's not worthwhile to try to
cryptanalyse it, because there are so many fast, well studied (and believed
to be secure) algorithms out there.


Are you thinking about gathering statistics of the code's output and
scrutinizing those statistics?

Oh, I've already done that. I know that it's output is statistically good.

And while a statistically bad pseudorandom generator makes a bad cipher,
having a statistically good pseudorandom generator doesn't guarantee
cryptographic goodness.

For example, MT19937 produces statistically excellent output,
and it is very fast, but you can't use it (by itself) as a stream
cipher, since one can very easily use a known plaintext attack
compute MT's internal state.
Is a mathematical analysis (without C
code involvement) not possible?

Determining whether something is cryptographically secure does not require C code, you're quite right in your implication.

However, cryptographic analysis is difficult in general, and quite time consuming.

But even supposing I do analyse the algorithm, and find it to be secure, it's going to be fairly useless if I can't make it run faster than the BBS cipher.
You might be interested in "loop unrolling". Is that what you're
talking about?

Simple loop unrolling would convert the inner loop of the algorithm
into 32 or 64 repetitions of the inside of that loop.

What I want is to eliminate the inner loop entirely,
in the same way the example loop I showed earlier was
converted to the much, much, simpler code, c = a ^ b.

What I have so far is:
[snip]

Is the goal to rewrite the original pseudo-code in C, or to speed up
some existing C code, or neither of these, or both?

I already have some code in C, which does the same as the pseudo-code.

Here's a working code snippet:
http://codepad.org/LRZ5fmWR

But it's still much much much too slow to be a practical cipher.

I want to turn the entire inner loop into three
or four xors, ands, etc, with no loop at all, and
no branching in the code that replaces that loop.

I'm perfectly willing to have the output bits come
out in a different order, if that's necessary for the
code to be sped up.
 
P

Pierre Asselin

[ actual C question snipped ]

An LFSR is a Linear Feedback Shift Register, a type of pseudo-random
bit generator which produces statistically decent output. This
pseudocode takes the bits and hopefully makes them into something
cryptographically secure.

No! These things are not designed for cryptography.

For the C question I was going to suggest you look at CRC routines
on the net, but Glen already did --and provided code. If I remember
correctly there is also a good discussion in the Wikipedia article
(which one is for you to search !)
 
B

Bart van Ingen Schenau

I've got some code which was written to make use of some integers as
vectors of bits, which deals with only a couple of bits at a time.

I'd like to speed it up, and use whole words at a time... similar to how
SIMD coding works.

But I'm not sure what to do next.

As your algorithm works on a different number of bits each time around
the outer loop, your best bet for letting it operate on words (or bytes)
at a time is to use a helper table that stores the values of the masks
and shifts needed for each iteration to compare the right number of bits
and to insert the new bit in the right location.

Bart v Ingen Schenau
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,009
Latest member
GidgetGamb

Latest Threads

Top