Byte array compare - Speed

MrCoder · Jun 6, 2004

Hey guys, my first post on here so I'll just say "Hello everbody!"

Ok heres my question for you lot.

Is there a faster way to compare 1 byte array to another?

This is my current code

// Check for a match

match = true;

for ( int i = 0; i < arraylen; i++)

{

if( data != data2 )

{

match = false;

break;

}

}

Thanks in advance guys!!

JKop · Jun 6, 2004

MrCoder posted:

Hey guys, my first post on here so I'll just say "Hello everbody!"

Ok heres my question for you lot.

Is there a faster way to compare 1 byte array to another?

This is my current code

// Check for a match

match = true;

for ( int i = 0; i < arraylen; i++)

{

if( data != data2 )

{

match = false;

break;

}

}

Thanks in advance guys!!

If the array is of a variable length, as in your above example, then I would
say that Yes, your code is pretty efficent, although maybe there's some
memory functions I'm unaware of that could do the job better?

If the array is of constant length, I'd suggest the following:

Struct ChocolateArray
{
int monkey[50];
};

int main(void)
{
bool match;

ChocolateArray aa;

ChocolateArray bb;

//Mess with the arrays

if (aa == bb)
{
match = true;
}
else
{
match = false;
}
}

-JKop

ak · Jun 6, 2004

Hey guys, my first post on here so I'll just say "Hello everbody!"

Ok heres my question for you lot.

Is there a faster way to compare 1 byte array to another?

This is my current code

// Check for a match

match = true;

for ( int i = 0; i < arraylen; i++)

{

if( data != data2 )

{

match = false;

break;

}

}

Thanks in advance guys!!

Click to expand...

memcmp( data, data2, arraylen );

Gianni Mariani · Jun 6, 2004

MrCoder said:
Hey guys, my first post on here so I'll just say "Hello everbody!"

Ok heres my question for you lot.

Is there a faster way to compare 1 byte array to another?

This is my current code

// Check for a match

match = true;

for ( int i = 0; i < arraylen; i++)

{

if( data != data2 )

{

match = false;

break;

}

}

On most systems, memcmp may be nicely optimized, it may (on some
platforms) be inlined by the compiler.

i.e.

#include <cstring>
bool isequal( const char * a, const char * b, int len )
{
return ::memcmp( a, b, len ) == 0;
}

using gcc: g++ -mtune=i686 -O3 generates:
memcmp_tst.o: file format elf32-i386

Disassembly of section .text:

00000000 <_Z7isequalPKcS0_i>:
0: 55 push %ebp
1: 89 e5 mov %esp,%ebp
3: 83 ec 0c sub $0xc,%esp
6: 89 75 f8 mov %esi,0xfffffff8(%ebp)
9: 8b 4d 10 mov 0x10(%ebp),%ecx
c: 8b 75 08 mov 0x8(%ebp),%esi
f: 89 7d fc mov %edi,0xfffffffc(%ebp)
12: 8b 7d 0c mov 0xc(%ebp),%edi
15: fc cld
16: 39 c9 cmp %ecx,%ecx
18: f3 a6 repz cmpsb %es%edi),%ds%esi)
1a: 0f 94 c0 sete %al
1d: 8b 75 f8 mov 0xfffffff8(%ebp),%esi
20: 8b 7d fc mov 0xfffffffc(%ebp),%edi
23: 89 ec mov %ebp,%esp
25: 0f b6 c0 movzbl %al,%eax
28: 5d pop %ebp
29: c3 ret

(Notice no function calls !)

However, on some architectures (Risc - i.e. mips, sparc, hppa etc) there
are a couple of different fast paths and it depends alot on the
processor, cache and usage patterns.

On 64 bit cpu, you might be able to take advantage of reading in large
64 bit chunks at a time. Performance optimizations at this level can get
very involved but unless the profiling indicates that there is a huge
proportion of time spent in memcmp, it's just not worth your time and
effort to optimize it.

memcmp is probably the best choice for a balance of performance and
portability.

Siemel Naran · Jun 6, 2004

Gianni Mariani said:
MrCoder wrote:

match = true;

for ( int i = 0; i < arraylen; i++)

{

if( data != data2 )

{

match = false;

break;

}

Click to expand...

#include <cstring>
bool isequal( const char * a, const char * b, int len )
{
return ::memcmp( a, b, len ) == 0;
}

using gcc: g++ -mtune=i686 -O3 generates:
memcmp_tst.o: file format elf32-i386

Disassembly of section .text:

00000000 <_Z7isequalPKcS0_i>:
0: 55 push %ebp
1: 89 e5 mov %esp,%ebp
3: 83 ec 0c sub $0xc,%esp
6: 89 75 f8 mov %esi,0xfffffff8(%ebp)
9: 8b 4d 10 mov 0x10(%ebp),%ecx
c: 8b 75 08 mov 0x8(%ebp),%esi
f: 89 7d fc mov %edi,0xfffffffc(%ebp)
12: 8b 7d 0c mov 0xc(%ebp),%edi
15: fc cld
16: 39 c9 cmp %ecx,%ecx
18: f3 a6 repz cmpsb %es%edi),%ds%esi)
1a: 0f 94 c0 sete %al
1d: 8b 75 f8 mov 0xfffffff8(%ebp),%esi
20: 8b 7d fc mov 0xfffffffc(%ebp),%edi
23: 89 ec mov %ebp,%esp
25: 0f b6 c0 movzbl %al,%eax
28: 5d pop %ebp
29: c3 ret

How is the memcmp version different from the original version? Is it
faster? Sorry, I don't know how to read assembly well.

memcmp is probably the best choice for a balance of performance and
portability.

Click to expand...

An implementation may specialize std::equal(begin, end, begin2) to call
std::memcmp for POD types. I don't know how many actually do though. My
implementation (Borland 6) does not.

MrCoder · Jun 6, 2004

Thanks for the answers guys!

Im going to try it using memcmp shortly and see what sort of speed increase
I get.

Siemel Naran · Jun 6, 2004

JKop said:
If the array is of constant length, I'd suggest the following:

Struct ChocolateArray
{
int monkey[50];
};

int main(void)
{
bool match;

ChocolateArray aa;

ChocolateArray bb;

//Mess with the arrays

if (aa == bb)

There is no compiler defined operator== for structs or classes. There's
only operator=, copy constructor, and destructor.

Gianni Mariani · Jun 6, 2004

Siemel said:
How is the memcmp version different from the original version? Is it
faster? Sorry, I don't know how to read assembly well.

The disassembly shows that memcmp functionality is inlined - there is no
call to the memcmp function. It's replaced with the ia32 "repz" prefix
to the "cmps" instruction. This is a single instuction that is
equivalent to the entire for loop in the op's example.

isequal() equivalent to isequal2() below.

bool isequal2( const char * a, const char * b, int len )
{

while ( len -- )
{
if ( * (a++) != * (b++) )
{
return false;
}
}
return true;

}

memcmp_tst.o: file format elf32-i386

Disassembly of section .text:

00000000 <_Z8isequal2PKcS0_i>:
0: 55 push %ebp
1: 89 e5 mov %esp,%ebp
3: 56 push %esi
4: 53 push %ebx
5: 8b 75 08 mov 0x8(%ebp),%esi
8: 8b 5d 0c mov 0xc(%ebp),%ebx
b: 8b 4d 10 mov 0x10(%ebp),%ecx
e: eb 0c jmp 1c <_Z8isequal2PKcS0_i+0x1c>
10: 0f b6 13 movzbl (%ebx),%edx
13: 43 inc %ebx
14: 0f b6 06 movzbl (%esi),%eax
17: 46 inc %esi
18: 38 d0 cmp %dl,%al
1a: 75 0f jne 2b <_Z8isequal2PKcS0_i+0x2b>
1c: 49 dec %ecx
1d: 83 f9 ff cmp $0xffffffff,%ecx
20: 75 ee jne 10 <_Z8isequal2PKcS0_i+0x10>
22: 5b pop %ebx
23: b8 01 00 00 00 mov $0x1,%eax
28: 5e pop %esi
29: 5d pop %ebp
2a: c3 ret
2b: 31 c0 xor %eax,%eax
2d: 5b pop %ebx
2e: 5e pop %esi
2f: 5d pop %ebp
30: c3 ret

As you can see, it has 2 conditional jumps "jne = jump if not equal".
Conditional jumps are costly in modern CPU's that do branch prediction
when the prediction is wrong. If the compiler was really really smart,
it *might* be able to figure out how to do the optimization to use "repz
cmp", but I have yet to see one !

An implementation may specialize std::equal(begin, end, begin2) to call
std::memcmp for POD types. I don't know how many actually do though. My
implementation (Borland 6) does not.

It appears that gcc also does not. Using std::equal

00000040 <_Z8isequal3PKcS0_i>:
40: 55 push %ebp
41: 89 e5 mov %esp,%ebp
43: 53 push %ebx
44: 8b 55 08 mov 0x8(%ebp),%edx
47: 8b 45 10 mov 0x10(%ebp),%eax
4a: 8b 4d 0c mov 0xc(%ebp),%ecx
4d: 89 d3 mov %edx,%ebx
4f: 01 c3 add %eax,%ebx
51: eb 09 jmp 5c <_Z8isequal3PKcS0_i+0x1c>
53: 0f b6 01 movzbl (%ecx),%eax
56: 38 02 cmp %al,(%edx)
58: 75 11 jne 6b <_Z8isequal3PKcS0_i+0x2b>
5a: 42 inc %edx
5b: 41 inc %ecx
5c: 39 da cmp %ebx,%edx
5e: 75 f3 jne 53 <_Z8isequal3PKcS0_i+0x13>
60: 5b pop %ebx
61: b8 01 00 00 00 mov $0x1,%eax
66: 0f b6 c0 movzbl %al,%eax
69: 5d pop %ebp
6a: c3 ret
6b: 5b pop %ebx
6c: 31 c0 xor %eax,%eax
6e: 0f b6 c0 movzbl %al,%eax
71: 5d pop %ebp
72: c3 ret

Note the 2 jne's - basically the same as the previous loop.

Paul Mensonides · Jun 7, 2004

MrCoder said:
Thanks for the answers guys!

Im going to try it using memcmp shortly and see what sort of speed
increase I get.

When you know the size, you can also unroll (or partially unroll) the loop.
This is a semi-generalized implementation:

#include <functional>
#include <iterator>

// identity

template<class T> struct identity {
typedef T type;
};

// map_integral

template<class T, T V> struct map_integral {
static const T value = V;
};

template<class T, T V> const T map_integral<T, V>::value;

// enable_if

template<bool, class R> struct enable_if { };
template<class R> struct enable_if<true, R> : identity<R> { };

// is_class

template<class T> class is_class {
private:
template<class U> static char check(int U::*);
template<class U> static char (& check(...))[2];
public:
static const bool value = sizeof(check<T>(0)) == 1;
};

template<class T> const bool is_class<T>::value;

// is_same

template<class T, class U> struct is_same : map_integral<bool, false> { };
template<class T> struct is_same<T, T> : map_integral<bool, true> { };

// is_pointer_to_function

template<class> struct is_pointer_to_function
: map_integral<bool, false> { };

template<> struct is_pointer_to_function<void*>
: map_integral<bool, false> { };

template<class T> struct is_pointer_to_function<T*>
: is_same<void (T), void (T*)> { };

// is_reference_to_function

template<class> struct is_reference_to_function
: map_integral<bool, false> { };

template<class T> struct is_reference_to_function<T&>
: is_same<void (T), void (T*)> { };

// compare

namespace detail {

template<class iterator, class predicate>
bool compare(
iterator a1, iterator a2,
iterator b1, iterator b2,
predicate pred, std::input_iterator_tag
) {
while (a1 != a2) {
if (b1 == b2 || !pred(*a1++, *b1++)) {
return false;
}
}
return b1 == b2;
}

template<class iterator, class predicate>
bool compare(
iterator a1, iterator a2,
iterator b1, iterator b2,
predicate pred, std::random_access_iterator_tag
) {
typedef typename std::iterator_traits<iterator>::difference_type
difference_type;
difference_type size(a2 - a1);
if (size != b2 - b1) {
return false;
}
// unrolling factor == 4
register difference_type n((size + 3) / 4);
#define body() \
if (!pred(*a1++, *b1++)) { \
return false; \
} \
/**/
switch (size % 4 - !size) {
case 0:
do {
body()
case 3: body()
case 2: body()
case 1: body()
} while (--n);
}
#undef body
return true;
}

} // detail

template<class iterator, class predicate>
inline bool compare(
iterator a1, iterator a2,
iterator b1, iterator b2,
predicate pred
) {
return detail::compare(
a1, a2, b1, b2,
pred, typename std::iterator_traits<iterator>::iterator_category()
);
}

template<class iterator>
inline bool compare(iterator a1, iterator a2, iterator b1, iterator b2) {
return compare(
a1, a2, b1, b2,
std::equal_to<typename std::iterator_traits<iterator>::value_type>()
);
}

template<class T, class predicate>
inline bool compare(const T* a, const T* b, unsigned size, predicate pred) {
return compare(a, a + size, b, b + size, pred);
}

template<class T, unsigned long s1, unsigned long s2, class predicate>
inline typename enable_if<
is_class<predicate>::value
|| is_pointer_to_function<predicate>::value

|| is_reference_to_function said:
::type compare(T (& a)[s1], T (& b)[s2], predicate pred) {

return compare(a, a + s1, b, b + s2, pred);
}

template<class T> inline bool compare(const T* a, const T* b, unsigned size) {
return compare(a, b, size, std::equal_to<const T>());
}

template<class T, unsigned long s1, unsigned long s2>
inline bool compare(T (& a)[s1], T (& b)[s2]) {
return compare(a, b, std::equal_to<T>());
}

Regards,
Paul Mensonides

Siemel Naran · Jun 7, 2004

Gianni Mariani said:
Siemel Naran wrote:

The disassembly shows that memcmp functionality is inlined - there is no
call to the memcmp function. It's replaced with the ia32 "repz" prefix
to the "cmps" instruction. This is a single instuction that is
equivalent to the entire for loop in the op's example.

isequal() equivalent to isequal2() below.

bool isequal2( const char * a, const char * b, int len )
{

while ( len -- )
{
if ( * (a++) != * (b++) )
{
return false;
}
}
return true;

}

memcmp_tst.o: file format elf32-i386

Disassembly of section .text:

00000000 <_Z8isequal2PKcS0_i>:
0: 55 push %ebp
1: 89 e5 mov %esp,%ebp
3: 56 push %esi
4: 53 push %ebx
5: 8b 75 08 mov 0x8(%ebp),%esi
8: 8b 5d 0c mov 0xc(%ebp),%ebx
b: 8b 4d 10 mov 0x10(%ebp),%ecx
e: eb 0c jmp 1c <_Z8isequal2PKcS0_i+0x1c>
10: 0f b6 13 movzbl (%ebx),%edx
13: 43 inc %ebx
14: 0f b6 06 movzbl (%esi),%eax
17: 46 inc %esi
18: 38 d0 cmp %dl,%al
1a: 75 0f jne 2b <_Z8isequal2PKcS0_i+0x2b>
1c: 49 dec %ecx
1d: 83 f9 ff cmp $0xffffffff,%ecx
20: 75 ee jne 10 <_Z8isequal2PKcS0_i+0x10>
22: 5b pop %ebx
23: b8 01 00 00 00 mov $0x1,%eax
28: 5e pop %esi
29: 5d pop %ebp
2a: c3 ret
2b: 31 c0 xor %eax,%eax
2d: 5b pop %ebx
2e: 5e pop %esi
2f: 5d pop %ebp
30: c3 ret

As you can see, it has 2 conditional jumps "jne = jump if not equal".
Conditional jumps are costly in modern CPU's that do branch prediction
when the prediction is wrong. If the compiler was really really smart,
it *might* be able to figure out how to do the optimization to use "repz
cmp", but I have yet to see one !

It appears that gcc also does not. Using std::equal

00000040 <_Z8isequal3PKcS0_i>:
40: 55 push %ebp
41: 89 e5 mov %esp,%ebp
43: 53 push %ebx
44: 8b 55 08 mov 0x8(%ebp),%edx
47: 8b 45 10 mov 0x10(%ebp),%eax
4a: 8b 4d 0c mov 0xc(%ebp),%ecx
4d: 89 d3 mov %edx,%ebx
4f: 01 c3 add %eax,%ebx
51: eb 09 jmp 5c <_Z8isequal3PKcS0_i+0x1c>
53: 0f b6 01 movzbl (%ecx),%eax
56: 38 02 cmp %al,(%edx)
58: 75 11 jne 6b <_Z8isequal3PKcS0_i+0x2b>
5a: 42 inc %edx
5b: 41 inc %ecx
5c: 39 da cmp %ebx,%edx
5e: 75 f3 jne 53 <_Z8isequal3PKcS0_i+0x13>
60: 5b pop %ebx
61: b8 01 00 00 00 mov $0x1,%eax
66: 0f b6 c0 movzbl %al,%eax
69: 5d pop %ebp
6a: c3 ret
6b: 5b pop %ebx
6c: 31 c0 xor %eax,%eax
6e: 0f b6 c0 movzbl %al,%eax
71: 5d pop %ebp
72: c3 ret

Note the 2 jne's - basically the same as the previous loop.

Thanks for your answers!

Thomas Matthews · Jun 7, 2004

MrCoder said:
Hey guys, my first post on here so I'll just say "Hello everbody!"
Ok heres my question for you lot.
Is there a faster way to compare 1 byte array to another?
This is my current code
// Check for a match
match = true;
for ( int i = 0; i < arraylen; i++)
{
if( data != data2 )
{
match = false;
break;
}
}

Thanks in advance guys!!

First off, search the web and the newgroups for
"premature optimization". Only optimize when absolutely
necessary.

Loop Unrolling (also search the web on this one).
--------------
In most CPU architectures, branches (a.k.a. jumps) are
more expensive (timewise) than data processing instructions.
Some processors also allow for conditional execution of
instructions. Anyway, the fundamental issue here is to
perform more comparisons than branches. Try this:
bool equal(true);
for (unsigned int i = 0;
equal && (i < arraylen)
{
equal = data == data2[i++];
// Repeat the following line many times.
equal = equal && (i < array_length)
&& (data == data2[i++]);
}
The objective here is to use boolean logic and the
short-circuiting of the logical AND operator to reduce
the amount of branching. The more data operations per
branch is better.

Library Routines
----------------
As others have said, use library routines. Some
library routines may be optimized for your platform's
processor while a simple for loop may not be.

Customize In Assembly.
----------------------
You can write a custom memory compare function that
will take advantage of any processor special functions.
The processor I'm working with has the capability to
fetch much data from memory into registers with one
instruction. So for large sized arrays, I would
use a "block" method that would haul in a lot of
data into registers, then compare each register pair.

--
Thomas Matthews

C++ newsgroup welcome message:
http://www.slack.net/~shiva/welcome.txt
C++ Faq: http://www.parashift.com/c++-faq-lite
C Faq: http://www.eskimo.com/~scs/c-faq/top.html
alt.comp.lang.learn.c-c++ faq:
http://www.raos.demon.uk/acllc-c++/faq.html
Other sites:
http://www.josuttis.com -- C++ STL Library book

Siemel Naran · Jun 7, 2004

Loop Unrolling (also search the web on this one).
--------------
In most CPU architectures, branches (a.k.a. jumps) are
more expensive (timewise) than data processing instructions.
Some processors also allow for conditional execution of
instructions. Anyway, the fundamental issue here is to
perform more comparisons than branches. Try this:
bool equal(true);
for (unsigned int i = 0;
equal && (i < arraylen)
{
equal = data == data2[i++];
// Repeat the following line many times.
equal = equal && (i < array_length)
&& (data == data2[i++]);
}
The objective here is to use boolean logic and the
short-circuiting of the logical AND operator to reduce
the amount of branching. The more data operations per
branch is better.

Have you measured the speed increase?

Thomas Matthews · Jun 7, 2004

Siemel said:
message

Loop Unrolling (also search the web on this one).
--------------
In most CPU architectures, branches (a.k.a. jumps) are
more expensive (timewise) than data processing instructions.
Some processors also allow for conditional execution of
instructions. Anyway, the fundamental issue here is to
perform more comparisons than branches. Try this:
bool equal(true);
for (unsigned int i = 0;
equal && (i < arraylen)
{
equal = data == data2[i++];
// Repeat the following line many times.
equal = equal && (i < array_length)
&& (data == data2[i++]);
}
The objective here is to use boolean logic and the
short-circuiting of the logical AND operator to reduce
the amount of branching. The more data operations per
branch is better.

Click to expand...

Have you measured the speed increase?

Only on my assembly language version for an embbeded
system using the ARM processor at 30Mhz.

I've found that it is faster for larger chunks of data,
sizes of 32 bytes or more, than using a simple byte
comparison.

--
Thomas Matthews

C++ newsgroup welcome message:
http://www.slack.net/~shiva/welcome.txt
C++ Faq: http://www.parashift.com/c++-faq-lite
C Faq: http://www.eskimo.com/~scs/c-faq/top.html
alt.comp.lang.learn.c-c++ faq:
http://www.raos.demon.uk/acllc-c++/faq.html
Other sites:
http://www.josuttis.com -- C++ STL Library book

Dave Townsend · Jun 8, 2004

And nobody's mentioned Duff's device yet?

Packing byte fields and an array object into struct	4	Nov 13, 2013
An idea for heap allocation at near stack allocation speed	14	Feb 13, 2011
Survey details won't go through using php, ajax, Mysql	0	Oct 26, 2023
Qsort() is messing with my entire Code!!!	0	Apr 25, 2022
Help with Loop	0	Mar 30, 2023
Qsort() messing with my entire Code	0	Apr 25, 2022
unicode compare errors	3	Dec 10, 2010
GET NEIL DEGRASSES TYSON, I ripped a hole with this one...	0	Nov 10, 2022

Byte array compare - Speed

MrCoder

JKop

ak

Gianni Mariani

Siemel Naran

MrCoder

Siemel Naran

Gianni Mariani

Paul Mensonides

Siemel Naran

Thomas Matthews

Siemel Naran

Thomas Matthews

Dave Townsend

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads