Breakthrough

J

jacob navia

Hi

In a recent discussion ("Open letter to Ian Collins") we found out that
the main problem in the C containers library was the time taken by the
Sort function.

The reasons for that is that it was a general sort function calling a
general comparison function that uses memcmp...

The first step to improve performance was to write data type specific
functions using a data type generic file. As a bonus, we get compile
time checking and better syntax.

Still, the generic sort function was taking a lot of time.

I have written then a data type generic quick sort function that
receives as a parameter a macro that is used to compare two elements.
This vastly improves performance.

The times are:

Generic sort function without any specialized heap
(using malloc)
real 0m17.029s
user 0m16.654s
sys 0m0.368s

Generic sort function using a specialized heap
(reduces the malloc overhead)
real 0m11.759s
user 0m11.500s
sys 0m0.252s

"Templated" sort function using a macro parameter
(reduces comparisons to a simple expression)
real 0m6.453s
user 0m6.165s
sys 0m0.281s

The expression used to compare two list elements is:
#define COMPARE_EXPRESSION(A, B) \
((*B)->Data > (*A)->Data ? -1 : (*B)->Data != (*A)->Data)

I thank Pete for that proposal, I wouldn't have come to it
alone :)

I would like to have the corresponding C++ times but I fear that
if I write it it would be unfair since I am not a C++ wizard...
Here is the C code for the example:

#include <stdlib.h>
#include "containers.h"
#include "intlist.h"
#define N 10000000
int main(void)
{
intList *L1;
size_t i;
int d;
long long sum=0,newSum=0;
Iterator *it;
int *pint;

L1 = iintList.Create(sizeof(int));
iintList.UseHeap(L1,NULL); // Use specialized Heap
// Add N random numbers to the integer list
for (i=0; i<N;i++) {
d = rand();
sum += d;
iintList.Add(L1,d);
}
// Sort it
iintList.Sort(L1);
// Go through the sorted list using an iterator
it = iintList.NewIterator(L1);
i=0;
for (pint = it->GetFirst(it); pint; pint = it->GetNext(it)) {
newSum += *pint;
}
// Verify that both sums are identical
if (sum != newSum)
printf("Failed\n");
else printf("Sum = %lld\n",sum);
// Destroy the list
iintList.Finalize(L1);
}

I have uploaded the latest code to:

http://code.google.com/p/ccl/

There you will find (in listgen.c) the "templated" version of quick sort
in C.
 
B

BartC

I have uploaded the latest code to:

http://code.google.com/p/ccl/

There you will find (in listgen.c) the "templated" version of quick sort
in C.

How do you build this thing? Typing make gives errors:

c:\lcc\bin\make says something about bitstrings.o.

Copying makefile.lcc to makefile then doing the same:

c:\lcc\bin\make now complains about something in lcc64.

(I never use 'make' so haven't got a clue. A readme file might be helpful
though.)
 
P

Paul

Juha said:
Your test is basically useless as a measurement of sorting speed because
the vast majority of the time will be spent on those push_back() calls.

If you want to test sorting speed and nothing else, you have to discard
the time spent in memory allocation.

If you're going to collect timing information, it should be
collected right around the working part of the test, and
inside the code itself. Rather than recording the runtime of
the entire executable, and all those rand() calls.

generate_random_numbers
...
get_time_before
do_the_sort
get_time_after
...
print ( get_time_after - get_time_before )

Paul
 
J

jacob navia

Le 04/06/12 18:30, BartC a écrit :
How do you build this thing? Typing make gives errors:

c:\lcc\bin\make says something about bitstrings.o.

Copying makefile.lcc to makefile then doing the same:

c:\lcc\bin\make now complains about something in lcc64.

(I never use 'make' so haven't got a clue. A readme file might be
helpful though.)


Sorry, I did not test under windows. I tested now, and you should
just type

make -f makefile.lcc

That compiles in 32 bits lcc

make -f makefile.lcc64

That compiles 64 bit lcc
I upddated the zip file. Please download again and tell me how it goes.

Thanks
 
B

BartC

jacob navia said:
Le 04/06/12 18:30, BartC a écrit :
Sorry, I did not test under windows.

That explains the funny line endings in the files..
I tested now, and you should
just type

make -f makefile.lcc

Thanks. Downloading the latest lcc 32-bits to make sure, this now gets hung
up on errors in wchar.h (first on line 521, then on line 509).
That compiles in 32 bits lcc

make -f makefile.lcc64

I couldn't see a .lcc64 file (in the distribution downloaded after your
post)

Anyway I've now given up on make files (I always have trouble with them.
Always.) I found I could build your test program by manually compiling and
linking these files:

intlist.c
buffer.c
vector.c
bitstrings.c
bloom.c
containererror.c
deque.c
dictionary.c
dlist.c
fgetline.c
generic.c
hashtable.c
heap.c
iMask.c
list.c
malloc_debug.c
pool.c
pooldebug.c
qsortex.c
queue.c
redblacktree.c
scapegoat.c
searchtree.c
strcollection.c
valarraydouble.c
valarrayint.c
valarrayuint.c
valarraysize_t.c
valarrayfloat.c
valarraylongdouble.c
valarrayuint.c
valarraylonglong.c
valarrayulonglong.c
valarrayshort.c
observer.c
memorymanager.c
sequential.c
wstrcollection.c
 
J

jacob navia

Le 04/06/12 20:18, BartC a écrit :
That explains the funny line endings in the files..


Thanks. Downloading the latest lcc 32-bits to make sure, this now gets
hung up on errors in wchar.h (first on line 521, then on line 509).

Ahhh you are using the distribution that comes with the compiler?

That's not really up-to-date. Please download the ccl code from

http://code.google.com/p/ccl/
 
M

MelissA

Your test is basically useless as a measurement of sorting speed
because the vast majority of the time will be spent on those
push_back() calls.

If you want to test sorting speed and nothing else, you have to
discard the time spent in memory allocation.
Here is just sorting time, C version is significantly faster:

bmaxa@maxa:~/jacob/ccl$ ./testlist
sort time 3.140
Sum = 10738138201479754
bmaxa@maxa:~/jacob/ccl$ ./cpplist
sort time 5.350
Sum = 10738138201479754
bmaxa@maxa:~/jacob/ccl$
 
M

MelissA

Here is just sorting time, C version is significantly faster:

bmaxa@maxa:~/jacob/ccl$ ./testlist
sort time 3.140
Sum = 10738138201479754
bmaxa@maxa:~/jacob/ccl$ ./cpplist
sort time 5.350
Sum = 10738138201479754
bmaxa@maxa:~/jacob/ccl$

But watch out this:
It is faster to populate vector from list, then
to sort vector , then to clear list, then
to populate list from vector then C version! ;)

bmaxa@maxa:~/jacob/ccl$ time ./cpplist
sort time 0.750
Sum = 10738138201479754

real 0m1.813s
user 0m1.660s
sys 0m0.152s
bmaxa@maxa:~/jacob/ccl$ time ./testlist
sort time 3.140
Sum = 10738138201479754

real 0m4.302s
user 0m4.208s
sys 0m0.096s
bmaxa@maxa:~/jacob/ccl$ cat cpplist.cpp
#include <list>
#include <vector>
#include <algorithm>
#include <cstdio>
#include <cstdlib>
#include <ctime>

#define N 10000000

int main()
{
std::list<int> L1;
long long sum=0,newSum=0;
for(int i = 0; i < N; ++i)
{
int d = rand();
sum+=d;
L1.push_back(d);
}
std::vector<int> tmp(L1.begin(),L1.end());
clock_t t = clock();
std::sort(tmp.begin(),tmp.end());
clock_t et = clock();
L1.clear();
L1.insert(L1.begin(),tmp.begin(),tmp.end());
printf("sort time %.3f\n",float(et - t)/CLOCKS_PER_SEC);

for(auto it : L1)
{
newSum+= it;
}

if (sum != newSum)
printf("Failed\n");
else printf("Sum = %lld\n",sum);

}
 
J

jacob navia

Confirmed.

real 0m5.112s
user 0m4.600s
sys 0m0.487s

That's faster than C by a small amount...
 
I

Ian Collins

Hi

In a recent discussion ("Open letter to Ian Collins") we found out that
the main problem in the C containers library was the time taken by the
Sort function.

The reasons for that is that it was a general sort function calling a
general comparison function that uses memcmp...

The first step to improve performance was to write data type specific
functions using a data type generic file. As a bonus, we get compile
time checking and better syntax.

Still, the generic sort function was taking a lot of time.

I have written then a data type generic quick sort function that
receives as a parameter a macro that is used to compare two elements.
This vastly improves performance.

The times are:

Generic sort function without any specialized heap
(using malloc)
real 0m17.029s
user 0m16.654s
sys 0m0.368s

Generic sort function using a specialized heap
(reduces the malloc overhead)
real 0m11.759s
user 0m11.500s
sys 0m0.252s

"Templated" sort function using a macro parameter
(reduces comparisons to a simple expression)
real 0m6.453s
user 0m6.165s
sys 0m0.281s

The expression used to compare two list elements is:
#define COMPARE_EXPRESSION(A, B) \
((*B)->Data> (*A)->Data ? -1 : (*B)->Data != (*A)->Data)

I thank Pete for that proposal, I wouldn't have come to it
alone :)

I would like to have the corresponding C++ times but I fear that
if I write it it would be unfair since I am not a C++ wizard...

I don't have the original code to had, but looking back at the original
thread, the C++ list sort was 5x faster, so I guess from your numbers it
is now 2-3 times faster.
 
M

MelissA

Confirmed.

real 0m5.112s
user 0m4.600s
sys 0m0.487s

That's faster than C by a small amount...

This is with my inplace qsort with bidirectional
iterators:


bmaxa@maxa:~/jacob/ccl$ time ./cpplist1
sort time 1.480
Sum = 10738138201479754

real 0m2.105s
user 0m1.936s
sys 0m0.168s
bmaxa@maxa:~/jacob/ccl$ time ./testlist
sort time 3.130
Sum = 10738138201479754

real 0m4.265s
user 0m4.128s
sys 0m0.136s
bmaxa@maxa:~/jacob/ccl$ cat QSort.h
#include <algorithm>

namespace VLib{
using std::swap;

template <typename T,typename F>
void qsort(T begin, T end, unsigned size, F f)
{
if(begin == end)return;

T high = end;
if(!size)
{
T tmp = begin;
while(tmp!=end){ high=tmp++;++size; }
}
else
{
--high;
}

if(size == 1)return;

T low = begin;
++low;

unsigned counthigh = 0,countlow = 0;
do
{
while(high != low && f(*begin,*high)){ --high;++counthigh; }
while(low != high && f(*low,*begin)){ ++low;++countlow; }

if(low != high && !f(*low,*begin) && !f(*begin,*low) && !f(*begin,*high) && !f(*high,*begin))
{
while(high != low && !f(*high,*low) && !f(*low,*high)){ --high;++counthigh; }
}

swap(*low,*high);
}while(low != high);
swap(*low,*begin);
T i = low;

while(++i != end && !f(*i,*low) && !f(*low,*i) )--counthigh;

VLib::qsort(begin,low,countlow,f);
VLib::qsort(i,end,counthigh,f);
}

template <typename T>
inline void qsort(T begin, T end, unsigned size = 0)
{
VLib::qsort(begin,end,size,std::less<typename std::iterator_traits<T>::value_type>());
}

}
bmaxa@maxa:~/jacob/ccl$ cat cpplist1.cpp
#include <list>
#include <cstdio>
#include <cstdlib>
#include <ctime>
#include "QSort.h"

#define N 10000000

int main()
{
std::list<int> L1;
long long sum=0,newSum=0;
for(int i = 0; i < N; ++i)
{
int d = rand();
sum+=d;
L1.push_back(d);
}

clock_t t = clock();
VLib::qsort(L1.begin(),L1.end());
clock_t et = clock();
printf("sort time %.3f\n",float(et - t)/CLOCKS_PER_SEC);

for(auto it : L1)
{
newSum+= it;
}

if (sum != newSum)
printf("Failed\n");
else printf("Sum = %lld\n",sum);

}
 
I

Ian Collins

Le 04/06/12 17:44, Ike Naar a écrit :

Yes, I see a noticeable delay between the printing of the
result sum and the end of the program.

Apparently cleaning up is quit e expensive since destructors must be
called 100 million times.

In the CCL you can ADD a destructor if you want but they are NOT
required as in C++.

So not leaking memory is optional :)
 
J

jacob navia

C: 6.6 s
C++ 14.760

See the analysis after this session transcript. Please try this in your
machine using the latest version of libccl.a


~/stl/test $ cat listexample.cpp
#include <list>
#include <cstdio>
#include <cstdlib>

#define N 10000000

int main()
{
std::list<int> L1;
long long sum=0,newSum=0;
int i;
for( i = 0; i < N; ++i)
{
int d = rand();
sum+=d;
L1.push_back(d);
}

L1.sort();

//for(auto it : L1)
std::list<int>::iterator it;
for(it=L1.begin(); it != L1.end(); ++it)
{
newSum+= *it;
}

if (sum != newSum)
printf("Failed\n");
else printf("Sum = %lld\n",sum);

}

~/stl/test $ g++ -O2 -o listexample-cpp listexample.cpp
~/stl/test $ time ./listexample-cpp
Sum = 10737818730605039

real 0m14.760s
user 0m14.438s
sys 0m0.320s
~/stl/test $ cat tintlist.c
#include <stdlib.h>
#include "containers.h"
#include "intlist.h"
#define N 10000000
int main(void)
{
intList *L1;
size_t i;
int d;
long long sum=0,newSum=0;
Iterator *it;
int *pint;

L1 = iintList.Create(sizeof(int));
iintList.UseHeap(L1,NULL); // Use specialized Heap
// Add N random numbers to the integer list
for (i=0; i<N;i++) {
d = rand();
sum += d;
iintList.Add(L1,d);
}
// Sort it
iintList.Sort(L1);
// Go through the sorted list using an iterator
it = iintList.NewIterator(L1);
i=0;
for (pint = it->GetFirst(it); pint; pint = it->GetNext(it)) {
newSum += *pint;
}
// Verify that both sums are identical
if (sum != newSum)
printf("Failed\n");
else printf("Sum = %lld\n",sum);
// Destroy the list
iintList.Finalize(L1);
}
~/stl/test $ gcc -O2 -I.. -o listexample-c tintlist.c ../libccl.a
~/stl/test $ time ./listexample-c
Sum = 10737818730605039

real 0m6.584s
user 0m6.319s
sys 0m0.258s
~/stl/test $

Analysis:
---------
The main problems in the C++code are:

1) malloc overhead. There are several solutions to that, but none are
standard.
2) Apparently the algorithm for sorting a list is different in C++ to
the one I use. I copy the list into a vector and sort the vector, then
copy the sorted result into a list again.

For example:
~/stl/test $ cat tintlist1.cpp
#include <list>
#include <vector>
#include <algorithm>
#include <cstdio>
#include <cstdlib>
#include <ctime>

#define N 10000000

int main()
{
std::list<int> L1;
long long sum=0,newSum=0;
for(int i = 0; i < N; ++i)
{
int d = rand();
sum+=d;
L1.push_back(d);
}
std::vector<int> tmp(L1.begin(),L1.end());
clock_t t = clock();
std::sort(tmp.begin(),tmp.end());
clock_t et = clock();
L1.clear();
L1.insert(L1.begin(),tmp.begin(),tmp.end());
printf("sort time %.3f\n",float(et - t)/CLOCKS_PER_SEC);

// for(auto it : L1)
std::list<int>::iterator it;
for(it=L1.begin(); it != L1.end(); ++it)

{
newSum+= *it;
}

if (sum != newSum)
printf("Failed\n");
else printf("Sum = %lld\n",sum);

}
This code takes only
~/stl/test $ time ./tintlist1-cpp
sort time 1.138
Sum = 10737818730605039

real 0m5.137s
user 0m4.656s
sys 0m0.476s

An improvement of 300%.

Of course now we aren't fair to C since I could have done a vector sort
and I would be faster than C++ anyway.

Conclusion:

Probably C and C++ go at the same speed.
 
J

jacob navia

Le 05/06/12 00:23, Ian Collins a écrit :
So not leaking memory is optional :)

No. You do a

iintList.Finalize(list);

The advantage is that you free ALL memory in a single call to a function
that destroys the heap not individually (list element by list element)
but in blocks of 1000 list elements each you see?

C++ calls destructors 100 million times. That takes time.
 
B

BartC

jacob navia said:
Le 04/06/12 20:18, BartC a écrit :

Ahhh you are using the distribution that comes with the compiler?

No. I updated lcc in case it was out-of-date and causing the problem. The
CCL stuff is separate.

In any case, this project doesn't need a makefile as it's straightforward
(now I know what's needed); I will leave that to the experts.

I wanted to test the speed of container lists versus ordinary arrays (I know
containers are more sophisticated but your test is simple so I'm looking at
just that).

The results of your test program weren't so interesting, because 95% of the
runtime seemed to be in sorting. So got rid of that, and just tested
allocating and iterating over 100M (not 10M) int objects. This took about 8
seconds and seem to use about 1.3GB of memory (is there a per-element
overhead?)

Using an ordinary allocated array, took only 1.7 seconds, and used the
expected 0.4GB of memory.

However this was *preallocated*, a possible advantage. (I ran the same test
on a dynamic language, which took 16 seconds, whether the int-array was
preallocated or not. Interestingly (need to verify this), using a
preallocated generic list only took 10 seconds, not far off the CCL timing
for a non-generic list).

It seems the conclusion, if my results are valid, is that if the
requirements are very simple, that ordinary C array handling should be
considered if speed and memory are important. Maybe you might want to put
forward a more elaborate test that isn't so easy to rewrite in standard C...
 
L

Luca Risolia

Le 05/06/12 00:23, Ian Collins a écrit :

No. You do a

iintList.Finalize(list);

The advantage is that you free ALL memory in a single call to a function
that destroys the heap not individually (list element by list element)
but in blocks of 1000 list elements each you see?

C++ calls destructors 100 million times. That takes time.

Not necessarily. Memory deallocation and object destruction are two
separate things usually controlled by an allocator. With regard to the
standard containers an allocator is always an (optional) template
parameter. If you really want, you can provide your preferred
deallocation strategy for a given type T to std::list, so that it frees
the memory in blocks of 1000 and doesn't destroy any object.
 
M

Martin Shobe

Luca said:
Not necessarily. Memory deallocation and object destruction are two
separate things usually controlled by an allocator. With regard to the
standard containers an allocator is always an (optional) template
parameter. If you really want, you can provide your preferred
deallocation strategy for a given type T to std::list, so that it frees
the memory in blocks of 1000 and doesn't destroy any object.

In the code that lead to this, the list was a list of ints. They don't
even have destructors to call, so the time certainly wasn't spent
calling them.

Martin Shobe
 
M

MelissA

Final QSort.h (corrected errors, this one is little bit slower)

#include <algorithm>

namespace VLib{
using std::swap;

template <typename T,typename F>
void qsort(T begin, T end, unsigned size, F f)
{
if(begin == end)return;

T high = end;
if(!size)
{
T tmp = begin;
while(tmp!=end){ high=tmp++;++size; }
}
else
{
--high;
}

if(size == 1)return;
T low = begin;
unsigned count = 0;
while(++count<size/2)++low;
// printf("size %u\n",size);
swap(*low,*begin);
low = begin;
++low;

unsigned counthigh = 0,countlow = 1;
do
{
while(high != low && f(*begin,*high)){ --high;++counthigh; }
while(low != high && f(*low,*begin)){ ++low;++countlow; }

if(low != high && !f(*low,*begin) && !f(*begin,*low) && !f(*begin,*high) && !f(*high,*begin))
{
while(high != low && !f(*high,*low) && !f(*low,*high)){ --high;++counthigh; }
}

swap(*low,*high);
}while(low != high);
swap(*low,*begin);
T i = low;

while(++i != end && !f(*i,*low) && !f(*low,*i) )--counthigh;

VLib::qsort(begin,low,countlow,f);
VLib::qsort(i,end,counthigh,f);
}

template <typename T>
inline void qsort(T begin, T end, unsigned size = 0)
{
VLib::qsort(begin,end,size,std::less<typename std::iterator_traits<T>::value_type>());
}

}
 
J

Juha Nieminen

In comp.lang.c++ jacob navia said:
Yes, but what I want to test is the speed to do ALL tasks mentioned above

Then you will be comparing C++'s default memory allocator speed to whatever
that C list was using.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top