C Containers Library vs STL

J

jacob navia

Hi

I would like to compare the C containers library (written in C for C
programmers) against the STL.

Here is the code for the CCL. Maybe one C++ wizard would solve the
same problem using the STL?

I would be very ingterested in comparing code size, complexity, etc.


Thanks in advance, and here is the C part:
--------------------------------------------------------------
Unique
------

Given a text file, print in standard output the lines that are
unique in it, i.e. filtering all duplicated lines.

Algorithm:
---------

Normally this involves keeping a sorted list/array of lines
and testing if a line is in the set or not.

Solution using the CCL.
----------------------

1 #include <containers.h>
2 int main(int argc,char *argv[])
3 {
4 FILE *f;
5 int i=1,r;
6 Dictionary *dict;
7 char buf[8192];
8
9 if (argc < 2) {
10 fprintf(stderr,"%s <file name>\n",argv[0]);
11 return -1;
12 }
13 f = fopen(argv[1],"r");
14 if (f == NULL)
15 return -1;
16 dict = iDictionary.Create(0,500);
17 if (dict == NULL)
18 return -1;
19 while (fgets(buf,sizeof(buf),f)) {
20 r= iDictionary.Add(dict,buf,NULL);
21 if (r > 0)
22 printf("[%3d] %s",i,buf);
23 else if (r < 0) break;
24 i++;
25 }
26 iDictionary.Finalize(dict);
27 fclose(f);
28 }

Algorithm
---------
A hash table will be used to determine if a line is a duplicate
or not.

Commentary
----------
We use the following local variables (lines 4-7):
Name Usage
f Input stream bound to the file to read
i Counter for lines read
r Result of adding a line
dict Dictionary (Hash table)
buf Line buffer limited to 8K per line

Lines 9-15 are concerned with opening the input file, with some error
checking.

In line 16 we create a dictionary, requesting a size of zero for the
data associated with the key since we aren't storing any data, just the
key, and we suppose that the table will contain more or less 500
entries. If the file contains much more lines performance could
suffer but the algorithm would still work.

Lines 19-25 are the main loop of the program. We read each line into
the buffer and add it to then dictionary. If the "Add" API returns
a positive number the line wasn't there, if it returns zero the
line was already in the dictionary. If the result is negative it
is an error code and we stop the loop aborting the operation. Failure
can be provoked only by lack of memory.

If the result is positive we print the line.

Cleanup is performed in lines 26 and 27: we dispose of the dictionary
and close the file.

The C containers library web page is:
http://code.google.com/p/ccl/
You will find there the source code and the documentation.

-----------------------------------------------------------------------
 
I

Ian Collins

Hi

I would like to compare the C containers library (written in C for C
programmers) against the STL.

Here is the code for the CCL. Maybe one C++ wizard would solve the
same problem using the STL?

I would be very ingterested in comparing code size, complexity, etc.


Thanks in advance, and here is the C part:
--------------------------------------------------------------
Unique
------

Given a text file, print in standard output the lines that are
unique in it, i.e. filtering all duplicated lines.

Algorithm:
---------

Normally this involves keeping a sorted list/array of lines
and testing if a line is in the set or not.

Solution using the CCL.
----------------------

1 #include<containers.h>
2 int main(int argc,char *argv[])
3 {
4 FILE *f;
5 int i=1,r;
6 Dictionary *dict;
7 char buf[8192];
8
9 if (argc< 2) {
10 fprintf(stderr,"%s<file name>\n",argv[0]);
11 return -1;
12 }
13 f = fopen(argv[1],"r");
14 if (f == NULL)
15 return -1;
16 dict = iDictionary.Create(0,500);
17 if (dict == NULL)
18 return -1;
19 while (fgets(buf,sizeof(buf),f)) {
20 r= iDictionary.Add(dict,buf,NULL);
21 if (r> 0)
22 printf("[%3d] %s",i,buf);
23 else if (r< 0) break;
24 i++;
25 }
26 iDictionary.Finalize(dict);
27 fclose(f);
28 }

Algorithm

This is probably the closest equivalent (sticking to a similar layout
style):

#include <set>
#include <string>
#include <iostream>
#include <fstream>

typedef std::set<std::string> Lines;

int main( int argc, char** argv )
{
std::ifstream in( argv[1] );

Lines unique;

while( in ) {
std::string line;

std::getline( in, line );

if( unique.insert(line).second ) {
std::cout << line << std::endl;
}
}
}

Commentary
----------
We use the following local variables (lines 4-7):
Name Usage
f Input stream bound to the file to read
i Counter for lines read
r Result of adding a line
dict Dictionary (Hash table)
buf Line buffer limited to 8K per line

It would have been much easier to give the variables meaningful names!
 
J

jacob navia

Le 04/08/11 00:38, Robert Wessel a écrit :
On Wed, 03 Aug 2011 17:36:54 -0500, Robert Wessel
I'd point out that unlike Jacob's original, you're not displaying a
line number, although that's trivial. You'd probably change to cout
to something like

std::cout<< "["<< setw(3)<< linenumber<< "] "<< line<< std:endl;

And add the appropriate code to maintain linenumber.


errr... std::setw(3)

Thanks to you and to Ian.

More important than the line number, is the behavior of the STL
container in case of errors, for instance when there is no more
memory. Does it throw an exception? In that case it would be
fully equivalent of the C code that tests for the return code
being less than zero.

In my Macintosh I compile using gcc the code to
-rw-r--r-- 1 jacobnavia staff 6548 3 aoû 23:23 unique.o // CPP
-rw-r--r-- 1 jacobnavia staff 1348 3 aoû 23:23 unique1.o // C

both are compiled using -Os (size optimization) using the same commpiler
~/Documents/stl $ gcc -v
Using built-in specs.
Target: i686-apple-darwin10
gcc version 4.2.1 (Apple Inc. build 5664)


I think that the codes are really equivalent, and I am pleased that
the CCL doesn't come all that bad. What performance is concerned
I doubt that there would be any differences since the programs are
bound by I/O anyway


Thanks again for your answers.

jacob
 
I

Ian Collins

Thanks to you and to Ian.

More important than the line number, is the behavior of the STL
container in case of errors, for instance when there is no more
memory. Does it throw an exception? In that case it would be
fully equivalent of the C code that tests for the return code
being less than zero.

In my Macintosh I compile using gcc the code to
-rw-r--r-- 1 jacobnavia staff 6548 3 aoû 23:23 unique.o // CPP
-rw-r--r-- 1 jacobnavia staff 1348 3 aoû 23:23 unique1.o // C

both are compiled using -Os (size optimization) using the same commpiler
~/Documents/stl $ gcc -v
Using built-in specs.
Target: i686-apple-darwin10
gcc version 4.2.1 (Apple Inc. build 5664)


I think that the codes are really equivalent, and I am pleased that
the CCL doesn't come all that bad. What performance is concerned
I doubt that there would be any differences since the programs are
bound by I/O anyway

Um, I tried both versions with the same optimisations (cc/CC -fast) on
my Solaris box and the run times (over several runs) for a 350K line
file with 214K unique lines were:

C: 5.027s
CC: 1.835s
 
I

Ian Collins

Um, I tried both versions with the same optimisations (cc/CC -fast) on
my Solaris box and the run times (over several runs) for a 350K line
file with 214K unique lines were:

C: 5.027s
CC: 1.835s

A quick profile shows the C version spends most of its time in strcmp
while the C++ version spends most of its time in I/O. So I guess the
performance difference is mainly between the r-b tree used by std::set
and your hashing algorithm.

So if I change your code to use std::set in order to remove the
difference in I/O libraries:

#include <set>
#include <string>
#include <stdio.h>

int main(int argc,char *argv[])
{
FILE* f = fopen(argv[1],"r");
if (f == NULL)
return -1;

std::set<std::string> unique;
char buf[8192];
int i = 0;

while (fgets(buf,sizeof(buf),f)) {
if( unique.insert(buf).second ) {
printf("[%3d] %s",i,buf);
}
i++;
}
fclose(f);
}

the run time drops to 0.66 seconds.
 
J

Juha Nieminen

Ian Collins said:
the run time drops to 0.66 seconds.

If you wanted to reduce the running time even further, and assuming the
entire input file fits in RAM, what you could do is to read the entire
file into RAM with one single fread() call and then use a std::set of
char* (and the proper comparator), each pointer pointing to this single
data block (which needs newlines replaced with null characters).

In this case the (probably significant) speedup comes from removing the
need to make a memory allocation for each line.

(And if that's not enough, then the next step is to not to use a
std::set, but a std::vector<char*> instead. Once it has been initialized,
sort it and make the elements unique (both one-liners). That way we avoid
making another memory allocation per line.)
 
I

Ian Collins

If you wanted to reduce the running time even further, and assuming the
entire input file fits in RAM, what you could do is to read the entire
file into RAM with one single fread() call and then use a std::set of
char* (and the proper comparator), each pointer pointing to this single
data block (which needs newlines replaced with null characters).

Running more than once should be sufficient to cache the file in RAM.
In this case the (probably significant) speedup comes from removing the
need to make a memory allocation for each line.

My point was the choice of container was the dominant player in this
comparison. All of the above optimisations can be applied to either
version.
 
J

jacob navia

Le 04/08/11 02:34, Ian Collins a écrit :
Um, I tried both versions with the same optimisations (cc/CC -fast) on
my Solaris box and the run times (over several runs) for a 350K line
file with 214K unique lines were:

C: 5.027s
CC: 1.835s

Did you change the number passed to the Create API?

The code as shown is optimized for a file of 500 lines or so.

Changing that 500 to (say 135000) would be significantly
better but would use much more memory.

jacob
 
I

Ian Collins

Le 04/08/11 02:34, Ian Collins a écrit :

Did you change the number passed to the Create API?

The code as shown is optimized for a file of 500 lines or so.

Changing that 500 to (say 135000) would be significantly
better but would use much more memory.

That does make a very significant difference:

0.51s

Slightly quicker than using std::set.
 
J

jacob navia

Le 04/08/11 01:40, Robert Wessel a écrit :
Yes, you'd get an exception if you ran out of memory. The handling of
that is a bit different than your program (you quietly return -1,
where as the C++ program would probably die with an "unhandled
exception" message, neither of which is really acceptable, although
both programs could obviously be improved).

The default behavior in the CCL is to print an error message in the
standard error stream. In this case the user would get a "No memory
left" message. That is why I do not print any error message and
return -1.

You can obviously change this behavior by another one if you wish.
(For instance aborting the program).
 
M

Marc

jacob said:
Le 04/08/11 02:34, Ian Collins a écrit :

Comparing a hash table to a binary tree is not really fair, try with
hash_set (common extension) or unordered_set (C++0x).
Did you change the number passed to the Create API?

The code as shown is optimized for a file of 500 lines or so.

Changing that 500 to (say 135000) would be significantly
better but would use much more memory.

Standard C++ hash tables automatically increase their number of buckets
when the number of objects is large enough.

Why should 135000 buckets use much more memory if there are enough
objects to fill most buckets?
 
J

jacob navia

Le 04/08/11 14:47, Marc a écrit :
Comparing a hash table to a binary tree is not really fair, try with
hash_set (common extension) or unordered_set (C++0x).


Standard C++ hash tables automatically increase their number of buckets
when the number of objects is large enough.

I am thinking about adding such an extension to the Dictionary
container. I have already such an extension for the general hash
table container, it resizes automatically. The Dictionary container
was thought as a light weight hash table, with character keys. The
general hash table can have keys of any type but its usage is
more cumbersome, that is why I splitted it into a dictionary
(good for 80% of applications) and a general hash table
(when you need resisizing, and keys that aren't character data.


Now, I am wondering if that interface is not MORE complicated than
a more capable dictionary... Anyway if the resizing is automatic I can
keep the same interface. The code is bulkier though.

Why should 135000 buckets use much more memory if there are enough
objects to fill most buckets?

Because there is more overhead per bucket. In my implementation
we have a table of <n> slots, and each slot with a single linked
list of objects that map to the same hash code.

Space overhead is:
(1) overhead for the linked list: 2*sizeof(void *) for each object
(2) Overhead of the table for the slots: sizeof(void *)/ number of
objects per slot.

You see? The larger the table, the more overhead it makes since the
occupancy decreases (and hence the speed of search increases)

jacob
 
J

Juha Nieminen

Ian Collins said:
Running more than once should be sufficient to cache the file in RAM.

That's not the point. The point was to avoid needless memory allocations.
 
J

jacob navia

Le 04/08/11 16:16, Juha Nieminen a écrit :
That's not the point. The point was to avoid needless memory allocations.

In the C containers library the container ALWAYS owns the memory used
by the objects/keys. This means the given buffers are always copied
when you add an item.

What is the behavior in C++, it is different?
 
N

Nobody

In the C containers library the container ALWAYS owns the memory used
by the objects/keys. This means the given buffers are always copied
when you add an item.

What is the behavior in C++, it is different?

No. But if you have a container which contains pointers, it
literally contains the pointers, not the data which they point to. E.g.
std::vector<char*>::push_back() will just add the supplied pointer, it
won't perform a strdup() (char* doesn't necessarily imply a NUL-terminated
string).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,902
Latest member
Elena68X5

Latest Threads

Top