C Containers Library vs STL

Discussion in 'C++' started by jacob navia, Aug 3, 2011.

  1. jacob navia

    jacob navia Guest

    Hi

    I would like to compare the C containers library (written in C for C
    programmers) against the STL.

    Here is the code for the CCL. Maybe one C++ wizard would solve the
    same problem using the STL?

    I would be very ingterested in comparing code size, complexity, etc.


    Thanks in advance, and here is the C part:
    --------------------------------------------------------------
    Unique
    ------

    Given a text file, print in standard output the lines that are
    unique in it, i.e. filtering all duplicated lines.

    Algorithm:
    ---------

    Normally this involves keeping a sorted list/array of lines
    and testing if a line is in the set or not.

    Solution using the CCL.
    ----------------------

    1 #include <containers.h>
    2 int main(int argc,char *argv[])
    3 {
    4 FILE *f;
    5 int i=1,r;
    6 Dictionary *dict;
    7 char buf[8192];
    8
    9 if (argc < 2) {
    10 fprintf(stderr,"%s <file name>\n",argv[0]);
    11 return -1;
    12 }
    13 f = fopen(argv[1],"r");
    14 if (f == NULL)
    15 return -1;
    16 dict = iDictionary.Create(0,500);
    17 if (dict == NULL)
    18 return -1;
    19 while (fgets(buf,sizeof(buf),f)) {
    20 r= iDictionary.Add(dict,buf,NULL);
    21 if (r > 0)
    22 printf("[%3d] %s",i,buf);
    23 else if (r < 0) break;
    24 i++;
    25 }
    26 iDictionary.Finalize(dict);
    27 fclose(f);
    28 }

    Algorithm
    ---------
    A hash table will be used to determine if a line is a duplicate
    or not.

    Commentary
    ----------
    We use the following local variables (lines 4-7):
    Name Usage
    f Input stream bound to the file to read
    i Counter for lines read
    r Result of adding a line
    dict Dictionary (Hash table)
    buf Line buffer limited to 8K per line

    Lines 9-15 are concerned with opening the input file, with some error
    checking.

    In line 16 we create a dictionary, requesting a size of zero for the
    data associated with the key since we aren't storing any data, just the
    key, and we suppose that the table will contain more or less 500
    entries. If the file contains much more lines performance could
    suffer but the algorithm would still work.

    Lines 19-25 are the main loop of the program. We read each line into
    the buffer and add it to then dictionary. If the "Add" API returns
    a positive number the line wasn't there, if it returns zero the
    line was already in the dictionary. If the result is negative it
    is an error code and we stop the loop aborting the operation. Failure
    can be provoked only by lack of memory.

    If the result is positive we print the line.

    Cleanup is performed in lines 26 and 27: we dispose of the dictionary
    and close the file.

    The C containers library web page is:
    http://code.google.com/p/ccl/
    You will find there the source code and the documentation.

    -----------------------------------------------------------------------
     
    jacob navia, Aug 3, 2011
    #1
    1. Advertising

  2. jacob navia

    Ian Collins Guest

    On 08/ 4/11 08:40 AM, jacob navia wrote:
    > Hi
    >
    > I would like to compare the C containers library (written in C for C
    > programmers) against the STL.
    >
    > Here is the code for the CCL. Maybe one C++ wizard would solve the
    > same problem using the STL?
    >
    > I would be very ingterested in comparing code size, complexity, etc.
    >
    >
    > Thanks in advance, and here is the C part:
    > --------------------------------------------------------------
    > Unique
    > ------
    >
    > Given a text file, print in standard output the lines that are
    > unique in it, i.e. filtering all duplicated lines.
    >
    > Algorithm:
    > ---------
    >
    > Normally this involves keeping a sorted list/array of lines
    > and testing if a line is in the set or not.
    >
    > Solution using the CCL.
    > ----------------------
    >
    > 1 #include<containers.h>
    > 2 int main(int argc,char *argv[])
    > 3 {
    > 4 FILE *f;
    > 5 int i=1,r;
    > 6 Dictionary *dict;
    > 7 char buf[8192];
    > 8
    > 9 if (argc< 2) {
    > 10 fprintf(stderr,"%s<file name>\n",argv[0]);
    > 11 return -1;
    > 12 }
    > 13 f = fopen(argv[1],"r");
    > 14 if (f == NULL)
    > 15 return -1;
    > 16 dict = iDictionary.Create(0,500);
    > 17 if (dict == NULL)
    > 18 return -1;
    > 19 while (fgets(buf,sizeof(buf),f)) {
    > 20 r= iDictionary.Add(dict,buf,NULL);
    > 21 if (r> 0)
    > 22 printf("[%3d] %s",i,buf);
    > 23 else if (r< 0) break;
    > 24 i++;
    > 25 }
    > 26 iDictionary.Finalize(dict);
    > 27 fclose(f);
    > 28 }
    >
    > Algorithm
    > ---------
    > A hash table will be used to determine if a line is a duplicate
    > or not.


    This is probably the closest equivalent (sticking to a similar layout
    style):

    #include <set>
    #include <string>
    #include <iostream>
    #include <fstream>

    typedef std::set<std::string> Lines;

    int main( int argc, char** argv )
    {
    std::ifstream in( argv[1] );

    Lines unique;

    while( in ) {
    std::string line;

    std::getline( in, line );

    if( unique.insert(line).second ) {
    std::cout << line << std::endl;
    }
    }
    }


    > Commentary
    > ----------
    > We use the following local variables (lines 4-7):
    > Name Usage
    > f Input stream bound to the file to read
    > i Counter for lines read
    > r Result of adding a line
    > dict Dictionary (Hash table)
    > buf Line buffer limited to 8K per line


    It would have been much easier to give the variables meaningful names!

    --
    Ian Collins
     
    Ian Collins, Aug 3, 2011
    #2
    1. Advertising

  3. jacob navia

    jacob navia Guest

    Le 04/08/11 00:38, Robert Wessel a écrit :
    > On Wed, 03 Aug 2011 17:36:54 -0500, Robert Wessel
    > <> wrote:
    >


    >>
    >> I'd point out that unlike Jacob's original, you're not displaying a
    >> line number, although that's trivial. You'd probably change to cout
    >> to something like
    >>
    >> std::cout<< "["<< setw(3)<< linenumber<< "] "<< line<< std:endl;
    >>
    >> And add the appropriate code to maintain linenumber.

    >
    >
    > errr... std::setw(3)


    Thanks to you and to Ian.

    More important than the line number, is the behavior of the STL
    container in case of errors, for instance when there is no more
    memory. Does it throw an exception? In that case it would be
    fully equivalent of the C code that tests for the return code
    being less than zero.

    In my Macintosh I compile using gcc the code to
    -rw-r--r-- 1 jacobnavia staff 6548 3 aoû 23:23 unique.o // CPP
    -rw-r--r-- 1 jacobnavia staff 1348 3 aoû 23:23 unique1.o // C

    both are compiled using -Os (size optimization) using the same commpiler
    ~/Documents/stl $ gcc -v
    Using built-in specs.
    Target: i686-apple-darwin10
    gcc version 4.2.1 (Apple Inc. build 5664)


    I think that the codes are really equivalent, and I am pleased that
    the CCL doesn't come all that bad. What performance is concerned
    I doubt that there would be any differences since the programs are
    bound by I/O anyway


    Thanks again for your answers.

    jacob
     
    jacob navia, Aug 4, 2011
    #3
  4. jacob navia

    Ian Collins Guest

    On 08/ 4/11 11:30 AM, jacob navia wrote:
    >
    > Thanks to you and to Ian.
    >
    > More important than the line number, is the behavior of the STL
    > container in case of errors, for instance when there is no more
    > memory. Does it throw an exception? In that case it would be
    > fully equivalent of the C code that tests for the return code
    > being less than zero.
    >
    > In my Macintosh I compile using gcc the code to
    > -rw-r--r-- 1 jacobnavia staff 6548 3 aoû 23:23 unique.o // CPP
    > -rw-r--r-- 1 jacobnavia staff 1348 3 aoû 23:23 unique1.o // C
    >
    > both are compiled using -Os (size optimization) using the same commpiler
    > ~/Documents/stl $ gcc -v
    > Using built-in specs.
    > Target: i686-apple-darwin10
    > gcc version 4.2.1 (Apple Inc. build 5664)
    >
    >
    > I think that the codes are really equivalent, and I am pleased that
    > the CCL doesn't come all that bad. What performance is concerned
    > I doubt that there would be any differences since the programs are
    > bound by I/O anyway


    Um, I tried both versions with the same optimisations (cc/CC -fast) on
    my Solaris box and the run times (over several runs) for a 350K line
    file with 214K unique lines were:

    C: 5.027s
    CC: 1.835s

    --
    Ian Collins
     
    Ian Collins, Aug 4, 2011
    #4
  5. jacob navia

    Ian Collins Guest

    On 08/ 4/11 12:34 PM, Ian Collins wrote:
    > On 08/ 4/11 11:30 AM, jacob navia wrote:
    >>
    >> Thanks to you and to Ian.
    >>
    >> More important than the line number, is the behavior of the STL
    >> container in case of errors, for instance when there is no more
    >> memory. Does it throw an exception? In that case it would be
    >> fully equivalent of the C code that tests for the return code
    >> being less than zero.
    >>
    >> In my Macintosh I compile using gcc the code to
    >> -rw-r--r-- 1 jacobnavia staff 6548 3 aoû 23:23 unique.o // CPP
    >> -rw-r--r-- 1 jacobnavia staff 1348 3 aoû 23:23 unique1.o // C
    >>
    >> both are compiled using -Os (size optimization) using the same commpiler
    >> ~/Documents/stl $ gcc -v
    >> Using built-in specs.
    >> Target: i686-apple-darwin10
    >> gcc version 4.2.1 (Apple Inc. build 5664)
    >>
    >>
    >> I think that the codes are really equivalent, and I am pleased that
    >> the CCL doesn't come all that bad. What performance is concerned
    >> I doubt that there would be any differences since the programs are
    >> bound by I/O anyway

    >
    > Um, I tried both versions with the same optimisations (cc/CC -fast) on
    > my Solaris box and the run times (over several runs) for a 350K line
    > file with 214K unique lines were:
    >
    > C: 5.027s
    > CC: 1.835s


    A quick profile shows the C version spends most of its time in strcmp
    while the C++ version spends most of its time in I/O. So I guess the
    performance difference is mainly between the r-b tree used by std::set
    and your hashing algorithm.

    So if I change your code to use std::set in order to remove the
    difference in I/O libraries:

    #include <set>
    #include <string>
    #include <stdio.h>

    int main(int argc,char *argv[])
    {
    FILE* f = fopen(argv[1],"r");
    if (f == NULL)
    return -1;

    std::set<std::string> unique;
    char buf[8192];
    int i = 0;

    while (fgets(buf,sizeof(buf),f)) {
    if( unique.insert(buf).second ) {
    printf("[%3d] %s",i,buf);
    }
    i++;
    }
    fclose(f);
    }

    the run time drops to 0.66 seconds.

    --
    Ian Collins
     
    Ian Collins, Aug 4, 2011
    #5
  6. Ian Collins <> wrote:
    > the run time drops to 0.66 seconds.


    If you wanted to reduce the running time even further, and assuming the
    entire input file fits in RAM, what you could do is to read the entire
    file into RAM with one single fread() call and then use a std::set of
    char* (and the proper comparator), each pointer pointing to this single
    data block (which needs newlines replaced with null characters).

    In this case the (probably significant) speedup comes from removing the
    need to make a memory allocation for each line.

    (And if that's not enough, then the next step is to not to use a
    std::set, but a std::vector<char*> instead. Once it has been initialized,
    sort it and make the elements unique (both one-liners). That way we avoid
    making another memory allocation per line.)
     
    Juha Nieminen, Aug 4, 2011
    #6
  7. jacob navia

    Ian Collins Guest

    On 08/ 4/11 07:02 PM, Juha Nieminen wrote:
    > Ian Collins<> wrote:
    >> the run time drops to 0.66 seconds.

    >
    > If you wanted to reduce the running time even further, and assuming the
    > entire input file fits in RAM, what you could do is to read the entire
    > file into RAM with one single fread() call and then use a std::set of
    > char* (and the proper comparator), each pointer pointing to this single
    > data block (which needs newlines replaced with null characters).


    Running more than once should be sufficient to cache the file in RAM.

    > In this case the (probably significant) speedup comes from removing the
    > need to make a memory allocation for each line.


    My point was the choice of container was the dominant player in this
    comparison. All of the above optimisations can be applied to either
    version.

    --
    Ian Collins
     
    Ian Collins, Aug 4, 2011
    #7
  8. jacob navia

    jacob navia Guest

    Le 04/08/11 02:34, Ian Collins a écrit :
    >
    > Um, I tried both versions with the same optimisations (cc/CC -fast) on
    > my Solaris box and the run times (over several runs) for a 350K line
    > file with 214K unique lines were:
    >
    > C: 5.027s
    > CC: 1.835s
    >


    Did you change the number passed to the Create API?

    The code as shown is optimized for a file of 500 lines or so.

    Changing that 500 to (say 135000) would be significantly
    better but would use much more memory.

    jacob
     
    jacob navia, Aug 4, 2011
    #8
  9. jacob navia

    Ian Collins Guest

    On 08/ 4/11 07:42 PM, jacob navia wrote:
    > Le 04/08/11 02:34, Ian Collins a écrit :
    >>
    >> Um, I tried both versions with the same optimisations (cc/CC -fast) on
    >> my Solaris box and the run times (over several runs) for a 350K line
    >> file with 214K unique lines were:
    >>
    >> C: 5.027s
    >> CC: 1.835s
    >>

    >
    > Did you change the number passed to the Create API?
    >
    > The code as shown is optimized for a file of 500 lines or so.
    >
    > Changing that 500 to (say 135000) would be significantly
    > better but would use much more memory.


    That does make a very significant difference:

    0.51s

    Slightly quicker than using std::set.

    --
    Ian Collins
     
    Ian Collins, Aug 4, 2011
    #9
  10. jacob navia

    jacob navia Guest

    Le 04/08/11 01:40, Robert Wessel a écrit :
    >
    >
    > Yes, you'd get an exception if you ran out of memory. The handling of
    > that is a bit different than your program (you quietly return -1,
    > where as the C++ program would probably die with an "unhandled
    > exception" message, neither of which is really acceptable, although
    > both programs could obviously be improved).


    The default behavior in the CCL is to print an error message in the
    standard error stream. In this case the user would get a "No memory
    left" message. That is why I do not print any error message and
    return -1.

    You can obviously change this behavior by another one if you wish.
    (For instance aborting the program).
     
    jacob navia, Aug 4, 2011
    #10
  11. jacob navia

    Marc Guest

    jacob navia wrote:

    > Le 04/08/11 02:34, Ian Collins a écrit :
    >>
    >> Um, I tried both versions with the same optimisations (cc/CC -fast) on
    >> my Solaris box and the run times (over several runs) for a 350K line
    >> file with 214K unique lines were:
    >>
    >> C: 5.027s
    >> CC: 1.835s


    Comparing a hash table to a binary tree is not really fair, try with
    hash_set (common extension) or unordered_set (C++0x).

    > Did you change the number passed to the Create API?
    >
    > The code as shown is optimized for a file of 500 lines or so.
    >
    > Changing that 500 to (say 135000) would be significantly
    > better but would use much more memory.


    Standard C++ hash tables automatically increase their number of buckets
    when the number of objects is large enough.

    Why should 135000 buckets use much more memory if there are enough
    objects to fill most buckets?
     
    Marc, Aug 4, 2011
    #11
  12. jacob navia

    jacob navia Guest

    Le 04/08/11 14:47, Marc a écrit :
    > jacob navia wrote:
    >
    >> Le 04/08/11 02:34, Ian Collins a écrit :
    >>>
    >>> Um, I tried both versions with the same optimisations (cc/CC -fast) on
    >>> my Solaris box and the run times (over several runs) for a 350K line
    >>> file with 214K unique lines were:
    >>>
    >>> C: 5.027s
    >>> CC: 1.835s

    >
    > Comparing a hash table to a binary tree is not really fair, try with
    > hash_set (common extension) or unordered_set (C++0x).
    >
    >> Did you change the number passed to the Create API?
    >>
    >> The code as shown is optimized for a file of 500 lines or so.
    >>
    >> Changing that 500 to (say 135000) would be significantly
    >> better but would use much more memory.

    >
    > Standard C++ hash tables automatically increase their number of buckets
    > when the number of objects is large enough.
    >


    I am thinking about adding such an extension to the Dictionary
    container. I have already such an extension for the general hash
    table container, it resizes automatically. The Dictionary container
    was thought as a light weight hash table, with character keys. The
    general hash table can have keys of any type but its usage is
    more cumbersome, that is why I splitted it into a dictionary
    (good for 80% of applications) and a general hash table
    (when you need resisizing, and keys that aren't character data.


    Now, I am wondering if that interface is not MORE complicated than
    a more capable dictionary... Anyway if the resizing is automatic I can
    keep the same interface. The code is bulkier though.


    > Why should 135000 buckets use much more memory if there are enough
    > objects to fill most buckets?


    Because there is more overhead per bucket. In my implementation
    we have a table of <n> slots, and each slot with a single linked
    list of objects that map to the same hash code.

    Space overhead is:
    (1) overhead for the linked list: 2*sizeof(void *) for each object
    (2) Overhead of the table for the slots: sizeof(void *)/ number of
    objects per slot.

    You see? The larger the table, the more overhead it makes since the
    occupancy decreases (and hence the speed of search increases)

    jacob
     
    jacob navia, Aug 4, 2011
    #12
  13. Ian Collins <> wrote:
    > On 08/ 4/11 07:02 PM, Juha Nieminen wrote:
    >> Ian Collins<> wrote:
    >>> the run time drops to 0.66 seconds.

    >>
    >> If you wanted to reduce the running time even further, and assuming the
    >> entire input file fits in RAM, what you could do is to read the entire
    >> file into RAM with one single fread() call and then use a std::set of
    >> char* (and the proper comparator), each pointer pointing to this single
    >> data block (which needs newlines replaced with null characters).

    >
    > Running more than once should be sufficient to cache the file in RAM.


    That's not the point. The point was to avoid needless memory allocations.
     
    Juha Nieminen, Aug 4, 2011
    #13
  14. jacob navia

    jacob navia Guest

    Le 04/08/11 16:16, Juha Nieminen a écrit :
    > Ian Collins<> wrote:
    >> On 08/ 4/11 07:02 PM, Juha Nieminen wrote:
    >>> Ian Collins<> wrote:
    >>>> the run time drops to 0.66 seconds.
    >>>
    >>> If you wanted to reduce the running time even further, and assuming the
    >>> entire input file fits in RAM, what you could do is to read the entire
    >>> file into RAM with one single fread() call and then use a std::set of
    >>> char* (and the proper comparator), each pointer pointing to this single
    >>> data block (which needs newlines replaced with null characters).

    >>
    >> Running more than once should be sufficient to cache the file in RAM.

    >
    > That's not the point. The point was to avoid needless memory allocations.


    In the C containers library the container ALWAYS owns the memory used
    by the objects/keys. This means the given buffers are always copied
    when you add an item.

    What is the behavior in C++, it is different?
     
    jacob navia, Aug 4, 2011
    #14
  15. jacob navia

    Nobody Guest

    On Thu, 04 Aug 2011 16:25:19 +0200, jacob navia wrote:

    > In the C containers library the container ALWAYS owns the memory used
    > by the objects/keys. This means the given buffers are always copied
    > when you add an item.
    >
    > What is the behavior in C++, it is different?


    No. But if you have a container which contains pointers, it
    literally contains the pointers, not the data which they point to. E.g.
    std::vector<char*>::push_back() will just add the supplied pointer, it
    won't perform a strdup() (char* doesn't necessarily imply a NUL-terminated
    string).
     
    Nobody, Aug 4, 2011
    #15
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Koen
    Replies:
    1
    Views:
    518
  2. Bob
    Replies:
    2
    Views:
    313
  3. Replies:
    7
    Views:
    577
    Pete Becker
    Jan 25, 2008
  4. Andrey Vul
    Replies:
    6
    Views:
    589
    James Kanze
    Oct 22, 2009
  5. Sebastian Mach
    Replies:
    5
    Views:
    342
Loading...

Share This Page