J
jacob navia
Any container will add *some* complexity and overhead to the data it stores.
A container object in C will be never as fast as managing each individual
datum by hand, i.e. giving to each datum an address and managing each
datum individually, as it is done in assembly language.
A container allows for scalability precisely by simplifying data management.
So, we pay an overhead.
How much of an overhead?
Let's take lcc-win's strings. They encapsulate a String object, reduced
to the bare essentials:
typedef struct _StringA {
size_t count; // Elements
char *content;
size_t capacity; // Allocated space
} String;
In a 64 bit system, an 8 character string will take:
24+8 --> 32 bytes, i.e. an overhead of 300%.
In a 32 bit system we would have
12+8 --> 20 bytes, an overhead of 150%.
Let's compare this to Java's string class. There, a character string of 8
characters will take 64 bytes, and there is nothing that the programmer
can do about it.
In C++ the size of a string of 89 characters appears to be 40 bytes. I
used the following program:
D:\temp>type str.cpp
#include <string>
#include <iostream>
using namespace std;
int main(void)
{
string m("12345678");
cout << sizeof(m) << endl;
}
This outputs
40
In a C container library, it is possible to design a "Small string type"
(maybe in Java too, I do not know). That type can be restricted to
strings shorter than 65535 characters (probably 99.999999% of the
strings used in a program) If we do that, we can reduce the overhead from
24 to 16 bytes. The alignments requirements in 64 bits still hunt us.
If we get rid of the pointer however, and store the characters in a
structure with a variable "tail" as introduced by C99, we can curtail
the alignment requirements and we have an overhead of just 16 bytes:
two unsigned shorts that specify the length and the capacity, followed
by the actual data.
In C (99) we would have
typedef struct _smallString {
unsigned short count; // Elements
unsigned short capacity; // Allocated space
char contents[];
} smString;
In this case the overhead is 4 bytes, i.e. only 50%.
By making apparent the workings of the container, C has the advantage
of making the programmer aware of what he/she is paying for each
container. The big problem in Java and other Java-like languages
(like C#) is that programmers are not used (and even not supposed to)
design their own data types but to reuse some package that will do
the job without caring about possible overhead costs.
For a container library in C, fighting overhead and giving versions of
smaller data types that have less overhead will be an important point.
In a typical Java heap we have tens/hundreds of thousands, even millions
of live collections. [1]
Java heaps have grown from 500 MB to 2-3GB now, without supporting
more features or users. It is increasingly common to require 1GB of
memory just to support a few hundred users, saving 500K session state
PER USER, or requiring 2MB for a text index per simple document, or
creating 100K temporary objects per web hit.
The consequences are clear: scalability disappears, at several
thousand users, the Java solution will require more than the 16GB
installed RAM, and will start swapping, killing performance. Power
usage goes up, more machines need to be bought to support the
bloat. With more machines, more communications and more overhead,
etc.
It is common to propose here that C can't be used for normal
workstation applications. I am convinced that this is wrong. C
can be used for web servers and web applications, and with a reasonable
container library it could have a three "killer arguments" for its use:
Scalability, low overhead, performance.
jacob
(yes, I am biased)
[1]
http://domino.research.ibm.com/comm...ILE/oopsla08 memory-efficient java slides.pdf
A container object in C will be never as fast as managing each individual
datum by hand, i.e. giving to each datum an address and managing each
datum individually, as it is done in assembly language.
A container allows for scalability precisely by simplifying data management.
So, we pay an overhead.
How much of an overhead?
Let's take lcc-win's strings. They encapsulate a String object, reduced
to the bare essentials:
typedef struct _StringA {
size_t count; // Elements
char *content;
size_t capacity; // Allocated space
} String;
In a 64 bit system, an 8 character string will take:
24+8 --> 32 bytes, i.e. an overhead of 300%.
In a 32 bit system we would have
12+8 --> 20 bytes, an overhead of 150%.
Let's compare this to Java's string class. There, a character string of 8
characters will take 64 bytes, and there is nothing that the programmer
can do about it.
In C++ the size of a string of 89 characters appears to be 40 bytes. I
used the following program:
D:\temp>type str.cpp
#include <string>
#include <iostream>
using namespace std;
int main(void)
{
string m("12345678");
cout << sizeof(m) << endl;
}
This outputs
40
In a C container library, it is possible to design a "Small string type"
(maybe in Java too, I do not know). That type can be restricted to
strings shorter than 65535 characters (probably 99.999999% of the
strings used in a program) If we do that, we can reduce the overhead from
24 to 16 bytes. The alignments requirements in 64 bits still hunt us.
If we get rid of the pointer however, and store the characters in a
structure with a variable "tail" as introduced by C99, we can curtail
the alignment requirements and we have an overhead of just 16 bytes:
two unsigned shorts that specify the length and the capacity, followed
by the actual data.
In C (99) we would have
typedef struct _smallString {
unsigned short count; // Elements
unsigned short capacity; // Allocated space
char contents[];
} smString;
In this case the overhead is 4 bytes, i.e. only 50%.
By making apparent the workings of the container, C has the advantage
of making the programmer aware of what he/she is paying for each
container. The big problem in Java and other Java-like languages
(like C#) is that programmers are not used (and even not supposed to)
design their own data types but to reuse some package that will do
the job without caring about possible overhead costs.
For a container library in C, fighting overhead and giving versions of
smaller data types that have less overhead will be an important point.
In a typical Java heap we have tens/hundreds of thousands, even millions
of live collections. [1]
Java heaps have grown from 500 MB to 2-3GB now, without supporting
more features or users. It is increasingly common to require 1GB of
memory just to support a few hundred users, saving 500K session state
PER USER, or requiring 2MB for a text index per simple document, or
creating 100K temporary objects per web hit.
The consequences are clear: scalability disappears, at several
thousand users, the Java solution will require more than the 16GB
installed RAM, and will start swapping, killing performance. Power
usage goes up, more machines need to be bought to support the
bloat. With more machines, more communications and more overhead,
etc.
It is common to propose here that C can't be used for normal
workstation applications. I am convinced that this is wrong. C
can be used for web servers and web applications, and with a reasonable
container library it could have a three "killer arguments" for its use:
Scalability, low overhead, performance.
jacob
(yes, I am biased)
[1]
http://domino.research.ibm.com/comm...ILE/oopsla08 memory-efficient java slides.pdf