concatenated strings

B

Boris Gambon

Hi

I'm looking for an efficient way to store, read and convert strings.
The strings and string sizes are known at compile time, but they are
not constant. A large number of strings should be decoded and then
referenced through an [n] array.

The executable size should be small and the overhead during runtime
should be small as well (high speed).

There may be nuls inside this char array. Even so, each string should
be nul terminated. Rather than using nuls to indicate the end of
strings, perhaps it's efficient to provide another char array, each
element of which contains the size of the string (limited to 256
chars). Pointers can be created and set by adding offset to the
previous pointer, starting with the beginning of the char [] array.

Using int's in stead of chars would make the array for pointers to
each string 4 times larger.

Is this a good method or are there better methods? How can this be
implemented as efficiently as possible in terms of executable size and
processing speed?

Thanks!
 
E

Eric Sosman

Boris said:
Hi

I'm looking for an efficient way to store, read and convert strings.
The strings and string sizes are known at compile time, but they are
not constant.

Okay, you've lost me: If the string values are known at
compile time, in what sense are they not constant? To put
it another way, what causes the string values to change?
A large number of strings should be decoded and then
referenced through an [n] array.

The executable size should be small and the overhead during runtime
should be small as well (high speed).

There may be nuls inside this char array. Even so, each string should
be nul terminated. Rather than using nuls to indicate the end of
strings, perhaps it's efficient to provide another char array, each
element of which contains the size of the string (limited to 256
chars). Pointers can be created and set by adding offset to the
previous pointer, starting with the beginning of the char [] array.

Using int's in stead of chars would make the array for pointers to
each string 4 times larger.

Is this a good method or are there better methods? How can this be
implemented as efficiently as possible in terms of executable size and
processing speed?

It's not entirely clear what you're trying to do, but
it *may* be that you're trying to recreate the old "xstr"
utility. This program read a collection of C source files
and replaced their string literals with references to a
big array (which xstr created), the idea being to store
just one copy of "Hello, world!" instead of one copy for
each source file that used it (or even one copy for each
appearance, in some older compilers).

Before xstr:

printf ("x = %d\n", x);
printf ("%d\n", y);

After xstr:

const char strings[] = {
'x', ' ', '=', ' ', '%', 'd', '\n', 0 };
printf (&strings[0], x);
printf (&strings[4], y);

You can probably find xstr floating around the Net
somewhere, or there may even be a copy on your system
already.
 
B

Boris Gambon

Eric, thanks for the response.

Eric said:
Okay, you've lost me: If the string values are known at
compile time, in what sense are they not constant? To put
it another way, what causes the string values to change?

The strings need to be decoded (in place, same size), so they cannot
be declared constant:
A large number of strings should be decoded and then
referenced through an [n] array.
The executable size should be small and the overhead during runtime
should be small as well (high speed).

There may be nuls inside this char array. Even so, each string should
be nul terminated. Rather than using nuls to indicate the end of
strings, perhaps it's efficient to provide another char array, each
element of which contains the size of the string (limited to 256
chars). Pointers can be created and set by adding offset to the
previous pointer, starting with the beginning of the char [] array.

Using int's in stead of chars would make the array for pointers to
each string 4 times larger.

Is this a good method or are there better methods? How can this be
implemented as efficiently as possible in terms of executable size and
processing speed?

It's not entirely clear what you're trying to do, but
it *may* be that you're trying to recreate the old "xstr"
utility. This program read a collection of C source files
and replaced their string literals with references to a
big array (which xstr created), the idea being to store
just one copy of "Hello, world!" instead of one copy for
each source file that used it (or even one copy for each
appearance, in some older compilers).

That sounds like an interesting tool, but it's not what I'm looking
for. Strings in an array use more space in the executable (compiled
with gcc) than one large char array containing strings.

The strings are created with a encoding tool and then imported into a
single c file using #include. The strings don't have the same content.
 
J

Jean-Christophe

I'm looking for an efficient way to store, read and convert strings.
The strings and string sizes are known at compile time, but they are
not constant. A large number of strings should be decoded and then
referenced through an [n] array.
Ok

The executable size should be small and the overhead during runtime
should be small as well (high speed).

Of course it will !
Is this a good method or are there better methods? How can this be
implemented as efficiently as possible in terms of executable size and
processing speed?

What about using some compression algorithm ?
Some have a slow compression speed for a good compression ratio,
while being very efficient about decompression speed.

In the past, I had almost the same problem to solve :
on an embedded system with a limited memory,
I needed to minimize ASCII strings ROM space,
so I implemented Huffman compression, worked well enough.
 
B

Boris Gambon

Jean-Christophe said:
I'm looking for an efficient way to store, read and convert strings.
The strings and string sizes are known at compile time, but they are
not constant. A large number of strings should be decoded and then
referenced through an [n] array.
Ok

The executable size should be small and the overhead during runtime
should be small as well (high speed).

Of course it will !
Is this a good method or are there better methods? How can this be
implemented as efficiently as possible in terms of executable size and
processing speed?

What about using some compression algorithm ?
Some have a slow compression speed for a good compression ratio,
while being very efficient about decompression speed.

It would certainly reduce the size of the executable by much, but the
overhead of decompression code is not acceptable. Even simple xor
decoding used now is already putting a strain on cpu time.
The code will be executed countless times (no way around that). I'm
always looking for ways to optimize execution time.

The current decoding is done in place, per byte, it does no require
extra memory.
In the past, I had almost the same problem to solve :
on an embedded system with a limited memory,
I needed to minimize ASCII strings ROM space,
so I implemented Huffman compression, worked well enough.

Thanks for the idea, Jean-Christophe.

B.
 
B

BartC

Boris Gambon said:
Hi

I'm looking for an efficient way to store, read and convert strings.
The strings and string sizes are known at compile time, but they are
not constant. A large number of strings should be decoded and then
referenced through an [n] array.

The executable size should be small and the overhead during runtime
should be small as well (high speed).

So you need to be access the beginning of the Nth string, and perhaps to
find the length of this string (if it's not terminated).

You already need the bytes to represent each string, you can't get around
that (without compression or sharing common substrings or whatever). So the
extra overheads needed per string range from:

0 bytes (marking string end with a high bit, and scanning all strings to
find the Nth string)
1 byte (using a terminator and again scanning the strings)
1 byte (using an 8-bit array of sizes, and scanning this array, which I
think was your idea)
2 bytes (using an 8-bit size array, plus terminators between strings)
2 bytes (using a 16-bit offset/index array, and no terminator or use
high-bit marker)
3 bytes (16-bit index array plus a terminator).

The last two will give fast access to the start of a string and it's size
(subtract the Nth offset from the next to find the size), but assume less
than 64K bytes in all the strings. The arrays need not be permanent; where
the strings already have a separator, the index arrays can be built as
needed when there is a lot of string access.

What overhead figures were you hoping to achieve? Is access to the strings
random, or serial?

What's the decoding about anyway; is sufficient to just decode the whole lot
once and for all?
 
F

Flash Gordon

Boris said:
Jean-Christophe said:
I'm looking for an efficient way to store, read and convert strings.
The strings and string sizes are known at compile time, but they are
not constant. A large number of strings should be decoded and then
referenced through an [n] array. Ok

The executable size should be small and the overhead during runtime
should be small as well (high speed).
Of course it will !
Is this a good method or are there better methods? How can this be
implemented as efficiently as possible in terms of executable size and
processing speed?
What about using some compression algorithm ?
Some have a slow compression speed for a good compression ratio,
while being very efficient about decompression speed.

It would certainly reduce the size of the executable by much, but the
overhead of decompression code is not acceptable. Even simple xor
decoding used now is already putting a strain on cpu time.
The code will be executed countless times (no way around that). I'm
always looking for ways to optimize execution time.

<snip>

If xor'ing is good enough you could try doing it a "word" at a time (you
will have to make sure the char arrays are suitably aligned, and it will
be non-portable). Alternatively you are doing it badly in some other
way, or you just don't have the time to do it during execution. Look at
the generated code to see what it is doing.

You could decode all the strings once at program startup (if you are
allowed to slow down boot-time to get execution time up).

Finally, something as simple as xor'ing the char arrays won't stop a
determined person from decoding them if that is your intent.
 
N

Nick Keighley

Eric, thanks for the response.

I think we need to know what you are doing with your strings. And
in particular what "convert" means.

I think Eric is using "constant" in the english language sense
of unchanging. So the string are in some way "compiled" into the
program?
The strings need to be decoded (in place, same size), so they cannot
be declared constant:

but they aren't "const" in the C sense. They cannot be C strings
as the consequences of modifying a string literal are not defined.
You have Undefined Behaviour.

Arrays of char then?

char encoded[] = {'h', 'e', 'l', 'l', 'o',' 0};

A large number of strings should be decoded and then
referenced through an [n] array.

through a what?

again, they ain't C strings

or make the first char in the array a count. This the Pascal
often implements it's strings. This is all a pain if you are doing it
by hand (hollerith strings anyone?). But ok if you are generating
your encoded strings programatically.

Perhaps your source can look like this

"he\0llo"

and your output like this

{7, 'h', 'e', 0, 'l', 'l', 'o',' 0}
Pointers can be created and set by adding offset to the
previous pointer, starting with the beginning of the char [] array.

I've no idea what you mean

so use chars. chars are perfectly ok to represent small integers

seems ok. But again we don't know enough about what you are doing with
them

Strings in an array use more space in the executable (compiled
with gcc) than one large char array containing strings.

what? why?

The strings are created with a encoding tool and then imported into a
single c file using #include.
ok

The strings don't have the same content

the same contents as what?


--
Nick Keighley

Unpredictability may be exciting, but I don't believe it constitutes
good programming practice.
Richard Heathfield
 
B

Boris Gambon

BartC said:
Boris Gambon said:
Hi

I'm looking for an efficient way to store, read and convert strings.
The strings and string sizes are known at compile time, but they are
not constant. A large number of strings should be decoded and then
referenced through an [n] array.

The executable size should be small and the overhead during runtime
should be small as well (high speed).

So you need to be access the beginning of the Nth string, and perhaps to
find the length of this string (if it's not terminated).

Yes. Length checks are in the code, so memcmp could be used as well,
but currently it's mostly strncmp.
The ability to treat the chars as strings can be regarded as a
requirement.
You already need the bytes to represent each string, you can't get around
that (without compression or sharing common substrings or whatever). So the
extra overheads needed per string range from:

0 bytes (marking string end with a high bit, and scanning all strings to
find the Nth string)
1 byte (using a terminator and again scanning the strings)
1 byte (using an 8-bit array of sizes, and scanning this array, which I
think was your idea)
Yes.

2 bytes (using an 8-bit size array, plus terminators between strings)
2 bytes (using a 16-bit offset/index array, and no terminator or use
high-bit marker)
3 bytes (16-bit index array plus a terminator).

Right. 16 bit would allow up to a 65,535 total size, but I think all
strings will be less than 256 chars.
The last two will give fast access to the start of a string and it's size
(subtract the Nth offset from the next to find the size), but assume less
than 64K bytes in all the strings. The arrays need not be permanent; where
the strings already have a separator, the index arrays can be built as
needed when there is a lot of string access.

The index will have to be built for each run. And because they're
referenced in the code, they need to have the same index number
between builds. The order cannot be changed.
What overhead figures were you hoping to achieve? Is access to the strings
random, or serial?
Random.

What's the decoding about anyway; is sufficient to just decode the whole lot
once and for all?

Yes, that's the idea, decode all at once and then reference randomly
throughout the code. The number of strings is about 2,000. The length
varies from 2 chars to 250.

I'm not sure using an index causes surrounding data to be loaded in
registers as well on 386. That would probably take more time.

Thanks for the insights

B.
 
B

Boris Gambon

Nick said:
I think we need to know what you are doing with your strings. And
in particular what "convert" means.

The string changes when the executable is run because of decoding.
I think Eric is using "constant" in the english language sense
of unchanging. So the string are in some way "compiled" into the
program?

Yes, but they will be changed when the program is run.
but they aren't "const" in the C sense. They cannot be C strings
as the consequences of modifying a string literal are not defined.
You have Undefined Behaviour.

Are you saying strings must be const? That's news to me. :)
Arrays of char then?

char encoded[] = {'h', 'e', 'l', 'l', 'o',' 0};

A large number of strings should be decoded and then
referenced through an [n] array.

through a what?

An array with n elements.
again, they ain't C strings

I know. It does not matter. Some may be used as strings other will not
be used as strings.
or make the first char in the array a count. This the Pascal
often implements it's strings. This is all a pain if you are doing it
by hand (hollerith strings anyone?). But ok if you are generating
your encoded strings programatically.

The encoded strings are generated from input before compiling.
Perhaps your source can look like this

"he\0llo"

and your output like this

{7, 'h', 'e', 0, 'l', 'l', 'o',' 0}

Or the length may be in a another char array so that array does not
have to be built at runtime.
Pointers can be created and set by adding offset to the
previous pointer, starting with the beginning of the char [] array.

I've no idea what you mean

Each string can be referenced by a pointer. This pointer is part of an
array.

e.g. array[2] points to the third string.
so use chars. chars are perfectly ok to represent small integers


seems ok. But again we don't know enough about what you are doing with
them

They're used throughout the code, mainly for comparisons.
what? why?

I don't know why, but separate strings occupy more space in the
executable. Possibly because they're aligned individually.
the same contents as what?

Same as other strings, there is little overlap.

B.
 
N

Nick Keighley

The string changes when the executable is run because of decoding.

ok the "strings" (they really aren't strings in the C sense!) are
compiled
into the program in encoded form and decoded when the program runs.

Are you saying strings must be const? That's news to me. :)

no I didn't. I said string literals cannot be modified. This allows
the
implementor the option of putting strings in ROM if he wants to.
Your compiler *may* allow you modify string literals but this
is not generally portable.
Arrays of char then?
char encoded[] = {'h', 'e', 'l', 'l', 'o',' 0};
or make the first char in the array a count. This the [way] Pascal
often implements [its] strings. This is all a pain if you are doing it
by hand (hollerith strings anyone?). But ok if you are generating
your encoded strings programatically.

The encoded strings are generated from input before compiling.
Perhaps your source can look like this

and your output like this
{7, 'h', 'e', 0, 'l', 'l', 'o',' 0}

Or the length may be in a another char array so that array does not
have to be built at runtime.

it doesn't have to be built at run time if you do it my way either

int main (void)
{
char encoded[] = {7, 'h', 'e', 0, 'l', 'l', 'o',' 0};
decode (encoded);
return 0;
}

there is no run time building going on.
You don't have to do it my way it was just a suggestion but
you should reject it for real reasons not mistaken ones.

I don't know why, but separate strings occupy more space in the
executable. Possibly because they're aligned individually.

that makes no sense. You need to examine the assembler

<snip>
 
N

Nick Keighley

oops I just worked out what you are doing. Ignore some bits
of previous post


Nick said:
Eric, thanks for the response.
Eric Sosman wrote:
Boris Gambon wrote:
Even so, each string should
be nul terminated. Rather than using nuls to indicate the end of
strings, perhaps it's efficient to provide another char array, each
element of which contains the size of the string (limited to 256
chars).
or make the first char in the array a count. This the [way] Pascal
often implements [its] strings. This is all a pain if you are doing it
by hand (hollerith strings anyone?). But ok if you are generating
your encoded strings programatically.
The encoded strings are generated from input before compiling.
Or the length may be in a another char array so that array does not
have to be built at runtime.

it doesn't have to be built at run time if you do it my way either

this is wrong.

you are doing something like this:
char ptrs[] = {0, 7};
char encoded[] = {'h', 'e', 0, 'l', 'l', 'o',' 0, 'w', 'o', 'r',
'l', 'd', 0};
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,013
Latest member
KatriceSwa

Latest Threads

Top