concatenated strings

Boris Gambon · Jun 23, 2009

Hi

I'm looking for an efficient way to store, read and convert strings.
The strings and string sizes are known at compile time, but they are
not constant. A large number of strings should be decoded and then
referenced through an [n] array.

The executable size should be small and the overhead during runtime
should be small as well (high speed).

There may be nuls inside this char array. Even so, each string should
be nul terminated. Rather than using nuls to indicate the end of
strings, perhaps it's efficient to provide another char array, each
element of which contains the size of the string (limited to 256
chars). Pointers can be created and set by adding offset to the
previous pointer, starting with the beginning of the char [] array.

Using int's in stead of chars would make the array for pointers to
each string 4 times larger.

Is this a good method or are there better methods? How can this be
implemented as efficiently as possible in terms of executable size and
processing speed?

Thanks!

Eric Sosman · Jun 23, 2009

Boris said:
Hi

I'm looking for an efficient way to store, read and convert strings.
The strings and string sizes are known at compile time, but they are
not constant.

Okay, you've lost me: If the string values are known at
compile time, in what sense are they not constant? To put
it another way, what causes the string values to change?

A large number of strings should be decoded and then
referenced through an [n] array.

The executable size should be small and the overhead during runtime
should be small as well (high speed).

There may be nuls inside this char array. Even so, each string should
be nul terminated. Rather than using nuls to indicate the end of
strings, perhaps it's efficient to provide another char array, each
element of which contains the size of the string (limited to 256
chars). Pointers can be created and set by adding offset to the
previous pointer, starting with the beginning of the char [] array.

Using int's in stead of chars would make the array for pointers to
each string 4 times larger.

Is this a good method or are there better methods? How can this be
implemented as efficiently as possible in terms of executable size and
processing speed?

It's not entirely clear what you're trying to do, but
it *may* be that you're trying to recreate the old "xstr"
utility. This program read a collection of C source files
and replaced their string literals with references to a
big array (which xstr created), the idea being to store
just one copy of "Hello, world!" instead of one copy for
each source file that used it (or even one copy for each
appearance, in some older compilers).

Before xstr:

printf ("x = %d\n", x);
printf ("%d\n", y);

After xstr:

const char strings[] = {
'x', ' ', '=', ' ', '%', 'd', '\n', 0 };
printf (&strings[0], x);
printf (&strings[4], y);

You can probably find xstr floating around the Net
somewhere, or there may even be a copy on your system
already.

Boris Gambon · Jun 24, 2009

Eric, thanks for the response.

Eric said:
Okay, you've lost me: If the string values are known at
compile time, in what sense are they not constant? To put
it another way, what causes the string values to change?

The strings need to be decoded (in place, same size), so they cannot
be declared constant:

A large number of strings should be decoded and then
referenced through an [n] array.
The executable size should be small and the overhead during runtime
should be small as well (high speed).

There may be nuls inside this char array. Even so, each string should
be nul terminated. Rather than using nuls to indicate the end of
strings, perhaps it's efficient to provide another char array, each
element of which contains the size of the string (limited to 256
chars). Pointers can be created and set by adding offset to the
previous pointer, starting with the beginning of the char [] array.

Using int's in stead of chars would make the array for pointers to
each string 4 times larger.

Is this a good method or are there better methods? How can this be
implemented as efficiently as possible in terms of executable size and
processing speed?

Click to expand...

It's not entirely clear what you're trying to do, but
it *may* be that you're trying to recreate the old "xstr"
utility. This program read a collection of C source files
and replaced their string literals with references to a
big array (which xstr created), the idea being to store
just one copy of "Hello, world!" instead of one copy for
each source file that used it (or even one copy for each
appearance, in some older compilers).

That sounds like an interesting tool, but it's not what I'm looking
for. Strings in an array use more space in the executable (compiled
with gcc) than one large char array containing strings.

The strings are created with a encoding tool and then imported into a
single c file using #include. The strings don't have the same content.

Jean-Christophe · Jun 24, 2009

I'm looking for an efficient way to store, read and convert strings.
The strings and string sizes are known at compile time, but they are
not constant. A large number of strings should be decoded and then
referenced through an [n] array.
Ok

The executable size should be small and the overhead during runtime
should be small as well (high speed).

Of course it will !

Is this a good method or are there better methods? How can this be
implemented as efficiently as possible in terms of executable size and
processing speed?

What about using some compression algorithm ?
Some have a slow compression speed for a good compression ratio,
while being very efficient about decompression speed.

In the past, I had almost the same problem to solve :
on an embedded system with a limited memory,
I needed to minimize ASCII strings ROM space,
so I implemented Huffman compression, worked well enough.

Boris Gambon · Jun 24, 2009

Jean-Christophe said:
I'm looking for an efficient way to store, read and convert strings.
The strings and string sizes are known at compile time, but they are
not constant. A large number of strings should be decoded and then
referenced through an [n] array.
Ok

The executable size should be small and the overhead during runtime
should be small as well (high speed).

Click to expand...

Of course it will !

Is this a good method or are there better methods? How can this be
implemented as efficiently as possible in terms of executable size and
processing speed?

Click to expand...

What about using some compression algorithm ?
Some have a slow compression speed for a good compression ratio,
while being very efficient about decompression speed.

It would certainly reduce the size of the executable by much, but the
overhead of decompression code is not acceptable. Even simple xor
decoding used now is already putting a strain on cpu time.
The code will be executed countless times (no way around that). I'm
always looking for ways to optimize execution time.

The current decoding is done in place, per byte, it does no require
extra memory.

In the past, I had almost the same problem to solve :
on an embedded system with a limited memory,
I needed to minimize ASCII strings ROM space,
so I implemented Huffman compression, worked well enough.

Thanks for the idea, Jean-Christophe.

B.

BartC · Jun 24, 2009

Boris Gambon said:
Hi

I'm looking for an efficient way to store, read and convert strings.
The strings and string sizes are known at compile time, but they are
not constant. A large number of strings should be decoded and then
referenced through an [n] array.

The executable size should be small and the overhead during runtime
should be small as well (high speed).

So you need to be access the beginning of the Nth string, and perhaps to
find the length of this string (if it's not terminated).

You already need the bytes to represent each string, you can't get around
that (without compression or sharing common substrings or whatever). So the
extra overheads needed per string range from:

0 bytes (marking string end with a high bit, and scanning all strings to
find the Nth string)
1 byte (using a terminator and again scanning the strings)
1 byte (using an 8-bit array of sizes, and scanning this array, which I
think was your idea)
2 bytes (using an 8-bit size array, plus terminators between strings)
2 bytes (using a 16-bit offset/index array, and no terminator or use
high-bit marker)
3 bytes (16-bit index array plus a terminator).

The last two will give fast access to the start of a string and it's size
(subtract the Nth offset from the next to find the size), but assume less
than 64K bytes in all the strings. The arrays need not be permanent; where
the strings already have a separator, the index arrays can be built as
needed when there is a lot of string access.

What overhead figures were you hoping to achieve? Is access to the strings
random, or serial?

What's the decoding about anyway; is sufficient to just decode the whole lot
once and for all?

Flash Gordon · Jun 24, 2009

Boris said:
Jean-Christophe said:

I'm looking for an efficient way to store, read and convert strings.
The strings and string sizes are known at compile time, but they are
not constant. A large number of strings should be decoded and then
referenced through an [n] array. Ok

The executable size should be small and the overhead during runtime
should be small as well (high speed).

Click to expand...

Of course it will !

Is this a good method or are there better methods? How can this be
implemented as efficiently as possible in terms of executable size and
processing speed?

Click to expand...

What about using some compression algorithm ?
Some have a slow compression speed for a good compression ratio,
while being very efficient about decompression speed.

Click to expand...

It would certainly reduce the size of the executable by much, but the
overhead of decompression code is not acceptable. Even simple xor
decoding used now is already putting a strain on cpu time.
The code will be executed countless times (no way around that). I'm
always looking for ways to optimize execution time.

<snip>

If xor'ing is good enough you could try doing it a "word" at a time (you
will have to make sure the char arrays are suitably aligned, and it will
be non-portable). Alternatively you are doing it badly in some other
way, or you just don't have the time to do it during execution. Look at
the generated code to see what it is doing.

You could decode all the strings once at program startup (if you are
allowed to slow down boot-time to get execution time up).

Finally, something as simple as xor'ing the char arrays won't stop a
determined person from decoding them if that is your intent.

Nick Keighley · Jun 24, 2009

Eric, thanks for the response.

I think we need to know what you are doing with your strings. And
in particular what "convert" means.

I think Eric is using "constant" in the english language sense
of unchanging. So the string are in some way "compiled" into the
program?

The strings need to be decoded (in place, same size), so they cannot
be declared constant:

but they aren't "const" in the C sense. They cannot be C strings
as the consequences of modifying a string literal are not defined.
You have Undefined Behaviour.

Arrays of char then?

char encoded[] = {'h', 'e', 'l', 'l', 'o',' 0};

A large number of strings should be decoded and then
referenced through an [n] array.

Click to expand...

Click to expand...

through a what?

again, they ain't C strings

or make the first char in the array a count. This the Pascal
often implements it's strings. This is all a pain if you are doing it
by hand (hollerith strings anyone?). But ok if you are generating
your encoded strings programatically.

Perhaps your source can look like this

"he\0llo"

and your output like this

{7, 'h', 'e', 0, 'l', 'l', 'o',' 0}

Pointers can be created and set by adding offset to the
previous pointer, starting with the beginning of the char [] array.

Click to expand...

Click to expand...

I've no idea what you mean

so use chars. chars are perfectly ok to represent small integers

seems ok. But again we don't know enough about what you are doing with
them

Strings in an array use more space in the executable (compiled
with gcc) than one large char array containing strings.

what? why?

The strings are created with a encoding tool and then imported into a
single c file using #include.
ok

The strings don't have the same content

the same contents as what?

--
Nick Keighley

Unpredictability may be exciting, but I don't believe it constitutes
good programming practice.
Richard Heathfield

Boris Gambon · Jun 24, 2009

BartC said:
Boris Gambon said:

Hi

I'm looking for an efficient way to store, read and convert strings.
The strings and string sizes are known at compile time, but they are
not constant. A large number of strings should be decoded and then
referenced through an [n] array.

The executable size should be small and the overhead during runtime
should be small as well (high speed).

Click to expand...

So you need to be access the beginning of the Nth string, and perhaps to
find the length of this string (if it's not terminated).

Yes. Length checks are in the code, so memcmp could be used as well,
but currently it's mostly strncmp.
The ability to treat the chars as strings can be regarded as a
requirement.

You already need the bytes to represent each string, you can't get around
that (without compression or sharing common substrings or whatever). So the
extra overheads needed per string range from:

0 bytes (marking string end with a high bit, and scanning all strings to
find the Nth string)
1 byte (using a terminator and again scanning the strings)
1 byte (using an 8-bit array of sizes, and scanning this array, which I
think was your idea)
Yes.

2 bytes (using an 8-bit size array, plus terminators between strings)
2 bytes (using a 16-bit offset/index array, and no terminator or use
high-bit marker)
3 bytes (16-bit index array plus a terminator).

Right. 16 bit would allow up to a 65,535 total size, but I think all
strings will be less than 256 chars.

The last two will give fast access to the start of a string and it's size
(subtract the Nth offset from the next to find the size), but assume less
than 64K bytes in all the strings. The arrays need not be permanent; where
the strings already have a separator, the index arrays can be built as
needed when there is a lot of string access.

The index will have to be built for each run. And because they're
referenced in the code, they need to have the same index number
between builds. The order cannot be changed.

What overhead figures were you hoping to achieve? Is access to the strings
random, or serial?
Random.

What's the decoding about anyway; is sufficient to just decode the whole lot
once and for all?

Yes, that's the idea, decode all at once and then reference randomly
throughout the code. The number of strings is about 2,000. The length
varies from 2 chars to 250.

I'm not sure using an index causes surrounding data to be loaded in
registers as well on 386. That would probably take more time.

Thanks for the insights

B.

Boris Gambon · Jun 24, 2009

Nick said:
I think we need to know what you are doing with your strings. And
in particular what "convert" means.

The string changes when the executable is run because of decoding.

I think Eric is using "constant" in the english language sense
of unchanging. So the string are in some way "compiled" into the
program?

Yes, but they will be changed when the program is run.

but they aren't "const" in the C sense. They cannot be C strings
as the consequences of modifying a string literal are not defined.
You have Undefined Behaviour.

Are you saying strings must be const? That's news to me.

Arrays of char then?

char encoded[] = {'h', 'e', 'l', 'l', 'o',' 0};

A large number of strings should be decoded and then
referenced through an [n] array.

Click to expand...

Click to expand...

through a what?

An array with n elements.

again, they ain't C strings

I know. It does not matter. Some may be used as strings other will not
be used as strings.

or make the first char in the array a count. This the Pascal
often implements it's strings. This is all a pain if you are doing it
by hand (hollerith strings anyone?). But ok if you are generating
your encoded strings programatically.

The encoded strings are generated from input before compiling.

Perhaps your source can look like this

"he\0llo"

and your output like this

{7, 'h', 'e', 0, 'l', 'l', 'o',' 0}

Or the length may be in a another char array so that array does not
have to be built at runtime.

Pointers can be created and set by adding offset to the
previous pointer, starting with the beginning of the char [] array.

Click to expand...

Click to expand...

I've no idea what you mean

Each string can be referenced by a pointer. This pointer is part of an
array.

e.g. array[2] points to the third string.

so use chars. chars are perfectly ok to represent small integers

seems ok. But again we don't know enough about what you are doing with
them

They're used throughout the code, mainly for comparisons.

what? why?

I don't know why, but separate strings occupy more space in the
executable. Possibly because they're aligned individually.

the same contents as what?

Same as other strings, there is little overlap.

B.

Nick Keighley · Jun 25, 2009

The string changes when the executable is run because of decoding.

ok the "strings" (they really aren't strings in the C sense!) are
compiled
into the program in encoded form and decoded when the program runs.

Are you saying strings must be const? That's news to me.

no I didn't. I said string literals cannot be modified. This allows
the
implementor the option of putting strings in ROM if he wants to.
Your compiler *may* allow you modify string literals but this
is not generally portable.

Arrays of char then?

Click to expand...

char encoded[] = {'h', 'e', 'l', 'l', 'o',' 0};

Click to expand...

or make the first char in the array a count. This the [way] Pascal
often implements [its] strings. This is all a pain if you are doing it
by hand (hollerith strings anyone?). But ok if you are generating
your encoded strings programatically.

Click to expand...

The encoded strings are generated from input before compiling.

Perhaps your source can look like this

and your output like this

Click to expand...

{7, 'h', 'e', 0, 'l', 'l', 'o',' 0}

Click to expand...

Or the length may be in a another char array so that array does not
have to be built at runtime.

it doesn't have to be built at run time if you do it my way either

int main (void)
{
char encoded[] = {7, 'h', 'e', 0, 'l', 'l', 'o',' 0};
decode (encoded);
return 0;
}

there is no run time building going on.
You don't have to do it my way it was just a suggestion but
you should reject it for real reasons not mistaken ones.

I don't know why, but separate strings occupy more space in the
executable. Possibly because they're aligned individually.

that makes no sense. You need to examine the assembler

<snip>

Nick Keighley · Jun 25, 2009

oops I just worked out what you are doing. Ignore some bits
of previous post

Nick said:
Nick said:

Eric, thanks for the response.
Eric Sosman wrote:
Boris Gambon wrote:
Even so, each string should
be nul terminated. Rather than using nuls to indicate the end of
strings, perhaps it's efficient to provide another char array, each
element of which contains the size of the string (limited to 256
chars).
or make the first char in the array a count. This the [way] Pascal
often implements [its] strings. This is all a pain if you are doing it
by hand (hollerith strings anyone?). But ok if you are generating
your encoded strings programatically.

Click to expand...

Click to expand...

The encoded strings are generated from input before compiling.

Click to expand...

Or the length may be in a another char array so that array does not
have to be built at runtime.

Click to expand...

it doesn't have to be built at run time if you do it my way either

this is wrong.

you are doing something like this:
char ptrs[] = {0, 7};
char encoded[] = {'h', 'e', 0, 'l', 'l', 'o',' 0, 'w', 'o', 'r',
'l', 'd', 0};

Help with passing test	3	Jun 8, 2023
Tasks	1	Nov 29, 2022
Converting an Array to a String in JavaScript	7	Sep 22, 2023
Copy string from 2D array to a 1D array in C	1	Nov 1, 2023
strings	1	Feb 9, 2007
I have to finish this code for my assignment but I cant figure out how to solve it	1	Jun 27, 2023
Non-constant constant strings	561	Jan 19, 2014
Help in hangman game	1	Jul 24, 2023

concatenated strings

Boris Gambon

Eric Sosman

Boris Gambon

Jean-Christophe

Boris Gambon

BartC

Flash Gordon

Nick Keighley

Boris Gambon

Boris Gambon

Nick Keighley

Nick Keighley

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads