String in programming languages that are based off C

Ben Bacarisse · Feb 17, 2014

James Kuyper said:
James Kuyper said:

In context, it looks pretty peculiar to me, too, and I'm pretty far from
being a beginner. As an expert C programmer, I can imagine a less
experienced programmer writing

((char*)ts+1)[l] = '\0';

Click to expand...

I would strongly recommend against writing code that way rather than the
alternative that was also given:

ts->string[l] = '\0';

Click to expand...

but there's at least a chance that those two alternatives do the same
thing, which is not the case for what was actually given:

((char *)(ts + 1))[l] = '\0';

Click to expand...

If "those two alternatives" refer to the first two statements quoted,

Click to expand...

It does.

then it is very hard to see them doing the same thing.

Click to expand...

They will do the same thing if the first member of the struct is exactly
one byte long, and there is no padding between that member and the one
named "string". That's very unlikely to be the case,

Right. I meant hard to see them doing the same thing in the given
context. But then, freed from the context in question, the third
could also do the same thing.

<snip>

Kaz Kylheku · Feb 17, 2014

Hello All,

I found the below string design pattern a bit too hard to absorb. I just
noticed that same pattern was used by two different languages. Now, I am
thinking why can't they use something like this:

ts = &luaC_newobj(L, LUA_TSTRING, totalsize, list, 0)->ts;
ts->tsv.len = l;
ts->tsv.hash = h;
ts->tsv.reserved = 0;
memcpy(ts.string, str, l*sizeof(char));

You mean ts->string or (*ts).string. (We cannot apply the -> operator to an
identifier, and then in the same scope apply the . operator; it's
contradictory.)

ts.string[l] = '\0'; /* ending 0 */

Instead of the below, and what is the advantage of this pattern.
ts = &luaC_newobj(L, LUA_TSTRING, totalsize, list, 0)->ts;
ts->tsv.len = l;
ts->tsv.hash = h;
ts->tsv.reserved = 0;
memcpy(ts+1, str, l*sizeof(char));
((char *)(ts+1))[l] = '\0'; /* ending 0 */

These two could be similar.

The luaC_newobj function could initialize the "string" pointer inside "ts" to
be to an area past the "ts" structure, sacrificing a storage location for the
sake of terser code.

Furthermore, sacrificing space is not necessary because "ts" could be using the
famous "C struct hack", such that string is actually an array at the end:

struct ts_struct {
/* ... */
char string[1];
}

If N bytes of memory are allocated, where N >= sizeof (ts_struct), then
there are N - offsetof(struct ts_struct, string) bytes available in string[],
effectively.

Malcolm McLean · Feb 18, 2014

I assume this means that Lua programs have no access to system-specific
functions (or graphics). Which makes it essentially useless as a
programming environment.

I knew there was a reason I never got around to learning it.

It's not designed for stand alone programs, but for adding scripting to video
games implemented in C/C++.

Malcolm McLean · Feb 18, 2014

Hello All,

I found the below string design pattern a bit too hard to absorb. I just
noticed that same pattern was used by two different languages. Now, I am
thinking why can't they use something like this:

ts = &luaC_newobj(L, LUA_TSTRING, totalsize, list, 0)->ts;
ts->tsv.len = l;
ts->tsv.hash = h;
ts->tsv.reserved = 0;
memcpy(ts.string, str, l*sizeof(char));
ts.string[l] = '\0'; /* ending 0 */

Instead of the below, and what is the advantage of this pattern.

ts = &luaC_newobj(L, LUA_TSTRING, totalsize, list, 0)->ts;
ts->tsv.len = l;
ts->tsv.hash = h;
ts->tsv.reserved = 0;
memcpy(ts+1, str, l*sizeof(char));
((char *)(ts+1))[l] = '\0'; /* ending 0 */

Memory games.
ts is a pointer to some sort of internal structure used by luaC_newobj.
The space behind this structure is being used to hold a string. It's rather
a dangerous thing to do, and usually indicates poor design, which is why C
makes the syntax a bit tricky. However it probably isn't a bug - the Lua
system likely knows that that space is not used for anything else and is big
enough to hold the biggest legal string.

Ben Bacarisse · Feb 18, 2014

Malcolm McLean said:
ts = &luaC_newobj(L, LUA_TSTRING, totalsize, list, 0)->ts;
ts->tsv.len = l;
ts->tsv.hash = h;
ts->tsv.reserved = 0;
memcpy(ts+1, str, l*sizeof(char));
((char *)(ts+1))[l] = '\0'; /* ending 0 */

Click to expand...

Memory games.
ts is a pointer to some sort of internal structure used by luaC_newobj.
The space behind this structure is being used to hold a string. It's rather
a dangerous thing to do,

What's dangerous about it?

and usually indicates poor design, which is why C
makes the syntax a bit tricky.

So why did C99 provide a specific syntax to simply doing this?

<snip>

Kaz Kylheku · Feb 18, 2014

I assume this means that Lua programs have no access to system-specific
functions (or graphics).

I wouldn't assume any such thing from the above statement, but rather interpret
the statement as being about the language dialect only, not about the use of
libraries.

Of course Lua programs can have binding to API's that are not in the
Library section of 1989 ANSI C.

http://lua-users.org/wiki/LibrariesAndBindings

Presumably, it has a core that doesn't fail to build if some of these are
unavailable.

janus · Feb 18, 2014

Lua is probably the largest open source project I know of which is almost

entirely implemented in pure C89. (They have optional non-compliant bits to

speed up some converions, to support loadable modules, etc). The very latest

Lua implementation (5.2) still compiles with Turbo C 1.0. They take ANSI/ISO

compliance very seriously so as to remain portable to as many hosted and

non-hosted environments as reasonable.

I think the OP munged some stuff, as is evident from the apparently varying

type of `ts'. I can't find the posted code in either the Lua 5.1 or Lua 5.2

implementations, although maybe it's derived from an earlier version.

Because Lua strives to be very strict C, I think this group is as good as

any others. The Lua mailing-list is mostly devoted to usage of the Lua

language and the Lua C API. The posted code is derived from the

implementation. Most discussions concering the implementation revolve around

actual and practical C89 compliance, because using extensions is usually out

of the question unless it's an optional, compiler-specific optimization.

Lua is probably the largest open source project I know of which is almost

entirely implemented in pure C89. (They have optional non-compliant bits to

speed up some converions, to support loadable modules, etc). The very latest

Lua implementation (5.2) still compiles with Turbo C 1.0. They take ANSI/ISO

compliance very seriously so as to remain portable to as many hosted and

non-hosted environments as reasonable.

I think the OP munged some stuff, as is evident from the apparently varying

type of `ts'. I can't find the posted code in either the Lua 5.1 or Lua 5.2

implementations, although maybe it's derived from an earlier version.

Because Lua strives to be very strict C, I think this group is as good as

any others. The Lua mailing-list is mostly devoted to usage of the Lua

language and the Lua C API. The posted code is derived from the

implementation. Most discussions concering the implementation revolve around

actual and practical C89 compliance, because using extensions is usually out

of the question unless it's an optional, compiler-specific optimization.

Check out this link for the code, http://www.lua.org/source/5.2/lstring.c.html

janus · Feb 18, 2014

Hello All,

I found the below string design pattern a bit too hard to absorb. I just

Click to expand...

noticed that same pattern was used by two different languages. Now, I am

Click to expand...

thinking why can't they use something like this:

ts = &luaC_newobj(L, LUA_TSTRING, totalsize, list, 0)->ts;

Click to expand...

ts->tsv.len = l;

Click to expand...

ts->tsv.hash = h;

Click to expand...

ts->tsv.reserved = 0;

Click to expand...

memcpy(ts.string, str, l*sizeof(char));

Click to expand...

You mean ts->string or (*ts).string. (We cannot apply the -> operator to an

identifier, and then in the same scope apply the . operator; it's

contradictory.)

ts.string[l] = '\0'; /* ending 0 */

Instead of the below, and what is the advantage of this pattern.

Click to expand...

ts = &luaC_newobj(L, LUA_TSTRING, totalsize, list, 0)->ts;

Click to expand...

ts->tsv.len = l;

Click to expand...

ts->tsv.hash = h;

Click to expand...

ts->tsv.reserved = 0;

Click to expand...

memcpy(ts+1, str, l*sizeof(char));

Click to expand...

((char *)(ts+1))[l] = '\0'; /* ending 0 */

Click to expand...

These two could be similar.

The luaC_newobj function could initialize the "string" pointer inside "ts" to

be to an area past the "ts" structure, sacrificing a storage location for the

sake of terser code.

Furthermore, sacrificing space is not necessary because "ts" could be using the

famous "C struct hack", such that string is actually an array at the end:

struct ts_struct {

/* ... */

char string[1];

}

If N bytes of memory are allocated, where N >= sizeof (ts_struct), then

there are N - offsetof(struct ts_struct, string) bytes available in string[],

effectively.

Hello All,

I found the below string design pattern a bit too hard to absorb. I just

Click to expand...

noticed that same pattern was used by two different languages. Now, I am

Click to expand...

thinking why can't they use something like this:

ts = &luaC_newobj(L, LUA_TSTRING, totalsize, list, 0)->ts;

Click to expand...

ts->tsv.len = l;

Click to expand...

ts->tsv.hash = h;

Click to expand...

ts->tsv.reserved = 0;

Click to expand...

memcpy(ts.string, str, l*sizeof(char));

Click to expand...

You mean ts->string or (*ts).string. (We cannot apply the -> operator to an

identifier, and then in the same scope apply the . operator; it's

contradictory.)

ts.string[l] = '\0'; /* ending 0 */

Instead of the below, and what is the advantage of this pattern.

Click to expand...

ts = &luaC_newobj(L, LUA_TSTRING, totalsize, list, 0)->ts;

Click to expand...

ts->tsv.len = l;

Click to expand...

ts->tsv.hash = h;

Click to expand...

ts->tsv.reserved = 0;

Click to expand...

memcpy(ts+1, str, l*sizeof(char));

Click to expand...

((char *)(ts+1))[l] = '\0'; /* ending 0 */

Click to expand...

These two could be similar.

The luaC_newobj function could initialize the "string" pointer inside "ts" to

be to an area past the "ts" structure, sacrificing a storage location for the

sake of terser code.

Furthermore, sacrificing space is not necessary because "ts" could be using the

famous "C struct hack", such that string is actually an array at the end:

struct ts_struct {

/* ... */

char string[1];

}

If N bytes of memory are allocated, where N >= sizeof (ts_struct), then

there are N - offsetof(struct ts_struct, string) bytes available in string[],

effectively.

My bad, was thinking of ts->string and not ts.string

Keith Thompson · Feb 18, 2014

[131 double-spaced lines deleted]

Check out this link for the code, http://www.lua.org/source/5.2/lstring.c.html

Please use a real newsreader to post here rather than the horribly
broken Google Groups web interface. GG, for some reason, likes to
double-space, and sometimes quadruple-space, quoted text. Articles
should have actual line breaks to keep them below 80 columns, preferably
about 72 columns. I use news.eternal-september.org as my news server
(it's free) and Gnus, which runs under Emacs, as my newsreader;
Thunderbird also includes a decent newsreader.

If you must use GG, please copy-and-paste your article into a decent text
editor, edit out the added blank lines, and trim the quoted text down to just
what's necessary for your followup to make sense; you don't need to quote all
of a 100+-line article to add a one-line reply. (But do keep some context.)

James Kuyper · Feb 18, 2014

On 02/18/2014 01:12 PM, janus wrote:
....

Check out this link for the code, http://www.lua.org/source/5.2/lstring.c.html

That's a great improvement. First of all, it includes the code that you
gave in your first message, code which was entirely missing from more
complete code that you showed in your second message. That code now
appears in context, and that context confirms Ben Bacarisse's
explanation of the use of ts+1, which I misunderstood. That code
over-allocates space for the struct, and ts+1 points to the first byte
of excess space, which is where the actual contents of the string is stored.

Now, I'm finally prepared to answer your original question. The approach
used in the actual Lua code has one key advantage: it has defined
behavior even when using C90.

However, since I don't believe in catering to old standards (C99 is
already 14 years old), I would favor taking advantage of C99's concept
of flexible array members. Your suggested alternative code:

ts = &luaC_newobj(L, LUA_TSTRING, totalsize, list, 0)->ts;
ts->tsv.len = l;
ts->tsv.hash = h;
ts->tsv.reserved = 0;
memcpy(ts.string, str, l*sizeof(char));
ts.string[l] = '\0'; /* ending 0 */

is inherently wrong because the use of ts.string instead of ts->string.
Even with that correction, it's still wrong, given that "ts" has the
type TString*, and TString is a union type, not a struct type. However,
if TString were modified as follows:

typedef union TString {
L_Umaxalign dummy; /* ensures maximum alignment for strings */
struct {
CommonHeader;
lu_byte reserved;
unsigned int hash;
size_t len; /* number of characters in string */
char string[];
} tsv;
} TString;

then your code would be correct if you replaced ts.string with
ts->tsv.string, and I would strongly favor using that approach instead
of the one used in the actual Lua implementation.

However, with the original code, both ts and ts+1 are guaranteed (by
whoever typedefs L_Umaxalign - the C standard guarantees no such thing)
to be maximally aligned. With the above modification, ts is guaranteed
to be maximally aligned, but ts->tsv.string is not. The phrase "ensures
maximum alignment for strings" is ambiguous - I'm not sure if it would
be considered to apply to ts->tsv.string, or only to ts.

If use of C2011 were permitted, there would be no need for "dummy", and
therefore no need for a union (making things a bit simpler), in order to
ensure that ts->string was maximally aligned:

#include <stddef.h> // for max_align_t

typedef struct TString {
CommonHeader;
lu_byte reserved;
unsigned int hash;
size_t len; /* number of characters in string */
_Alignof(max_align_t) char string[];
} TString;

Kaz Kylheku · Feb 18, 2014

On 02/18/2014 01:12 PM, janus wrote:
...

That's a great improvement. First of all, it includes the code that you
gave in your first message, code which was entirely missing from more
complete code that you showed in your second message. That code now
appears in context, and that context confirms Ben Bacarisse's
explanation of the use of ts+1, which I misunderstood. That code
over-allocates space for the struct, and ts+1 points to the first byte
of excess space, which is where the actual contents of the string is stored.

Now, I'm finally prepared to answer your original question. The approach
used in the actual Lua code has one key advantage: it has defined
behavior even when using C90.

However, since I don't believe in catering to old standards (C99 is
already 14 years old), I would favor taking advantage of C99's concept
of flexible array members.

So you think that the ages old, reliable array-[1]-at-the-end-of-a-struct hack
suddenly does not work in C99 compilers when they are operated in C90 mode?

Keith Thompson · Feb 18, 2014

James Kuyper said:
On 02/18/2014 01:12 PM, janus wrote:
...

That's a great improvement. First of all, it includes the code that you
gave in your first message, code which was entirely missing from more
complete code that you showed in your second message. That code now
appears in context, and that context confirms Ben Bacarisse's
explanation of the use of ts+1, which I misunderstood. That code
over-allocates space for the struct, and ts+1 points to the first byte
of excess space, which is where the actual contents of the string is stored.

Now, I'm finally prepared to answer your original question. The approach
used in the actual Lua code has one key advantage: it has defined
behavior even when using C90.

However, since I don't believe in catering to old standards (C99 is
already 14 years old), I would favor taking advantage of C99's concept
of flexible array members.

[...]

I don't believe that's an option in this case. The Lua
implementation apparently is very carefully written to conform to
the C89/C90 standard (and also to compile as C++) to maximize the
number of compilers that can be used to compile it. It's optimized
for portability over modernity.

[...]

Malcolm McLean · Feb 18, 2014

What's dangerous about it?

So why did C99 provide a specific syntax to simply doing this?

You're writing to a block of memory in an uncontrolled way.
Buffers (reserved areas of memory) are a valid concept in C, but they should
normally be of only one type of object. Otherwise it's tempting to say
"this buffer can hold a hundred size_ts or two hundred sint16s".
Let's say that the struct has a member added or subtracted. Will this break the
code? How would you find out? Let's say we move to wchar_t for our strings.
Will adding a byte member to the struct break the code now? How would you find out?

There are answers, of course. Sometimes you have to do these things. But
often it's a sign of bad programming, micro-optimisation which impacts the
maintainability of the code. Most IT projects don't fail because the
program fragments memory too much. They fail because the interactions between
the various components get too complicated for the programmers to understand,
additional development causes unexpected bugs elsewhere, and becomes too
expensive and error-prone to be viable.

glen herrmannsfeldt · Feb 18, 2014

(snip, someone wrote)

You're writing to a block of memory in an uncontrolled way.
Buffers (reserved areas of memory) are a valid concept in C, but
they should normally be of only one type of object.
Otherwise it's tempting to say "this buffer can hold a hundred
size_ts or two hundred sint16s".

Let's say that the struct has a member added or subtracted.
Will this break the code? How would you find out? Let's say we
move to wchar_t for our strings.
Will adding a byte member to the struct break the code now?
How would you find out?

If you write and read back in the same program, then there should
be no problem. But yes, if you want to read on a different system,
where there might be different size or byte order, then it is
a problem.

In the days of smaller computers, it used to be much more common
to write out temporary files and read them back again.

-- glen

Ben Bacarisse · Feb 18, 2014

Malcolm McLean said:
You're writing to a block of memory in an uncontrolled way. Buffers
(reserved areas of memory) are a valid concept in C, but they should
normally be of only one type of object. Otherwise it's tempting to say
"this buffer can hold a hundred size_ts or two hundred sint16s".
Let's say that the struct has a member added or subtracted. Will this
break the code? How would you find out? Let's say we move to wchar_t
for our strings. Will adding a byte member to the struct break the
code now? How would you find out?

Maybe this is a difference in the use of the term dangerous. Some data
structures need more care than others, but I don't call it dangerous.

There are answers, of course. Sometimes you have to do these things. But
often it's a sign of bad programming, micro-optimisation which impacts the
maintainability of the code. Most IT projects don't fail because the
program fragments memory too much. They fail because the interactions between
the various components get too complicated for the programmers to understand,
additional development causes unexpected bugs elsewhere, and becomes too
expensive and error-prone to be viable.

Sure, but (a) I think this is a case where you want to do this sort of
thing, and (b) I don't think using the space beyond the declared members
is, itself, a source of complexity. The way it is accessed might
be (and there I think a flexible array member is a great help) but the
alternative -- usually to allocate a separate area -- also adds some
complexity to the code.

James Kuyper · Feb 19, 2014

However, since I don't believe in catering to old standards (C99 is
already 14 years old), I would favor taking advantage of C99's concept
of flexible array members.

Click to expand...

[...]

I don't believe that's an option in this case. The Lua
implementation apparently is very carefully written to conform to
the C89/C90 standard (and also to compile as C++) to maximize the
number of compilers that can be used to compile it. It's optimized
for portability over modernity.

It might not be an option for Lua code; but janus was asking about the
reasons for their approach. In he ever needs to do something like this
in a context where compatibility with C90 is not an issue, he should
consider the benefits of using the flexible array member approach.
Instead of accessing the array through a cast that is not type-safe, he
can access it through a named struct member of a specific declared type,
which seems much safer to me.

Generic programming in C	46	Apr 17, 2010
list.c	0	Oct 4, 2009
A container library in C. Part 2: String collection implementation	3	Sep 28, 2009
A container library in C. Part 1: Header file	14	Sep 28, 2009
comparing binary trees in C	12	May 1, 2009
A container library in C: Part 3. List container implementation	5	Sep 28, 2009
Thread-Safe Generic List Queue in C	7	Mar 16, 2014
calling other languages C/C++	2	Oct 3, 2006

String in programming languages that are based off C

Ben Bacarisse

Kaz Kylheku

Malcolm McLean

Malcolm McLean

Ben Bacarisse

Kaz Kylheku

janus

janus

Keith Thompson

James Kuyper

Kaz Kylheku

Keith Thompson

Malcolm McLean

glen herrmannsfeldt

Ben Bacarisse

James Kuyper

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads