String in programming languages that are based off C

J

janus

Hello All,

I found the below string design pattern a bit too hard to absorb. I just noticed that same pattern was used by two different languages. Now, I am thinking why can't they use something like this:

ts = &luaC_newobj(L, LUA_TSTRING, totalsize, list, 0)->ts;
ts->tsv.len = l;
ts->tsv.hash = h;
ts->tsv.reserved = 0;
memcpy(ts.string, str, l*sizeof(char));
ts.string[l] = '\0'; /* ending 0 */

Instead of the below, and what is the advantage of this pattern.
ts = &luaC_newobj(L, LUA_TSTRING, totalsize, list, 0)->ts;
ts->tsv.len = l;
ts->tsv.hash = h;
ts->tsv.reserved = 0;
memcpy(ts+1, str, l*sizeof(char));
((char *)(ts+1))[l] = '\0'; /* ending 0 */

Regards, Janus
 
J

James Kuyper

Hello All,

I found the below string design pattern a bit too hard to absorb. I just noticed that same pattern was used by two different languages. Now, I am thinking why can't they use something like this:

I was starting to write up an answer that actually addressed your
question, when I realized that there was something odd about your code:

....
ts->tsv.reserved = 0;

In order for ts->tsv to be a valid expression, ts must be a pointer to
an object of struct or union type.
memcpy(ts.string, str, l*sizeof(char));

In order for ts.string to be a valid expression, ts must be an lvalue of
struct or union type. Which line is correct, and what should the
corrected version of the other line look like?


This isn't just a matter of nit-picking. The answer I was putting
together depends upon some assumptions about what "ts" is, and the
answer is different in those two cases.
 
J

janus

I was starting to write up an answer that actually addressed your

question, when I realized that there was something odd about your code:



...




In order for ts->tsv to be a valid expression, ts must be a pointer to

an object of struct or union type.






In order for ts.string to be a valid expression, ts must be an lvalue of

struct or union type. Which line is correct, and what should the

corrected version of the other line look like?





This isn't just a matter of nit-picking. The answer I was putting

together depends upon some assumptions about what "ts" is, and the

answer is different in those two cases.



I was starting to write up an answer that actually addressed your

question, when I realized that there was something odd about your code:



...




In order for ts->tsv to be a valid expression, ts must be a pointer to

an object of struct or union type.






In order for ts.string to be a valid expression, ts must be an lvalue of

struct or union type. Which line is correct, and what should the

corrected version of the other line look like?





This isn't just a matter of nit-picking. The answer I was putting

together depends upon some assumptions about what "ts" is, and the

answer is different in those two cases.


James,

Everything is below.. It is actually copied from Lua language.


TString *luaS_newlstr (lua_State *L, const char *str, size_t l) {
GCObject *o;
unsigned int h = cast(unsigned int, l); /* seed */
size_t step = (l>>5)+1; /* if string is too long, don't hash all its chars */
size_t l1;
for (l1=l; l1>=step; l1-=step) /* compute hash */
h = h ^ ((h<<5)+(h>>2)+cast(unsigned char, str[l1-1]));
for (o = G(L)->strt.hash[lmod(h, G(L)->strt.size)];
o != NULL;
o = gch(o)->next) {
TString *ts = rawgco2ts(o);
if (h == ts->tsv.hash &&
ts->tsv.len == l &&
(memcmp(str, getstr(ts), l * sizeof(char)) == 0)) {
if (isdead(G(L), o)) /* string is dead (but was not collected yet)? */
changewhite(o); /* resurrect it */
return ts;


typedef union TString {
L_Umaxalign dummy; /* ensures maximum alignment for strings */
struct {
CommonHeader;
lu_byte reserved;
unsigned int hash;
size_t len; /* number of characters in string */
} tsv;
} TString;
 
J

jacob navia

Le 17/02/2014 05:17, janus a écrit :


Strings are defined as a structure followed by the actual characters.
The expression (ts+1) makes a pointer to that region immediately
following the structure.

In C this kinds of structures are recognized by the C standard of 1999,
15 years ago. You declare a flexible structure like that as follow:
struct string {
int a;
int b;
int c; // Fixed fields
char string[]; // Variable field
};

Then you can write your code as you propose. But if you do not want C99,
you write it using a pointer and casting, and making the whole
clompletely fucked up.
 
J

James Kuyper

....
Everything is below.. It is actually copied from Lua language.

The code you posted looks a lot like C, but it clearly relies upon a
number of features of Lua that work differently from C. I don't know
Lua, I recognize those features only as things that look like errors
from a "C" perspective; you need a response from someone who knows
precisely how those features work.

You'll get better responses by asking your question in a forum
specializing in Lua. My news server doesn't list any newsgroups with
"lua" in their name, so you'll have to find some other kind of forum: a
chat room, a mailing list, a bulletin board, a facebook page - but since
I know there's a fair number people working with lua, I'm sure you'll be
able to find one.

Also, the key difference between the code fragments you posted in your
original message was in the call to memcpy() and the line which
terminates the string, both of which are completely missing from this
message, which supposedly contains "Everything". What is the connection
between those code fragments and this piece of code? Don't tell me the
answer to that question - but when you find a suitable Lua forum, you
should make that connection clear to them.
 
B

BartC

James Kuyper said:
The code you posted looks a lot like C, but it clearly relies upon a
number of features of Lua that work differently from C. I don't know
Lua, I recognize those features only as things that look like errors
from a "C" perspective; you need a response from someone who knows
precisely how those features work.

Which bits of the code aren't valid C? Obviously there is a lot missing that
declares the identifiers, but it looks fine to me (except maybe the end of
the function appears to be missing). It's certainly not Lua anyway!
 
B

Ben Bacarisse

James Kuyper said:
The code you posted looks a lot like C, but it clearly relies upon a
number of features of Lua that work differently from C. I don't know
Lua, I recognize those features only as things that look like errors
from a "C" perspective; you need a response from someone who knows
precisely how those features work.

You'll get better responses by asking your question in a forum
specializing in Lua.

But the question was about C, not Lua. It was about why the code uses a
single data block with that string following the data that describes the
object, rather than using a separate member that points to the string
data. That's a C question, or at least a question about designing C
data structures.

The code also had what to a beginner would be a peculiar bit of C:

((char *)(ts + 1))[l] = 0; /* roughly, I don't recall exactly */

rather than using the more modern flexible array member syntax of C99
(as discussed by Jacob). That's about C too.

<snip>
 
J

James Kuyper

Which bits of the code aren't valid C? Obviously there is a lot missing that
declares the identifiers, but it looks fine to me (except maybe the end of
the function appears to be missing). It's certainly not Lua anyway!

As I pointed out in my first message, ts.string and ts+1 can't both be
valid C expressions using the same definition of ts. Also, ts+1 is used
in two places where (char*)ts+1 is what I would have expected in C code.
Without the (char*), interpreted as C code, it doesn't make much sense.
When I noticed that fact, I scrapped the answer I was writing, and
started asking questions instead.

Assuming that those were typos, the rest of it could be C code, but if
so, lots of additional explanation is needed. "Everything is below" is
far from true.

TString, GCObject, L_Umaxalign, and lu_byte could, in principle, be C
typedefs - but the types they are typedefs for would need to be defined
in order to answer any detailed questions about this code. It seems
likely to me that use of GCObject is somehow meant to automatically
invoke a garbage collection system, which is not a standard C feature.

The way "cast" is used suggests that it could be a keyword in some other
language. If this is C, "cast" must be the name of a function-like
macro, since a function cannot take a type name as one of its arguments.
If so, a definition for that macro is needed in order to clearly
understand this code. The "obvious" definition for that macro would be

#define cast(type, expression) ((type)(expression))

but I didn't want to automatically assume that something that stupid had
been done.

G(), isdead(), and changedwhite() are used in this code without any
definition provided. The comments around the use of the latter two
functions suggest that they might be connected to the garbage collection.
 
K

Keith Thompson

James Kuyper said:
The code you posted looks a lot like C, but it clearly relies upon a
number of features of Lua that work differently from C. I don't know
Lua, I recognize those features only as things that look like errors
from a "C" perspective; you need a response from someone who knows
precisely how those features work.

I think janus means that the code is copied from the Lua
*implementation*, which is written in C. In fact, it's from
lstring.c in the Lua-5.2.0 sources (that function was simplified
in 5.2.1). It depends on some declarations that that weren't shown,
but apart from that it appears to be standard C. (Lua source code
is different enough from C that one is not likely to be mistaken
for the other.)

According to http://www.lua.org/download.html :

Lua is implemented in pure ANSI C and compiles unmodified in
all platforms that have an ANSI C compiler. Lua also compiles
cleanly as C++.

[...]
 
K

Keith Thompson

Ken Brody said:
On 2/16/2014 11:17 PM, janus wrote:
[...]
memcpy(ts+1, str, l*sizeof(char));
[...]

Nit: "sizeof(char)" is guaranteed to be 1.

6.5.3.4p3:
When applied to an operand that has type char, unsigned char, or
signed char, (or a qualified version thereof) the result is 1.

The Lua implementation source code itself uses sizeof(char). Yes, it's
redundant, but it's not the OP's mistake.
 
K

Kenny McCormack

Keith Thompson said:
Lua is implemented in pure ANSI C and compiles unmodified in
all platforms that have an ANSI C compiler. Lua also compiles
cleanly as C++.

I assume this means that Lua programs have no access to system-specific
functions (or graphics). Which makes it essentially useless as a
programming environment.

I knew there was a reason I never got around to learning it.
 
J

James Kuyper

James Kuyper said:
The code you posted looks a lot like C, but it clearly relies upon a
number of features of Lua that work differently from C. I don't know
Lua, I recognize those features only as things that look like errors
from a "C" perspective; you need a response from someone who knows
precisely how those features work.

You'll get better responses by asking your question in a forum
specializing in Lua.

But the question was about C, not Lua. It was about why the code uses a
single data block with that string following the data that describes the
object, rather than using a separate member that points to the string
data. That's a C question, or at least a question about designing C
data structures.

The code also had what to a beginner would be a peculiar bit of C:

((char *)(ts + 1))[l] = 0; /* roughly, I don't recall exactly */

The original code used '\0', rather than 0 - otherwise, you recall
correctly.
rather than using the more modern flexible array member syntax of C99
(as discussed by Jacob). That's about C too.

In context, it looks pretty peculiar to me, too, and I'm pretty far from
being a beginner. As an expert C programmer, I can imagine a less
experienced programmer writing

((char*)ts+1)[l] = '\0';

I would strongly recommend against writing code that way rather than the
alternative that was also given:

ts->string[l] = '\0';

but there's at least a chance that those two alternatives do the same
thing, which is not the case for what was actually given:

((char *)(ts + 1))[l] = '\0';
 
J

James Kuyper

On 02/17/2014 11:59 AM, Keith Thompson wrote:
....
I think janus means that the code is copied from the Lua
*implementation*, which is written in C. ....
According to http://www.lua.org/download.html :

Lua is implemented in pure ANSI C and compiles unmodified in
all platforms that have an ANSI C compiler. Lua also compiles
cleanly as C++.

That can't be the case for the original message, which contained
ts->tsv, ts.string, and ts+1 in a context where (char*)ts+1 seems more
plausible. If the original compiled cleanly and performed correctly,
then the copying must have been done manually, with transcription errors.
 
K

Keith Thompson

James Kuyper said:
On 02/17/2014 11:59 AM, Keith Thompson wrote:
...

That can't be the case for the original message, which contained
ts->tsv, ts.string, and ts+1 in a context where (char*)ts+1 seems more
plausible. If the original compiled cleanly and performed correctly,
then the copying must have been done manually, with transcription errors.

Here's janus's message at the top of this thread:

| I found the below string design pattern a bit too hard to absorb. I just
| noticed that same pattern was used by two different languages. Now, I am
| thinking why can't they use something like this:
|
| ts = &luaC_newobj(L, LUA_TSTRING, totalsize, list, 0)->ts;
| ts->tsv.len = l;
| ts->tsv.hash = h;
| ts->tsv.reserved = 0;
| memcpy(ts.string, str, l*sizeof(char));
| ts.string[l] = '\0'; /* ending 0 */
|
| Instead of the below, and what is the advantage of this pattern.
| ts = &luaC_newobj(L, LUA_TSTRING, totalsize, list, 0)->ts;
| ts->tsv.len = l;
| ts->tsv.hash = h;
| ts->tsv.reserved = 0;
| memcpy(ts+1, str, l*sizeof(char));
| ((char *)(ts+1))[l] = '\0'; /* ending 0 */

The second chunk of code is quoted from the Lua sources, and is valid C
(given the missing declarations). The first chunk is janus's suggested
replacement, and that code is buggy. (And as I mentioned elsethread,
the superfluous `*sizeof(char)` is in the Lua source code.)

I haven't studied it closely enough to construct a plausible replacement
that fixes the bugs while preserving janus's ideas.
 
J

James Kuyper

On 02/17/2014 01:59 PM, Keith Thompson wrote:
....
On 02/17/2014 11:59 AM, Keith Thompson wrote:
Here's janus's message at the top of this thread: ....
| ts = &luaC_newobj(L, LUA_TSTRING, totalsize, list, 0)->ts;
| ts->tsv.len = l;
| ts->tsv.hash = h;
| ts->tsv.reserved = 0;
| memcpy(ts+1, str, l*sizeof(char));
| ((char *)(ts+1))[l] = '\0'; /* ending 0 */

The second chunk of code is quoted from the Lua sources, and is valid C
(given the missing declarations).

So, do you think ts+1 points at the location intended by that code?
 
B

Ben Bacarisse

James Kuyper said:
In context, it looks pretty peculiar to me, too, and I'm pretty far from
being a beginner. As an expert C programmer, I can imagine a less
experienced programmer writing

((char*)ts+1)[l] = '\0';

I getting confused. This is not a correct alternative unless there is
something very odd going on.
I would strongly recommend against writing code that way rather than the
alternative that was also given:

ts->string[l] = '\0';

There are three possible situations here and they are all less then
ideal:

1) string is a C99 flexible array member, but LUA is supposed to be
written in ANSI C.

2) string is char * member, in which case an extra allocation is called
for (or some horrid hack to point it just after the "header" struct).

3) string is a 1-character char array -- the old ANSI C hack to do what
C99 flexible array members do better. You have to keep remembering
to adjust the allocation size.
but there's at least a chance that those two alternatives do the same
thing, which is not the case for what was actually given:

((char *)(ts + 1))[l] = '\0';

If "those two alternatives" refer to the first two statements quoted,
then it is very hard to see them doing the same thing.

This last alternative is a common idiom in ANSI C when the 1-char array
option is eschewed (and some people, like me, always hatted it). It's
not identical, because it can waste padding at the end of the struct,
but the allocation is simple: (malloc(sizeof *ts + string_size).
 
K

Keith Thompson

James Kuyper said:
On 02/17/2014 01:59 PM, Keith Thompson wrote:
...
On 02/17/2014 11:59 AM, Keith Thompson wrote:
Here's janus's message at the top of this thread: ...
| ts = &luaC_newobj(L, LUA_TSTRING, totalsize, list, 0)->ts;
| ts->tsv.len = l;
| ts->tsv.hash = h;
| ts->tsv.reserved = 0;
| memcpy(ts+1, str, l*sizeof(char));
| ((char *)(ts+1))[l] = '\0'; /* ending 0 */

The second chunk of code is quoted from the Lua sources, and is valid C
(given the missing declarations).

So, do you think ts+1 points at the location intended by that code?

Probably. ts is a pointer to a TString, which is a union type.
I'd have to explore the code more to be sure of what's going on.
 
A

Alain Ketterlin

James Kuyper said:
On 02/17/2014 09:23 AM, Ben Bacarisse wrote:
The code also had what to a beginner would be a peculiar bit of C:

((char *)(ts + 1))[l] = 0; /* roughly, I don't recall exactly */

The original code used '\0', rather than 0 - otherwise, you recall
correctly.
rather than using the more modern flexible array member syntax of C99
(as discussed by Jacob). That's about C too.

In context, it looks pretty peculiar to me, too, and I'm pretty far from
being a beginner. As an expert C programmer, I can imagine a less
experienced programmer writing

((char*)ts+1)[l] = '\0';

I would strongly recommend against writing code that way

The original code is at http://www.lua.org/source/5.2/lstring.c.html (in
function newlstr)
rather than the alternative that was also given:

ts->string[l] = '\0';

The "string" field is janus' invention/speculation. There is no such
field in the original structure (nor is there a flexible array member).
These guys are just piling data after the structure.
but there's at least a chance that those two alternatives do the same
thing, which is not the case for what was actually given:

((char *)(ts + 1))[l] = '\0';

I think they really mean to place data after sizeof(TString). The whole
region is allocated with a size of:

totalsize = sizeof(TString) + ((l + 1) * sizeof(char));

(their code).

-- Alain.
 
J

James Kuyper

James Kuyper said:
In context, it looks pretty peculiar to me, too, and I'm pretty far from
being a beginner. As an expert C programmer, I can imagine a less
experienced programmer writing

((char*)ts+1)[l] = '\0';

I getting confused. This is not a correct alternative unless there is
something very odd going on.
I would strongly recommend against writing code that way rather than the
alternative that was also given:

ts->string[l] = '\0';

There are three possible situations here and they are all less then
ideal:

1) string is a C99 flexible array member, but LUA is supposed to be
written in ANSI C.

2) string is char * member, in which case an extra allocation is called
for (or some horrid hack to point it just after the "header" struct).

3) string is a 1-character char array -- the old ANSI C hack to do what
C99 flexible array members do better. You have to keep remembering
to adjust the allocation size.
but there's at least a chance that those two alternatives do the same
thing, which is not the case for what was actually given:

((char *)(ts + 1))[l] = '\0';

If "those two alternatives" refer to the first two statements quoted,

It does.
then it is very hard to see them doing the same thing.

They will do the same thing if the first member of the struct is exactly
one byte long, and there is no padding between that member and the one
named "string". That's very unlikely to be the case, but if it happened
to be the case, code written to rely upon that fact would work. I
believe some languages actually use a similar layout for their built-in
strings, with a one-byte size followed by the contents of the string.
The expression (char*)ts+1 would be unnecessarily more sensitive to
modifications in the layout than ts->string, which is why I denigrated
that option - but it could work.
This last alternative is a common idiom in ANSI C when the 1-char array
option is eschewed (and some people, like me, always hatted it). It's
not identical, because it can waste padding at the end of the struct,
but the allocation is simple: (malloc(sizeof *ts + string_size).

I've never used that approach. I've written code that uses the struct
hack, and code which uses C99 flexible array members (which I greatly
prefer). In both of those cases, ts->string would have been preferable
to (char*)(ts+1), which is why I didn't consider that possibility.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,566
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top