String in programming languages that are based off C

Discussion in 'C Programming' started by janus, Feb 17, 2014.

  1. janus

    janus Guest

    Hello All,

    I found the below string design pattern a bit too hard to absorb. I just noticed that same pattern was used by two different languages. Now, I am thinking why can't they use something like this:

    ts = &luaC_newobj(L, LUA_TSTRING, totalsize, list, 0)->ts;
    ts->tsv.len = l;
    ts->tsv.hash = h;
    ts->tsv.reserved = 0;
    memcpy(ts.string, str, l*sizeof(char));
    ts.string[l] = '\0'; /* ending 0 */

    Instead of the below, and what is the advantage of this pattern.
    ts = &luaC_newobj(L, LUA_TSTRING, totalsize, list, 0)->ts;
    ts->tsv.len = l;
    ts->tsv.hash = h;
    ts->tsv.reserved = 0;
    memcpy(ts+1, str, l*sizeof(char));
    ((char *)(ts+1))[l] = '\0'; /* ending 0 */

    Regards, Janus
     
    janus, Feb 17, 2014
    #1
    1. Advertisements

  2. janus

    James Kuyper Guest

    I was starting to write up an answer that actually addressed your
    question, when I realized that there was something odd about your code:

    ....
    In order for ts->tsv to be a valid expression, ts must be a pointer to
    an object of struct or union type.
    In order for ts.string to be a valid expression, ts must be an lvalue of
    struct or union type. Which line is correct, and what should the
    corrected version of the other line look like?


    This isn't just a matter of nit-picking. The answer I was putting
    together depends upon some assumptions about what "ts" is, and the
    answer is different in those two cases.
     
    James Kuyper, Feb 17, 2014
    #2
    1. Advertisements

  3. janus

    janus Guest




    James,

    Everything is below.. It is actually copied from Lua language.


    TString *luaS_newlstr (lua_State *L, const char *str, size_t l) {
    GCObject *o;
    unsigned int h = cast(unsigned int, l); /* seed */
    size_t step = (l>>5)+1; /* if string is too long, don't hash all its chars */
    size_t l1;
    for (l1=l; l1>=step; l1-=step) /* compute hash */
    h = h ^ ((h<<5)+(h>>2)+cast(unsigned char, str[l1-1]));
    for (o = G(L)->strt.hash[lmod(h, G(L)->strt.size)];
    o != NULL;
    o = gch(o)->next) {
    TString *ts = rawgco2ts(o);
    if (h == ts->tsv.hash &&
    ts->tsv.len == l &&
    (memcmp(str, getstr(ts), l * sizeof(char)) == 0)) {
    if (isdead(G(L), o)) /* string is dead (but was not collected yet)? */
    changewhite(o); /* resurrect it */
    return ts;


    typedef union TString {
    L_Umaxalign dummy; /* ensures maximum alignment for strings */
    struct {
    CommonHeader;
    lu_byte reserved;
    unsigned int hash;
    size_t len; /* number of characters in string */
    } tsv;
    } TString;
     
    janus, Feb 17, 2014
    #3
  4. janus

    jacob navia Guest

    Le 17/02/2014 05:17, janus a écrit :


    Strings are defined as a structure followed by the actual characters.
    The expression (ts+1) makes a pointer to that region immediately
    following the structure.

    In C this kinds of structures are recognized by the C standard of 1999,
    15 years ago. You declare a flexible structure like that as follow:
    struct string {
    int a;
    int b;
    int c; // Fixed fields
    char string[]; // Variable field
    };

    Then you can write your code as you propose. But if you do not want C99,
    you write it using a pointer and casting, and making the whole
    clompletely fucked up.
     
    jacob navia, Feb 17, 2014
    #4
  5. janus

    James Kuyper Guest

    The code you posted looks a lot like C, but it clearly relies upon a
    number of features of Lua that work differently from C. I don't know
    Lua, I recognize those features only as things that look like errors
    from a "C" perspective; you need a response from someone who knows
    precisely how those features work.

    You'll get better responses by asking your question in a forum
    specializing in Lua. My news server doesn't list any newsgroups with
    "lua" in their name, so you'll have to find some other kind of forum: a
    chat room, a mailing list, a bulletin board, a facebook page - but since
    I know there's a fair number people working with lua, I'm sure you'll be
    able to find one.

    Also, the key difference between the code fragments you posted in your
    original message was in the call to memcpy() and the line which
    terminates the string, both of which are completely missing from this
    message, which supposedly contains "Everything". What is the connection
    between those code fragments and this piece of code? Don't tell me the
    answer to that question - but when you find a suitable Lua forum, you
    should make that connection clear to them.
     
    James Kuyper, Feb 17, 2014
    #5
  6. janus

    BartC Guest

    Which bits of the code aren't valid C? Obviously there is a lot missing that
    declares the identifiers, but it looks fine to me (except maybe the end of
    the function appears to be missing). It's certainly not Lua anyway!
     
    BartC, Feb 17, 2014
    #6
  7. But the question was about C, not Lua. It was about why the code uses a
    single data block with that string following the data that describes the
    object, rather than using a separate member that points to the string
    data. That's a C question, or at least a question about designing C
    data structures.

    The code also had what to a beginner would be a peculiar bit of C:

    ((char *)(ts + 1))[l] = 0; /* roughly, I don't recall exactly */

    rather than using the more modern flexible array member syntax of C99
    (as discussed by Jacob). That's about C too.

    <snip>
     
    Ben Bacarisse, Feb 17, 2014
    #7
  8. janus

    Ken Brody Guest

    On 2/16/2014 11:17 PM, janus wrote:
    [...]
    [...]

    Nit: "sizeof(char)" is guaranteed to be 1.

    6.5.3.4p3:
     
    Ken Brody, Feb 17, 2014
    #8
  9. janus

    James Kuyper Guest

    As I pointed out in my first message, ts.string and ts+1 can't both be
    valid C expressions using the same definition of ts. Also, ts+1 is used
    in two places where (char*)ts+1 is what I would have expected in C code.
    Without the (char*), interpreted as C code, it doesn't make much sense.
    When I noticed that fact, I scrapped the answer I was writing, and
    started asking questions instead.

    Assuming that those were typos, the rest of it could be C code, but if
    so, lots of additional explanation is needed. "Everything is below" is
    far from true.

    TString, GCObject, L_Umaxalign, and lu_byte could, in principle, be C
    typedefs - but the types they are typedefs for would need to be defined
    in order to answer any detailed questions about this code. It seems
    likely to me that use of GCObject is somehow meant to automatically
    invoke a garbage collection system, which is not a standard C feature.

    The way "cast" is used suggests that it could be a keyword in some other
    language. If this is C, "cast" must be the name of a function-like
    macro, since a function cannot take a type name as one of its arguments.
    If so, a definition for that macro is needed in order to clearly
    understand this code. The "obvious" definition for that macro would be

    #define cast(type, expression) ((type)(expression))

    but I didn't want to automatically assume that something that stupid had
    been done.

    G(), isdead(), and changedwhite() are used in this code without any
    definition provided. The comments around the use of the latter two
    functions suggest that they might be connected to the garbage collection.
     
    James Kuyper, Feb 17, 2014
    #9
  10. I think janus means that the code is copied from the Lua
    *implementation*, which is written in C. In fact, it's from
    lstring.c in the Lua-5.2.0 sources (that function was simplified
    in 5.2.1). It depends on some declarations that that weren't shown,
    but apart from that it appears to be standard C. (Lua source code
    is different enough from C that one is not likely to be mistaken
    for the other.)

    According to http://www.lua.org/download.html :

    Lua is implemented in pure ANSI C and compiles unmodified in
    all platforms that have an ANSI C compiler. Lua also compiles
    cleanly as C++.

    [...]
     
    Keith Thompson, Feb 17, 2014
    #10
  11. The Lua implementation source code itself uses sizeof(char). Yes, it's
    redundant, but it's not the OP's mistake.
     
    Keith Thompson, Feb 17, 2014
    #11
  12. I assume this means that Lua programs have no access to system-specific
    functions (or graphics). Which makes it essentially useless as a
    programming environment.

    I knew there was a reason I never got around to learning it.
     
    Kenny McCormack, Feb 17, 2014
    #12
  13. janus

    James Kuyper Guest

    The original code used '\0', rather than 0 - otherwise, you recall
    correctly.
    In context, it looks pretty peculiar to me, too, and I'm pretty far from
    being a beginner. As an expert C programmer, I can imagine a less
    experienced programmer writing

    ((char*)ts+1)[l] = '\0';

    I would strongly recommend against writing code that way rather than the
    alternative that was also given:

    ts->string[l] = '\0';

    but there's at least a chance that those two alternatives do the same
    thing, which is not the case for what was actually given:

    ((char *)(ts + 1))[l] = '\0';
     
    James Kuyper, Feb 17, 2014
    #13
  14. janus

    James Kuyper Guest

    On 02/17/2014 11:59 AM, Keith Thompson wrote:
    ....
    That can't be the case for the original message, which contained
    ts->tsv, ts.string, and ts+1 in a context where (char*)ts+1 seems more
    plausible. If the original compiled cleanly and performed correctly,
    then the copying must have been done manually, with transcription errors.
     
    James Kuyper, Feb 17, 2014
    #14
  15. Here's janus's message at the top of this thread:

    | I found the below string design pattern a bit too hard to absorb. I just
    | noticed that same pattern was used by two different languages. Now, I am
    | thinking why can't they use something like this:
    |
    | ts = &luaC_newobj(L, LUA_TSTRING, totalsize, list, 0)->ts;
    | ts->tsv.len = l;
    | ts->tsv.hash = h;
    | ts->tsv.reserved = 0;
    | memcpy(ts.string, str, l*sizeof(char));
    | ts.string[l] = '\0'; /* ending 0 */
    |
    | Instead of the below, and what is the advantage of this pattern.
    | ts = &luaC_newobj(L, LUA_TSTRING, totalsize, list, 0)->ts;
    | ts->tsv.len = l;
    | ts->tsv.hash = h;
    | ts->tsv.reserved = 0;
    | memcpy(ts+1, str, l*sizeof(char));
    | ((char *)(ts+1))[l] = '\0'; /* ending 0 */

    The second chunk of code is quoted from the Lua sources, and is valid C
    (given the missing declarations). The first chunk is janus's suggested
    replacement, and that code is buggy. (And as I mentioned elsethread,
    the superfluous `*sizeof(char)` is in the Lua source code.)

    I haven't studied it closely enough to construct a plausible replacement
    that fixes the bugs while preserving janus's ideas.
     
    Keith Thompson, Feb 17, 2014
    #15
  16. janus

    James Kuyper Guest

    On 02/17/2014 01:59 PM, Keith Thompson wrote:
    ....
    So, do you think ts+1 points at the location intended by that code?
     
    James Kuyper, Feb 17, 2014
    #16
  17. I getting confused. This is not a correct alternative unless there is
    something very odd going on.
    There are three possible situations here and they are all less then
    ideal:

    1) string is a C99 flexible array member, but LUA is supposed to be
    written in ANSI C.

    2) string is char * member, in which case an extra allocation is called
    for (or some horrid hack to point it just after the "header" struct).

    3) string is a 1-character char array -- the old ANSI C hack to do what
    C99 flexible array members do better. You have to keep remembering
    to adjust the allocation size.
    If "those two alternatives" refer to the first two statements quoted,
    then it is very hard to see them doing the same thing.

    This last alternative is a common idiom in ANSI C when the 1-char array
    option is eschewed (and some people, like me, always hatted it). It's
    not identical, because it can waste padding at the end of the struct,
    but the allocation is simple: (malloc(sizeof *ts + string_size).
     
    Ben Bacarisse, Feb 17, 2014
    #17
  18. Probably. ts is a pointer to a TString, which is a union type.
    I'd have to explore the code more to be sure of what's going on.
     
    Keith Thompson, Feb 17, 2014
    #18
  19. The original code is at http://www.lua.org/source/5.2/lstring.c.html (in
    function newlstr)
    The "string" field is janus' invention/speculation. There is no such
    field in the original structure (nor is there a flexible array member).
    These guys are just piling data after the structure.
    I think they really mean to place data after sizeof(TString). The whole
    region is allocated with a size of:

    totalsize = sizeof(TString) + ((l + 1) * sizeof(char));

    (their code).

    -- Alain.
     
    Alain Ketterlin, Feb 17, 2014
    #19
  20. janus

    James Kuyper Guest

    It does.
    They will do the same thing if the first member of the struct is exactly
    one byte long, and there is no padding between that member and the one
    named "string". That's very unlikely to be the case, but if it happened
    to be the case, code written to rely upon that fact would work. I
    believe some languages actually use a similar layout for their built-in
    strings, with a one-byte size followed by the contents of the string.
    The expression (char*)ts+1 would be unnecessarily more sensitive to
    modifications in the layout than ts->string, which is why I denigrated
    that option - but it could work.
    I've never used that approach. I've written code that uses the struct
    hack, and code which uses C99 flexible array members (which I greatly
    prefer). In both of those cases, ts->string would have been preferable
    to (char*)(ts+1), which is why I didn't consider that possibility.
     
    James Kuyper, Feb 17, 2014
    #20
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.