validcstring function

Discussion in 'C Programming' started by Malcolm McLean, Sep 28, 2013.

  1. Are all the is dotted and ts crossed with this?



    /*
    test if a string is a valid C string - can it go in the place
    of a C string literal?
    */
    int validcstring(const char *str)
    {
    size_t len;
    size_t i, j;
    size_t start;
    size_t end;

    len = strlen(str);
    if(len < 2)
    return 0;
    for(start=0; start < len; start++)
    if(!isspace( (unsigned char) str[start]))
    break;
    end = len;
    while(end--)
    if(!isspace((unsigned char) str[end]))
    break;
    if(start == end || str[start] != '\"' || str[end] != '\"')
    return 0;
    start++;
    end--;


    for(i=start;i<end;i++)
    {
    if(str == '\\')
    {
    if(strchr("aftbrnvxo01234567\?\\\'\"", str[i+1]))
    {
    if(str[i+1] == 'o')
    {
    if(!isdigit((unsigned char) str[i+2]) ||
    !isdigit((unsigned char) str[i+3]) ||
    !isdigit((unsigned char) str[i+4]) )
    return 0;

    }
    else if(str[i+1] == 'x')
    {
    if(!isxdigit((unsigned char) str[i+2]) ||
    !isxdigit((unsigned char) str[i+3]))
    return 0;
    }
    else
    i++;
    continue;
    }
    }

    if(str == '\"')
    {
    for(j=i+1;j<end;j++)
    if(!isspace((unsigned char) str[j]))
    break;
    if(str[j] == '\"')
    i = j;
    else
    return 0;
    }
    if(!isgraph( (unsigned char) str) && str != ' ')
    return 0;
    }

    return 1;
    }
     
    Malcolm McLean, Sep 28, 2013
    #1
    1. Advertisements

  2. Malcolm McLean

    Alan Curry Guest

    It accepts "\"

    Missing \u for Unicode

    What is \o supposed to be? I haven't heard of it, and gcc doesn't like it
    either.

    Is it right to reject a string containing an actual tab character instead of
    backslash t? gcc accepts these even in pedantic mode. But it also accepts
    most control characters so I guess you have to draw a line somewhere.
     
    Alan Curry, Sep 28, 2013
    #2
    1. Advertisements

  3. C literals include L"...". Recent versions permit universal character
    names. You may have excluded these by design, but then the comment is
    misleading.
    It incorrectly accepts "\o123" whilst correctly rejecting "\o". Then
    again, it incorrectly accepts "\p". It incorrectly rejects "\x1x". It
    incorrectly accepts "\". It incorrectly accepts """" (Are you, perhaps,
    trying to implements the concatenation of string literals? If so the
    comment needs further adjusting.)

    It looks untested to me. The code is rather laboured -- by which I mean
    the structure of it does not reflect the structure in the spec. That
    might explain a bug or two.

    <snip code>
     
    Ben Bacarisse, Sep 28, 2013
    #3
  4. Yes, it started off as a trivial function testing for an opening and closing
    quote. But I realised that people might want constructs like

    char *str = "Fred "
    "Bloggs";

    (It's for an automatic code generator).
    The idea is that people write in their scripts
    <string> "This is a C literal\n" </string>
    or
    <string> This is embedded text
    with a newline </string>

    and it intelligently distinguishes between the two, escaping the embedded
    text.

    It sort of steadily ballooned until I wasn't sure I had something that actually
    met the spec or not.
     
    Malcolm McLean, Sep 28, 2013
    #4
  5. C permits other things between concatenated string literals. Most
    notably, comments.

    <snip>
     
    Ben Bacarisse, Sep 28, 2013
    #5
  6. That's a point.
    I don't want to throw half a C parser at this problem.
    But it's going to be very difficult for the user to enter newlines in
    single line strings if I don't allow escaped C syntax.
    He'll have to write

    <string>this is a line
    </string>
    with no trailing whitespace before the newline. That isn't really acceptable.
     
    Malcolm McLean, Sep 28, 2013
    #6
  7. You might consider throwing a *whole* C parser at the problem. Find an
    existing open-source C parser and modify it so it accepts string
    literals rather than translation units, and strip out everything that
    recognizes things that can't be part of a string literal.
     
    Keith Thompson, Sep 28, 2013
    #7
  8. It's for the Baby X resource compiler.

    It's a medium-weight program, it's got quite elaborate source to read in
    images in various formats, and I've pulled in the freetype library to
    rasterise fonts. I thought that adding strings would be a trivial bit
    of code, and it is pretty simple to write something that works acceptably.

    But ideally you want so support all string literal syntax. I haven't even
    decided what to do about Unicode yet - I decided it's too big a problem
    to handle for now, though I appreciate that by doing that I'm storing up
    compatibility issues for the future.

    So a full parser wouldn't be that unreasonable. I'm still a bit reluctant
    to complicate the source like that, however.
     
    Malcolm McLean, Sep 28, 2013
    #8
  9. Malcolm McLean

    BartC Guest

    Are the people writing the input to this program (which might include C
    literals) expert C coders?

    If not, then they probably won't know every feature to do with C string
    literals, so just document what is supported by your recogniser. Possibly
    even make up your own features (of how to deal with new lines for example)
    which can then be trivially converted to valid C. (Because the output needs
    to be valid C, but the input doesn't need to be, or it could be a subset.)
     
    BartC, Sep 28, 2013
    #9
  10. Baby X is a toolkit or widget library for the X window system.

    The Baby X resource compiler is going to be a companion program for
    embedding data into Baby X programs, but it will also be usable on
    its own. It takes a script file and outputs compileable C source.

    So I'd expect that most users would be pretty competent in C. Strings
    are a bit of an afterthought. But I think most users would expect
    a string type.
     
    Malcolm McLean, Sep 28, 2013
    #10
  11. Malcolm McLean

    BartC Guest

    OK, but ... the script file presumably has its own non-C syntax, so it isn't
    valid C either, so there's no reason for its string literals to be entirely
    C-compatible.

    Otherwise, it's been pointed out verifying every possible input that can
    legally be a C string can be difficult. For example, users may want to
    construct some of their strings like this:

    STR(this_is_a_string)

    (which depends on a macro existing such as #define STR(a) #a), but you can't
    verify that without knowing the entire context in which that fragment is
    going to be compiled. Therefore you don't allow this, but that means you are
    creating a restriction. So just have a few more!
     
    BartC, Sep 29, 2013
    #11
  12. The script file is meant to be very simple, though of course technically it
    has a syntax. I don't want people constructing formal grammars for it, it's
    just a list of resources to be included in the program.

    So the basic string syntax is

    <string name = "fred">Fred</string>

    That creates the output

    char *fred = "Fred";

    The other basic case is

    <string name = "fred", src = "/texts/fred.txt"></string>

    which pulls in the file fred.txt, converts it to a C-parseable string, and
    gives it the name "fred".

    So what should happen if the user gives a non-legal C identifer as name?
    Currently, the program just passes it, on the grounds that if he writes
    "1fred?", or, more subtly, "stranger", he knows what he's doing. But there's
    a case for generating warnings at least when this happens.

    The other problem is that people may want to embed newlines or other control
    characters in strings. So really you have to allow

    <string name = "fred">"Fred\n"</string>

    Now that creates problems. An automatic script generator might construct

    <string name = "address">"Four score and seven years ago our fathers brought
    forth on this continent a new nation, conceived in liberty, and dedicated to
    the proposition that all men are created equal."<string>

    expecting the output

    char *address = "\"Four score ..."

    strictly you should have two tags, <string> and <stringlit>. But that makes
    the script files hard to write. People will forget when you use <string> and
    when you use <stringlit>.
     
    Malcolm McLean, Sep 29, 2013
    #12
  13. Malcolm McLean

    BartC Guest

    When I tried your validcstring() function, it seemed to need the quotes
    around the string, so Fred on it's own wouldn't work. (I need to write
    validcstring("\"Fred\"") rather than validcstring("Fred").)

    If Fred *is* acceptable, then you are already adapting your stylised syntax
    to be C compatible! That's along the lines of what I've been saying.
    What does it do with characters that need to be escape codes in the C
    string? I can't imagine the entire file contents are surrounded with double
    quotes, nor that new lines are denoted by \n. It should just be normal text
    file.

    Printing the contents of a string so that it looks like a string literal,
    with special characters converted to escape codes, if something I've done
    many times. So:

    This
    is "my" string

    with an embedded newline, is displayed as "This\nis \"my\" string".

    Sometimes I optionally allow string input to be quoted (in order to allow
    embedded spaces for example), but then this creates a problem of whether the
    quotes are to be part of the output or not; so, should:

    "Fred"

    be output as "Fred" or as "\"Fred\""? (Generally, the first set are quotes,
    if present, are not part of the contents.)
    Checking C identifiers is much simpler (unless perhaps the user wants to use
    advanced techniques to create the name such as macro calls). But as you say
    an error will be generated later. You might to (if you don't already) add a
    comment to the generated C which refers to the line number in the input
    script file, to help trace back the error.
    (Now the script is generated automatically as well as the C output!)
    People will prefer to write the above just as it is, without special syntax,
    or perhaps to paste from another source without the tedium of converting
    newlines to "\n" or splitting into multiple strings. But it's *your* script
    language, and you can do what you like! Generally script languages are used
    because they're simpler than writing in something like C, not just as
    hard! (In fact harder, with real C syntax embedded in the script.)
     
    BartC, Sep 29, 2013
    #13
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.
Similar Threads
There are no similar threads yet.
Loading...