validcstring function

Discussion in 'C Programming' started by Malcolm McLean, Sep 28, 2013.

  1. Are all the is dotted and ts crossed with this?



    /*
    test if a string is a valid C string - can it go in the place
    of a C string literal?
    */
    int validcstring(const char *str)
    {
    size_t len;
    size_t i, j;
    size_t start;
    size_t end;

    len = strlen(str);
    if(len < 2)
    return 0;
    for(start=0; start < len; start++)
    if(!isspace( (unsigned char) str[start]))
    break;
    end = len;
    while(end--)
    if(!isspace((unsigned char) str[end]))
    break;
    if(start == end || str[start] != '\"' || str[end] != '\"')
    return 0;
    start++;
    end--;


    for(i=start;i<end;i++)
    {
    if(str == '\\')
    {
    if(strchr("aftbrnvxo01234567\?\\\'\"", str[i+1]))
    {
    if(str[i+1] == 'o')
    {
    if(!isdigit((unsigned char) str[i+2]) ||
    !isdigit((unsigned char) str[i+3]) ||
    !isdigit((unsigned char) str[i+4]) )
    return 0;

    }
    else if(str[i+1] == 'x')
    {
    if(!isxdigit((unsigned char) str[i+2]) ||
    !isxdigit((unsigned char) str[i+3]))
    return 0;
    }
    else
    i++;
    continue;
    }
    }

    if(str == '\"')
    {
    for(j=i+1;j<end;j++)
    if(!isspace((unsigned char) str[j]))
    break;
    if(str[j] == '\"')
    i = j;
    else
    return 0;
    }
    if(!isgraph( (unsigned char) str) && str != ' ')
    return 0;
    }

    return 1;
    }
    Malcolm McLean, Sep 28, 2013
    #1
    1. Advertising

  2. Malcolm McLean

    Alan Curry Guest

    In article <>,
    Malcolm McLean <> wrote:
    >Are all the is dotted and ts crossed with this?
    >
    >/*
    > test if a string is a valid C string - can it go in the place
    > of a C string literal?
    > */
    >int validcstring(const char *str)
    >{


    It accepts "\"

    Missing \u for Unicode

    What is \o supposed to be? I haven't heard of it, and gcc doesn't like it
    either.

    Is it right to reject a string containing an actual tab character instead of
    backslash t? gcc accepts these even in pedantic mode. But it also accepts
    most control characters so I guess you have to draw a line somewhere.

    --
    Alan Curry
    Alan Curry, Sep 28, 2013
    #2
    1. Advertising

  3. Malcolm McLean <> writes:

    > Are all the is dotted and ts crossed with this?


    No.

    > /*
    > test if a string is a valid C string - can it go in the place
    > of a C string literal?


    C literals include L"...". Recent versions permit universal character
    names. You may have excluded these by design, but then the comment is
    misleading.

    > */
    > int validcstring(const char *str)


    It incorrectly accepts "\o123" whilst correctly rejecting "\o". Then
    again, it incorrectly accepts "\p". It incorrectly rejects "\x1x". It
    incorrectly accepts "\". It incorrectly accepts """" (Are you, perhaps,
    trying to implements the concatenation of string literals? If so the
    comment needs further adjusting.)

    It looks untested to me. The code is rather laboured -- by which I mean
    the structure of it does not reflect the structure in the spec. That
    might explain a bug or two.

    <snip code>
    --
    Ben.
    Ben Bacarisse, Sep 28, 2013
    #3
  4. On Saturday, September 28, 2013 4:05:38 PM UTC+1, Ben Bacarisse wrote:
    > Malcolm McLean <> writes:
    >
    >
    >
    > > Are all the is dotted and ts crossed with this?

    >
    >
    > No.
    >
    >
    >
    > > /*
    > > test if a string is a valid C string - can it go in the place
    > > of a C string literal?

    >
    >
    >
    > C literals include L"...". Recent versions permit universal character
    > names. You may have excluded these by design, but then the comment is
    > misleading.
    >
    > > */

    >
    > > int validcstring(const char *str)

    >
    >
    >
    > It incorrectly accepts "\o123" whilst correctly rejecting "\o". Then
    > again, it incorrectly accepts "\p". It incorrectly rejects "\x1x". It
    > incorrectly accepts "\". It incorrectly accepts """" (Are you, perhaps,
    > trying to implements the concatenation of string literals? If so the
    > comment needs further adjusting.)
    >
    >
    >
    > It looks untested to me. The code is rather laboured -- by which I mean
    > the structure of it does not reflect the structure in the spec. That
    > might explain a bug or two.
    >
    >

    Yes, it started off as a trivial function testing for an opening and closing
    quote. But I realised that people might want constructs like

    char *str = "Fred "
    "Bloggs";

    (It's for an automatic code generator).
    The idea is that people write in their scripts
    <string> "This is a C literal\n" </string>
    or
    <string> This is embedded text
    with a newline </string>

    and it intelligently distinguishes between the two, escaping the embedded
    text.

    It sort of steadily ballooned until I wasn't sure I had something that actually
    met the spec or not.
    Malcolm McLean, Sep 28, 2013
    #4
  5. Malcolm McLean <> writes:
    <snip>
    > Yes, it started off as a trivial function testing for an opening and closing
    > quote. But I realised that people might want constructs like
    >
    > char *str = "Fred "
    > "Bloggs";


    C permits other things between concatenated string literals. Most
    notably, comments.

    <snip>
    --
    Ben.
    Ben Bacarisse, Sep 28, 2013
    #5
  6. On Saturday, September 28, 2013 6:06:49 PM UTC+1, Ben Bacarisse wrote:
    > Malcolm McLean <> writes:
    >
    > > Yes, it started off as a trivial function testing for an opening and closing
    > > quote. But I realised that people might want constructs like

    >
    >
    > > char *str = "Fred "
    > > "Bloggs";

    >
    > C permits other things between concatenated string literals. Most
    > notably, comments.
    >
    >

    That's a point.
    I don't want to throw half a C parser at this problem.
    But it's going to be very difficult for the user to enter newlines in
    single line strings if I don't allow escaped C syntax.
    He'll have to write

    <string>this is a line
    </string>
    with no trailing whitespace before the newline. That isn't really acceptable.
    Malcolm McLean, Sep 28, 2013
    #6
  7. Malcolm McLean <> writes:
    > On Saturday, September 28, 2013 6:06:49 PM UTC+1, Ben Bacarisse wrote:
    >> Malcolm McLean <> writes:
    >>
    >> > Yes, it started off as a trivial function testing for an opening
    >> > and closing quote. But I realised that people might want constructs
    >> > like

    >>
    >>
    >> > char *str = "Fred "
    >> > "Bloggs";

    >>
    >> C permits other things between concatenated string literals. Most
    >> notably, comments.
    >>
    >>

    > That's a point.
    > I don't want to throw half a C parser at this problem.
    > But it's going to be very difficult for the user to enter newlines in
    > single line strings if I don't allow escaped C syntax.


    You might consider throwing a *whole* C parser at the problem. Find an
    existing open-source C parser and modify it so it accepts string
    literals rather than translation units, and strip out everything that
    recognizes things that can't be part of a string literal.

    --
    Keith Thompson (The_Other_Keith) <http://www.ghoti.net/~kst>
    Working, but not speaking, for JetHead Development, Inc.
    "We must do something. This is something. Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"
    Keith Thompson, Sep 28, 2013
    #7
  8. On Saturday, September 28, 2013 7:49:46 PM UTC+1, Keith Thompson wrote:
    > Malcolm McLean <> writes:
    >
    > You might consider throwing a *whole* C parser at the problem. Find an
    > existing open-source C parser and modify it so it accepts string
    > literals rather than translation units, and strip out everything that
    > recognizes things that can't be part of a string literal.
    >

    It's for the Baby X resource compiler.

    It's a medium-weight program, it's got quite elaborate source to read in
    images in various formats, and I've pulled in the freetype library to
    rasterise fonts. I thought that adding strings would be a trivial bit
    of code, and it is pretty simple to write something that works acceptably.

    But ideally you want so support all string literal syntax. I haven't even
    decided what to do about Unicode yet - I decided it's too big a problem
    to handle for now, though I appreciate that by doing that I'm storing up
    compatibility issues for the future.

    So a full parser wouldn't be that unreasonable. I'm still a bit reluctant
    to complicate the source like that, however.
    Malcolm McLean, Sep 28, 2013
    #8
  9. Malcolm McLean

    BartC Guest

    "Malcolm McLean" <> wrote in message
    news:...
    > On Saturday, September 28, 2013 7:49:46 PM UTC+1, Keith Thompson wrote:
    >> Malcolm McLean <> writes:
    >>
    >> You might consider throwing a *whole* C parser at the problem. Find an
    >> existing open-source C parser and modify it so it accepts string
    >> literals rather than translation units, and strip out everything that
    >> recognizes things that can't be part of a string literal.
    >>

    > It's for the Baby X resource compiler.
    >
    > It's a medium-weight program, it's got quite elaborate source to read in
    > images in various formats, and I've pulled in the freetype library to
    > rasterise fonts. I thought that adding strings would be a trivial bit
    > of code, and it is pretty simple to write something that works acceptably.


    Are the people writing the input to this program (which might include C
    literals) expert C coders?

    If not, then they probably won't know every feature to do with C string
    literals, so just document what is supported by your recogniser. Possibly
    even make up your own features (of how to deal with new lines for example)
    which can then be trivially converted to valid C. (Because the output needs
    to be valid C, but the input doesn't need to be, or it could be a subset.)

    --
    Bartc
    BartC, Sep 28, 2013
    #9
  10. On Saturday, September 28, 2013 10:53:37 PM UTC+1, Bart wrote:
    > "Malcolm McLean" <> wrote in message
    >
    > Are the people writing the input to this program (which might include C
    > literals) expert C coders?
    >

    Baby X is a toolkit or widget library for the X window system.

    The Baby X resource compiler is going to be a companion program for
    embedding data into Baby X programs, but it will also be usable on
    its own. It takes a script file and outputs compileable C source.

    So I'd expect that most users would be pretty competent in C. Strings
    are a bit of an afterthought. But I think most users would expect
    a string type.
    Malcolm McLean, Sep 28, 2013
    #10
  11. Malcolm McLean

    BartC Guest

    "Malcolm McLean" <> wrote in message
    news:...
    > On Saturday, September 28, 2013 10:53:37 PM UTC+1, Bart wrote:
    >> "Malcolm McLean" <> wrote in message
    >>
    >> Are the people writing the input to this program (which might include C
    >> literals) expert C coders?
    >>

    > Baby X is a toolkit or widget library for the X window system.
    >
    > The Baby X resource compiler is going to be a companion program for
    > embedding data into Baby X programs, but it will also be usable on
    > its own. It takes a script file and outputs compileable C source.
    >
    > So I'd expect that most users would be pretty competent in C. Strings
    > are a bit of an afterthought. But I think most users would expect
    > a string type.


    OK, but ... the script file presumably has its own non-C syntax, so it isn't
    valid C either, so there's no reason for its string literals to be entirely
    C-compatible.

    Otherwise, it's been pointed out verifying every possible input that can
    legally be a C string can be difficult. For example, users may want to
    construct some of their strings like this:

    STR(this_is_a_string)

    (which depends on a macro existing such as #define STR(a) #a), but you can't
    verify that without knowing the entire context in which that fragment is
    going to be compiled. Therefore you don't allow this, but that means you are
    creating a restriction. So just have a few more!

    --
    Bartc
    BartC, Sep 29, 2013
    #11
  12. On Sunday, September 29, 2013 9:12:37 AM UTC+1, Bart wrote:
    > "Malcolm McLean" <> wrote in message
    >
    >
    > > The Baby X resource compiler is going to be a companion program for
    > > embedding data into Baby X programs, but it will also be usable on
    > > its own. It takes a script file and outputs compileable C source.

    >
    >
    > > So I'd expect that most users would be pretty competent in C. Strings
    > > are a bit of an afterthought. But I think most users would expect
    > > a string type.

    >
    > OK, but ... the script file presumably has its own non-C syntax, so it isn't
    > valid C either, so there's no reason for its string literals to be entirely
    > C-compatible.
    >
    >
    > Otherwise, it's been pointed out verifying every possible input that can
    > legally be a C string can be difficult. For example, users may want to
    > construct some of their strings like this:
    >
    > STR(this_is_a_string)
    >
    > (which depends on a macro existing such as #define STR(a) #a), but you can't
    > verify that without knowing the entire context in which that fragment is
    > going to be compiled. Therefore you don't allow this, but that means you are
    > creating a restriction. So just have a few more!
    >
    >

    The script file is meant to be very simple, though of course technically it
    has a syntax. I don't want people constructing formal grammars for it, it's
    just a list of resources to be included in the program.

    So the basic string syntax is

    <string name = "fred">Fred</string>

    That creates the output

    char *fred = "Fred";

    The other basic case is

    <string name = "fred", src = "/texts/fred.txt"></string>

    which pulls in the file fred.txt, converts it to a C-parseable string, and
    gives it the name "fred".

    So what should happen if the user gives a non-legal C identifer as name?
    Currently, the program just passes it, on the grounds that if he writes
    "1fred?", or, more subtly, "stranger", he knows what he's doing. But there's
    a case for generating warnings at least when this happens.

    The other problem is that people may want to embed newlines or other control
    characters in strings. So really you have to allow

    <string name = "fred">"Fred\n"</string>

    Now that creates problems. An automatic script generator might construct

    <string name = "address">"Four score and seven years ago our fathers brought
    forth on this continent a new nation, conceived in liberty, and dedicated to
    the proposition that all men are created equal."<string>

    expecting the output

    char *address = "\"Four score ..."

    strictly you should have two tags, <string> and <stringlit>. But that makes
    the script files hard to write. People will forget when you use <string> and
    when you use <stringlit>.
    Malcolm McLean, Sep 29, 2013
    #12
  13. Malcolm McLean

    BartC Guest

    "Malcolm McLean" <> wrote in message
    news:...

    > The script file is meant to be very simple, though of course technically
    > it
    > has a syntax. I don't want people constructing formal grammars for it,
    > it's
    > just a list of resources to be included in the program.
    >
    > So the basic string syntax is
    >
    > <string name = "fred">Fred</string>
    >
    > That creates the output
    >
    > char *fred = "Fred";


    When I tried your validcstring() function, it seemed to need the quotes
    around the string, so Fred on it's own wouldn't work. (I need to write
    validcstring("\"Fred\"") rather than validcstring("Fred").)

    If Fred *is* acceptable, then you are already adapting your stylised syntax
    to be C compatible! That's along the lines of what I've been saying.

    > The other basic case is
    >
    > <string name = "fred", src = "/texts/fred.txt"></string>
    >
    > which pulls in the file fred.txt, converts it to a C-parseable string, and
    > gives it the name "fred".


    What does it do with characters that need to be escape codes in the C
    string? I can't imagine the entire file contents are surrounded with double
    quotes, nor that new lines are denoted by \n. It should just be normal text
    file.

    Printing the contents of a string so that it looks like a string literal,
    with special characters converted to escape codes, if something I've done
    many times. So:

    This
    is "my" string

    with an embedded newline, is displayed as "This\nis \"my\" string".

    Sometimes I optionally allow string input to be quoted (in order to allow
    embedded spaces for example), but then this creates a problem of whether the
    quotes are to be part of the output or not; so, should:

    "Fred"

    be output as "Fred" or as "\"Fred\""? (Generally, the first set are quotes,
    if present, are not part of the contents.)

    > So what should happen if the user gives a non-legal C identifer as name?
    > Currently, the program just passes it, on the grounds that if he writes
    > "1fred?", or, more subtly, "stranger", he knows what he's doing. But
    > there's
    > a case for generating warnings at least when this happens.


    Checking C identifiers is much simpler (unless perhaps the user wants to use
    advanced techniques to create the name such as macro calls). But as you say
    an error will be generated later. You might to (if you don't already) add a
    comment to the generated C which refers to the line number in the input
    script file, to help trace back the error.

    > The other problem is that people may want to embed newlines or other
    > control
    > characters in strings. So really you have to allow
    >
    > <string name = "fred">"Fred\n"</string>
    >
    > Now that creates problems. An automatic script generator might construct


    (Now the script is generated automatically as well as the C output!)

    > <string name = "address">"Four score and seven years ago our fathers
    > brought
    > forth on this continent a new nation, conceived in liberty, and dedicated
    > to
    > the proposition that all men are created equal."<string>


    > expecting the output
    >
    > char *address = "\"Four score ..."


    People will prefer to write the above just as it is, without special syntax,
    or perhaps to paste from another source without the tedium of converting
    newlines to "\n" or splitting into multiple strings. But it's *your* script
    language, and you can do what you like! Generally script languages are used
    because they're simpler than writing in something like C, not just as
    hard! (In fact harder, with real C syntax embedded in the script.)

    --
    Bartc
    BartC, Sep 29, 2013
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. James Vanns
    Replies:
    7
    Views:
    7,030
    Evan Carew
    Jan 21, 2004
  2. komal
    Replies:
    6
    Views:
    1,418
    msalters
    Jan 25, 2005
  3. Replies:
    2
    Views:
    914
    Bengt Richter
    Aug 1, 2005
  4. Giannis Papadopoulos

    Function pointer to void function and int function

    Giannis Papadopoulos, Sep 5, 2005, in forum: C Programming
    Replies:
    5
    Views:
    1,229
    Barry Schwarz
    Sep 5, 2005
  5. weafon
    Replies:
    1
    Views:
    302
    Diez B. Roggisch
    Jul 14, 2009
Loading...

Share This Page