validcstring function


M

Malcolm McLean

Are all the is dotted and ts crossed with this?



/*
test if a string is a valid C string - can it go in the place
of a C string literal?
*/
int validcstring(const char *str)
{
size_t len;
size_t i, j;
size_t start;
size_t end;

len = strlen(str);
if(len < 2)
return 0;
for(start=0; start < len; start++)
if(!isspace( (unsigned char) str[start]))
break;
end = len;
while(end--)
if(!isspace((unsigned char) str[end]))
break;
if(start == end || str[start] != '\"' || str[end] != '\"')
return 0;
start++;
end--;


for(i=start;i<end;i++)
{
if(str == '\\')
{
if(strchr("aftbrnvxo01234567\?\\\'\"", str[i+1]))
{
if(str[i+1] == 'o')
{
if(!isdigit((unsigned char) str[i+2]) ||
!isdigit((unsigned char) str[i+3]) ||
!isdigit((unsigned char) str[i+4]) )
return 0;

}
else if(str[i+1] == 'x')
{
if(!isxdigit((unsigned char) str[i+2]) ||
!isxdigit((unsigned char) str[i+3]))
return 0;
}
else
i++;
continue;
}
}

if(str == '\"')
{
for(j=i+1;j<end;j++)
if(!isspace((unsigned char) str[j]))
break;
if(str[j] == '\"')
i = j;
else
return 0;
}
if(!isgraph( (unsigned char) str) && str != ' ')
return 0;
}

return 1;
}
 
Ad

Advertisements

A

Alan Curry

Are all the is dotted and ts crossed with this?

/*
test if a string is a valid C string - can it go in the place
of a C string literal?
*/
int validcstring(const char *str)
{

It accepts "\"

Missing \u for Unicode

What is \o supposed to be? I haven't heard of it, and gcc doesn't like it
either.

Is it right to reject a string containing an actual tab character instead of
backslash t? gcc accepts these even in pedantic mode. But it also accepts
most control characters so I guess you have to draw a line somewhere.
 
B

Ben Bacarisse

Malcolm McLean said:
Are all the is dotted and ts crossed with this?
No.

/*
test if a string is a valid C string - can it go in the place
of a C string literal?

C literals include L"...". Recent versions permit universal character
names. You may have excluded these by design, but then the comment is
misleading.
*/
int validcstring(const char *str)

It incorrectly accepts "\o123" whilst correctly rejecting "\o". Then
again, it incorrectly accepts "\p". It incorrectly rejects "\x1x". It
incorrectly accepts "\". It incorrectly accepts """" (Are you, perhaps,
trying to implements the concatenation of string literals? If so the
comment needs further adjusting.)

It looks untested to me. The code is rather laboured -- by which I mean
the structure of it does not reflect the structure in the spec. That
might explain a bug or two.

<snip code>
 
M

Malcolm McLean

C literals include L"...". Recent versions permit universal character
names. You may have excluded these by design, but then the comment is
misleading.





It incorrectly accepts "\o123" whilst correctly rejecting "\o". Then
again, it incorrectly accepts "\p". It incorrectly rejects "\x1x". It
incorrectly accepts "\". It incorrectly accepts """" (Are you, perhaps,
trying to implements the concatenation of string literals? If so the
comment needs further adjusting.)



It looks untested to me. The code is rather laboured -- by which I mean
the structure of it does not reflect the structure in the spec. That
might explain a bug or two.
Yes, it started off as a trivial function testing for an opening and closing
quote. But I realised that people might want constructs like

char *str = "Fred "
"Bloggs";

(It's for an automatic code generator).
The idea is that people write in their scripts
<string> "This is a C literal\n" </string>
or
<string> This is embedded text
with a newline </string>

and it intelligently distinguishes between the two, escaping the embedded
text.

It sort of steadily ballooned until I wasn't sure I had something that actually
met the spec or not.
 
B

Ben Bacarisse

Malcolm McLean said:
Yes, it started off as a trivial function testing for an opening and closing
quote. But I realised that people might want constructs like

char *str = "Fred "
"Bloggs";

C permits other things between concatenated string literals. Most
notably, comments.

<snip>
 
M

Malcolm McLean

C permits other things between concatenated string literals. Most
notably, comments.
That's a point.
I don't want to throw half a C parser at this problem.
But it's going to be very difficult for the user to enter newlines in
single line strings if I don't allow escaped C syntax.
He'll have to write

<string>this is a line
</string>
with no trailing whitespace before the newline. That isn't really acceptable.
 
Ad

Advertisements

K

Keith Thompson

Malcolm McLean said:
That's a point.
I don't want to throw half a C parser at this problem.
But it's going to be very difficult for the user to enter newlines in
single line strings if I don't allow escaped C syntax.

You might consider throwing a *whole* C parser at the problem. Find an
existing open-source C parser and modify it so it accepts string
literals rather than translation units, and strip out everything that
recognizes things that can't be part of a string literal.
 
M

Malcolm McLean

You might consider throwing a *whole* C parser at the problem. Find an
existing open-source C parser and modify it so it accepts string
literals rather than translation units, and strip out everything that
recognizes things that can't be part of a string literal.
It's for the Baby X resource compiler.

It's a medium-weight program, it's got quite elaborate source to read in
images in various formats, and I've pulled in the freetype library to
rasterise fonts. I thought that adding strings would be a trivial bit
of code, and it is pretty simple to write something that works acceptably.

But ideally you want so support all string literal syntax. I haven't even
decided what to do about Unicode yet - I decided it's too big a problem
to handle for now, though I appreciate that by doing that I'm storing up
compatibility issues for the future.

So a full parser wouldn't be that unreasonable. I'm still a bit reluctant
to complicate the source like that, however.
 
B

BartC

Malcolm McLean said:
It's for the Baby X resource compiler.

It's a medium-weight program, it's got quite elaborate source to read in
images in various formats, and I've pulled in the freetype library to
rasterise fonts. I thought that adding strings would be a trivial bit
of code, and it is pretty simple to write something that works acceptably.

Are the people writing the input to this program (which might include C
literals) expert C coders?

If not, then they probably won't know every feature to do with C string
literals, so just document what is supported by your recogniser. Possibly
even make up your own features (of how to deal with new lines for example)
which can then be trivially converted to valid C. (Because the output needs
to be valid C, but the input doesn't need to be, or it could be a subset.)
 
M

Malcolm McLean

Are the people writing the input to this program (which might include C
literals) expert C coders?
Baby X is a toolkit or widget library for the X window system.

The Baby X resource compiler is going to be a companion program for
embedding data into Baby X programs, but it will also be usable on
its own. It takes a script file and outputs compileable C source.

So I'd expect that most users would be pretty competent in C. Strings
are a bit of an afterthought. But I think most users would expect
a string type.
 
B

BartC

Malcolm McLean said:
Baby X is a toolkit or widget library for the X window system.

The Baby X resource compiler is going to be a companion program for
embedding data into Baby X programs, but it will also be usable on
its own. It takes a script file and outputs compileable C source.

So I'd expect that most users would be pretty competent in C. Strings
are a bit of an afterthought. But I think most users would expect
a string type.

OK, but ... the script file presumably has its own non-C syntax, so it isn't
valid C either, so there's no reason for its string literals to be entirely
C-compatible.

Otherwise, it's been pointed out verifying every possible input that can
legally be a C string can be difficult. For example, users may want to
construct some of their strings like this:

STR(this_is_a_string)

(which depends on a macro existing such as #define STR(a) #a), but you can't
verify that without knowing the entire context in which that fragment is
going to be compiled. Therefore you don't allow this, but that means you are
creating a restriction. So just have a few more!
 
Ad

Advertisements

M

Malcolm McLean

OK, but ... the script file presumably has its own non-C syntax, so it isn't
valid C either, so there's no reason for its string literals to be entirely
C-compatible.


Otherwise, it's been pointed out verifying every possible input that can
legally be a C string can be difficult. For example, users may want to
construct some of their strings like this:

STR(this_is_a_string)

(which depends on a macro existing such as #define STR(a) #a), but you can't
verify that without knowing the entire context in which that fragment is
going to be compiled. Therefore you don't allow this, but that means you are
creating a restriction. So just have a few more!
The script file is meant to be very simple, though of course technically it
has a syntax. I don't want people constructing formal grammars for it, it's
just a list of resources to be included in the program.

So the basic string syntax is

<string name = "fred">Fred</string>

That creates the output

char *fred = "Fred";

The other basic case is

<string name = "fred", src = "/texts/fred.txt"></string>

which pulls in the file fred.txt, converts it to a C-parseable string, and
gives it the name "fred".

So what should happen if the user gives a non-legal C identifer as name?
Currently, the program just passes it, on the grounds that if he writes
"1fred?", or, more subtly, "stranger", he knows what he's doing. But there's
a case for generating warnings at least when this happens.

The other problem is that people may want to embed newlines or other control
characters in strings. So really you have to allow

<string name = "fred">"Fred\n"</string>

Now that creates problems. An automatic script generator might construct

<string name = "address">"Four score and seven years ago our fathers brought
forth on this continent a new nation, conceived in liberty, and dedicated to
the proposition that all men are created equal."<string>

expecting the output

char *address = "\"Four score ..."

strictly you should have two tags, <string> and <stringlit>. But that makes
the script files hard to write. People will forget when you use <string> and
when you use <stringlit>.
 
Ad

Advertisements

B

BartC

The script file is meant to be very simple, though of course technically
it
has a syntax. I don't want people constructing formal grammars for it,
it's
just a list of resources to be included in the program.

So the basic string syntax is

<string name = "fred">Fred</string>

That creates the output

char *fred = "Fred";

When I tried your validcstring() function, it seemed to need the quotes
around the string, so Fred on it's own wouldn't work. (I need to write
validcstring("\"Fred\"") rather than validcstring("Fred").)

If Fred *is* acceptable, then you are already adapting your stylised syntax
to be C compatible! That's along the lines of what I've been saying.
The other basic case is

<string name = "fred", src = "/texts/fred.txt"></string>

which pulls in the file fred.txt, converts it to a C-parseable string, and
gives it the name "fred".

What does it do with characters that need to be escape codes in the C
string? I can't imagine the entire file contents are surrounded with double
quotes, nor that new lines are denoted by \n. It should just be normal text
file.

Printing the contents of a string so that it looks like a string literal,
with special characters converted to escape codes, if something I've done
many times. So:

This
is "my" string

with an embedded newline, is displayed as "This\nis \"my\" string".

Sometimes I optionally allow string input to be quoted (in order to allow
embedded spaces for example), but then this creates a problem of whether the
quotes are to be part of the output or not; so, should:

"Fred"

be output as "Fred" or as "\"Fred\""? (Generally, the first set are quotes,
if present, are not part of the contents.)
So what should happen if the user gives a non-legal C identifer as name?
Currently, the program just passes it, on the grounds that if he writes
"1fred?", or, more subtly, "stranger", he knows what he's doing. But
there's
a case for generating warnings at least when this happens.

Checking C identifiers is much simpler (unless perhaps the user wants to use
advanced techniques to create the name such as macro calls). But as you say
an error will be generated later. You might to (if you don't already) add a
comment to the generated C which refers to the line number in the input
script file, to help trace back the error.
The other problem is that people may want to embed newlines or other
control
characters in strings. So really you have to allow

<string name = "fred">"Fred\n"</string>

Now that creates problems. An automatic script generator might construct

(Now the script is generated automatically as well as the C output!)
<string name = "address">"Four score and seven years ago our fathers
brought
forth on this continent a new nation, conceived in liberty, and dedicated
to
the proposition that all men are created equal."<string>
expecting the output

char *address = "\"Four score ..."

People will prefer to write the above just as it is, without special syntax,
or perhaps to paste from another source without the tedium of converting
newlines to "\n" or splitting into multiple strings. But it's *your* script
language, and you can do what you like! Generally script languages are used
because they're simpler than writing in something like C, not just as
hard! (In fact harder, with real C syntax embedded in the script.)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top