Parsing a string

A

arnuld

I wrote this program (parser ?) to extract two values from a string
separated by a comma. I extract the values without any problem. Error
checking is there with different kinds of input but in comparison to clc
folks my error checking always looks like Calvin and Hobbes plan for
saving themselves form world war 3.

As usual, any advice om improving the program ? (I wonder why the heck
the value of *pname never changes even when pointer pname is going to
next element for each loop)


/* A program that searches for 2 comma separated values inside a string:
name and number. Name can be anything while number can only be
* an integer greater than zero.
*
* VERSION 0.0
*/

#include <stdio.h>
#include <stdlib.h>
#include <errno.h>

enum { SIZE_NAME = 10 };

void get_values(const char*, char*, const int, int*);

int main(void)
{
const char* arr1 = "Arnul@23,1";
const char* arr2 = "00#$%^&@23,Arnuld";
const char* arr3 = "Arnuld,";
char name[SIZE_NAME+1] = {0};
int num = -1;

printf("arr: %s\tname: %s\tnum: %d\n", arr3, name, num);
printf("\n");

get_values(arr3, name, SIZE_NAME, &num);
printf("name = %s\tnum = %d\n", name, num);

return 0;
}



void get_values(const char* p, char* pname, const int pname_size, int*
pnum)
{
int idx;

for(idx = 0; (pname_size != idx) && *p && (',' != *p); ++p, ++idx, +
+pname)
{
printf("*pname = %c, *tmp = %c\n", *pname, *p);
*pname = *p;
}

errno = 0;
*pnum = (int) strtol(p+1, NULL, 10);
if(errno)
{
perror("***ERROR***, can not convert string \"Max Phone Lines\"
into integer\n");
}

}
====================== OUTPUT ============================
[arnuld@dune programs]$ gcc -ansi -pedantic -Wall -Wextra test.c
test.c: In function ‘main’:
test.c:19: warning: unused variable ‘arr2’
test.c:18: warning: unused variable ‘arr1’
[arnuld@dune programs]$ ./a.out
arr: Arnuld, name: num: -1

*pname = , *tmp = A
*pname = , *tmp = r
*pname = , *tmp = n
*pname = , *tmp = u
*pname = , *tmp = l
*pname = , *tmp = d
name = Arnuld num = 0
[arnuld@dune programs]$
 
A

arnuld

If you don't like strtok (and many don't), then

man strchr
man memcpy
Actually, from the comments in your program header, actually sscanf
might be a good choice.

I never ever use scanf() and its sibling sscanf(). Just scared of it.

Why not get it to return a value indicating success or failure?
fine.
ugh! When you drop out of here, it's not clear whether it's because you
found a comma, or you ran out of space. Consider the behaviour for
"myDogIsFat42", for example.

I will be using strchr() anyway but just wanted to know how about adding
if(',' == *p) for rest of the code after for loop ?

There's also issues with whether the string
you build is null-terminated correctly.

Its always null terminated. I am actually trying to understand cURL and
WriteMemoryCallback() makes sure that its null terminated before being
passed to my fucntion:

http://curl.haxx.se/libcurl/c/getinmemory.html
 
A

arnuld

If you don't like strtok (and many don't), then

man strchr
man memcpy


Here is the strchr() and memcpy() version, any issues in this:


/* A program that searches for 2 comma seperated values inside a string:
name and number. Name can be anything while number can only be
* an integer greater than zero.
*
* VERSION 0.1
*/

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>

enum { SIZE_NAME = 10 };


int main(void)
{
char arrdest[100+1] = {0};
int arrnum = -1;
const char* arr1 = "Arnuld,2";
const char* arr2 = "00#$%^&@23,Arnuld";
const char* arr3 = "Arnuld,";
const char* arr4 = "#$%^&@23,3";
const char* arr5 = ",";
const char* arr6 = "";
const char* arr7 = ",3";
const char* arr8 = "3,";

char* p = NULL;
int recvd_id_size = -1;
char separator = ',';
char arr_used[30+1] = {0};

strcpy(arr_used, arr8);

p = strchr(arr_used,separator);

if(NULL == p)
{
printf("can not find separator inside input\n");
exit(EXIT_FAILURE);
}

recvd_id_size = p - arr_used;

if(recvd_id_size < 1)
{
printf("Did not receive any ID\n");
exit(EXIT_FAILURE);
}

memcpy(arrdest, arr_used, recvd_id_size);

errno = 0;
arrnum = (int) strtol(p+1, NULL, 10);
if(errno)
{
perror("***ERROR***, can not convert string \"Max Phone Lines\"
into an integer\n");
}


printf("ID = %s, num = %d\n", arrdest, arrnum);


return 0;
}
 
E

Eric Sosman

Here is the strchr() and memcpy() version, any issues in this:


/* A program that searches for 2 comma seperated values inside a string:
name and number. Name can be anything while number can only be
* an integer greater than zero.
*
* VERSION 0.1
*/

#include<stdio.h>
#include<stdlib.h>
#include<string.h>
#include<errno.h>

enum { SIZE_NAME = 10 };

What's this for?
int main(void)
{
char arrdest[100+1] = {0};
int arrnum = -1;

Pointless initialization.
const char* arr1 = "Arnuld,2";
const char* arr2 = "00#$%^&@23,Arnuld";
const char* arr3 = "Arnuld,";
const char* arr4 = "#$%^&@23,3";
const char* arr5 = ",";
const char* arr6 = "";
const char* arr7 = ",3";
const char* arr8 = "3,";

char* p = NULL;

Pointless initialization.
int recvd_id_size = -1;

Pointless initialization.
char separator = ',';
char arr_used[30+1] = {0};

Pointless initialization.
strcpy(arr_used, arr8);

Since your parsing is non-destructive (it does not alter the
string it parses), why bother with making a copy and with worrying
about whether the copied-to area is big enough? Just parse `arr8'
directly, in situ.
p = strchr(arr_used,separator);

if(NULL == p)
{
printf("can not find separator inside input\n");
exit(EXIT_FAILURE);
}

recvd_id_size = p - arr_used;

if(recvd_id_size< 1)
{
printf("Did not receive any ID\n");
exit(EXIT_FAILURE);
}

memcpy(arrdest, arr_used, recvd_id_size);

errno = 0;
arrnum = (int) strtol(p+1, NULL, 10);

Probably better to leave it as a `long' and store it in a `long'
variable, lest Strange Things happen with very big numbers.
if(errno)
{
perror("***ERROR***, can not convert string \"Max Phone Lines\"
into an integer\n");
}

A little more checking would be good here. For example, the
code as it stands would find nothing wrong with "Arnuld,33XYZ" or
with "Arnuld,99,Bertram,44". One very useful thing the strtoxxx()
functions provide is a pointer to the first character they did not
convert, that is, a pointer to the tail that follows the converted
number. Checking whether the tail is empty would tell you whether
strtoxxx() had converted the entire string, or whether it found
something indigestible at the end.

Also, you stated that the number should be "greater than zero."
The code doesn't check for this.
printf("ID = %s, num = %d\n", arrdest, arrnum);

If all you want to do is print the ID, you don't need arrdest[]
at all:

printf ("ID = %.*s, num = %d\n",
recvd_id_size, arr8, arrnum);

Alternatively, since you've copied the string to the modifiable
array arr_used[], you could zero-terminate it there:

*p = '\0';
printf ("%ID = %s, num = %d\n", arr_used, arrnum);
return 0;
}

General: You're just assuming that arrdest[] and arr_used[]
will be big enough for the strings you copy into them. This is
true of the strings you have shown, but it's a *very* bad habit
to form. You may have heard the phrases "buffer overflow" or
"buffer overrun" or "stack-smashing for fun and profit;" the
style you're using is exactly the style that's vulnerable to all
of these.
 
A

arnuld

What's this for?

Eh.. leftover from VERSION 0.0 :-\

int main(void)
{
char arrdest[100+1] = {0};
int arrnum = -1;

Pointless initialization.

I never ever leave an array uninitialized because it may cause strange
and paranormal events to happen in program.


Pointless initialization.

Is it wrong to initialized any variable or pointer etc. ?

Since your parsing is non-destructive (it does not alter the
string it parses), why bother with making a copy and with worrying about
whether the copied-to area is big enough? Just parse `arr8' directly,
in situ.

I should have been clear. Input array is being used at 3 or more many
places below this strcpy() code. So rather than changing it 3 times for
every single change in input array name (8 input arrays for error
checking), I created this new array name and all I have to do is just one
change in name rather than 3. It serves the purpose.


Probably better to leave it as a `long' and store it in a `long'
variable, lest Strange Things happen with very big numbers.

The input will never have more than 4 digit number and hence will always
be an int. More important thing is I have a whole lot of code here in my
company which uses this arrnum value and at all those places it is an
int, hence can't help. I myself, do not like conversions like this.




A little more checking would be good here. For example, the
code as it stands would find nothing wrong with "Arnuld,33XYZ" or with
"Arnuld,99,Bertram,44". One very useful thing the strtoxxx() functions
provide is a pointer to the first character they did not convert, that
is, a pointer to the tail that follows the converted number. Checking
whether the tail is empty would tell you whether strtoxxx() had
converted the entire string, or whether it found something indigestible
at the end.

Humm.. just checked section 16.4 of H&S 5, you are right, strtoxxx()
functions store the value as ** in their 2nd argument. Okay, all I have
to do is to equate the double dereference the value and equate it to
null character.

Also, you stated that the number should be "greater than zero."
The code doesn't check for this.

If its zero then some error has occured and perror() will execute.


printf("ID = %s, num = %d\n", arrdest, arrnum);
If all you want to do is print the ID, you don't need arrdest[]
at all:


No, I need a lot more to do with ID, just wait for that :)

Alternatively, since you've copied the string to the modifiable array
arr_used[], you could zero-terminate it there:

*p = '\0';
printf ("%ID = %s, num = %d\n", arr_used, arrnum);

Did not get what you mean here.


General: You're just assuming that arrdest[] and arr_used[]
will be big enough for the strings you copy into them. This is true of
the strings you have shown, but it's a *very* bad habit to form. You
may have heard the phrases "buffer overflow" or "buffer overrun" or
"stack-smashing for fun and profit;" the style you're using is exactly
the style that's vulnerable to all of these.


Every program has some assumptions and in my program I know what size of
input will be (its size/values are filtered first from somewhere else
only then will be given to me at my workplace).

The style/habit you are talking about is I understand very well. What
other choice I have ? There is never ever any guarantee on the size of
the data being input to the program, so one should use malloc() all the
time ? Even malloc() depends on RAM and when RAM goes out then what ? All
I am asking is if my habit is wrong then whats the new habit I need to
replace it with ?
 
S

Stefan Ram

arnuld said:
I wrote this program (parser ?) to extract two values from a string
separated by a comma.

I don't see a grammar in your post. Obviously, one cannot
write a parser without a grammar. Well, at least I cannot
do this.
/* A program that searches for 2 comma separated values inside a string:
name and number. Name can be anything while number can only be
* an integer greater than zero.

Does »Name can be anything« mean that it can be, for
example, "2,2,\0,2\n,2,,,22\01000"?

Could you write an EBNF grammar for the strings that the
parser should accept, preferably one that yields at most
one possible way to parse such a string?
 
A

arnuld

Who you gonna call?

Call for initializing the array ? (whether initialized or not, no one can
call anyone, you have to still whole night in office and find out
yourself)
 
A

arnuld

..SNIP...
Does »Name can be anything« mean that it can be, for example,
"2,2,\0,2\n,2,,,22\01000"?
Right.


Could you write an EBNF grammar for the strings that the parser should
accept, preferably one that yields at most one possible way to parse
such a string?

Okay, I accept that my definition of parser (or my thinking about what a
parser is) was wrong.
 
A

arnuld

A little more checking would be good here. For example, the
code as it stands would find nothing wrong with "Arnuld,33XYZ" or with
"Arnuld,99,Bertram,44". One very useful thing the strtoxxx() functions
provide is a pointer to the first character they did not convert, that
is, a pointer to the tail that follows the converted number. Checking
whether the tail is empty would tell you whether strtoxxx() had
converted the entire string, or whether it found something indigestible
at the end.

H&S 5, section 16.4

long strtol(const char * restrict str, char ** restrict ptr, int base);

For all of these functions, str points to the string to be converted, and
ptr(if not null)designates a char ** that is set by the functions to
point to the first character in str immediately following the converted
part of the string. If ptr is null, then it is ignored.


Don't know but here the pointer is never set by strtol() as mentioned:



/* A program to understand how strtol() works
*
* VERSION 0.0
*/

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>


int main(void)
{
int arrnum = -1;
char** p = NULL;
const char* arr0 = "2,Arnuld";

errno = 0;
arrnum = (int) strtol(arr0, p, 10);
if(errno)
{
perror("***ERROR***, can not convert string into an integer\n");
}

printf("p = %p\n", (void*)p);
printf("num = %d\n", arrnum);


return 0;
}

==================== OUTPUT [arnuld@dune programs]$ gcc -std=c99 -
pedantic -Wall -Wextra strtol.c
[arnuld@dune programs]$ ./a.out
p = (nil)
num = 2
[arnuld@dune programs]$
===========================
 
D

David Resnick

Call for initializing the array ? (whether initialized or not, no one can
call anyone, you have to still whole  night in office and find out
yourself)

--www.lispmachine.wordpress.com

The answer he was looking for was "Ghostbusters", a reference to the
lyrics in that movie in case you haven't seen it...
 
M

Mark Bluemel

H&S 5, section 16.4

long strtol(const char * restrict str, char ** restrict ptr, int base);

For all of these functions, str points to the string to be converted, and
ptr(if not null)designates a char ** that is set by the functions to
point to the first character in str immediately following the converted
part of the string. If ptr is null, then it is ignored.


Don't know but here the pointer is never set by strtol() as mentioned:

I don't think that is well expressed, especially with respect to the use
of "ptr" - ptr cannot be in any useful sense set to point to a
character. *ptr can...

The man pages on my system express it better :-

long int strtol(const char *nptr, char **endptr, int base);
....
If endptr is not NULL, strtol() stores the address of the first invalid
character in *endptr. If there were no digits at all, strtol()
stores
the original value of nptr in *endptr (and returns 0). In
particular,
if *nptr is not ’\0’ but **endptr is ’\0’ on return, the entire
string
is valid.


So endptr is the address of a char * pointer, into which strtol will
store data.
/* A program to understand how strtol() works
*
* VERSION 0.0
*/

#include<stdio.h>
#include<stdlib.h>
#include<string.h>
#include<errno.h>


int main(void)
{
int arrnum = -1;
char** p = NULL;

try "char *p = NULL;" here
const char* arr0 = "2,Arnuld";

errno = 0;
arrnum = (int) strtol(arr0, p, 10);

try "arrnum = (int) strtol(arr0, &p, 10);" here.
if(errno)
{
perror("***ERROR***, can not convert string into an integer\n");
}

printf("p = %p\n", (void*)p);

You might also find it instructive to have

if (p != NULL) {
printf("*p = %c\n", *p);
}
 
E

Eric Sosman

What's this for?

Eh.. leftover from VERSION 0.0 :-\

int main(void)
{
char arrdest[100+1] = {0};
int arrnum = -1;

Pointless initialization.

I never ever leave an array uninitialized because it may cause strange
and paranormal events to happen in program.

I bet you scatter salt whenever you sneeze, and avoid looking
at the full Moon over your left shoulder lest it strike you dead.

Okay, that's exaggeration, but the point remains: Superstition
is a poor foundation for coding practices. If you do something
"because it works" without understanding *why* it works, then you
don't know when or why it might cease to work. "Abracadabra," you
say to a piece of code you've struggle with too long, and lo! it
suddenly works! Unfortunately, this doesn't mean the "abracadabra"
was the reason; the real cure was your change from "<" to "<=" in
a few strategic places, and "abracadabra" merely distracts you from
noticing it. So next time you have a problem you shout "abracadabra"
with great confidence -- and nothing (good) happens. Find a better way.
Is it wrong to initialized any variable or pointer etc. ?

Not "wrong," just "pointless." As I said.
I should have been clear. Input array is being used at 3 or more many
places below this strcpy() code. So rather than changing it 3 times for
every single change in input array name (8 input arrays for error
checking), I created this new array name and all I have to do is just one
change in name rather than 3. It serves the purpose.

You've heard of pointers, it seems, but you haven't yet grasped
what they can do for you.
The input will never have more than 4 digit number and hence will always
be an int.

"The input will always be" is surely the seed of more (and worse)
bugs any other I have heard.
More important thing is I have a whole lot of code here in my
company which uses this arrnum value and at all those places it is an
int, hence can't help. I myself, do not like conversions like this.

Fine. Then take it as a `long', range-check against INT_MAX and
INT_MIN (or zero), and *then* convert to `int'.
Humm.. just checked section 16.4 of H&S 5, you are right, strtoxxx()
functions store the value as ** in their 2nd argument. Okay, all I have
to do is to equate the double dereference the value and equate it to
null character.



If its zero then some error has occured and perror() will execute.

No. If the input is "Arnuld,0" or "Arnuld,-42" you will wind up
with (1) no error at all and (2) a number not greater than zero.
General: You're just assuming that arrdest[] and arr_used[]
will be big enough for the strings you copy into them. This is true of
the strings you have shown, but it's a *very* bad habit to form. You
may have heard the phrases "buffer overflow" or "buffer overrun" or
"stack-smashing for fun and profit;" the style you're using is exactly
the style that's vulnerable to all of these.

Every program has some assumptions and in my program I know what size of
input will be (its size/values are filtered first from somewhere else
only then will be given to me at my workplace).

Even if every program has "some" assumptions, it does not follow
that every program should have "risky" assumptions. Size matters.
The style/habit you are talking about is I understand very well. What
other choice I have ? There is never ever any guarantee on the size of
the data being input to the program, so one should use malloc() all the
time ? Even malloc() depends on RAM and when RAM goes out then what ? All
time ? Even malloc() depends on RAM and when RAM goes out then what ?

If malloc() fails, you can at least do a controlled exit. Most
people who have been victimized by buffer overruns -- people who have
had keyloggers and worse planted on their systems because some lazy
coder just assumed that "the input will always be" -- those people
will come after you with torches and pitchforks. A nice "Out of
memory, shutting down" is *vastly* better than a buffer overrun and
a nice dose of SQL Slammer.
I am asking is if my habit is wrong then whats the new habit I need to
replace it with ?

"Don't trust the input." There are refinements and elaborations,
but I think this is the kernel of most of them. You may know that the
input file is formatted in thus-and-such a way, because you prepared
the file yourself. Yet your input routines should *still* check those
assumptions, because a simple command-line typo in starting the program
may feed it the wrong input file. "Oh, I meant `data.csv' when I typed
`data.raw', oh, well, too bad ..."

Truth be told, there are limits even to paranoia. But you are
not on the paranoiac side; you are on the Pollyanna side. Cross over.
 
E

Eric Sosman

arnuld wrote:
[...]
char arrdest[100+1] = {0};
int arrnum = -1;

Pointless initialization.

I never ever leave an array uninitialized because it may cause strange
and paranormal events to happen in program.

Oh, and by the way: My remark referred to the immediately
preceding line, not to its antecedent. Your use of arrays is
clumsy and error-prone, but that's not what the remark was about.
 
A

arnuld

Not "wrong," just "pointless." As I said.

If you don't initialize a variable then I see the only other way is to
make sure you assign some value to the variable before you use it. Is
that right way ?


You've heard of pointers, it seems, but you haven't yet grasped
what they can do for you.

Right, I still don't know anything much about pointers and know ZERO
percent about function pointers (never used them).

BTW, How can the pointers are helpful in passing an array 3 times here in
my code ? I don't see any way except using an extra array and using strcpy
().



Fine. Then take it as a `long', range-check against INT_MAX and
INT_MIN (or zero), and *then* convert to `int'.

Fine, will do that.



No. If the input is "Arnuld,0" or "Arnuld,-42" you will wind up
with (1) no error at all and (2) a number not greater than zero.

Its fine with that, any input with value zero or less will be discarded.
I don't see anything wrong here (except that program is no way portable
to other place with this kind of assumption and I *must* write this in
comments with big WARNING word)


"Don't trust the input." There are refinements and elaborations,
but I think this is the kernel of most of them. You may know that the
input file is formatted in thus-and-such a way, because you prepared the
file yourself. Yet your input routines should *still* check those
assumptions, because a simple command-line typo in starting the program
may feed it the wrong input file. "Oh, I meant `data.csv' when I typed
`data.raw', oh, well, too bad ..."

I agree with that and am accepting it into my style of coding. Still, if
an array is given as a function parameter and you don't know if its null
terminated or not or if its full of null characters (initialized with
{0}). If its not null terminated and you don't know its size then I don't
see any way to check it. Do you ?
 
A

arnuld

Oh, and by the way: My remark referred to the immediately
preceding line, not to its antecedent. Your use of arrays is clumsy and
error-prone, but that's not what the remark was about.

Thanks for clarification else I kept on thinking you were talking about
initializing the array. My use of arrays is clumsy ... hummm....... I
used arrays like this only:

(1) initialize
(2) pass it to function to do the job
(3) print it

So what is clumsy here ?
 
A

arnuld

I don't think that is well expressed, especially with respect to the use
of "ptr" - ptr cannot be in any useful sense set to point to a
character. *ptr can...

man page on my system (CentOS 5.5) is much more limited than even H&S :
( . This explains what your systems explains:

http://www.manpagez.com/man/3/strtol/



try "char *p = NULL;" here
..SNIP...
try "arrnum = (int) strtol(arr0, &p, 10);" here.

This has always confused me. When argument has to be a double pointer.
When char* p then passing &p works but when char** p then passing p does
not work (both are double pointers)
 
A

arnuld

...SNIP...
try "char *p = NULL;" here
..SNIP...
try "arrnum = (int) strtol(arr0, &p, 10);" here.


/* A program to understand how strtol() works
*
* VERSION 0.1
*/

#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <limits.h>


int main(void)
{
long arrnum;
char* p;
const char* arr0 = "2"; /* It has integer inside */
const char separator = ',';

errno = 0;
arrnum = strtol(arr0, &p, 10);
if((ERANGE == errno && (LONG_MAX == arrnum || LONG_MIN == arrnum)) ||
(0 != errno && 0 == arrnum))
{
perror("***ERROR***, strtol() can not convert string into an integer
\n");
}
else if((0 == errno) && (NULL != p && separator == *p))
{
printf("Successful Conversion by strtol()\n");
}
else
{
printf("Something strane happened inside the program\n");
}


if((arrnum >= INT_MIN) && (arrnum <= INT_MAX))
{
printf("Integer range check successful\n");
printf("num = %d\n", (int)arrnum);
}
else
{
printf("Integer range check unsuccessful\n");
printf("num = %d\n", (int)arrnum);
}


return 0;
}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,566
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top