sscanf parsing doubt

S

Simone Mehta

hi All,
I am parsing a CSV file.
I want to read every row into a char array of reasonable size and then
extract strings from it.
<snippet>
char foo[128]="hello,world,bye,bye,world";
.....
sscanf(foo,"%s%*[,]%s%*[,]%s%*[,]%s%*[,]%s",s1,s2,s3,s4,s5);
<snippet/>
This is giving me junk .
I understand it is not finding '\0' to scan (%s) strings.
but then I cannot use %c also .
I think i can use like "%64c%*[,]%64c" .
Please enlighten me as to the algo to be used here . Am i doing it the
right way ?

Thanks In Advance,
Simone Mehta.
 
M

Michael Mair

Hi,
<snippet>
char foo[128]="hello,world,bye,bye,world";
....
sscanf(foo,"%s%*[,]%s%*[,]%s%*[,]%s%*[,]%s",s1,s2,s3,s4,s5);
<snippet/>
This is giving me junk .
I understand it is not finding '\0' to scan (%s) strings.

Nope. It gives you junk because %s spans from white space to
white space. Commas are not white spaces, so s1 gets it all.

Check the return value of scanf(), this tells you how many
input items you actually could read.

Use the scanset: For example, you can scan for "%[^, \t]"
which stops at the first comma, blank or tabulator.

but then I cannot use %c also .
I think i can use like "%64c%*[,]%64c" .

No. The c conversion specifier will not give you strings
but character arrays which can be nasty to handle.
Apart from that, the problem of the comma being gobbled
by %64c still persists.


Apart from that, using a field width for reading in the
strings to be stored in s1 through s5 is a Good Idea.
If a string before the last item was too long, the return value
of scanf will tell you. For the last item, look up
Pop's Device here in the newsgroup to see how to get
rid of the rest of the line.


Cheers
Michael


#include <stdio.h>
#include <stdlib.h>


#define MAXITEMLEN 32

#define STRINGIZE(s) # s
#define XSTR(s) STRINGIZE(s)

#define DONTSCAN ", \t"
#define ITEMFORMAT "[^" DONTSCAN "]"
#define MAXITEMFORMAT XSTR(MAXITEMLEN) ITEMFORMAT

#define ONEITEM "%" MAXITEMFORMAT
#define SEP "%*[" DONTSCAN "]"

int main (void)
{
char foo[128] = "hello,world, bye ,\tbye\t,world";
char s0[MAXITEMLEN], s1[MAXITEMLEN], s2[MAXITEMLEN];
char s3[MAXITEMLEN], s4[MAXITEMLEN];
int rv;

rv = sscanf(foo, " " ONEITEM SEP ONEITEM SEP ONEITEM SEP
ONEITEM SEP ONEITEM, s0, s1, s2, s3, s4);

switch (rv) {
case 5:
fprintf(stdout,"s4: %s\n",s4);
case 4:
fprintf(stdout,"s3: %s\n",s3);
case 3:
fprintf(stdout,"s2: %s\n",s2);
case 2:
fprintf(stdout,"s1: %s\n",s1);
case 1:
fprintf(stdout,"s0: %s\n",s0);
default:
if (rv != 5) {
fprintf(stderr, "Did not get all items!\n");
exit(EXIT_FAILURE);
}
}


return 0;
}
 
P

pete

Simone said:
hi All,
I am parsing a CSV file.
I want to read every row into a char array of reasonable size and then
extract strings from it.
<snippet>
char foo[128]="hello,world,bye,bye,world";
....
sscanf(foo,"%s%*[,]%s%*[,]%s%*[,]%s%*[,]%s",s1,s2,s3,s4,s5);
<snippet/>
This is giving me junk .
I understand it is not finding '\0' to scan (%s) strings.
but then I cannot use %c also .
I think i can use like "%64c%*[,]%64c" .
Please enlighten me as to the algo to be used here . Am i doing it the
right way ?

I think the smimplest way, is to read whole lines from the file
into strings, and then to process the strings in memory.

/* BEGIN output from new.c */

helloworldbyebyeworld

/* END output from new.c */



/* BEGIN new.c */

#include <stdio.h>
#include <string.h>

int main(void)
{
char foo[128] = "hello,world,bye,bye,world";
char *pointer;

for (pointer = foo; *pointer != '\0'; ++pointer) {
if (*pointer == ',') {
memmove(pointer, pointer + 1, strlen(pointer));
}
}
puts("\n/* BEGIN output from new.c */\n");
puts(foo);
puts("\n/* END output from new.c */");
return 0;
}

/* END new.c */
 
M

Michael Mair

Hi pete,


it seems to me that you misunderstood the OP's question:
^^^^^^^^^^^^^^^^^^^^^^^^^
Note: The OP is doing things line by line.
He wants to set s1 through s5.
[snip! code <snippet> and questions to that]

I think the smimplest way, is to read whole lines from the file
into strings, and then to process the strings in memory.

Which is what the OP does, if I understood him/her correctly.

/* BEGIN output from new.c */

helloworldbyebyeworld

/* END output from new.c */


/* BEGIN new.c */

#include <stdio.h>
#include <string.h>

int main(void)
{
char foo[128] = "hello,world,bye,bye,world";
char *pointer;

for (pointer = foo; *pointer != '\0'; ++pointer) {
if (*pointer == ',') {
memmove(pointer, pointer + 1, strlen(pointer));
}
}
puts("\n/* BEGIN output from new.c */\n");
puts(foo);
puts("\n/* END output from new.c */");
return 0;
}

/* END new.c */

I would suggest the following modification:
> #include <stdio.h>
> #include <string.h>
#include <assert.h>

#define MAXNUMENTRIES 5
> int main(void)
> {
> char foo[128] = "hello,world,bye,bye,world";
char *pointer, *s[MAXNUMENTRIES+1];
size_t i=0;
> s[i++] = foo;
> for (pointer = foo; *pointer != '\0'; ++pointer) {
> if (*pointer == ',') {
*pointer = '\0';
s[i++] = pointer+1;
assert(i<=MAXNUMENTRIES);
s = NULL; /* Signify end of valid entries */
> puts("\n/* BEGIN output from new.c */\n");
for (i=0; s != NULL; i++)
puts(s);
> puts("\n/* END output from new.c */");
> return 0;
> }

I did not test it, though; just wanted to make clear
how to do it :)


Cheers
Michael
 
S

Simone Mehta

Hi pete,Michael,
thanks for the useful replies.
Michael Mair <[email protected]>

it seems to me that you misunderstood the OP's question:

you are right Michael I want to scan line by line.
I would suggest the following modification:

#include <assert.h>

#define MAXNUMENTRIES 5
I am able to get the same using your program michael.
but need to go for sscanf is that .
csv files have strings with quotes also.
like "hello",world,"foo",FSM,"comp,lang,c"
so this being the case. I will have to maintain a small FSM when it
comes to quote
which can make things difficult.
So i wanted to train sscanf to identify quotes or strings without
them.
but sscanf seems to have a real bad man page or maybe I am not able to
understand much from it.
I would in the above case be interested in
s1=hello
s2=world
s3=foo
s4=FSM
s5=comp,lang,c

any sscanf URLs/bookmarks any one has, explaining a little more would
be a great help. google has helped me a lot but not much on this one
though...

TIA,
Simone Mehta
 
M

Michael Mair

Hi Simone,
[Modified code, original code from pete]
I am able to get the same using your program michael.
but need to go for sscanf is that .
csv files have strings with quotes also.
like "hello",world,"foo",FSM,"comp,lang,c"
so this being the case. I will have to maintain a small FSM when it
comes to quote which can make things difficult.
So i wanted to train sscanf to identify quotes or strings without
them.

Hmmm, considering that, I would advise you to abandon sscanf
as a solution for the whole line -- you just cannot get that
in readable code. So, sscanf essentially will give you more
of a headache than it gains in (seeming) shortness and
conciseness.
but sscanf seems to have a real bad man page or maybe I am not able to
understand much from it. .....
> any sscanf URLs/bookmarks any one has, explaining a little more would
> be a great help. google has helped me a lot but not much on this one
> though...

Well, it is not very good, but the man pages at dinkumware.com
( http://www.dinkumware.com/refxc.html ) about formatted I/O may
help you a little bit more. Apart from that: Many people are
requesting scanf-format help around here, so maybe a google-search
through comp.lang.c archives can give you a better understanding
of what is happening.

> I would in the above case be interested in
> s1=hello
> s2=world
> s3=foo
> s4=FSM
> s5=comp,lang,c

If you know _beforehand_ in which places to expect quotation marks,
you can easily adjust the format in my example.
Otherwise, I would just go through the string in the way pete
has showed. If you encounter a '\"' as first character after
a comma (and zero or more white spaces), just search for '\"'
instead of a terminating ',' and after finding it, throw away
everything up to the next ','...


Cheers
Michael
 
D

Dag Viken

Simone Mehta said:
hi All,
I am parsing a CSV file.
I want to read every row into a char array of reasonable size and then
extract strings from it.
<snippet>
char foo[128]="hello,world,bye,bye,world";
....
sscanf(foo,"%s%*[,]%s%*[,]%s%*[,]%s%*[,]%s",s1,s2,s3,s4,s5);
<snippet/>
This is giving me junk .
I understand it is not finding '\0' to scan (%s) strings.
but then I cannot use %c also .
I think i can use like "%64c%*[,]%64c" .
Please enlighten me as to the algo to be used here . Am i doing it the
right way ?

Thanks In Advance,
Simone Mehta.

You could use
sscanf(foo, "%[^,],%[^,],%[^,],%[^,],%[^,]", s1, s2, s3, s4, s5);
where s1,s2,s3,s4,s5 all point to string buffers;

You could also try this:

char foo[128] = "hello,world,bye,bye,world";
char* sep = ",";
char* str;
int n;
for (n=0, str=strtok(foo,sep); n++, str!=NULL; str=strtok(NULL,sep))
printf("%d: %s\n", n, str);

which gives me the output:
1: hello
2: world
3: bye
4: bye
5: world

Note that strtok will replace the commas with a NULLs in foo. Also, avoid
strtok in multi-threaded applications since it uses static data to preserve
context.

Dag
 
D

Dan Pop

In said:
I am parsing a CSV file.
I want to read every row into a char array of reasonable size and then
extract strings from it.
<snippet>
char foo[128]="hello,world,bye,bye,world";
....
sscanf(foo,"%s%*[,]%s%*[,]%s%*[,]%s%*[,]%s",s1,s2,s3,s4,s5);
<snippet/>
This is giving me junk .

What else can you expect from your brain dead sscanf call?
I understand it is not finding '\0' to scan (%s) strings.

You appear to be completely clueless about how %s works.
but then I cannot use %c also .

%c is useful only when you know in advance how many characters you want
to read. And it doesn't store its output as a properly terminated string.
I think i can use like "%64c%*[,]%64c" .

%64c is hardly any better than %s. I'd say it's actually worse...
Please enlighten me as to the algo to be used here . Am i doing it the
right way ?

Nope. Which is to be expected, since you have obviously not bothered to
*carefully* read the specification of the sscanf function. The first rule
of programming: if you don't know what you're doing, don't do it at all.

A %s directive starts by skipping white space (if any) and then it
consumes everything until a white space character or the null character
terminating the input string are encountered. Your string has no white
space characters, so the first %s will store the whole string in s1.
So, %s is useless for your purpose. The right solution is:

rc = sscanf(foo, "%[^,],%[^,],%[^,],%[^,],%[^\n]", s1, s2, s3, s4, s5);

The last conversion specification can be %s if your fields cannot contain
white space. No need for %*[,] unless you want to skip multiple commas,
which doesn't make much sense (no point in skipping multiple commas if
you don't know their exact position inside the input string).

Always check the value of rc, instead of blindly assuming that all 5
fields were properly extracted from the input string.

Trivia quiz: why did I use %[^\n] for the last conversion?

Dan
 
C

Chris Torek

csv files have strings with quotes also.
like "hello",world,"foo",FSM,"comp,lang,c"
so this being the case. I will have to maintain a small FSM when it
comes to quote
which can make things difficult.
So i wanted to train sscanf to identify quotes or strings without
them. ...

The scanf engine is less powerful than regular expressions, and
in this case, is not powerful enough to do what you want.

Note that even regular expressions -- which *can* match quotes,
at least in some RE systems -- cannot handle more-general parsing
tasks, such as matching parentheses. But clearly the scanf engine,
which does only literal matches without alternation, is not enough
by itself to handle both quoted and unquoted strings. The closest
you can get is a sort of "manual alternation" scheme:

while (there is more to scan) {
if (this item begins with a double quote) {
run scanf engine on RE-subset "[^"]+", e.g.:

ret = sscanf(&buf[offset], "\"%79[^\"]%c%n",
dequoted_string, &doublequote_char, &more_offset);
if (ret != 2) ... handle error ...

now doublequote_char is " and more_offset says how many
characters were scanned. Note that this assumes the
dequoted_string[] array has size 80 or more (%79 above).
} else {
run scanf engine on RE-subset [^,]+
}
}

This is still not good enough for "real" CSV files, which allow
quoting the quote marks (in various ways).

I recommend writing a real (but ad-hoc) lexer (or finding one, e.g.,
via google search, and adapting it if needed).
 
R

Ravi Uday

Dan Pop said:
In <[email protected]>
I am parsing a CSV file.
I want to read every row into a char array of reasonable size and then
extract strings from it.
<snippet>
char foo[128]="hello,world,bye,bye,world";
....
sscanf(foo,"%s%*[,]%s%*[,]%s%*[,]%s%*[,]%s",s1,s2,s3,s4,s5);
<snippet/>
This is giving me junk .

What else can you expect from your brain dead sscanf call?
I understand it is not finding '\0' to scan (%s) strings.

You appear to be completely clueless about how %s works.
but then I cannot use %c also .

%c is useful only when you know in advance how many characters you want
to read. And it doesn't store its output as a properly terminated string.
I think i can use like "%64c%*[,]%64c" .

%64c is hardly any better than %s. I'd say it's actually worse...
Please enlighten me as to the algo to be used here . Am i doing it the
right way ?

Nope. Which is to be expected, since you have obviously not bothered to
*carefully* read the specification of the sscanf function. The first rule
of programming: if you don't know what you're doing, don't do it at all.

A %s directive starts by skipping white space (if any) and then it
consumes everything until a white space character or the null character
terminating the input string are encountered. Your string has no white
space characters, so the first %s will store the whole string in s1.
So, %s is useless for your purpose. The right solution is:

rc = sscanf(foo, "%[^,],%[^,],%[^,],%[^,],%[^\n]", s1, s2, s3, s4, s5);

The last conversion specification can be %s if your fields cannot contain
white space. No need for %*[,] unless you want to skip multiple commas,
which doesn't make much sense (no point in skipping multiple commas if
you don't know their exact position inside the input string).

Always check the value of rc, instead of blindly assuming that all 5
fields were properly extracted from the input string.

Trivia quiz: why did I use %[^\n] for the last conversion?
Does it serve any purpose ? Because sscanf would terminate anyways if it
encounters '\0' which in the OP
code is present.
 
D

Dan Pop

Dan Pop said:
Trivia quiz: why did I use %[^\n] for the last conversion?
Does it serve any purpose ? Because sscanf would terminate anyways if it
encounters '\0' which in the OP
code is present.

Try broadening your horizon, beyond the artificial example of the OP.
In real programs, where do such strings come from?

Dan
 
A

Aslam Sheikh Durrani

In said:
I am parsing a CSV file.
I want to read every row into a char array of reasonable size and then
extract strings from it.
<snippet>
char foo[128]="hello,world,bye,bye,world";
....
sscanf(foo,"%s%*[,]%s%*[,]%s%*[,]%s%*[,]%s",s1,s2,s3,s4,s5);
<snippet/>
This is giving me junk .

What else can you expect from your brain dead sscanf call?
I understand it is not finding '\0' to scan (%s) strings.

You appear to be completely clueless about how %s works.
It appears you are in complete control of the situation then pray give
the right answer , stop bullying around the OP.
but then I cannot use %c also .

%c is useful only when you know in advance how many characters you want
to read. And it doesn't store its output as a properly terminated string.
I think i can use like "%64c%*[,]%64c" .

%64c is hardly any better than %s. I'd say it's actually worse...
Please enlighten me as to the algo to be used here . Am i doing it the
right way ?

Nope. Which is to be expected, since you have obviously not bothered to
*carefully* read the specification of the sscanf function. The first rule
of programming: if you don't know what you're doing, don't do it at all.
The OP has some confusions thats why he has turned to the list.
don't scare her. I am sure she must have tried the Circumflex with
lilttle success,.
A %s directive starts by skipping white space (if any) and then it
consumes everything until a white space character or the null character
terminating the input string are encountered. Your string has no white
space characters, so the first %s will store the whole string in s1.
So, %s is useless for your purpose. The right solution is:

rc = sscanf(foo, "%[^,],%[^,],%[^,],%[^,],%[^\n]", s1, s2, s3, s4, s5);

The last conversion specification can be %s if your fields cannot contain
white space. No need for %*[,] unless you want to skip multiple commas,
which doesn't make much sense (no point in skipping multiple commas if
you don't know their exact position inside the input string).

Always check the value of rc, instead of blindly assuming that all 5
fields were properly extracted from the input string.
Please stop thinking people will paste complete code here. some code
is always left out for clarity.
Trivia quiz: why did I use %[^\n] for the last conversion?

Dan

your signature says ur looking for a job...
Such arrogance from you can only lead to the search getting prolonged
..
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top