(f)scanf Question - Grab String of Spaces

N

NvrBst

I have a file full of data that I want to tokenize. My function works
as long as the data I want to grab doesn't have padded whitespaces,
however, I want to preserve the padded whitespaces. Can I modify
fscanf to include them in the match?


---Example File---
MyKey1: INT, 3341, 1
MyKey2: STRING, Hello World, 1
MyKey3: STRING, , 1

--Format is Like so "KEYWORD: TYPE, Data1, Data2"---
fscanf(fFile, "%32[^:]: %32[^,], %32[^,], %d\n", P1, P2, P3, &P4);

When it gets to "MyKey3" it fails to match P3 thus returns 2
elements. I want P3 to be " ". Shouldn't "%32[^,]" be matching
anything but ",", aka spaces as well? A way around this? Different
way I should be tokenizeing such data?

Note: P1/P2/P3 are just "char[32+1]"'s. P4 is an int.

Thanks in Advance; I'm using GNU GCC 4.3.2 on a Ubuntu Machine w/
Latest Eclipse CDT.
 
N

NvrBst

NvrBst said:
I have a file full of data that I want to tokenize.  My function works
as long as the data I want to grab doesn't have padded whitespaces,
however, I want to preserve the padded whitespaces. Can I modify
fscanf to include them in the match?
---Example File---
MyKey1: INT, 3341, 1
MyKey2: STRING, Hello World, 1
MyKey3: STRING,     , 1
--Format is Like so "KEYWORD: TYPE, Data1, Data2"---
fscanf(fFile, "%32[^:]: %32[^,], %32[^,], %d\n", P1, P2, P3, &P4);
When it gets to "MyKey3" it fails to match P3 thus returns 2
elements.  I want P3 to be "    ".  Shouldn't "%32[^,]" be matching
anything but ",", aka spaces as well?  A way around this?  Different
way I should be tokenizeing such data?

     Yes, "%32[^,]" matches anything other than a comma, including
spaces.  But the spaces have already been swallowed by the " "
you put right before it.  If you want to preserve the spaces, don't
write a " " directive to gobble them up.  (If you want to gobble
exactly one space, try "%*1[ ]" instead.)

Ahh I didn't know " " gobbles more than one :) The %*1[ ] made
everything work perfectly. Thank you kindly
 
E

Eric Sosman

NvrBst said:
[... concerning fscanf() ...]
Ahh I didn't know " " gobbles more than one :) The %*1[ ] made
everything work perfectly. Thank you kindly

For future reference, observe that *any* kind of white
space in the format string matches *any* kind of white space
in the input stream. For example, the format " " matches
the inputs " ", " ", "\n\n", "\t \n \t \n \f", and so on.
 
G

Guest

NvrBst said:
I have a file full of data that I want to tokenize. My function works
as long as the data I want to grab doesn't have padded whitespaces,
however, I want to preserve the padded whitespaces. Can I modify
fscanf to include them in the match?


---Example File---
MyKey1: INT, 3341, 1
MyKey2: STRING, Hello World, 1
MyKey3: STRING, , 1

--Format is Like so "KEYWORD: TYPE, Data1, Data2"---
fscanf(fFile, "%32[^:]: %32[^,], %32[^,], %d\n", P1, P2, P3, &P4);

When it gets to "MyKey3" it fails to match P3 thus returns 2
elements. I want P3 to be " ". Shouldn't "%32[^,]" be matching
anything but ",", aka spaces as well? A way around this? Different
way I should be tokenizeing such data?

Note: P1/P2/P3 are just "char[32+1]"'s. P4 is an int.

Thanks in Advance; I'm using GNU GCC 4.3.2 on a Ubuntu Machine w/
Latest Eclipse CDT.

I can suggest you to develop a self-made and overflow-free getline()
method to get the whole line in a file, something like this:

char* getline (FILE *fp) {
char *line = NULL;
char ch;
unsigned int size=0;

while ((ch=fgetc(fp)) && ch!='\n' && ch!='\r' && !feof(fp)) {
line = (char*) realloc(line,++size);
line[size-1]=ch;
}

line[size]=0;
return line;
}

Then you can parse the line obtained this way using regex.h functions
to match what you like, using "," as separator.
 
K

Keith Thompson

I can suggest you to develop a self-made and overflow-free getline()
method to get the whole line in a file, something like this:

char* getline (FILE *fp) {
char *line = NULL;
char ch;
unsigned int size=0;

while ((ch=fgetc(fp)) && ch!='\n' && ch!='\r' && !feof(fp)) {
line = (char*) realloc(line,++size);
line[size-1]=ch;
}

line[size]=0;
return line;
}
[...]

Something like that, but not exactly like it.

fgetc() returns an int, not a char, so ch should be of type int.

feof() doesn't do quite what you seem to be assuming it does. It can
be called *after* fgetc() returns EOF, to determine whether it did so
because it reached the end of the file or because it encountered an
error. You should check whether ch is equal to EOF (which is why it
needs to be an int) *instead* of calling feof(). See section 12 of
the comp.lang.c FAQ, <http://www.c-faq.com/>.

I don't think checking for '\r' makes much sense; if the file is in
text mode, and you're on a system where end-of-line is represented as
a CR LF pair, then that sequence will be converted to '\n' anyway. If
you're on a system where end-of-line is represented as '\n', you might
see '\r' in a text file copied from another system, but adding
special-case code for it is questionable.

Calling realloc() for each character read is likely to be inefficient,
and may cause heap fragmentation on some systems. A common scheme is
to double the allocated size when you run out of room; you can then do
a final realloc() to shrink it down to what's needed. The cast is
superfluous, and can mask errors.
 
R

Richard Tobin

Keith Thompson said:
Calling realloc() for each character read is likely to be inefficient,
and may cause heap fragmentation on some systems. A common scheme is
to double the allocated size when you run out of room; you can then do
a final realloc() to shrink it down to what's needed.

All the realloc() implementations I've checked recently effectively do
this internally anyway. They don't have the quadratic behaviour you'd
get if they had to copy each time, or every Nth time. (This might not
be true if other *alloc() calls were interleaved with the realloc()s;
I didn't test that.)

-- Richard
 
K

Keith Thompson

Eric Sosman said:
Keith said:
char* getline (FILE *fp) {
char *line = NULL;
char ch;
unsigned int size=0;

while ((ch=fgetc(fp)) && ch!='\n' && ch!='\r' && !feof(fp)) {
line = (char*) realloc(line,++size);
line[size-1]=ch;
}

line[size]=0;
return line;
}
[...]

feof() doesn't do quite what you seem to be assuming it does. It can
be called *after* fgetc() returns EOF, to determine whether it did so
because it reached the end of the file or because it encountered an
error. [...]

Keith, you're about to kick yourself :) Thanks to the
sequence points accompanying the `&&' operators, feof() *is*
being called after fgetc() is called; it only looks "predictive"
because it's at the top of the loop.

True, but calling feof() still isn't the right way to check whether
you've run out of input. If there's an error, fgetc() will return EOF
and feof() will return 0 (but ferror() will return a true value).

You're right that feof() is called after fgetc() in the posted code,
and calling both feof() and ferror() would probably make the code
work, but checking whether fgetc() returned EOF is still better.
By the way, feof() and ferror() can be called at any time
on an open stream; there's no need to wait until some other
I/O function indicates an abnormality.

Right -- but if you just called fgetc() and it *didn't* return EOF,
then feof() and ferror() should both return 0, and there's no point in
calling them.
 
K

Keith Thompson

Eric Sosman said:
Keith said:
Eric Sosman said:
[...]
By the way, feof() and ferror() can be called at any time
on an open stream; there's no need to wait until some other
I/O function indicates an abnormality.

Right -- but if you just called fgetc() and it *didn't* return EOF,
then feof() and ferror() should both return 0, and there's no point in
calling them.

feof() and ferror() are "sticky:" if a transient failure bollixes
one I/O operation and then a subsequent operation succeeds, the success
of the second does not clear the stream's eof or error indicator. One
situation where this arises with some frequency is in handling input
from an interactive device that allows further input after an end-of-
input indication like ^Z or ^D: One fgetc() could return EOF due to
the transient end-of-input condition, and the next fgetc() could
succeed and return an actual input character. feof() would return
true even after the second fgetc() succeeded.

I don't think that's legal behavior for a conforming implementation.
C99 7.19.7.1p3:

If the end-of-file indicator for the stream is set, or if the
stream is at end-of-file, the end-of-file indicator for the stream
is set and the fgetc function returns EOF. Otherwise, the fgetc
function returns the next character from the input stream pointed
to by stream. If a read error occurs, the error indicator for the
stream is set and the fgetc function returns EOF.

In other words, the standard doesn't allow for a "transient
end-of-input condition", though you can explicitly reset it calling
fseek(stream, 0, SEEK_CUR).

But a quick experiment shows that at least one implementation doesn't
behave as the standard specifies; fgetc() can return something other
than EOF even when the end-of-file indicator is set.
Perhaps a more usual case is to "summarize" the outcome of a lot
of I/O operations, as an alternative to testing each one for failure
at the time it's attempted. For example, a program might make a large
number of fprintf() calls from a large number of places in the code,
such that testing each individual fprintf()'s return value would be
cumbersome. As an alternative, the program could simply ignore the
returned values until the very end, finishing up with something like

if (ferror(stream)) {
... something went wrong ...
}
else if (fclose(stream) != 0) {
... something else went wrong ...
}
else {
... all is well ...
}

Yes, that can work (but it's not what's going on in the posted code,
which calls feof() after each fgetc()).
(Of course, this technique is not a panacea. If the program's
fourth fprintf() call bumps up against "disk quota exceeded," it would
be nicer to discover the problem fairly promptly than to wait until
after another four million fprintf()'s had also failed ...)

It's also possible that one fprintf() call can hit a "disk quota
exceeded" error, but the next call, either because it produces less
output or because some disk space has been freed, might be successful,
so you could get gaps in your output. If the implementation is
conforming, the error flag should prevent any further fprintf() calls
from succeeding until the flag is reset, but if the implementation
doesn't conform -- well, then there are no guarantees anyway.
 
L

lawrence.jones

Eric Sosman said:
feof() and ferror() are "sticky:" if a transient failure bollixes
one I/O operation and then a subsequent operation succeeds, the success
of the second does not clear the stream's eof or error indicator. One
situation where this arises with some frequency is in handling input
from an interactive device that allows further input after an end-of-
input indication like ^Z or ^D: One fgetc() could return EOF due to
the transient end-of-input condition, and the next fgetc() could
succeed and return an actual input character.

Not in a conforming C implementation. It's the underlying end-of-file
and error indicators that are sticky, not feof() and ferror(). And
fgetc() is required to fail if either of the indicators is set, it is
not allowed to return any subsequent input, even if there is some. If
you want to read past EOF (when that's possible), you have to call
clearerr() to reset the indicators first.
 
C

CBFalconer

NvrBst said:
I have a file full of data that I want to tokenize. My function works
as long as the data I want to grab doesn't have padded whitespaces,
however, I want to preserve the padded whitespaces. Can I modify
fscanf to include them in the match?


---Example File---
MyKey1: INT, 3341, 1
MyKey2: STRING, Hello World, 1
MyKey3: STRING, , 1

--Format is Like so "KEYWORD: TYPE, Data1, Data2"---
fscanf(fFile, "%32[^:]: %32[^,], %32[^,], %d\n", P1, P2, P3, &P4);

When it gets to "MyKey3" it fails to match P3 thus returns 2
elements. I want P3 to be " ". Shouldn't "%32[^,]" be matching
anything but ",", aka spaces as well? A way around this? Different
way I should be tokenizeing such data?

Note: P1/P2/P3 are just "char[32+1]"'s. P4 is an int.

Thanks in Advance; I'm using GNU GCC 4.3.2 on a Ubuntu Machine w/
Latest Eclipse CDT.

I can suggest you to develop a self-made and overflow-free getline()
method to get the whole line in a file, something like this:

char* getline (FILE *fp) {
char *line = NULL;
char ch;
unsigned int size=0;

while ((ch=fgetc(fp)) && ch!='\n' && ch!='\r' && !feof(fp)) {
line = (char*) realloc(line,++size);
line[size-1]=ch;
}

line[size]=0;
return line;
}

Then you can parse the line obtained this way using regex.h functions
to match what you like, using "," as separator.

There are much simpler (and faster) ways to handle getline. Among
them is ggets, available at:

<http://cbfalconer.home.att.net/download/ggets.zip>

I don't know if I supplied this earlier, in this thread. At any
rate, try it out. #define TESTING to get a testable version.
Don't define TESTING for normal use.

/* ------- file tknsplit.h ----------*/
#ifndef H_tknsplit_h
# define H_tknsplit_h

# ifdef __cplusplus
extern "C" {
# endif

#include <stddef.h>

/* copy over the next tkn from an input string, after
skipping leading blanks (or other whitespace?). The
tkn is terminated by the first appearance of tknchar,
or by the end of the source string.

The caller must supply sufficient space in tkn to
receive any tkn, Otherwise tkns will be truncated.

Returns: a pointer past the terminating tknchar.

This will happily return an infinity of empty tkns if
called with src pointing to the end of a string. Tokens
will never include a copy of tknchar.

released to Public Domain, by C.B. Falconer.
Published 2006-02-20. Attribution appreciated.
revised 2007-05-26 (name)
*/

const char *tknsplit(const char *src, /* Source of tkns */
char tknchar, /* tkn delimiting char */
char *tkn, /* receiver of parsed tkn */
size_t lgh); /* length tkn can receive */
/* not including final '\0' */

# ifdef __cplusplus
}
# endif
#endif
/* ------- end file tknsplit.h ----------*/

/* ------- file tknsplit.c ----------*/
#include "tknsplit.h"

/* copy over the next tkn from an input string, after
skipping leading blanks (or other whitespace?). The
tkn is terminated by the first appearance of tknchar,
or by the end of the source string.

The caller must supply sufficient space in tkn to
receive any tkn, Otherwise tkns will be truncated.

Returns: a pointer past the terminating tknchar.

This will happily return an infinity of empty tkns if
called with src pointing to the end of a string. Tokens
will never include a copy of tknchar.

A better name would be "strtkn", except that is reserved
for the system namespace. Change to that at your risk.

released to Public Domain, by C.B. Falconer.
Published 2006-02-20. Attribution appreciated.
Revised 2006-06-13 2007-05-26 (name)
*/

const char *tknsplit(const char *src, /* Source of tkns */
char tknchar, /* tkn delimiting char */
char *tkn, /* receiver of parsed tkn */
size_t lgh) /* length tkn can receive */
/* not including final '\0' */
{
if (src) {
while (' ' == *src) src++;

while (*src && (tknchar != *src)) {
if (lgh) {
*tkn++ = *src;
--lgh;
}
src++;
}
if (*src && (tknchar == *src)) src++;
}
*tkn = '\0';
return src;
} /* tknsplit */

#ifdef TESTING
#include <stdio.h>

#define ABRsize 6 /* length of acceptable tkn abbreviations */

/* ---------------- */

static void showtkn(int i, char *tok)
{
putchar(i + '1'); putchar(':');
puts(tok);
} /* showtkn */

/* ---------------- */

int main(void)
{
char teststring[] = "This is a test, ,, abbrev, more";

const char *t, *s = teststring;
int i;
char tkn[ABRsize + 1];

puts(teststring);
t = s;
for (i = 0; i < 4; i++) {
t = tknsplit(t, ',', tkn, ABRsize);
showtkn(i, tkn);
}

puts("\nHow to detect 'no more tkns' while truncating");
t = s; i = 0;
while (*t) {
t = tknsplit(t, ',', tkn, 3);
showtkn(i, tkn);
i++;
}

puts("\nUsing blanks as tkn delimiters");
t = s; i = 0;
while (*t) {
t = tknsplit(t, ' ', tkn, ABRsize);
showtkn(i, tkn);
i++;
}
return 0;
} /* main */

#endif
/* ------- end file tknsplit.c ----------*/
 
E

Eric Sosman

Not in a conforming C implementation. It's the underlying end-of-file
and error indicators that are sticky, not feof() and ferror(). And
fgetc() is required to fail if either of the indicators is set, it is
not allowed to return any subsequent input, even if there is some. If
you want to read past EOF (when that's possible), you have to call
clearerr() to reset the indicators first.

I think you're right about feof(), because 7.19.7.1p2
says that fgetc() fails if the eof indicator is set, and all
the other input functions work "as if" by calling fgetc().
But I don't see any similar language about ferror() and the
error indicator. Can you offer a citation?
 
G

Guest

NvrBst said:
I have a file full of data that I want to tokenize. My function works
as long as the data I want to grab doesn't have padded whitespaces,
however, I want to preserve the padded whitespaces. Can I modify
fscanf to include them in the match?


---Example File---
MyKey1: INT, 3341, 1
MyKey2: STRING, Hello World, 1
MyKey3: STRING, , 1

--Format is Like so "KEYWORD: TYPE, Data1, Data2"---
fscanf(fFile, "%32[^:]: %32[^,], %32[^,], %d\n", P1, P2, P3, &P4);

When it gets to "MyKey3" it fails to match P3 thus returns 2
elements. I want P3 to be " ". Shouldn't "%32[^,]" be matching
anything but ",", aka spaces as well? A way around this? Different
way I should be tokenizeing such data?

Note: P1/P2/P3 are just "char[32+1]"'s. P4 is an int.

Thanks in Advance; I'm using GNU GCC 4.3.2 on a Ubuntu Machine w/
Latest Eclipse CDT.

Did you try using regex.h functions?
 
L

lawrence.jones

Eric Sosman said:
I think you're right about feof(), because 7.19.7.1p2
says that fgetc() fails if the eof indicator is set, and all
the other input functions work "as if" by calling fgetc().
But I don't see any similar language about ferror() and the
error indicator. Can you offer a citation?

No, I was mistaken. Sorry for the confusion.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,009
Latest member
GidgetGamb

Latest Threads

Top