newbie fscanf %[ conversions, multipliers

S

Steven

Hi,

I am using fscanf() to read words. But I want to match alphanumeric
characters only. However the program, when using the conversion
specifier %255[a-z,A-Z] prints only spaces and other non-standard
ascii characters. I have listed a small example below. Can someone
please tell me what I am doing wrong or forgetting, with regards to
the conversion specifier ? Thankx. !

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

#define MAXWORDLEN 256

int main(void) {
char word[MAXWORDLEN];
char *bigr[2];
int i = 0;

bigr[0] = calloc(MAXWORDLEN, sizeof(char));
bigr[1] = calloc(MAXWORDLEN, sizeof(char));

while(fscanf(stdin, "%255[a-z,A-Z]", word) != EOF) {
strcpy(bigr[i++], word);
if(i == 2) {
printf("%s %s\n", bigr[0], bigr[1]);
strcpy(bigr[0], bigr[1]);
i = 1;
}
}

return 0;
}
 
C

Chris Torek

I am using fscanf() to read words. But I want to match alphanumeric
characters only. ...
#include <stdio.h>
#include <string.h>
#include <stdlib.h>

#define MAXWORDLEN 256

int main(void) {
char word[MAXWORDLEN];
char *bigr[2];
int i = 0;

bigr[0] = calloc(MAXWORDLEN, sizeof(char));
bigr[1] = calloc(MAXWORDLEN, sizeof(char));

while(fscanf(stdin, "%255[a-z,A-Z]", word) != EOF) {

So far, not too bad, except that you did not store the return
value from fscanf(). There are three possible return values
for this particular call: EOF, 0, and 1 (representing "input
failure", "matching failure", and "success" respectively). This
code handles input failure, but cannot distinguish between
"matching failure" and "success".

It also seems a little odd to me that you include a comma in the
scanset, and no digits, when you say "alphanumeric only". (It is
also worth pointing out that on an EBCDIC machine, such as some
IBM mainframes, %[A-Za-z] includes some punctuation and such, as
the alphabetic characters are not contiguous. It will also not
work well with some European character sets in ISO Latin 1, where
legitimate alphabetic characters like á will be excluded.)

The biggest problem, though, is what happens if the scanf engine
succeeds. The conversion specification here is %255[ and the
scanset is "a through z" plus "," plus "A through Z": lowercase
alphabetic, comma, and uppercase alphabetic, if the machine uses
ASCII. If the input begins with at least one alphabetic character
or comma, the conversion will succeed -- fscanf will return 1 --
and the converted characters will be stored in the array named
"word", which is in fact big enough (256 characters).

Something eventually causes the scan to stop. There are only
three possibilities: an attempt to read encounters EOF; the 255
character limit runs out; or -- most likely -- the next character
in the input stream is not in the scanset. It is the third case
that is the immediate problem. When the scanf engine stops
processing input directives, whatever character(s) are in the
input stream remain in the input stream. Assuming the first
directive stops because of a space or a newline, the space or
newline remains in the stream.

The %[ directive, unlike most directives, *does not skip initial
white space* (spaces, tabs, newlines, etc).

The code inside the loop also does not skip white space:
strcpy(bigr[i++], word);
if(i == 2) {
printf("%s %s\n", bigr[0], bigr[1]);
strcpy(bigr[0], bigr[1]);
i = 1;
}
}

Thus, on the next trip through the loop, the first character that
the fscanf() call encounters will be the whitespace left behind by
the previous fscanf(). This will cause a "matching failure", so
that the second fscanf() will return 0, leaving the "word" array
unmodified.

You could attempt to fix this by skipping whitespace inside the
loop:

#include <ctype.h> /* with the other #includes */
...
/* somewhere inside the loop */
int c;

while ((c = getc(stdin)) != EOF && isspace(c))
continue;
if (c != EOF)
ungetc(c, stdin);

but this is not quite correct. Suppose the scanf engine eventually
stops, but not because of whitespace, not because of EOF, and not
because the 255-character limit ran out: suppose it stops because
the next input character available is, e.g., '('. This is not
alphabetic but is also not whitespace, so isspace() will say "not
space".

In fact, what you need is "read and convert stuff that *is* part
of a word" interleaved between "read and discard stuff that is
*not* part of a word". The question then becomes whether the file
must begin with a "word", or will you allow "non-word" stuff to
come before the first "word".
return 0;
}

Good, main() needs a return value. :)

It *is* possible to do this with the scanf engine, but you will
need at least two calls to it unless the file *must* begin with a
word. In the latter case, you can do:

for (;;) {
/* fscanf(stdin, fmt, ...) == scanf(fmt, ...) */
result = scanf("%255[A-Za-z0-9]%*[^A-Za-z0-9]", word);
if (result == EOF)
break;
if (result == 0)
... do something ...

This scanf directive-pair means: "Read and convert stuff in the
character class, with an input failure if EOF occurs before any
input, or a matching failure if there are no characters in the
class. Then, if no failure, read and discard stuff (not) in the
character class, with an input failure if EOF occurs before any
further characters are input, or a matching failure if the next
input character is in the class." The return value will be EOF
if input failure occurred before any data were stored in the array
named word, 0 if a matching failure occurred before any data were
stored in the array, or 1 if data were stored in the array. (You
get no notice if the second directive fails, due to the assignment
suppression.)

The second character-class is negated because of the "^", hence
the (not). Note that either directive can fail if there is not at
least one character in (or not in) the class: %[ demands that at
least one character be read (and discarded for %*[, or assigned
for %[).

The scanf() above will "get stuck" if the file begins with a non-word
character. Suppose the first character is ':' (colon), for instance.
The first %[ directive will see the colon and fail with a matching
failure, terminating the scan, returning 0, leaving the colon in
the input stream. A subsequent trip through the loop will again
see the colon and again cause the scanf to terminate with a matching
failure, returning 0.

If you wish to discard "non-word" characters, but allow the case
of "no non-word characters", you can invoke scanf twice:

#define WORD_CLASS "A-Za-z0-9"

result = scanf("%*[^" WORD_CLASS "]");
/* XXX: throw above result away */

result = scanf("%255[" WORD_CLASS "]", word);
if (result == EOF)
break;
if (result == 0)
... panic -- this should never happen ...

Here, the first call is allowed to fail with a matching failure if
there is a "word-class" character. In this case, it leaves the
"word-class" character in the input stream, and the second scanf
will find it there. It is also allowed to fail (silently) with an
input failure, in the hopes that the second scanf will also
immediately encounter input failure (this is likely, but not
guaranteed -- if you want to avoid the situation, you could test
the first result). And of course, it is allowed to succeed,
eating up all "non-word" characters and leaving either EOF or
a "word" character for the second scanf.

You cannot combine these two calls into one, because if the stream
currently begins with a valid word character, the negated class
directive ("%[^...]") will cause the scanf call to fail, and return
without converting-and-assigning into the "word" array.

Finally, two more notes.

First: suppose an input word exceeds 255 characters in length. A
loop of the form:

for (;;) {
/* read and discard any non-word characters, allowing none */
...
/* read and convert valid "word" characters, requiring 1 or
more but stopping after 255 even if there are more */
...
/* do something with the word */
}

will consider the remaining character(s) -- up to the next 255 --
as an additional, separate word, even though the two input "words"
were not separated by any non-word characters.

This may be what you want, or may not.

Second: "alphanumeric" words often mean "words starting with an
alphabetic character, then allowing alphabetic or numeric characters"
(in programming languages, at least -- C among them -- identfiers
are alphanumeric words that cannot *begin* with digits). The
scanf engine is not very suited to such a job: its directives are
clumsier than typical regular-expression handlers (lex, perl and
awk REs, and the like). You can sort-of express this with:

char firstchar[2];
char rest[256 - 1];
int result;

result = scanf("%1[A-Za-z]%254[A-Za-z0-9]", firstchar, rest);

although to allow for EBCDIC, the "A-Z"s should be expanded out
as well:

#define ALPHABETIC "ABCDEFGHIJKLMNOPQRSTUVWXYZ" \
"abcdefghijklmnopqrstuvwxyz"
#define ALPHANUMERIC ALPHABETIC "0-9"

...

result = scanf("%1[" ALPHABETIC "]%254[" ALPHANUMERIC "]",
firstchar, rest);

(C guarantees that the digits are grouped "properly" so we can use
the shorthand for the digit part). In both cases, if "result" is
1, the input was just a single-character alphabetic-only "word";
if result is 2, the alphanumeric tail of the word is in "rest".
(We need a 2-character array to hold the first character because
the %[ directive always stores a C string, i.e., adds the '\0'.)

The best solution is probably to ignore scanf entirely. In this
case, you can write a small "word reading" function that uses
isalpha() and isdigit() from <ctype.h>, and a corresponding
"word skipping" function that also uses isapha() and isdigit().

As usual, scanf is a poor solution: for simple problems, it is too
complicated; for robust programs that do complicated jobs, it is
too simple.
 
S

Steven

Sorry for breaking the net etiquette by starting my reply at the top.

But thank you so much for the very complete reply!

I resulted to the scanf fam. for ease of use, but after reading your
reply I am not sure anymore `what easy actually is' :)

Especially as a beginner even such seemingly simple task as deriving
tokens from text data can be dangerous. For the future I hope that
this will turn into C power for me, untill then I promise to keep
reading.

Thanks again for the complete reply!

Steven.


I am using fscanf() to read words. But I want to match alphanumeric
characters only. ...
#include <stdio.h>
#include <string.h>
#include <stdlib.h>

#define MAXWORDLEN 256

int main(void) {
char word[MAXWORDLEN];
char *bigr[2];
int i = 0;

bigr[0] = calloc(MAXWORDLEN, sizeof(char));
bigr[1] = calloc(MAXWORDLEN, sizeof(char));

while(fscanf(stdin, "%255[a-z,A-Z]", word) != EOF) {

So far, not too bad, except that you did not store the return
value from fscanf(). There are three possible return values
for this particular call: EOF, 0, and 1 (representing "input
failure", "matching failure", and "success" respectively). This
code handles input failure, but cannot distinguish between
"matching failure" and "success".

It also seems a little odd to me that you include a comma in the
scanset, and no digits, when you say "alphanumeric only". (It is
also worth pointing out that on an EBCDIC machine, such as some
IBM mainframes, %[A-Za-z] includes some punctuation and such, as
the alphabetic characters are not contiguous. It will also not
work well with some European character sets in ISO Latin 1, where
legitimate alphabetic characters like á will be excluded.)

The biggest problem, though, is what happens if the scanf engine
succeeds. The conversion specification here is %255[ and the
scanset is "a through z" plus "," plus "A through Z": lowercase
alphabetic, comma, and uppercase alphabetic, if the machine uses
ASCII. If the input begins with at least one alphabetic character
or comma, the conversion will succeed -- fscanf will return 1 --
and the converted characters will be stored in the array named
"word", which is in fact big enough (256 characters).

Something eventually causes the scan to stop. There are only
three possibilities: an attempt to read encounters EOF; the 255
character limit runs out; or -- most likely -- the next character
in the input stream is not in the scanset. It is the third case
that is the immediate problem. When the scanf engine stops
processing input directives, whatever character(s) are in the
input stream remain in the input stream. Assuming the first
directive stops because of a space or a newline, the space or
newline remains in the stream.

The %[ directive, unlike most directives, *does not skip initial
white space* (spaces, tabs, newlines, etc).

The code inside the loop also does not skip white space:
strcpy(bigr[i++], word);
if(i == 2) {
printf("%s %s\n", bigr[0], bigr[1]);
strcpy(bigr[0], bigr[1]);
i = 1;
}
}

Thus, on the next trip through the loop, the first character that
the fscanf() call encounters will be the whitespace left behind by
the previous fscanf(). This will cause a "matching failure", so
that the second fscanf() will return 0, leaving the "word" array
unmodified.

You could attempt to fix this by skipping whitespace inside the
loop:

#include <ctype.h> /* with the other #includes */
...
/* somewhere inside the loop */
int c;

while ((c = getc(stdin)) != EOF && isspace(c))
continue;
if (c != EOF)
ungetc(c, stdin);

but this is not quite correct. Suppose the scanf engine eventually
stops, but not because of whitespace, not because of EOF, and not
because the 255-character limit ran out: suppose it stops because
the next input character available is, e.g., '('. This is not
alphabetic but is also not whitespace, so isspace() will say "not
space".

In fact, what you need is "read and convert stuff that *is* part
of a word" interleaved between "read and discard stuff that is
*not* part of a word". The question then becomes whether the file
must begin with a "word", or will you allow "non-word" stuff to
come before the first "word".
return 0;
}

Good, main() needs a return value. :)

It *is* possible to do this with the scanf engine, but you will
need at least two calls to it unless the file *must* begin with a
word. In the latter case, you can do:

for (;;) {
/* fscanf(stdin, fmt, ...) == scanf(fmt, ...) */
result = scanf("%255[A-Za-z0-9]%*[^A-Za-z0-9]", word);
if (result == EOF)
break;
if (result == 0)
... do something ...

This scanf directive-pair means: "Read and convert stuff in the
character class, with an input failure if EOF occurs before any
input, or a matching failure if there are no characters in the
class. Then, if no failure, read and discard stuff (not) in the
character class, with an input failure if EOF occurs before any
further characters are input, or a matching failure if the next
input character is in the class." The return value will be EOF
if input failure occurred before any data were stored in the array
named word, 0 if a matching failure occurred before any data were
stored in the array, or 1 if data were stored in the array. (You
get no notice if the second directive fails, due to the assignment
suppression.)

The second character-class is negated because of the "^", hence
the (not). Note that either directive can fail if there is not at
least one character in (or not in) the class: %[ demands that at
least one character be read (and discarded for %*[, or assigned
for %[).

The scanf() above will "get stuck" if the file begins with a non-word
character. Suppose the first character is ':' (colon), for instance.
The first %[ directive will see the colon and fail with a matching
failure, terminating the scan, returning 0, leaving the colon in
the input stream. A subsequent trip through the loop will again
see the colon and again cause the scanf to terminate with a matching
failure, returning 0.

If you wish to discard "non-word" characters, but allow the case
of "no non-word characters", you can invoke scanf twice:

#define WORD_CLASS "A-Za-z0-9"

result = scanf("%*[^" WORD_CLASS "]");
/* XXX: throw above result away */

result = scanf("%255[" WORD_CLASS "]", word);
if (result == EOF)
break;
if (result == 0)
... panic -- this should never happen ...

Here, the first call is allowed to fail with a matching failure if
there is a "word-class" character. In this case, it leaves the
"word-class" character in the input stream, and the second scanf
will find it there. It is also allowed to fail (silently) with an
input failure, in the hopes that the second scanf will also
immediately encounter input failure (this is likely, but not
guaranteed -- if you want to avoid the situation, you could test
the first result). And of course, it is allowed to succeed,
eating up all "non-word" characters and leaving either EOF or
a "word" character for the second scanf.

You cannot combine these two calls into one, because if the stream
currently begins with a valid word character, the negated class
directive ("%[^...]") will cause the scanf call to fail, and return
without converting-and-assigning into the "word" array.

Finally, two more notes.

First: suppose an input word exceeds 255 characters in length. A
loop of the form:

for (;;) {
/* read and discard any non-word characters, allowing none */
...
/* read and convert valid "word" characters, requiring 1 or
more but stopping after 255 even if there are more */
...
/* do something with the word */
}

will consider the remaining character(s) -- up to the next 255 --
as an additional, separate word, even though the two input "words"
were not separated by any non-word characters.

This may be what you want, or may not.

Second: "alphanumeric" words often mean "words starting with an
alphabetic character, then allowing alphabetic or numeric characters"
(in programming languages, at least -- C among them -- identfiers
are alphanumeric words that cannot *begin* with digits). The
scanf engine is not very suited to such a job: its directives are
clumsier than typical regular-expression handlers (lex, perl and
awk REs, and the like). You can sort-of express this with:

char firstchar[2];
char rest[256 - 1];
int result;

result = scanf("%1[A-Za-z]%254[A-Za-z0-9]", firstchar, rest);

although to allow for EBCDIC, the "A-Z"s should be expanded out
as well:

#define ALPHABETIC "ABCDEFGHIJKLMNOPQRSTUVWXYZ" \
"abcdefghijklmnopqrstuvwxyz"
#define ALPHANUMERIC ALPHABETIC "0-9"

...

result = scanf("%1[" ALPHABETIC "]%254[" ALPHANUMERIC "]",
firstchar, rest);

(C guarantees that the digits are grouped "properly" so we can use
the shorthand for the digit part). In both cases, if "result" is
1, the input was just a single-character alphabetic-only "word";
if result is 2, the alphanumeric tail of the word is in "rest".
(We need a 2-character array to hold the first character because
the %[ directive always stores a C string, i.e., adds the '\0'.)

The best solution is probably to ignore scanf entirely. In this
case, you can write a small "word reading" function that uses
isalpha() and isdigit() from <ctype.h>, and a corresponding
"word skipping" function that also uses isapha() and isdigit().

As usual, scanf is a poor solution: for simple problems, it is too
complicated; for robust programs that do complicated jobs, it is
too simple.
 
C

Christopher Benson-Manica

Steven said:
bigr[0] = calloc(MAXWORDLEN, sizeof(char));
bigr[1] = calloc(MAXWORDLEN, sizeof(char));

Far be it from me to nitpick the outstanding reply you already
received, but you should check the return value of calloc() before
continuing.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top