newbie fscanf %[ conversions, multipliers

Discussion in 'C Programming' started by Steven, Dec 27, 2005.

  1. Steven

    Steven Guest

    Hi,

    I am using fscanf() to read words. But I want to match alphanumeric
    characters only. However the program, when using the conversion
    specifier %255[a-z,A-Z] prints only spaces and other non-standard
    ascii characters. I have listed a small example below. Can someone
    please tell me what I am doing wrong or forgetting, with regards to
    the conversion specifier ? Thankx. !

    #include <stdio.h>
    #include <string.h>
    #include <stdlib.h>

    #define MAXWORDLEN 256

    int main(void) {
    char word[MAXWORDLEN];
    char *bigr[2];
    int i = 0;

    bigr[0] = calloc(MAXWORDLEN, sizeof(char));
    bigr[1] = calloc(MAXWORDLEN, sizeof(char));

    while(fscanf(stdin, "%255[a-z,A-Z]", word) != EOF) {
    strcpy(bigr[i++], word);
    if(i == 2) {
    printf("%s %s\n", bigr[0], bigr[1]);
    strcpy(bigr[0], bigr[1]);
    i = 1;
    }
    }

    return 0;
    }
     
    Steven, Dec 27, 2005
    #1
    1. Advertising

  2. Steven

    Chris Torek Guest

    In article <>
    Steven <> wrote:
    >I am using fscanf() to read words. But I want to match alphanumeric
    >characters only. ...


    >#include <stdio.h>
    >#include <string.h>
    >#include <stdlib.h>
    >
    >#define MAXWORDLEN 256
    >
    >int main(void) {
    > char word[MAXWORDLEN];
    > char *bigr[2];
    > int i = 0;
    >
    > bigr[0] = calloc(MAXWORDLEN, sizeof(char));
    > bigr[1] = calloc(MAXWORDLEN, sizeof(char));
    >
    > while(fscanf(stdin, "%255[a-z,A-Z]", word) != EOF) {


    So far, not too bad, except that you did not store the return
    value from fscanf(). There are three possible return values
    for this particular call: EOF, 0, and 1 (representing "input
    failure", "matching failure", and "success" respectively). This
    code handles input failure, but cannot distinguish between
    "matching failure" and "success".

    It also seems a little odd to me that you include a comma in the
    scanset, and no digits, when you say "alphanumeric only". (It is
    also worth pointing out that on an EBCDIC machine, such as some
    IBM mainframes, %[A-Za-z] includes some punctuation and such, as
    the alphabetic characters are not contiguous. It will also not
    work well with some European character sets in ISO Latin 1, where
    legitimate alphabetic characters like á will be excluded.)

    The biggest problem, though, is what happens if the scanf engine
    succeeds. The conversion specification here is %255[ and the
    scanset is "a through z" plus "," plus "A through Z": lowercase
    alphabetic, comma, and uppercase alphabetic, if the machine uses
    ASCII. If the input begins with at least one alphabetic character
    or comma, the conversion will succeed -- fscanf will return 1 --
    and the converted characters will be stored in the array named
    "word", which is in fact big enough (256 characters).

    Something eventually causes the scan to stop. There are only
    three possibilities: an attempt to read encounters EOF; the 255
    character limit runs out; or -- most likely -- the next character
    in the input stream is not in the scanset. It is the third case
    that is the immediate problem. When the scanf engine stops
    processing input directives, whatever character(s) are in the
    input stream remain in the input stream. Assuming the first
    directive stops because of a space or a newline, the space or
    newline remains in the stream.

    The %[ directive, unlike most directives, *does not skip initial
    white space* (spaces, tabs, newlines, etc).

    The code inside the loop also does not skip white space:

    > strcpy(bigr[i++], word);
    > if(i == 2) {
    > printf("%s %s\n", bigr[0], bigr[1]);
    > strcpy(bigr[0], bigr[1]);
    > i = 1;
    > }
    > }


    Thus, on the next trip through the loop, the first character that
    the fscanf() call encounters will be the whitespace left behind by
    the previous fscanf(). This will cause a "matching failure", so
    that the second fscanf() will return 0, leaving the "word" array
    unmodified.

    You could attempt to fix this by skipping whitespace inside the
    loop:

    #include <ctype.h> /* with the other #includes */
    ...
    /* somewhere inside the loop */
    int c;

    while ((c = getc(stdin)) != EOF && isspace(c))
    continue;
    if (c != EOF)
    ungetc(c, stdin);

    but this is not quite correct. Suppose the scanf engine eventually
    stops, but not because of whitespace, not because of EOF, and not
    because the 255-character limit ran out: suppose it stops because
    the next input character available is, e.g., '('. This is not
    alphabetic but is also not whitespace, so isspace() will say "not
    space".

    In fact, what you need is "read and convert stuff that *is* part
    of a word" interleaved between "read and discard stuff that is
    *not* part of a word". The question then becomes whether the file
    must begin with a "word", or will you allow "non-word" stuff to
    come before the first "word".

    > return 0;
    >}


    Good, main() needs a return value. :)

    It *is* possible to do this with the scanf engine, but you will
    need at least two calls to it unless the file *must* begin with a
    word. In the latter case, you can do:

    for (;;) {
    /* fscanf(stdin, fmt, ...) == scanf(fmt, ...) */
    result = scanf("%255[A-Za-z0-9]%*[^A-Za-z0-9]", word);
    if (result == EOF)
    break;
    if (result == 0)
    ... do something ...

    This scanf directive-pair means: "Read and convert stuff in the
    character class, with an input failure if EOF occurs before any
    input, or a matching failure if there are no characters in the
    class. Then, if no failure, read and discard stuff (not) in the
    character class, with an input failure if EOF occurs before any
    further characters are input, or a matching failure if the next
    input character is in the class." The return value will be EOF
    if input failure occurred before any data were stored in the array
    named word, 0 if a matching failure occurred before any data were
    stored in the array, or 1 if data were stored in the array. (You
    get no notice if the second directive fails, due to the assignment
    suppression.)

    The second character-class is negated because of the "^", hence
    the (not). Note that either directive can fail if there is not at
    least one character in (or not in) the class: %[ demands that at
    least one character be read (and discarded for %*[, or assigned
    for %[).

    The scanf() above will "get stuck" if the file begins with a non-word
    character. Suppose the first character is ':' (colon), for instance.
    The first %[ directive will see the colon and fail with a matching
    failure, terminating the scan, returning 0, leaving the colon in
    the input stream. A subsequent trip through the loop will again
    see the colon and again cause the scanf to terminate with a matching
    failure, returning 0.

    If you wish to discard "non-word" characters, but allow the case
    of "no non-word characters", you can invoke scanf twice:

    #define WORD_CLASS "A-Za-z0-9"

    result = scanf("%*[^" WORD_CLASS "]");
    /* XXX: throw above result away */

    result = scanf("%255[" WORD_CLASS "]", word);
    if (result == EOF)
    break;
    if (result == 0)
    ... panic -- this should never happen ...

    Here, the first call is allowed to fail with a matching failure if
    there is a "word-class" character. In this case, it leaves the
    "word-class" character in the input stream, and the second scanf
    will find it there. It is also allowed to fail (silently) with an
    input failure, in the hopes that the second scanf will also
    immediately encounter input failure (this is likely, but not
    guaranteed -- if you want to avoid the situation, you could test
    the first result). And of course, it is allowed to succeed,
    eating up all "non-word" characters and leaving either EOF or
    a "word" character for the second scanf.

    You cannot combine these two calls into one, because if the stream
    currently begins with a valid word character, the negated class
    directive ("%[^...]") will cause the scanf call to fail, and return
    without converting-and-assigning into the "word" array.

    Finally, two more notes.

    First: suppose an input word exceeds 255 characters in length. A
    loop of the form:

    for (;;) {
    /* read and discard any non-word characters, allowing none */
    ...
    /* read and convert valid "word" characters, requiring 1 or
    more but stopping after 255 even if there are more */
    ...
    /* do something with the word */
    }

    will consider the remaining character(s) -- up to the next 255 --
    as an additional, separate word, even though the two input "words"
    were not separated by any non-word characters.

    This may be what you want, or may not.

    Second: "alphanumeric" words often mean "words starting with an
    alphabetic character, then allowing alphabetic or numeric characters"
    (in programming languages, at least -- C among them -- identfiers
    are alphanumeric words that cannot *begin* with digits). The
    scanf engine is not very suited to such a job: its directives are
    clumsier than typical regular-expression handlers (lex, perl and
    awk REs, and the like). You can sort-of express this with:

    char firstchar[2];
    char rest[256 - 1];
    int result;

    result = scanf("%1[A-Za-z]%254[A-Za-z0-9]", firstchar, rest);

    although to allow for EBCDIC, the "A-Z"s should be expanded out
    as well:

    #define ALPHABETIC "ABCDEFGHIJKLMNOPQRSTUVWXYZ" \
    "abcdefghijklmnopqrstuvwxyz"
    #define ALPHANUMERIC ALPHABETIC "0-9"

    ...

    result = scanf("%1[" ALPHABETIC "]%254[" ALPHANUMERIC "]",
    firstchar, rest);

    (C guarantees that the digits are grouped "properly" so we can use
    the shorthand for the digit part). In both cases, if "result" is
    1, the input was just a single-character alphabetic-only "word";
    if result is 2, the alphanumeric tail of the word is in "rest".
    (We need a 2-character array to hold the first character because
    the %[ directive always stores a C string, i.e., adds the '\0'.)

    The best solution is probably to ignore scanf entirely. In this
    case, you can write a small "word reading" function that uses
    isalpha() and isdigit() from <ctype.h>, and a corresponding
    "word skipping" function that also uses isapha() and isdigit().

    As usual, scanf is a poor solution: for simple problems, it is too
    complicated; for robust programs that do complicated jobs, it is
    too simple.
    --
    In-Real-Life: Chris Torek, Wind River Systems
    Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603
    email: forget about it http://web.torek.net/torek/index.html
    Reading email is like searching for food in the garbage, thanks to spammers.
     
    Chris Torek, Dec 27, 2005
    #2
    1. Advertising

  3. Steven

    Steven Guest

    Sorry for breaking the net etiquette by starting my reply at the top.

    But thank you so much for the very complete reply!

    I resulted to the scanf fam. for ease of use, but after reading your
    reply I am not sure anymore `what easy actually is' :)

    Especially as a beginner even such seemingly simple task as deriving
    tokens from text data can be dangerous. For the future I hope that
    this will turn into C power for me, untill then I promise to keep
    reading.

    Thanks again for the complete reply!

    Steven.


    On 27 Dec 2005 19:54:32 GMT, Chris Torek <> wrote:
    > In article <>
    > Steven <> wrote:
    > >I am using fscanf() to read words. But I want to match alphanumeric
    > >characters only. ...

    >
    > >#include <stdio.h>
    > >#include <string.h>
    > >#include <stdlib.h>
    > >
    > >#define MAXWORDLEN 256
    > >
    > >int main(void) {
    > > char word[MAXWORDLEN];
    > > char *bigr[2];
    > > int i = 0;
    > >
    > > bigr[0] = calloc(MAXWORDLEN, sizeof(char));
    > > bigr[1] = calloc(MAXWORDLEN, sizeof(char));
    > >
    > > while(fscanf(stdin, "%255[a-z,A-Z]", word) != EOF) {

    >
    > So far, not too bad, except that you did not store the return
    > value from fscanf(). There are three possible return values
    > for this particular call: EOF, 0, and 1 (representing "input
    > failure", "matching failure", and "success" respectively). This
    > code handles input failure, but cannot distinguish between
    > "matching failure" and "success".
    >
    > It also seems a little odd to me that you include a comma in the
    > scanset, and no digits, when you say "alphanumeric only". (It is
    > also worth pointing out that on an EBCDIC machine, such as some
    > IBM mainframes, %[A-Za-z] includes some punctuation and such, as
    > the alphabetic characters are not contiguous. It will also not
    > work well with some European character sets in ISO Latin 1, where
    > legitimate alphabetic characters like á will be excluded.)
    >
    > The biggest problem, though, is what happens if the scanf engine
    > succeeds. The conversion specification here is %255[ and the
    > scanset is "a through z" plus "," plus "A through Z": lowercase
    > alphabetic, comma, and uppercase alphabetic, if the machine uses
    > ASCII. If the input begins with at least one alphabetic character
    > or comma, the conversion will succeed -- fscanf will return 1 --
    > and the converted characters will be stored in the array named
    > "word", which is in fact big enough (256 characters).
    >
    > Something eventually causes the scan to stop. There are only
    > three possibilities: an attempt to read encounters EOF; the 255
    > character limit runs out; or -- most likely -- the next character
    > in the input stream is not in the scanset. It is the third case
    > that is the immediate problem. When the scanf engine stops
    > processing input directives, whatever character(s) are in the
    > input stream remain in the input stream. Assuming the first
    > directive stops because of a space or a newline, the space or
    > newline remains in the stream.
    >
    > The %[ directive, unlike most directives, *does not skip initial
    > white space* (spaces, tabs, newlines, etc).
    >
    > The code inside the loop also does not skip white space:
    >
    > > strcpy(bigr[i++], word);
    > > if(i == 2) {
    > > printf("%s %s\n", bigr[0], bigr[1]);
    > > strcpy(bigr[0], bigr[1]);
    > > i = 1;
    > > }
    > > }

    >
    > Thus, on the next trip through the loop, the first character that
    > the fscanf() call encounters will be the whitespace left behind by
    > the previous fscanf(). This will cause a "matching failure", so
    > that the second fscanf() will return 0, leaving the "word" array
    > unmodified.
    >
    > You could attempt to fix this by skipping whitespace inside the
    > loop:
    >
    > #include <ctype.h> /* with the other #includes */
    > ...
    > /* somewhere inside the loop */
    > int c;
    >
    > while ((c = getc(stdin)) != EOF && isspace(c))
    > continue;
    > if (c != EOF)
    > ungetc(c, stdin);
    >
    > but this is not quite correct. Suppose the scanf engine eventually
    > stops, but not because of whitespace, not because of EOF, and not
    > because the 255-character limit ran out: suppose it stops because
    > the next input character available is, e.g., '('. This is not
    > alphabetic but is also not whitespace, so isspace() will say "not
    > space".
    >
    > In fact, what you need is "read and convert stuff that *is* part
    > of a word" interleaved between "read and discard stuff that is
    > *not* part of a word". The question then becomes whether the file
    > must begin with a "word", or will you allow "non-word" stuff to
    > come before the first "word".
    >
    > > return 0;
    > >}

    >
    > Good, main() needs a return value. :)
    >
    > It *is* possible to do this with the scanf engine, but you will
    > need at least two calls to it unless the file *must* begin with a
    > word. In the latter case, you can do:
    >
    > for (;;) {
    > /* fscanf(stdin, fmt, ...) == scanf(fmt, ...) */
    > result = scanf("%255[A-Za-z0-9]%*[^A-Za-z0-9]", word);
    > if (result == EOF)
    > break;
    > if (result == 0)
    > ... do something ...
    >
    > This scanf directive-pair means: "Read and convert stuff in the
    > character class, with an input failure if EOF occurs before any
    > input, or a matching failure if there are no characters in the
    > class. Then, if no failure, read and discard stuff (not) in the
    > character class, with an input failure if EOF occurs before any
    > further characters are input, or a matching failure if the next
    > input character is in the class." The return value will be EOF
    > if input failure occurred before any data were stored in the array
    > named word, 0 if a matching failure occurred before any data were
    > stored in the array, or 1 if data were stored in the array. (You
    > get no notice if the second directive fails, due to the assignment
    > suppression.)
    >
    > The second character-class is negated because of the "^", hence
    > the (not). Note that either directive can fail if there is not at
    > least one character in (or not in) the class: %[ demands that at
    > least one character be read (and discarded for %*[, or assigned
    > for %[).
    >
    > The scanf() above will "get stuck" if the file begins with a non-word
    > character. Suppose the first character is ':' (colon), for instance.
    > The first %[ directive will see the colon and fail with a matching
    > failure, terminating the scan, returning 0, leaving the colon in
    > the input stream. A subsequent trip through the loop will again
    > see the colon and again cause the scanf to terminate with a matching
    > failure, returning 0.
    >
    > If you wish to discard "non-word" characters, but allow the case
    > of "no non-word characters", you can invoke scanf twice:
    >
    > #define WORD_CLASS "A-Za-z0-9"
    >
    > result = scanf("%*[^" WORD_CLASS "]");
    > /* XXX: throw above result away */
    >
    > result = scanf("%255[" WORD_CLASS "]", word);
    > if (result == EOF)
    > break;
    > if (result == 0)
    > ... panic -- this should never happen ...
    >
    > Here, the first call is allowed to fail with a matching failure if
    > there is a "word-class" character. In this case, it leaves the
    > "word-class" character in the input stream, and the second scanf
    > will find it there. It is also allowed to fail (silently) with an
    > input failure, in the hopes that the second scanf will also
    > immediately encounter input failure (this is likely, but not
    > guaranteed -- if you want to avoid the situation, you could test
    > the first result). And of course, it is allowed to succeed,
    > eating up all "non-word" characters and leaving either EOF or
    > a "word" character for the second scanf.
    >
    > You cannot combine these two calls into one, because if the stream
    > currently begins with a valid word character, the negated class
    > directive ("%[^...]") will cause the scanf call to fail, and return
    > without converting-and-assigning into the "word" array.
    >
    > Finally, two more notes.
    >
    > First: suppose an input word exceeds 255 characters in length. A
    > loop of the form:
    >
    > for (;;) {
    > /* read and discard any non-word characters, allowing none */
    > ...
    > /* read and convert valid "word" characters, requiring 1 or
    > more but stopping after 255 even if there are more */
    > ...
    > /* do something with the word */
    > }
    >
    > will consider the remaining character(s) -- up to the next 255 --
    > as an additional, separate word, even though the two input "words"
    > were not separated by any non-word characters.
    >
    > This may be what you want, or may not.
    >
    > Second: "alphanumeric" words often mean "words starting with an
    > alphabetic character, then allowing alphabetic or numeric characters"
    > (in programming languages, at least -- C among them -- identfiers
    > are alphanumeric words that cannot *begin* with digits). The
    > scanf engine is not very suited to such a job: its directives are
    > clumsier than typical regular-expression handlers (lex, perl and
    > awk REs, and the like). You can sort-of express this with:
    >
    > char firstchar[2];
    > char rest[256 - 1];
    > int result;
    >
    > result = scanf("%1[A-Za-z]%254[A-Za-z0-9]", firstchar, rest);
    >
    > although to allow for EBCDIC, the "A-Z"s should be expanded out
    > as well:
    >
    > #define ALPHABETIC "ABCDEFGHIJKLMNOPQRSTUVWXYZ" \
    > "abcdefghijklmnopqrstuvwxyz"
    > #define ALPHANUMERIC ALPHABETIC "0-9"
    >
    > ...
    >
    > result = scanf("%1[" ALPHABETIC "]%254[" ALPHANUMERIC "]",
    > firstchar, rest);
    >
    > (C guarantees that the digits are grouped "properly" so we can use
    > the shorthand for the digit part). In both cases, if "result" is
    > 1, the input was just a single-character alphabetic-only "word";
    > if result is 2, the alphanumeric tail of the word is in "rest".
    > (We need a 2-character array to hold the first character because
    > the %[ directive always stores a C string, i.e., adds the '\0'.)
    >
    > The best solution is probably to ignore scanf entirely. In this
    > case, you can write a small "word reading" function that uses
    > isalpha() and isdigit() from <ctype.h>, and a corresponding
    > "word skipping" function that also uses isapha() and isdigit().
    >
    > As usual, scanf is a poor solution: for simple problems, it is too
    > complicated; for robust programs that do complicated jobs, it is
    > too simple.
     
    Steven, Dec 27, 2005
    #3
  4. Steven <> wrote:

    > bigr[0] = calloc(MAXWORDLEN, sizeof(char));
    > bigr[1] = calloc(MAXWORDLEN, sizeof(char));


    Far be it from me to nitpick the outstanding reply you already
    received, but you should check the return value of calloc() before
    continuing.

    --
    Christopher Benson-Manica | I *should* know what I'm talking about - if I
    ataru(at)cyberspace.org | don't, I need to know. Flames welcome.
     
    Christopher Benson-Manica, Dec 28, 2005
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jack Stone
    Replies:
    1
    Views:
    1,248
    Symon
    Jul 25, 2003
  2. Replies:
    10
    Views:
    709
    Jasen Betts
    Aug 5, 2005
  3. Replies:
    0
    Views:
    778
  4. Barzo
    Replies:
    2
    Views:
    337
    Thomas J. Gritzan
    Jan 27, 2009
  5. khodorf

    Pipelined signed multipliers

    khodorf, Mar 3, 2009, in forum: VHDL
    Replies:
    0
    Views:
    453
    khodorf
    Mar 3, 2009
Loading...

Share This Page