(f)scanf Question - Grab String of Spaces

Discussion in 'C Programming' started by NvrBst, Apr 7, 2009.

  1. NvrBst

    NvrBst Guest

    I have a file full of data that I want to tokenize. My function works
    as long as the data I want to grab doesn't have padded whitespaces,
    however, I want to preserve the padded whitespaces. Can I modify
    fscanf to include them in the match?


    ---Example File---
    MyKey1: INT, 3341, 1
    MyKey2: STRING, Hello World, 1
    MyKey3: STRING, , 1

    --Format is Like so "KEYWORD: TYPE, Data1, Data2"---
    fscanf(fFile, "%32[^:]: %32[^,], %32[^,], %d\n", P1, P2, P3, &P4);

    When it gets to "MyKey3" it fails to match P3 thus returns 2
    elements. I want P3 to be " ". Shouldn't "%32[^,]" be matching
    anything but ",", aka spaces as well? A way around this? Different
    way I should be tokenizeing such data?

    Note: P1/P2/P3 are just "char[32+1]"'s. P4 is an int.

    Thanks in Advance; I'm using GNU GCC 4.3.2 on a Ubuntu Machine w/
    Latest Eclipse CDT.
    NvrBst, Apr 7, 2009
    #1
    1. Advertising

  2. NvrBst

    NvrBst Guest

    On Apr 7, 2:13 pm, Eric Sosman <> wrote:
    > NvrBst wrote:
    > > I have a file full of data that I want to tokenize.  My function works
    > > as long as the data I want to grab doesn't have padded whitespaces,
    > > however, I want to preserve the padded whitespaces. Can I modify
    > > fscanf to include them in the match?

    >
    > > ---Example File---
    > > MyKey1: INT, 3341, 1
    > > MyKey2: STRING, Hello World, 1
    > > MyKey3: STRING,     , 1

    >
    > > --Format is Like so "KEYWORD: TYPE, Data1, Data2"---
    > > fscanf(fFile, "%32[^:]: %32[^,], %32[^,], %d\n", P1, P2, P3, &P4);

    >
    > > When it gets to "MyKey3" it fails to match P3 thus returns 2
    > > elements.  I want P3 to be "    ".  Shouldn't "%32[^,]" be matching
    > > anything but ",", aka spaces as well?  A way around this?  Different
    > > way I should be tokenizeing such data?

    >
    >      Yes, "%32[^,]" matches anything other than a comma, including
    > spaces.  But the spaces have already been swallowed by the " "
    > you put right before it.  If you want to preserve the spaces, don't
    > write a " " directive to gobble them up.  (If you want to gobble
    > exactly one space, try "%*1[ ]" instead.)
    >
    > --
    >


    Ahh I didn't know " " gobbles more than one :) The %*1[ ] made
    everything work perfectly. Thank you kindly
    NvrBst, Apr 7, 2009
    #2
    1. Advertising

  3. NvrBst

    Eric Sosman Guest

    NvrBst wrote:
    > [... concerning fscanf() ...]
    > Ahh I didn't know " " gobbles more than one :) The %*1[ ] made
    > everything work perfectly. Thank you kindly


    For future reference, observe that *any* kind of white
    space in the format string matches *any* kind of white space
    in the input stream. For example, the format " " matches
    the inputs " ", " ", "\n\n", "\t \n \t \n \f", and so on.

    --
    Eric Sosman
    lid
    Eric Sosman, Apr 8, 2009
    #3
  4. NvrBst

    Guest Guest

    NvrBst <> wrote:
    > I have a file full of data that I want to tokenize. My function works
    > as long as the data I want to grab doesn't have padded whitespaces,
    > however, I want to preserve the padded whitespaces. Can I modify
    > fscanf to include them in the match?
    >
    >
    > ---Example File---
    > MyKey1: INT, 3341, 1
    > MyKey2: STRING, Hello World, 1
    > MyKey3: STRING, , 1
    >
    > --Format is Like so "KEYWORD: TYPE, Data1, Data2"---
    > fscanf(fFile, "%32[^:]: %32[^,], %32[^,], %d\n", P1, P2, P3, &P4);
    >
    > When it gets to "MyKey3" it fails to match P3 thus returns 2
    > elements. I want P3 to be " ". Shouldn't "%32[^,]" be matching
    > anything but ",", aka spaces as well? A way around this? Different
    > way I should be tokenizeing such data?
    >
    > Note: P1/P2/P3 are just "char[32+1]"'s. P4 is an int.
    >
    > Thanks in Advance; I'm using GNU GCC 4.3.2 on a Ubuntu Machine w/
    > Latest Eclipse CDT.


    I can suggest you to develop a self-made and overflow-free getline()
    method to get the whole line in a file, something like this:

    char* getline (FILE *fp) {
    char *line = NULL;
    char ch;
    unsigned int size=0;

    while ((ch=fgetc(fp)) && ch!='\n' && ch!='\r' && !feof(fp)) {
    line = (char*) realloc(line,++size);
    line[size-1]=ch;
    }

    line[size]=0;
    return line;
    }

    Then you can parse the line obtained this way using regex.h functions
    to match what you like, using "," as separator.

    --
    -----BEGIN GEEK CODE BLOCK-----
    GCS/CM/CC/E/IT/LS/M d-(--) C++++$ UBL++++$ P++++ L+++++$ E--- W+++ w--
    PS+++ PE-- Y++ PGP+++ R++ tv-- b++>+++ D+ G>+++ e++>+++++ h* r++ z+++
    ------END GEEK CODE BLOCK------
    Guest, Apr 9, 2009
    #4
  5. <> writes:
    [...]
    > I can suggest you to develop a self-made and overflow-free getline()
    > method to get the whole line in a file, something like this:
    >
    > char* getline (FILE *fp) {
    > char *line = NULL;
    > char ch;
    > unsigned int size=0;
    >
    > while ((ch=fgetc(fp)) && ch!='\n' && ch!='\r' && !feof(fp)) {
    > line = (char*) realloc(line,++size);
    > line[size-1]=ch;
    > }
    >
    > line[size]=0;
    > return line;
    > }

    [...]

    Something like that, but not exactly like it.

    fgetc() returns an int, not a char, so ch should be of type int.

    feof() doesn't do quite what you seem to be assuming it does. It can
    be called *after* fgetc() returns EOF, to determine whether it did so
    because it reached the end of the file or because it encountered an
    error. You should check whether ch is equal to EOF (which is why it
    needs to be an int) *instead* of calling feof(). See section 12 of
    the comp.lang.c FAQ, <http://www.c-faq.com/>.

    I don't think checking for '\r' makes much sense; if the file is in
    text mode, and you're on a system where end-of-line is represented as
    a CR LF pair, then that sequence will be converted to '\n' anyway. If
    you're on a system where end-of-line is represented as '\n', you might
    see '\r' in a text file copied from another system, but adding
    special-case code for it is questionable.

    Calling realloc() for each character read is likely to be inefficient,
    and may cause heap fragmentation on some systems. A common scheme is
    to double the allocated size when you run out of room; you can then do
    a final realloc() to shrink it down to what's needed. The cast is
    superfluous, and can mask errors.

    --
    Keith Thompson (The_Other_Keith) <http://www.ghoti.net/~kst>
    Nokia
    "We must do something. This is something. Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"
    Keith Thompson, Apr 9, 2009
    #5
  6. In article <>,
    Keith Thompson <> wrote:

    >Calling realloc() for each character read is likely to be inefficient,
    >and may cause heap fragmentation on some systems. A common scheme is
    >to double the allocated size when you run out of room; you can then do
    >a final realloc() to shrink it down to what's needed.


    All the realloc() implementations I've checked recently effectively do
    this internally anyway. They don't have the quadratic behaviour you'd
    get if they had to copy each time, or every Nth time. (This might not
    be true if other *alloc() calls were interleaved with the realloc()s;
    I didn't test that.)

    -- Richard
    --
    Please remember to mention me / in tapes you leave behind.
    Richard Tobin, Apr 9, 2009
    #6
  7. Eric Sosman <> writes:
    > Keith Thompson wrote:
    >> <> writes:
    >> [...]
    >>>
    >>> char* getline (FILE *fp) {
    >>> char *line = NULL;
    >>> char ch;
    >>> unsigned int size=0;
    >>>
    >>> while ((ch=fgetc(fp)) && ch!='\n' && ch!='\r' && !feof(fp)) {
    >>> line = (char*) realloc(line,++size);
    >>> line[size-1]=ch;
    >>> }
    >>>
    >>> line[size]=0;
    >>> return line;
    >>> }

    >> [...]
    >>
    >> feof() doesn't do quite what you seem to be assuming it does. It can
    >> be called *after* fgetc() returns EOF, to determine whether it did so
    >> because it reached the end of the file or because it encountered an
    >> error. [...]

    >
    > Keith, you're about to kick yourself :) Thanks to the
    > sequence points accompanying the `&&' operators, feof() *is*
    > being called after fgetc() is called; it only looks "predictive"
    > because it's at the top of the loop.


    True, but calling feof() still isn't the right way to check whether
    you've run out of input. If there's an error, fgetc() will return EOF
    and feof() will return 0 (but ferror() will return a true value).

    You're right that feof() is called after fgetc() in the posted code,
    and calling both feof() and ferror() would probably make the code
    work, but checking whether fgetc() returned EOF is still better.

    > By the way, feof() and ferror() can be called at any time
    > on an open stream; there's no need to wait until some other
    > I/O function indicates an abnormality.


    Right -- but if you just called fgetc() and it *didn't* return EOF,
    then feof() and ferror() should both return 0, and there's no point in
    calling them.

    --
    Keith Thompson (The_Other_Keith) <http://www.ghoti.net/~kst>
    Nokia
    "We must do something. This is something. Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"
    Keith Thompson, Apr 9, 2009
    #7
  8. Eric Sosman <> writes:
    > Keith Thompson wrote:
    >> Eric Sosman <> writes:
    >>> [...]
    >>> By the way, feof() and ferror() can be called at any time
    >>> on an open stream; there's no need to wait until some other
    >>> I/O function indicates an abnormality.

    >>
    >> Right -- but if you just called fgetc() and it *didn't* return EOF,
    >> then feof() and ferror() should both return 0, and there's no point in
    >> calling them.

    >
    > feof() and ferror() are "sticky:" if a transient failure bollixes
    > one I/O operation and then a subsequent operation succeeds, the success
    > of the second does not clear the stream's eof or error indicator. One
    > situation where this arises with some frequency is in handling input
    > from an interactive device that allows further input after an end-of-
    > input indication like ^Z or ^D: One fgetc() could return EOF due to
    > the transient end-of-input condition, and the next fgetc() could
    > succeed and return an actual input character. feof() would return
    > true even after the second fgetc() succeeded.


    I don't think that's legal behavior for a conforming implementation.
    C99 7.19.7.1p3:

    If the end-of-file indicator for the stream is set, or if the
    stream is at end-of-file, the end-of-file indicator for the stream
    is set and the fgetc function returns EOF. Otherwise, the fgetc
    function returns the next character from the input stream pointed
    to by stream. If a read error occurs, the error indicator for the
    stream is set and the fgetc function returns EOF.

    In other words, the standard doesn't allow for a "transient
    end-of-input condition", though you can explicitly reset it calling
    fseek(stream, 0, SEEK_CUR).

    But a quick experiment shows that at least one implementation doesn't
    behave as the standard specifies; fgetc() can return something other
    than EOF even when the end-of-file indicator is set.

    > Perhaps a more usual case is to "summarize" the outcome of a lot
    > of I/O operations, as an alternative to testing each one for failure
    > at the time it's attempted. For example, a program might make a large
    > number of fprintf() calls from a large number of places in the code,
    > such that testing each individual fprintf()'s return value would be
    > cumbersome. As an alternative, the program could simply ignore the
    > returned values until the very end, finishing up with something like
    >
    > if (ferror(stream)) {
    > ... something went wrong ...
    > }
    > else if (fclose(stream) != 0) {
    > ... something else went wrong ...
    > }
    > else {
    > ... all is well ...
    > }


    Yes, that can work (but it's not what's going on in the posted code,
    which calls feof() after each fgetc()).

    > (Of course, this technique is not a panacea. If the program's
    > fourth fprintf() call bumps up against "disk quota exceeded," it would
    > be nicer to discover the problem fairly promptly than to wait until
    > after another four million fprintf()'s had also failed ...)


    It's also possible that one fprintf() call can hit a "disk quota
    exceeded" error, but the next call, either because it produces less
    output or because some disk space has been freed, might be successful,
    so you could get gaps in your output. If the implementation is
    conforming, the error flag should prevent any further fprintf() calls
    from succeeding until the flag is reset, but if the implementation
    doesn't conform -- well, then there are no guarantees anyway.

    --
    Keith Thompson (The_Other_Keith) <http://www.ghoti.net/~kst>
    Nokia
    "We must do something. This is something. Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"
    Keith Thompson, Apr 9, 2009
    #8
  9. NvrBst

    Guest

    Eric Sosman <> wrote:
    >
    > feof() and ferror() are "sticky:" if a transient failure bollixes
    > one I/O operation and then a subsequent operation succeeds, the success
    > of the second does not clear the stream's eof or error indicator. One
    > situation where this arises with some frequency is in handling input
    > from an interactive device that allows further input after an end-of-
    > input indication like ^Z or ^D: One fgetc() could return EOF due to
    > the transient end-of-input condition, and the next fgetc() could
    > succeed and return an actual input character.


    Not in a conforming C implementation. It's the underlying end-of-file
    and error indicators that are sticky, not feof() and ferror(). And
    fgetc() is required to fail if either of the indicators is set, it is
    not allowed to return any subsequent input, even if there is some. If
    you want to read past EOF (when that's possible), you have to call
    clearerr() to reset the indicators first.
    --
    Larry Jones

    Mom would be a lot more fun if she was a little more gullible. -- Calvin
    , Apr 9, 2009
    #9
  10. NvrBst

    CBFalconer Guest

    wrote:
    >
    > NvrBst <> wrote:
    > > I have a file full of data that I want to tokenize. My function works
    > > as long as the data I want to grab doesn't have padded whitespaces,
    > > however, I want to preserve the padded whitespaces. Can I modify
    > > fscanf to include them in the match?
    > >
    > >
    > > ---Example File---
    > > MyKey1: INT, 3341, 1
    > > MyKey2: STRING, Hello World, 1
    > > MyKey3: STRING, , 1
    > >
    > > --Format is Like so "KEYWORD: TYPE, Data1, Data2"---
    > > fscanf(fFile, "%32[^:]: %32[^,], %32[^,], %d\n", P1, P2, P3, &P4);
    > >
    > > When it gets to "MyKey3" it fails to match P3 thus returns 2
    > > elements. I want P3 to be " ". Shouldn't "%32[^,]" be matching
    > > anything but ",", aka spaces as well? A way around this? Different
    > > way I should be tokenizeing such data?
    > >
    > > Note: P1/P2/P3 are just "char[32+1]"'s. P4 is an int.
    > >
    > > Thanks in Advance; I'm using GNU GCC 4.3.2 on a Ubuntu Machine w/
    > > Latest Eclipse CDT.

    >
    > I can suggest you to develop a self-made and overflow-free getline()
    > method to get the whole line in a file, something like this:
    >
    > char* getline (FILE *fp) {
    > char *line = NULL;
    > char ch;
    > unsigned int size=0;
    >
    > while ((ch=fgetc(fp)) && ch!='\n' && ch!='\r' && !feof(fp)) {
    > line = (char*) realloc(line,++size);
    > line[size-1]=ch;
    > }
    >
    > line[size]=0;
    > return line;
    > }
    >
    > Then you can parse the line obtained this way using regex.h functions
    > to match what you like, using "," as separator.


    There are much simpler (and faster) ways to handle getline. Among
    them is ggets, available at:

    <http://cbfalconer.home.att.net/download/ggets.zip>

    I don't know if I supplied this earlier, in this thread. At any
    rate, try it out. #define TESTING to get a testable version.
    Don't define TESTING for normal use.

    /* ------- file tknsplit.h ----------*/
    #ifndef H_tknsplit_h
    # define H_tknsplit_h

    # ifdef __cplusplus
    extern "C" {
    # endif

    #include <stddef.h>

    /* copy over the next tkn from an input string, after
    skipping leading blanks (or other whitespace?). The
    tkn is terminated by the first appearance of tknchar,
    or by the end of the source string.

    The caller must supply sufficient space in tkn to
    receive any tkn, Otherwise tkns will be truncated.

    Returns: a pointer past the terminating tknchar.

    This will happily return an infinity of empty tkns if
    called with src pointing to the end of a string. Tokens
    will never include a copy of tknchar.

    released to Public Domain, by C.B. Falconer.
    Published 2006-02-20. Attribution appreciated.
    revised 2007-05-26 (name)
    */

    const char *tknsplit(const char *src, /* Source of tkns */
    char tknchar, /* tkn delimiting char */
    char *tkn, /* receiver of parsed tkn */
    size_t lgh); /* length tkn can receive */
    /* not including final '\0' */

    # ifdef __cplusplus
    }
    # endif
    #endif
    /* ------- end file tknsplit.h ----------*/

    /* ------- file tknsplit.c ----------*/
    #include "tknsplit.h"

    /* copy over the next tkn from an input string, after
    skipping leading blanks (or other whitespace?). The
    tkn is terminated by the first appearance of tknchar,
    or by the end of the source string.

    The caller must supply sufficient space in tkn to
    receive any tkn, Otherwise tkns will be truncated.

    Returns: a pointer past the terminating tknchar.

    This will happily return an infinity of empty tkns if
    called with src pointing to the end of a string. Tokens
    will never include a copy of tknchar.

    A better name would be "strtkn", except that is reserved
    for the system namespace. Change to that at your risk.

    released to Public Domain, by C.B. Falconer.
    Published 2006-02-20. Attribution appreciated.
    Revised 2006-06-13 2007-05-26 (name)
    */

    const char *tknsplit(const char *src, /* Source of tkns */
    char tknchar, /* tkn delimiting char */
    char *tkn, /* receiver of parsed tkn */
    size_t lgh) /* length tkn can receive */
    /* not including final '\0' */
    {
    if (src) {
    while (' ' == *src) src++;

    while (*src && (tknchar != *src)) {
    if (lgh) {
    *tkn++ = *src;
    --lgh;
    }
    src++;
    }
    if (*src && (tknchar == *src)) src++;
    }
    *tkn = '\0';
    return src;
    } /* tknsplit */

    #ifdef TESTING
    #include <stdio.h>

    #define ABRsize 6 /* length of acceptable tkn abbreviations */

    /* ---------------- */

    static void showtkn(int i, char *tok)
    {
    putchar(i + '1'); putchar(':');
    puts(tok);
    } /* showtkn */

    /* ---------------- */

    int main(void)
    {
    char teststring[] = "This is a test, ,, abbrev, more";

    const char *t, *s = teststring;
    int i;
    char tkn[ABRsize + 1];

    puts(teststring);
    t = s;
    for (i = 0; i < 4; i++) {
    t = tknsplit(t, ',', tkn, ABRsize);
    showtkn(i, tkn);
    }

    puts("\nHow to detect 'no more tkns' while truncating");
    t = s; i = 0;
    while (*t) {
    t = tknsplit(t, ',', tkn, 3);
    showtkn(i, tkn);
    i++;
    }

    puts("\nUsing blanks as tkn delimiters");
    t = s; i = 0;
    while (*t) {
    t = tknsplit(t, ' ', tkn, ABRsize);
    showtkn(i, tkn);
    i++;
    }
    return 0;
    } /* main */

    #endif
    /* ------- end file tknsplit.c ----------*/

    --
    [mail]: Chuck F (cbfalconer at maineline dot net)
    [page]: <http://cbfalconer.home.att.net>
    Try the download section.
    CBFalconer, Apr 10, 2009
    #10
  11. NvrBst

    Eric Sosman Guest

    wrote:
    > Eric Sosman <> wrote:
    >> feof() and ferror() are "sticky:" if a transient failure bollixes
    >> one I/O operation and then a subsequent operation succeeds, the success
    >> of the second does not clear the stream's eof or error indicator. One
    >> situation where this arises with some frequency is in handling input
    >> from an interactive device that allows further input after an end-of-
    >> input indication like ^Z or ^D: One fgetc() could return EOF due to
    >> the transient end-of-input condition, and the next fgetc() could
    >> succeed and return an actual input character.

    >
    > Not in a conforming C implementation. It's the underlying end-of-file
    > and error indicators that are sticky, not feof() and ferror(). And
    > fgetc() is required to fail if either of the indicators is set, it is
    > not allowed to return any subsequent input, even if there is some. If
    > you want to read past EOF (when that's possible), you have to call
    > clearerr() to reset the indicators first.


    I think you're right about feof(), because 7.19.7.1p2
    says that fgetc() fails if the eof indicator is set, and all
    the other input functions work "as if" by calling fgetc().
    But I don't see any similar language about ferror() and the
    error indicator. Can you offer a citation?

    --
    Eric Sosman
    lid
    Eric Sosman, Apr 10, 2009
    #11
  12. NvrBst

    Guest Guest

    NvrBst <> wrote:
    > I have a file full of data that I want to tokenize. My function works
    > as long as the data I want to grab doesn't have padded whitespaces,
    > however, I want to preserve the padded whitespaces. Can I modify
    > fscanf to include them in the match?
    >
    >
    > ---Example File---
    > MyKey1: INT, 3341, 1
    > MyKey2: STRING, Hello World, 1
    > MyKey3: STRING, , 1
    >
    > --Format is Like so "KEYWORD: TYPE, Data1, Data2"---
    > fscanf(fFile, "%32[^:]: %32[^,], %32[^,], %d\n", P1, P2, P3, &P4);
    >
    > When it gets to "MyKey3" it fails to match P3 thus returns 2
    > elements. I want P3 to be " ". Shouldn't "%32[^,]" be matching
    > anything but ",", aka spaces as well? A way around this? Different
    > way I should be tokenizeing such data?
    >
    > Note: P1/P2/P3 are just "char[32+1]"'s. P4 is an int.
    >
    > Thanks in Advance; I'm using GNU GCC 4.3.2 on a Ubuntu Machine w/
    > Latest Eclipse CDT.


    Did you try using regex.h functions?

    --
    -----BEGIN GEEK CODE BLOCK-----
    GCS/CM/CC/E/IT/LS/M d-(--) C++++$ UBL++++$ P++++ L+++++$ E--- W+++ w--
    PS+++ PE-- Y++ PGP+++ R++ tv-- b++>+++ D+ G>+++ e++>+++++ h* r++ z+++
    ------END GEEK CODE BLOCK------
    Guest, Apr 10, 2009
    #12
  13. NvrBst

    Guest

    Eric Sosman <> wrote:
    >
    > I think you're right about feof(), because 7.19.7.1p2
    > says that fgetc() fails if the eof indicator is set, and all
    > the other input functions work "as if" by calling fgetc().
    > But I don't see any similar language about ferror() and the
    > error indicator. Can you offer a citation?


    No, I was mistaken. Sorry for the confusion.
    --
    Larry Jones

    I'm so disappointed. -- Calvin
    , Apr 10, 2009
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?ISO-8859-1?Q?Martin_J=F8rgensen?=

    scanf (yes/no) - doesn't work + deprecation errors scanf, fopen etc.

    =?ISO-8859-1?Q?Martin_J=F8rgensen?=, Feb 16, 2006, in forum: C Programming
    Replies:
    185
    Views:
    3,390
    those who know me have no need of my name
    Apr 3, 2006
  2. =?ISO-8859-1?Q?Martin_J=F8rgensen?=

    difference between scanf("%i") and scanf("%d") ??? perhaps bug inVS2005?

    =?ISO-8859-1?Q?Martin_J=F8rgensen?=, Apr 26, 2006, in forum: C Programming
    Replies:
    18
    Views:
    674
    Richard Bos
    May 2, 2006
  3. spaces in scanf format string

    , Nov 5, 2006, in forum: C Programming
    Replies:
    3
    Views:
    906
  4. John B. Matthews
    Replies:
    4
    Views:
    667
    John B. Matthews
    Sep 12, 2008
  5. Roedy Green
    Replies:
    3
    Views:
    623
Loading...

Share This Page