parse two field file

Discussion in 'C Programming' started by Richard, Dec 17, 2006.

  1. Richard

    Richard Guest

    Which way would you guys recommened to best parse a multiline file which contains
    two fields seperated by a tab. In this case its the
    linux/proc/filesystems file a sample of which I have included below:

    nodev usbfs
    ext3
    nodev fuse
    vfat
    ntfs
    nodev binfmt_misc
    udf
    iso9660

    The first field can be "empty" and concist of only a single tab
    character. The seperator is a tab.

    Is sscanf best suited to this? Or use strtok/strtok_r?

    The field I am really interested in is the second one : any hints & tips
    appreciated as to do this in the most efficient manor.

    --
    Richard, Dec 17, 2006
    #1
    1. Advertising

  2. Richard

    CBFalconer Guest

    Richard wrote:
    >
    > Which way would you guys recommened to best parse a multiline file
    > which contains two fields seperated by a tab. In this case its the
    > linux/proc/filesystems file a sample of which I have included below:
    >
    > nodev usbfs
    > ext3
    > nodev fuse
    > vfat
    > ntfs
    > nodev binfmt_misc
    > udf
    > iso9660
    >
    > The first field can be "empty" and concist of only a single tab
    > character. The seperator is a tab.
    >
    > Is sscanf best suited to this? Or use strtok/strtok_r?
    >
    > The field I am really interested in is the second one : any hints
    > & tips appreciated as to do this in the most efficient manor.


    Use toksplit. Call with tokchar set to '\t'. Std C code follows:

    /* ------- file toksplit.h ----------*/
    #ifndef H_toksplit_h
    # define H_toksplit_h

    # ifdef __cplusplus
    extern "C" {
    # endif

    #include <stddef.h>

    /* copy over the next token from an input string, after
    skipping leading blanks (or other whitespace?). The
    token is terminated by the first appearance of tokchar,
    or by the end of the source string.

    The caller must supply sufficient space in token to
    receive any token, Otherwise tokens will be truncated.

    Returns: a pointer past the terminating tokchar.

    This will happily return an infinity of empty tokens if
    called with src pointing to the end of a string. Tokens
    will never include a copy of tokchar.

    released to Public Domain, by C.B. Falconer.
    Published 2006-02-20. Attribution appreciated.
    */

    const char *toksplit(const char *src, /* Source of tokens */
    char tokchar, /* token delimiting char */
    char *token, /* receiver of parsed token */
    size_t lgh); /* length token can receive */
    /* not including final '\0' */

    # ifdef __cplusplus
    }
    # endif
    #endif
    /* ------- end file toksplit.h ----------*/

    /* ------- file toksplit.c ----------*/
    #include "toksplit.h"

    /* copy over the next token from an input string, after
    skipping leading blanks (or other whitespace?). The
    token is terminated by the first appearance of tokchar,
    or by the end of the source string.

    The caller must supply sufficient space in token to
    receive any token, Otherwise tokens will be truncated.

    Returns: a pointer past the terminating tokchar.

    This will happily return an infinity of empty tokens if
    called with src pointing to the end of a string. Tokens
    will never include a copy of tokchar.

    A better name would be "strtkn", except that is reserved
    for the system namespace. Change to that at your risk.

    released to Public Domain, by C.B. Falconer.
    Published 2006-02-20. Attribution appreciated.
    Revised 2006-06-13
    */

    const char *toksplit(const char *src, /* Source of tokens */
    char tokchar, /* token delimiting char */
    char *token, /* receiver of parsed token */
    size_t lgh) /* length token can receive */
    /* not including final '\0' */
    {
    if (src) {
    while (' ' == *src) src++;

    while (*src && (tokchar != *src)) {
    if (lgh) {
    *token++ = *src;
    --lgh;
    }
    src++;
    }
    if (*src && (tokchar == *src)) src++;
    }
    *token = '\0';
    return src;
    } /* toksplit */

    #ifdef TESTING
    #include <stdio.h>

    #define ABRsize 6 /* length of acceptable token abbreviations */

    /* ---------------- */

    static void showtoken(int i, char *tok)
    {
    putchar(i + '1'); putchar(':');
    puts(tok);
    } /* showtoken */

    /* ---------------- */

    int main(void)
    {
    char teststring[] = "This is a test, ,, abbrev, more";

    const char *t, *s = teststring;
    int i;
    char token[ABRsize + 1];

    puts(teststring);
    t = s;
    for (i = 0; i < 4; i++) {
    t = toksplit(t, ',', token, ABRsize);
    showtoken(i, token);
    }

    puts("\nHow to detect 'no more tokens' while truncating");
    t = s; i = 0;
    while (*t) {
    t = toksplit(t, ',', token, 3);
    showtoken(i, token);
    i++;
    }

    puts("\nUsing blanks as token delimiters");
    t = s; i = 0;
    while (*t) {
    t = toksplit(t, ' ', token, ABRsize);
    showtoken(i, token);
    i++;
    }
    return 0;
    } /* main */

    #endif
    /* ------- end file toksplit.c ----------*/

    --
    Chuck F (cbfalconer at maineline dot net)
    Available for consulting/temporary embedded and systems.
    <http://cbfalconer.home.att.net>
    CBFalconer, Dec 17, 2006
    #2
    1. Advertising

  3. Richard

    Malcolm Guest

    "Richard" <> wrote in message
    news:...
    >
    > Which way would you guys recommened to best parse a multiline file which
    > contains
    > two fields seperated by a tab. In this case its the
    > linux/proc/filesystems file a sample of which I have included below:
    >
    > nodev usbfs
    > ext3
    > nodev fuse
    > vfat
    > ntfs
    > nodev binfmt_misc
    > udf
    > iso9660
    >
    > The first field can be "empty" and concist of only a single tab
    > character. The seperator is a tab.
    >
    > Is sscanf best suited to this? Or use strtok/strtok_r?
    >
    > The field I am really interested in is the second one : any hints & tips
    > appreciated as to do this in the most efficient manor.
    >

    The input format is slightly quirky, so the best solution is to call fgets()
    to read a line and then parse it yourself.

    int checkheader(char *str)

    ccan check whether the string is a header or not by looking for the tab or
    counting whitespace.

    parseheader(char *str, char *field1, char *field2)

    will pull out the fields for you. make sure you reject over-long strings.
    Then the data fields only contain one string.

    However

    void trim(char *str)

    which removes leading and trailing whitespace is a good function to have.

    so too is
    int checkblank(char *str)

    which checks for strings which consist entirely of whitespace characters.
    --
    www.personal.leeds.ac.uk/~bgy1mm
    freeware games to download.
    Malcolm, Dec 17, 2006
    #3
  4. Richard

    Richard Guest

    "Malcolm" <> writes:

    > "Richard" <> wrote in message
    > news:...
    >>
    >> Which way would you guys recommened to best parse a multiline file which
    >> contains
    >> two fields seperated by a tab. In this case its the
    >> linux/proc/filesystems file a sample of which I have included below:
    >>
    >> nodev usbfs
    >> ext3
    >> nodev fuse
    >> vfat
    >> ntfs
    >> nodev binfmt_misc
    >> udf
    >> iso9660
    >>
    >> The first field can be "empty" and concist of only a single tab
    >> character. The seperator is a tab.
    >>
    >> Is sscanf best suited to this? Or use strtok/strtok_r?
    >>
    >> The field I am really interested in is the second one : any hints & tips
    >> appreciated as to do this in the most efficient manor.
    >>

    > The input format is slightly quirky, so the best solution is to call fgets()
    > to read a line and then parse it yourself.
    >
    > int checkheader(char *str)
    >
    > ccan check whether the string is a header or not by looking for the tab or
    > counting whitespace.
    >
    > parseheader(char *str, char *field1, char *field2)
    >
    > will pull out the fields for you. make sure you reject over-long strings.
    > Then the data fields only contain one string.
    >
    > However
    >
    > void trim(char *str)
    >
    > which removes leading and trailing whitespace is a good function to have.
    >
    > so too is
    > int checkblank(char *str)
    >
    > which checks for strings which consist entirely of whitespace
    > characters.


    I just did sscanf("%s%s",f1,f2) in the end.

    --
    Richard, Dec 17, 2006
    #4
  5. Richard

    Eric Sosman Guest

    Richard wrote:
    > Which way would you guys recommened to best parse a multiline file which contains
    > two fields seperated by a tab. In this case its the
    > linux/proc/filesystems file a sample of which I have included below:
    >
    > nodev usbfs
    > ext3
    > nodev fuse
    > vfat
    > ntfs
    > nodev binfmt_misc
    > udf
    > iso9660
    >
    > The first field can be "empty" and concist of only a single tab
    > character. The seperator is a tab.
    >
    > Is sscanf best suited to this? Or use strtok/strtok_r?


    strtok(..., "\t") will give the same result for "\tfoo"
    and "\t\tfoo\t" and "foo". If you *know* that the input has
    two tab-separated fields and that only the first (never the
    second) can be empty, you can get this to work: If strtok()
    finds two fields they are #1 and #2, but if it finds only
    one it is #2 with #1 empty.

    However, it makes me queasy to put that much faith in an
    input source I don't control programmatically. Who knows?
    Maybe in six months somebody will extend the format, adding
    an optional third field. If that happened, then the field-
    counting approach would misinterpret "\tfoo\tbar" as if it
    were "foo\tbar". It would be better to adopt a method that
    would complain about "\tfoo\tbar" than to be fooled by it.

    fgets() plus sscanf() is a possibility, but it's a bit
    tricky to use: The obvious "%s\t%s" will not do what you
    want. (The first "%s" will skip any leading white space,
    leaving you in the same hole as the strtok() approach, and
    the "\t" will match any amount of any kind of white space,
    tabs or other.) Something like "%[^\t]%*1[\t]%s" would do
    a little better, but still wouldn't be fully satisfactory:
    It would match the prefix of "foo\tbar baz goozle frobnitz"
    without any warning of the trailing junk. You could use
    "%[^\t]%*1[\t]%s%n" and then check that sscanf() had in fact
    consumed the entire string ...

    ... but wouldn't it be simpler just to pick the line
    apart for yourself? Read it in with fgets(), use strchr()
    to find the first tab (syntax error if there isn't one), and
    the first (possibly empty) field is everything from the start
    to just before the tab. Then start just after the tab and use
    strchr() again to find the terminating '\n'; the second field
    is everything from just after the tab to just before the '\n'
    (syntax error if its length is zero). You can use strcspn()
    to check that the second field contains no white space and
    squawk if it does (somebody added a third field you don't
    understand).

    > The field I am really interested in is the second one : any hints & tips
    > appreciated as to do this in the most efficient manor.


    The "most efficient manor" is the house of Usher. Resist
    this unnecessary impulse for efficiency, lest your program meet
    the same fate as did that storied manse.

    (In other words: How long is this file, anyhow? How many
    times will you scan its contents? If you sped up the scanning
    by a factor of four hundred twenty gazillion, how much faster
    would the program as a whole run? If you give your SUV a coat
    of wax, will you improve its fuel economy by making it slipperier
    or harm it by adding weight?)

    --
    Eric Sosman
    lid
    Eric Sosman, Dec 17, 2006
    #5
  6. On Sun, 17 Dec 2006 01:10:16 +0100, Richard <> wrote:
    > Which way would you guys recommened to best parse a multiline file
    > which contains two fields seperated by a tab. In this case its the
    > linux/proc/filesystems file a sample of which I have included below:
    >
    > nodev usbfs
    > ext3
    > nodev fuse
    > vfat
    > ntfs
    > nodev binfmt_misc
    > udf
    > iso9660
    >
    > The first field can be "empty" and concist of only a single tab
    > character. The seperator is a tab.
    >
    > Is sscanf best suited to this? Or use strtok/strtok_r?


    strtok() is not so nice, because it tries to modify the string you pass
    to it. I would probably use strcspn() for this, with something like:

    #include <assert.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>

    #define MAXLINE 256

    static void doline(char *buf, size_t bufsize);

    int
    main(void)
    {
    char buf[MAXLINE];
    FILE *fp;

    /*
    * Add code here that opens /proc/filesystems file, instead of using
    * `stdin' as the input file.
    */
    fp = stdin;

    clearerr(fp);
    while (fgets(buf, sizeof buf, fp) != NULL) {
    doline(buf, sizeof buf);
    }
    if (ferror(fp) != 0) {
    perror("fgets");
    exit(EXIT_FAILURE);
    }
    /*
    * Add code here that closes the open file referenced by `fp'.
    */

    return EXIT_SUCCESS;
    }

    static void
    doline(char *buf, size_t bufsize)
    {
    char *field;
    size_t pos, pos2, fieldsize;

    assert(buf != NULL && bufsize > 0);
    (void)bufsize;

    pos = strcspn(buf, "\t");
    if (buf[pos] == '\0') {
    fprintf(stderr,
    "warning: no TAB in `%s', skipping this line\n", buf);
    return;
    }
    pos2 = strcspn(buf + pos + 1, "\t");

    fieldsize = pos2 + 1;
    field = malloc(fieldsize);
    if (field == NULL) {
    perror("malloc");
    return;
    }
    strncpy(field, buf + pos + 1, fieldsize - 1);
    field[fieldsize - 1] = '\0';
    field[strcspn(field, "\n\r")] = '\0';
    printf("%s\n", field);
    free(field);
    }

    The trick is to use strcspn() to find out the 'part' of the original
    string which you are interested in, and then you can do whatever you
    like with this part. In the particular program, I'm temporarily
    allocate a new string buffer, copy the original contents in this new
    buffer, print the buffer and release its memory. Any other way you can
    think about to use this substring is fine too :)
    Giorgos Keramidas, Dec 26, 2006
    #6
  7. On Sun, 17 Dec 2006 10:37:28 -0500, Eric Sosman
    <> wrote:

    > Richard wrote:
    > > Which way would you guys recommened to best parse a multiline file which contains
    > > two fields seperated by a tab. <snip>

    > strtok(..., "\t") will [lose empty fields]


    Right.

    > fgets() plus sscanf() is a possibility, but it's a bit
    > tricky to use: The obvious "%s\t%s" will not do what you
    > want. (The first "%s" will skip any leading white space,
    > leaving you in the same hole as the strtok() approach, and
    > the "\t" will match any amount of any kind of white space,
    > tabs or other.) Something like "%[^\t]%*1[\t]%s" would do
    > a little better, but still wouldn't be fully satisfactory:


    Not enough better. If the first field is empty and thus the first
    %[^\t] matches nothing, *scanf stops and doesn't do the %*1[\t]s.

    This is effectively the same problem of the people who periodically
    try to use {,f}scanf to replace <ILLEGAL> fflush (input) </>.
    (Some people, including IIRC Dan Pop, have recommended e.g.
    if( scanf ("%*[^\n]%*1[\n]") < 2 ) getchar ();
    but I consider that too much uglier than the obvious, though slightly
    longer and possibly slightly less efficient
    while( (ch = getchar()) != EOF && ch != '\n' ) ;
    etc.

    Plus unbounded %[...] or %s risks buffer overflow and resulting UB.
    You should specify a length at most one less than the buffer size.

    > It would match the prefix of "foo\tbar baz goozle frobnitz"
    > without any warning of the trailing junk. You could use
    > "%[^\t]%*1[\t]%s%n" and then check that sscanf() had in fact
    > consumed the entire string ...
    >
    > ... but wouldn't it be simpler just to pick the line
    > apart for yourself? Read it in with fgets(), use strchr()
    > to find the first tab <snip>


    Yes.

    > The "most efficient manor" is the house of Usher. Resist
    > this unnecessary impulse for efficiency, lest your program meet
    > the same fate as did that storied manse.
    >

    Yes. Or even the hundred-year shay, IIRC grade school. <G>

    - David.Thompson1 at worldnet.att.net
    Dave Thompson, Jan 3, 2007
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. GenxLogic
    Replies:
    3
    Views:
    1,266
    andrewmcdonagh
    Dec 6, 2006
  2. Asif Iqbal
    Replies:
    0
    Views:
    157
    Asif Iqbal
    Aug 6, 2009
  3. Sound
    Replies:
    2
    Views:
    436
    Randy Webb
    Sep 28, 2006
  4. VUNETdotUS
    Replies:
    25
    Views:
    452
    Thomas 'PointedEars' Lahn
    Nov 10, 2007
  5. jr
    Replies:
    3
    Views:
    414
Loading...

Share This Page