Parsing two formatted text files

Discussion in 'C Programming' started by bfowlkes@gmail.com, Apr 1, 2006.

  1. Guest

    Hello,

    I am trying to parse two pre-formatted text files and write them to a
    different files formatted in a different way. The story about this is I
    was hired along with about 20 other people and it seems we are trying
    to learn the whole C language in two weeks! To top it all off, I was an
    English Major, but I'm trying my best. Ok back to the program. So we
    have two files product_catalog.txt and sales_month.txt

    The info in product_catalog.txt looks like this:

    1010:CD drive external 32x :1MagiCopy:15.5:100
    1020:CD drive external 40x :20th Century Fox:16.74:130
    1030:CD drive external 48x :3COM:13.48:160
    1040:CD drive external 52x :4XEM:15.92:190

    We need to write it to another file that is going to look like this

    ID Number Description Provider Cost Stock Total
    1010 CD Drive 32x 1MagiCopy 15.50 100 1550.00

    Since the text file to be read from is preformatted I thought I could
    use the fscanf() to to parse each line and assign it into structure
    variables, but I am having problems.

    Here is my code to read the file:

    int readFile (char *filename, struct productData product[], size_t
    arrLen)
    /* Returns number of products read */
    {
    FILE *fp;

    if ( ( fp = fopen( "product_catalog.txt", "rb+" ) ) == NULL ) {
    printf( "File could not be opened.\n" );
    } /* end if */

    else
    {
    int i;
    for (i=0; i<arrLen && !feof(fp); i++)
    {
    if (5 != fscanf(fp, "%d %s %s %f %d",
    &product.idnumber,
    product.description,
    product.provider,
    &product.cost,
    &product.stock))
    {
    printf("Invalid file format\n");
    fclose(fp);
    return 0;
    }
    }
    fclose(fp);
    return i;
    }


    }

    The problem seems to be that each field I want to parse seems to be
    separated by a colon :)) Is there anyway to tell fscanf() to parse up
    until you reach a colon and then stop and start scanning again, or
    should I give up this approach and try to tokenize the input stream?
    Any help is much appreciated.

    Brett
    , Apr 1, 2006
    #1
    1. Advertising

  2. Eric Sosman Guest

    wrote On 03/31/06 18:06,:
    > Hello,
    >
    > I am trying to parse two pre-formatted text files and write them to a
    > different files formatted in a different way. The story about this is I
    > was hired along with about 20 other people and it seems we are trying
    > to learn the whole C language in two weeks! To top it all off, I was an
    > English Major, but I'm trying my best. Ok back to the program. So we
    > have two files product_catalog.txt and sales_month.txt
    >
    > The info in product_catalog.txt looks like this:
    >
    > 1010:CD drive external 32x :1MagiCopy:15.5:100
    > 1020:CD drive external 40x :20th Century Fox:16.74:130
    > 1030:CD drive external 48x :3COM:13.48:160
    > 1040:CD drive external 52x :4XEM:15.92:190
    >
    > We need to write it to another file that is going to look like this
    >
    > ID Number Description Provider Cost Stock Total
    > 1010 CD Drive 32x 1MagiCopy 15.50 100 1550.00


    That's not just reformatting. There's a little bit
    of computation (deriving the 1550.00), which isn't hard.
    Harder -- potentially very hard -- is the translation
    that seems to be occurring: How did "drive" become "Drive,"
    and where did "external" disappear to, and what rules
    govern such transformations?

    > Since the text file to be read from is preformatted I thought I could
    > use the fscanf() to to parse each line and assign it into structure
    > variables, but I am having problems.
    >
    > Here is my code to read the file:
    >
    > int readFile (char *filename, struct productData product[], size_t
    > arrLen)
    > /* Returns number of products read */
    > {
    > FILE *fp;
    >
    > if ( ( fp = fopen( "product_catalog.txt", "rb+" ) ) == NULL ) {
    > printf( "File could not be opened.\n" );
    > } /* end if */
    >
    > else
    > {
    > int i;
    > for (i=0; i<arrLen && !feof(fp); i++)
    > {
    > if (5 != fscanf(fp, "%d %s %s %f %d",
    > &product.idnumber,
    > product.description,
    > product.provider,
    > &product.cost,
    > &product.stock))
    > {
    > printf("Invalid file format\n");
    > fclose(fp);
    > return 0;
    > }
    > }
    > fclose(fp);
    > return i;
    > }
    >
    >
    > }
    >
    > The problem seems to be that each field I want to parse seems to be
    > separated by a colon :)) Is there anyway to tell fscanf() to parse up
    > until you reach a colon and then stop and start scanning again, or
    > should I give up this approach and try to tokenize the input stream?
    > Any help is much appreciated.


    "%s" will skip leading white space, grab a string,
    and stop when it hits white space again. Hence, it's
    no good for your input format, where white spaces can
    occur as part of a data field.

    You could use "%[^:]" to look for colon-delimited
    fields, but the resulting program would be rather fragile.
    One lousy line with an extra colon or a missing colon,
    and you'll be out of step for the rest of the journey.
    or until you trip and fall, whichever comes first.
    (fscanf() is no respecter of line boundaries, and will
    happily cross them in search of more input.)

    Recommended approach: Use fgets() (but not gets()!!!)
    to read each line into a big char[] array, and then pick
    the line apart with other tools. sscanf() may be a choice
    you'd find familiar -- and since sscanf() cannot run off
    the end of its input array (and thus inadvertengly bypass
    line boundaries), some of the infelicities of fscanf()
    disappear.

    --
    Eric Sosman, Apr 1, 2006
    #2
    1. Advertising

  3. Ben C Guest

    On 2006-03-31, <> wrote:
    > [...]
    > The info in product_catalog.txt looks like this:
    >
    > 1010:CD drive external 32x :1MagiCopy:15.5:100
    > 1020:CD drive external 40x :20th Century Fox:16.74:130
    > 1030:CD drive external 48x :3COM:13.48:160
    > 1040:CD drive external 52x :4XEM:15.92:190
    >
    > We need to write it to another file that is going to look like this
    >
    > ID Number Description Provider Cost Stock Total
    > 1010 CD Drive 32x 1MagiCopy 15.50 100 1550.00
    >
    > Since the text file to be read from is preformatted I thought I could
    > use the fscanf() to to parse each line and assign it into structure
    > variables, but I am having problems.
    >
    > Here is my code to read the file:
    >
    > int readFile (char *filename, struct productData product[], size_t
    > arrLen)
    > /* Returns number of products read */
    > {
    > FILE *fp;
    >
    > if ( ( fp = fopen( "product_catalog.txt", "rb+" ) ) == NULL ) {
    > printf( "File could not be opened.\n" );
    > } /* end if */
    >
    > else
    > {
    > int i;
    > for (i=0; i<arrLen && !feof(fp); i++)
    > {
    > if (5 != fscanf(fp, "%d %s %s %f %d",
    > &product.idnumber,
    > product.description,
    > product.provider,
    > &product.cost,
    > &product.stock))
    > {
    > printf("Invalid file format\n");
    > fclose(fp);
    > return 0;
    > }
    > }
    > fclose(fp);
    > return i;
    > }
    > }
    >
    > The problem seems to be that each field I want to parse seems to be
    > separated by a colon :)) Is there anyway to tell fscanf() to parse up
    > until you reach a colon and then stop and start scanning again, or
    > should I give up this approach and try to tokenize the input stream?


    You put the colons in the format string:

    if (5 != fscanf(fp, "%d:%s:%s:%f:%d" ...

    But this still won't work quite right, because %s will make fscanf will
    stop at the spaces.

    You can use %[^:] to mean "series of non-colons" so:

    if (5 != fscanf(fp, "%d:%[^:]:%[^:]:%f:%d" ...

    should do the trick.

    You also have to be careful that badly formatted input data can't
    overflow the arrays you're storing the data in. fscanf provides various
    format modifiers for this-- it can optionally scan up to a maximum
    length, or it can allocate the buffers for you.

    e.g.:

    if (5 != fscanf(fp, "%d:%64[^:]:%64[^:]:%f:%d" ...

    if your buffers for description and provider were 64 bytes long. They'd
    get truncated of course, which might not be acceptable. In that case you
    could try %a[^:] (see fscanf manual).

    The other point is that if you have any choice in the matter C is not
    the best language for this task, you'd be much better off with something
    else-- Python, Tcl, Perl, that kind of thing. Awk might be the perfect
    choice.
    Ben C, Apr 1, 2006
    #3
  4. CBFalconer Guest

    Eric Sosman wrote:
    > wrote On 03/31/06 18:06,:
    >>
    >> I am trying to parse two pre-formatted text files and write them to a
    >> different files formatted in a different way. The story about this is I
    >> was hired along with about 20 other people and it seems we are trying
    >> to learn the whole C language in two weeks! To top it all off, I was an
    >> English Major, but I'm trying my best. Ok back to the program. So we
    >> have two files product_catalog.txt and sales_month.txt
    >>
    >> The info in product_catalog.txt looks like this:
    >>
    >> 1010:CD drive external 32x :1MagiCopy:15.5:100
    >> 1020:CD drive external 40x :20th Century Fox:16.74:130
    >> 1030:CD drive external 48x :3COM:13.48:160
    >> 1040:CD drive external 52x :4XEM:15.92:190
    >>
    >> We need to write it to another file that is going to look like this
    >>
    >> ID Number Description Provider Cost Stock Total
    >> 1010 CD Drive 32x 1MagiCopy 15.50 100 1550.00

    >
    > That's not just reformatting. There's a little bit
    > of computation (deriving the 1550.00), which isn't hard.
    > Harder -- potentially very hard -- is the translation
    > that seems to be occurring: How did "drive" become "Drive,"
    > and where did "external" disappear to, and what rules
    > govern such transformations?
    >
    >> Since the text file to be read from is preformatted I thought I could
    >> use the fscanf() to to parse each line and assign it into structure
    >> variables, but I am having problems.
    >>
    >> Here is my code to read the file:
    >>
    >> int readFile (char *filename, struct productData product[], size_t
    >> arrLen)
    >> /* Returns number of products read */
    >> {
    >> FILE *fp;
    >>
    >> if ( ( fp = fopen( "product_catalog.txt", "rb+" ) ) == NULL ) {
    >> printf( "File could not be opened.\n" );
    >> } /* end if */
    >> else
    > > {
    > > int i;
    > > for (i=0; i<arrLen && !feof(fp); i++)
    > > {
    > > if (5 != fscanf(fp, "%d %s %s %f %d",
    > > &product.idnumber,
    > > product.description,
    > > product.provider,
    > > &product.cost,
    > > &product.stock))
    > > {
    > > printf("Invalid file format\n");
    > > fclose(fp);
    > > return 0;
    > > }
    > > }
    > > fclose(fp);
    > > return i;
    > > }
    > > }
    > >
    > > The problem seems to be that each field I want to parse seems to be
    > > separated by a colon :)) Is there anyway to tell fscanf() to parse up
    > > until you reach a colon and then stop and start scanning again, or
    > > should I give up this approach and try to tokenize the input stream?
    > > Any help is much appreciated.

    >
    > "%s" will skip leading white space, grab a string,
    > and stop when it hits white space again. Hence, it's
    > no good for your input format, where white spaces can
    > occur as part of a data field.
    >
    > You could use "%[^:]" to look for colon-delimited
    > fields, but the resulting program would be rather fragile.
    > One lousy line with an extra colon or a missing colon,
    > and you'll be out of step for the rest of the journey.
    > or until you trip and fall, whichever comes first.
    > (fscanf() is no respecter of line boundaries, and will
    > happily cross them in search of more input.)
    >
    > Recommended approach: Use fgets() (but not gets()!!!)
    > to read each line into a big char[] array, and then pick
    > the line apart with other tools. sscanf() may be a choice
    > you'd find familiar -- and since sscanf() cannot run off
    > the end of its input array (and thus inadvertengly bypass
    > line boundaries), some of the infelicities of fscanf()
    > disappear.


    I would suggest he keep things as simple as possible. He could use
    my ggets() to input the lines, and my toksplit to parse them.
    toksplit was published here a few days ago, just search the group
    archives. ggets is available on my page at:

    <http://cbfalconer.home.att.net/download/ggets.zip>

    Then the code will look much like:

    char *ln, *tmp;
    int ix;
    char tok[MAXTOKEN + 1]; /* allow for '0' always */

    while (0 == ggets(&ln)) {
    tmp = ln; ix = 0;
    while (*tmp) {
    tmp = toksplit(tmp, ':', tok, MAXTOKEN);
    ix++; /* just to keep track of which token in line */
    /* code to modify and output from tok */
    /* probably best isolated in a separate function */
    }
    free(ln);
    }

    Notice that the only configuration constants are MAXTOKEN and what
    the token delimiting character (':' here) actually is.

    --
    "If you want to post a followup via groups.google.com, don't use
    the broken "Reply" link at the bottom of the article. Click on
    "show options" at the top of the article, then click on the
    "Reply" at the bottom of the article headers." - Keith Thompson
    More details at: <http://cfaj.freeshell.org/google/>
    Also see <http://www.safalra.com/special/googlegroupsreply/>
    CBFalconer, Apr 1, 2006
    #4
  5. Guest

    I am making some progress, but not much unfortunately. Using these two
    code segments that I found from another post I was able to parse out
    each field as a text file, the output looks like this:

    Line number: 1
    Token: 1010
    Token: CD drive external 32x
    Token: 1MagiCopy
    Token: 15.5
    Token: 100

    Line number: 2
    Token: 1020
    Token: CD drive external 40x
    Token: 20th Century Fox
    Token: 16.74
    Token: 130


    size_t get_line( FILE *f , char *line, size_t len )
    {
    char *ptr;


    ptr = fgets( line, len, f );


    if( NULL == ptr ) {
    line[0] = '\0';
    return 0;
    }


    if( NULL != (ptr = strchr(line, DELIMITER)) ) *ptr = '\0';


    return strlen(line);
    }

    while( 0 != get_line( fp, data, sizeof(data)) ) {
    count++;
    printf( "Line number: %d\n", count );
    for( ptr0 = data; NULL != (ptr1 = strtok(ptr0, TOKEN)); ptr0 =
    NULL )
    printf( "Token: %s\n", ptr1 );
    putchar( '\n' );
    }


    What I was going to do was assign each field value into an array of
    structures, but it gives me a segmentation fault, is there another way
    to achieve the main objective?
    , Apr 1, 2006
    #5
  6. Ben C Guest

    On 2006-04-01, <> wrote:
    > I am making some progress, but not much unfortunately. Using these two
    > code segments that I found from another post I was able to parse out
    > each field as a text file, the output looks like this:
    >
    > Line number: 1
    > Token: 1010
    > Token: CD drive external 32x
    > Token: 1MagiCopy
    > Token: 15.5
    > Token: 100
    >
    > Line number: 2
    > Token: 1020
    > Token: CD drive external 40x
    > Token: 20th Century Fox
    > Token: 16.74
    > Token: 130
    >
    > size_t get_line( FILE *f , char *line, size_t len )
    > {
    > char *ptr;
    >
    >
    > ptr = fgets( line, len, f );
    >
    >
    > if( NULL == ptr ) {
    > line[0] = '\0';
    > return 0;
    > }
    >
    >
    > if( NULL != (ptr = strchr(line, DELIMITER)) ) *ptr = '\0';
    >
    >
    > return strlen(line);
    > }
    >
    > while( 0 != get_line( fp, data, sizeof(data)) ) {
    > count++;
    > printf( "Line number: %d\n", count );
    > for( ptr0 = data; NULL != (ptr1 = strtok(ptr0, TOKEN)); ptr0 =
    > NULL )
    > printf( "Token: %s\n", ptr1 );
    > putchar( '\n' );
    > }


    > What I was going to do was assign each field value into an array of
    > structures, but it gives me a segmentation fault, is there another way
    > to achieve the main objective?


    If the main objective is just to print it all out again formatted
    differently, you can maybe do that in the loop, and avoid having to
    store the data.

    But you should be able to fix the segmentation fault! The error might be
    in part of the code we can't see-- it looks from "data, sizeof(data)"
    that data is an array; where do you declare it? And how's the array of
    structures created?

    In any case, you reuse the same buffer for each line, so you're going to
    have to actually copy the strings out somehow.

    Guessing, but the problem may be that you're just copying the pointers,
    but not duplicating the actual strings.

    for( ptr0 = data; NULL != (ptr1 = strtok(ptr0, TOKEN)); ptr0 = NULL )

    records.name = ptr1; /* very likely to be wrong */
    records.name = strdup(ptr1); /* some chance of working */

    HTH
    Ben C, Apr 1, 2006
    #6
  7. CBFalconer Guest

    wrote:
    >
    > I am making some progress, but not much unfortunately. Using these
    > two code segments that I found from another post I was able to
    > parse out each field as a text file, the output looks like this:


    You reply to my posting, but ignore all that I suggested, and
    refuse to quote proper context. I see no point in anyone
    attempting to assist you further.

    --
    "If you want to post a followup via groups.google.com, don't use
    the broken "Reply" link at the bottom of the article. Click on
    "show options" at the top of the article, then click on the
    "Reply" at the bottom of the article headers." - Keith Thompson
    More details at: <http://cfaj.freeshell.org/google/>
    Also see <http://www.safalra.com/special/googlegroupsreply/>
    CBFalconer, Apr 2, 2006
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. DanielESFA
    Replies:
    9
    Views:
    409
    DanielESFA
    May 25, 2005
  2. GenxLogic
    Replies:
    3
    Views:
    1,249
    andrewmcdonagh
    Dec 6, 2006
  3. VUNETdotUS

    Help Parsing RFC822 Formatted Date

    VUNETdotUS, Oct 17, 2007, in forum: ASP General
    Replies:
    3
    Views:
    197
    VUNETdotUS
    Oct 18, 2007
  4. triangle
    Replies:
    1
    Views:
    106
    Gunnar Hjalmarsson
    Jan 30, 2004
  5. Steve D
    Replies:
    4
    Views:
    254
    Steve D
    Jan 10, 2006
Loading...

Share This Page