Parsing two formatted text files

B

bfowlkes

Hello,

I am trying to parse two pre-formatted text files and write them to a
different files formatted in a different way. The story about this is I
was hired along with about 20 other people and it seems we are trying
to learn the whole C language in two weeks! To top it all off, I was an
English Major, but I'm trying my best. Ok back to the program. So we
have two files product_catalog.txt and sales_month.txt

The info in product_catalog.txt looks like this:

1010:CD drive external 32x :1MagiCopy:15.5:100
1020:CD drive external 40x :20th Century Fox:16.74:130
1030:CD drive external 48x :3COM:13.48:160
1040:CD drive external 52x :4XEM:15.92:190

We need to write it to another file that is going to look like this

ID Number Description Provider Cost Stock Total
1010 CD Drive 32x 1MagiCopy 15.50 100 1550.00

Since the text file to be read from is preformatted I thought I could
use the fscanf() to to parse each line and assign it into structure
variables, but I am having problems.

Here is my code to read the file:

int readFile (char *filename, struct productData product[], size_t
arrLen)
/* Returns number of products read */
{
FILE *fp;

if ( ( fp = fopen( "product_catalog.txt", "rb+" ) ) == NULL ) {
printf( "File could not be opened.\n" );
} /* end if */

else
{
int i;
for (i=0; i<arrLen && !feof(fp); i++)
{
if (5 != fscanf(fp, "%d %s %s %f %d",
&product.idnumber,
product.description,
product.provider,
&product.cost,
&product.stock))
{
printf("Invalid file format\n");
fclose(fp);
return 0;
}
}
fclose(fp);
return i;
}


}

The problem seems to be that each field I want to parse seems to be
separated by a colon :)) Is there anyway to tell fscanf() to parse up
until you reach a colon and then stop and start scanning again, or
should I give up this approach and try to tokenize the input stream?
Any help is much appreciated.

Brett
 
E

Eric Sosman

Hello,

I am trying to parse two pre-formatted text files and write them to a
different files formatted in a different way. The story about this is I
was hired along with about 20 other people and it seems we are trying
to learn the whole C language in two weeks! To top it all off, I was an
English Major, but I'm trying my best. Ok back to the program. So we
have two files product_catalog.txt and sales_month.txt

The info in product_catalog.txt looks like this:

1010:CD drive external 32x :1MagiCopy:15.5:100
1020:CD drive external 40x :20th Century Fox:16.74:130
1030:CD drive external 48x :3COM:13.48:160
1040:CD drive external 52x :4XEM:15.92:190

We need to write it to another file that is going to look like this

ID Number Description Provider Cost Stock Total
1010 CD Drive 32x 1MagiCopy 15.50 100 1550.00

That's not just reformatting. There's a little bit
of computation (deriving the 1550.00), which isn't hard.
Harder -- potentially very hard -- is the translation
that seems to be occurring: How did "drive" become "Drive,"
and where did "external" disappear to, and what rules
govern such transformations?
Since the text file to be read from is preformatted I thought I could
use the fscanf() to to parse each line and assign it into structure
variables, but I am having problems.

Here is my code to read the file:

int readFile (char *filename, struct productData product[], size_t
arrLen)
/* Returns number of products read */
{
FILE *fp;

if ( ( fp = fopen( "product_catalog.txt", "rb+" ) ) == NULL ) {
printf( "File could not be opened.\n" );
} /* end if */

else
{
int i;
for (i=0; i<arrLen && !feof(fp); i++)
{
if (5 != fscanf(fp, "%d %s %s %f %d",
&product.idnumber,
product.description,
product.provider,
&product.cost,
&product.stock))
{
printf("Invalid file format\n");
fclose(fp);
return 0;
}
}
fclose(fp);
return i;
}


}

The problem seems to be that each field I want to parse seems to be
separated by a colon :)) Is there anyway to tell fscanf() to parse up
until you reach a colon and then stop and start scanning again, or
should I give up this approach and try to tokenize the input stream?
Any help is much appreciated.


"%s" will skip leading white space, grab a string,
and stop when it hits white space again. Hence, it's
no good for your input format, where white spaces can
occur as part of a data field.

You could use "%[^:]" to look for colon-delimited
fields, but the resulting program would be rather fragile.
One lousy line with an extra colon or a missing colon,
and you'll be out of step for the rest of the journey.
or until you trip and fall, whichever comes first.
(fscanf() is no respecter of line boundaries, and will
happily cross them in search of more input.)

Recommended approach: Use fgets() (but not gets()!!!)
to read each line into a big char[] array, and then pick
the line apart with other tools. sscanf() may be a choice
you'd find familiar -- and since sscanf() cannot run off
the end of its input array (and thus inadvertengly bypass
line boundaries), some of the infelicities of fscanf()
disappear.
 
B

Ben C

[...]
The info in product_catalog.txt looks like this:

1010:CD drive external 32x :1MagiCopy:15.5:100
1020:CD drive external 40x :20th Century Fox:16.74:130
1030:CD drive external 48x :3COM:13.48:160
1040:CD drive external 52x :4XEM:15.92:190

We need to write it to another file that is going to look like this

ID Number Description Provider Cost Stock Total
1010 CD Drive 32x 1MagiCopy 15.50 100 1550.00

Since the text file to be read from is preformatted I thought I could
use the fscanf() to to parse each line and assign it into structure
variables, but I am having problems.

Here is my code to read the file:

int readFile (char *filename, struct productData product[], size_t
arrLen)
/* Returns number of products read */
{
FILE *fp;

if ( ( fp = fopen( "product_catalog.txt", "rb+" ) ) == NULL ) {
printf( "File could not be opened.\n" );
} /* end if */

else
{
int i;
for (i=0; i<arrLen && !feof(fp); i++)
{
if (5 != fscanf(fp, "%d %s %s %f %d",
&product.idnumber,
product.description,
product.provider,
&product.cost,
&product.stock))
{
printf("Invalid file format\n");
fclose(fp);
return 0;
}
}
fclose(fp);
return i;
}
}

The problem seems to be that each field I want to parse seems to be
separated by a colon :)) Is there anyway to tell fscanf() to parse up
until you reach a colon and then stop and start scanning again, or
should I give up this approach and try to tokenize the input stream?


You put the colons in the format string:

if (5 != fscanf(fp, "%d:%s:%s:%f:%d" ...

But this still won't work quite right, because %s will make fscanf will
stop at the spaces.

You can use %[^:] to mean "series of non-colons" so:

if (5 != fscanf(fp, "%d:%[^:]:%[^:]:%f:%d" ...

should do the trick.

You also have to be careful that badly formatted input data can't
overflow the arrays you're storing the data in. fscanf provides various
format modifiers for this-- it can optionally scan up to a maximum
length, or it can allocate the buffers for you.

e.g.:

if (5 != fscanf(fp, "%d:%64[^:]:%64[^:]:%f:%d" ...

if your buffers for description and provider were 64 bytes long. They'd
get truncated of course, which might not be acceptable. In that case you
could try %a[^:] (see fscanf manual).

The other point is that if you have any choice in the matter C is not
the best language for this task, you'd be much better off with something
else-- Python, Tcl, Perl, that kind of thing. Awk might be the perfect
choice.
 
C

CBFalconer

Eric said:
I am trying to parse two pre-formatted text files and write them to a
different files formatted in a different way. The story about this is I
was hired along with about 20 other people and it seems we are trying
to learn the whole C language in two weeks! To top it all off, I was an
English Major, but I'm trying my best. Ok back to the program. So we
have two files product_catalog.txt and sales_month.txt

The info in product_catalog.txt looks like this:

1010:CD drive external 32x :1MagiCopy:15.5:100
1020:CD drive external 40x :20th Century Fox:16.74:130
1030:CD drive external 48x :3COM:13.48:160
1040:CD drive external 52x :4XEM:15.92:190

We need to write it to another file that is going to look like this

ID Number Description Provider Cost Stock Total
1010 CD Drive 32x 1MagiCopy 15.50 100 1550.00

That's not just reformatting. There's a little bit
of computation (deriving the 1550.00), which isn't hard.
Harder -- potentially very hard -- is the translation
that seems to be occurring: How did "drive" become "Drive,"
and where did "external" disappear to, and what rules
govern such transformations?
Since the text file to be read from is preformatted I thought I could
use the fscanf() to to parse each line and assign it into structure
variables, but I am having problems.

Here is my code to read the file:

int readFile (char *filename, struct productData product[], size_t
arrLen)
/* Returns number of products read */
{
FILE *fp;

if ( ( fp = fopen( "product_catalog.txt", "rb+" ) ) == NULL ) {
printf( "File could not be opened.\n" );
} /* end if */
else
{
int i;
for (i=0; i<arrLen && !feof(fp); i++)
{
if (5 != fscanf(fp, "%d %s %s %f %d",
&product.idnumber,
product.description,
product.provider,
&product.cost,
&product.stock))
{
printf("Invalid file format\n");
fclose(fp);
return 0;
}
}
fclose(fp);
return i;
}
}

The problem seems to be that each field I want to parse seems to be
separated by a colon :)) Is there anyway to tell fscanf() to parse up
until you reach a colon and then stop and start scanning again, or
should I give up this approach and try to tokenize the input stream?
Any help is much appreciated.


"%s" will skip leading white space, grab a string,
and stop when it hits white space again. Hence, it's
no good for your input format, where white spaces can
occur as part of a data field.

You could use "%[^:]" to look for colon-delimited
fields, but the resulting program would be rather fragile.
One lousy line with an extra colon or a missing colon,
and you'll be out of step for the rest of the journey.
or until you trip and fall, whichever comes first.
(fscanf() is no respecter of line boundaries, and will
happily cross them in search of more input.)

Recommended approach: Use fgets() (but not gets()!!!)
to read each line into a big char[] array, and then pick
the line apart with other tools. sscanf() may be a choice
you'd find familiar -- and since sscanf() cannot run off
the end of its input array (and thus inadvertengly bypass
line boundaries), some of the infelicities of fscanf()
disappear.


I would suggest he keep things as simple as possible. He could use
my ggets() to input the lines, and my toksplit to parse them.
toksplit was published here a few days ago, just search the group
archives. ggets is available on my page at:

<http://cbfalconer.home.att.net/download/ggets.zip>

Then the code will look much like:

char *ln, *tmp;
int ix;
char tok[MAXTOKEN + 1]; /* allow for '0' always */

while (0 == ggets(&ln)) {
tmp = ln; ix = 0;
while (*tmp) {
tmp = toksplit(tmp, ':', tok, MAXTOKEN);
ix++; /* just to keep track of which token in line */
/* code to modify and output from tok */
/* probably best isolated in a separate function */
}
free(ln);
}

Notice that the only configuration constants are MAXTOKEN and what
the token delimiting character (':' here) actually is.

--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
More details at: <http://cfaj.freeshell.org/google/>
Also see <http://www.safalra.com/special/googlegroupsreply/>
 
B

bfowlkes

I am making some progress, but not much unfortunately. Using these two
code segments that I found from another post I was able to parse out
each field as a text file, the output looks like this:

Line number: 1
Token: 1010
Token: CD drive external 32x
Token: 1MagiCopy
Token: 15.5
Token: 100

Line number: 2
Token: 1020
Token: CD drive external 40x
Token: 20th Century Fox
Token: 16.74
Token: 130


size_t get_line( FILE *f , char *line, size_t len )
{
char *ptr;


ptr = fgets( line, len, f );


if( NULL == ptr ) {
line[0] = '\0';
return 0;
}


if( NULL != (ptr = strchr(line, DELIMITER)) ) *ptr = '\0';


return strlen(line);
}

while( 0 != get_line( fp, data, sizeof(data)) ) {
count++;
printf( "Line number: %d\n", count );
for( ptr0 = data; NULL != (ptr1 = strtok(ptr0, TOKEN)); ptr0 =
NULL )
printf( "Token: %s\n", ptr1 );
putchar( '\n' );
}


What I was going to do was assign each field value into an array of
structures, but it gives me a segmentation fault, is there another way
to achieve the main objective?
 
B

Ben C

I am making some progress, but not much unfortunately. Using these two
code segments that I found from another post I was able to parse out
each field as a text file, the output looks like this:

Line number: 1
Token: 1010
Token: CD drive external 32x
Token: 1MagiCopy
Token: 15.5
Token: 100

Line number: 2
Token: 1020
Token: CD drive external 40x
Token: 20th Century Fox
Token: 16.74
Token: 130

size_t get_line( FILE *f , char *line, size_t len )
{
char *ptr;


ptr = fgets( line, len, f );


if( NULL == ptr ) {
line[0] = '\0';
return 0;
}


if( NULL != (ptr = strchr(line, DELIMITER)) ) *ptr = '\0';


return strlen(line);
}

while( 0 != get_line( fp, data, sizeof(data)) ) {
count++;
printf( "Line number: %d\n", count );
for( ptr0 = data; NULL != (ptr1 = strtok(ptr0, TOKEN)); ptr0 =
NULL )
printf( "Token: %s\n", ptr1 );
putchar( '\n' );
}
What I was going to do was assign each field value into an array of
structures, but it gives me a segmentation fault, is there another way
to achieve the main objective?

If the main objective is just to print it all out again formatted
differently, you can maybe do that in the loop, and avoid having to
store the data.

But you should be able to fix the segmentation fault! The error might be
in part of the code we can't see-- it looks from "data, sizeof(data)"
that data is an array; where do you declare it? And how's the array of
structures created?

In any case, you reuse the same buffer for each line, so you're going to
have to actually copy the strings out somehow.

Guessing, but the problem may be that you're just copying the pointers,
but not duplicating the actual strings.

for( ptr0 = data; NULL != (ptr1 = strtok(ptr0, TOKEN)); ptr0 = NULL )

records.name = ptr1; /* very likely to be wrong */
records.name = strdup(ptr1); /* some chance of working */

HTH
 
C

CBFalconer

I am making some progress, but not much unfortunately. Using these
two code segments that I found from another post I was able to
parse out each field as a text file, the output looks like this:

You reply to my posting, but ignore all that I suggested, and
refuse to quote proper context. I see no point in anyone
attempting to assist you further.

--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
More details at: <http://cfaj.freeshell.org/google/>
Also see <http://www.safalra.com/special/googlegroupsreply/>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,566
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top