parse two field file

R

Richard

Which way would you guys recommened to best parse a multiline file which contains
two fields seperated by a tab. In this case its the
linux/proc/filesystems file a sample of which I have included below:

nodev usbfs
ext3
nodev fuse
vfat
ntfs
nodev binfmt_misc
udf
iso9660

The first field can be "empty" and concist of only a single tab
character. The seperator is a tab.

Is sscanf best suited to this? Or use strtok/strtok_r?

The field I am really interested in is the second one : any hints & tips
appreciated as to do this in the most efficient manor.

--
 
C

CBFalconer

Richard said:
Which way would you guys recommened to best parse a multiline file
which contains two fields seperated by a tab. In this case its the
linux/proc/filesystems file a sample of which I have included below:

nodev usbfs
ext3
nodev fuse
vfat
ntfs
nodev binfmt_misc
udf
iso9660

The first field can be "empty" and concist of only a single tab
character. The seperator is a tab.

Is sscanf best suited to this? Or use strtok/strtok_r?

The field I am really interested in is the second one : any hints
& tips appreciated as to do this in the most efficient manor.

Use toksplit. Call with tokchar set to '\t'. Std C code follows:

/* ------- file toksplit.h ----------*/
#ifndef H_toksplit_h
# define H_toksplit_h

# ifdef __cplusplus
extern "C" {
# endif

#include <stddef.h>

/* copy over the next token from an input string, after
skipping leading blanks (or other whitespace?). The
token is terminated by the first appearance of tokchar,
or by the end of the source string.

The caller must supply sufficient space in token to
receive any token, Otherwise tokens will be truncated.

Returns: a pointer past the terminating tokchar.

This will happily return an infinity of empty tokens if
called with src pointing to the end of a string. Tokens
will never include a copy of tokchar.

released to Public Domain, by C.B. Falconer.
Published 2006-02-20. Attribution appreciated.
*/

const char *toksplit(const char *src, /* Source of tokens */
char tokchar, /* token delimiting char */
char *token, /* receiver of parsed token */
size_t lgh); /* length token can receive */
/* not including final '\0' */

# ifdef __cplusplus
}
# endif
#endif
/* ------- end file toksplit.h ----------*/

/* ------- file toksplit.c ----------*/
#include "toksplit.h"

/* copy over the next token from an input string, after
skipping leading blanks (or other whitespace?). The
token is terminated by the first appearance of tokchar,
or by the end of the source string.

The caller must supply sufficient space in token to
receive any token, Otherwise tokens will be truncated.

Returns: a pointer past the terminating tokchar.

This will happily return an infinity of empty tokens if
called with src pointing to the end of a string. Tokens
will never include a copy of tokchar.

A better name would be "strtkn", except that is reserved
for the system namespace. Change to that at your risk.

released to Public Domain, by C.B. Falconer.
Published 2006-02-20. Attribution appreciated.
Revised 2006-06-13
*/

const char *toksplit(const char *src, /* Source of tokens */
char tokchar, /* token delimiting char */
char *token, /* receiver of parsed token */
size_t lgh) /* length token can receive */
/* not including final '\0' */
{
if (src) {
while (' ' == *src) src++;

while (*src && (tokchar != *src)) {
if (lgh) {
*token++ = *src;
--lgh;
}
src++;
}
if (*src && (tokchar == *src)) src++;
}
*token = '\0';
return src;
} /* toksplit */

#ifdef TESTING
#include <stdio.h>

#define ABRsize 6 /* length of acceptable token abbreviations */

/* ---------------- */

static void showtoken(int i, char *tok)
{
putchar(i + '1'); putchar(':');
puts(tok);
} /* showtoken */

/* ---------------- */

int main(void)
{
char teststring[] = "This is a test, ,, abbrev, more";

const char *t, *s = teststring;
int i;
char token[ABRsize + 1];

puts(teststring);
t = s;
for (i = 0; i < 4; i++) {
t = toksplit(t, ',', token, ABRsize);
showtoken(i, token);
}

puts("\nHow to detect 'no more tokens' while truncating");
t = s; i = 0;
while (*t) {
t = toksplit(t, ',', token, 3);
showtoken(i, token);
i++;
}

puts("\nUsing blanks as token delimiters");
t = s; i = 0;
while (*t) {
t = toksplit(t, ' ', token, ABRsize);
showtoken(i, token);
i++;
}
return 0;
} /* main */

#endif
/* ------- end file toksplit.c ----------*/
 
M

Malcolm

Richard said:
Which way would you guys recommened to best parse a multiline file which
contains
two fields seperated by a tab. In this case its the
linux/proc/filesystems file a sample of which I have included below:

nodev usbfs
ext3
nodev fuse
vfat
ntfs
nodev binfmt_misc
udf
iso9660

The first field can be "empty" and concist of only a single tab
character. The seperator is a tab.

Is sscanf best suited to this? Or use strtok/strtok_r?

The field I am really interested in is the second one : any hints & tips
appreciated as to do this in the most efficient manor.
The input format is slightly quirky, so the best solution is to call fgets()
to read a line and then parse it yourself.

int checkheader(char *str)

ccan check whether the string is a header or not by looking for the tab or
counting whitespace.

parseheader(char *str, char *field1, char *field2)

will pull out the fields for you. make sure you reject over-long strings.
Then the data fields only contain one string.

However

void trim(char *str)

which removes leading and trailing whitespace is a good function to have.

so too is
int checkblank(char *str)

which checks for strings which consist entirely of whitespace characters.
 
R

Richard

Malcolm said:
The input format is slightly quirky, so the best solution is to call fgets()
to read a line and then parse it yourself.

int checkheader(char *str)

ccan check whether the string is a header or not by looking for the tab or
counting whitespace.

parseheader(char *str, char *field1, char *field2)

will pull out the fields for you. make sure you reject over-long strings.
Then the data fields only contain one string.

However

void trim(char *str)

which removes leading and trailing whitespace is a good function to have.

so too is
int checkblank(char *str)

which checks for strings which consist entirely of whitespace
characters.

I just did sscanf("%s%s",f1,f2) in the end.

--
 
E

Eric Sosman

Richard said:
Which way would you guys recommened to best parse a multiline file which contains
two fields seperated by a tab. In this case its the
linux/proc/filesystems file a sample of which I have included below:

nodev usbfs
ext3
nodev fuse
vfat
ntfs
nodev binfmt_misc
udf
iso9660

The first field can be "empty" and concist of only a single tab
character. The seperator is a tab.

Is sscanf best suited to this? Or use strtok/strtok_r?

strtok(..., "\t") will give the same result for "\tfoo"
and "\t\tfoo\t" and "foo". If you *know* that the input has
two tab-separated fields and that only the first (never the
second) can be empty, you can get this to work: If strtok()
finds two fields they are #1 and #2, but if it finds only
one it is #2 with #1 empty.

However, it makes me queasy to put that much faith in an
input source I don't control programmatically. Who knows?
Maybe in six months somebody will extend the format, adding
an optional third field. If that happened, then the field-
counting approach would misinterpret "\tfoo\tbar" as if it
were "foo\tbar". It would be better to adopt a method that
would complain about "\tfoo\tbar" than to be fooled by it.

fgets() plus sscanf() is a possibility, but it's a bit
tricky to use: The obvious "%s\t%s" will not do what you
want. (The first "%s" will skip any leading white space,
leaving you in the same hole as the strtok() approach, and
the "\t" will match any amount of any kind of white space,
tabs or other.) Something like "%[^\t]%*1[\t]%s" would do
a little better, but still wouldn't be fully satisfactory:
It would match the prefix of "foo\tbar baz goozle frobnitz"
without any warning of the trailing junk. You could use
"%[^\t]%*1[\t]%s%n" and then check that sscanf() had in fact
consumed the entire string ...

... but wouldn't it be simpler just to pick the line
apart for yourself? Read it in with fgets(), use strchr()
to find the first tab (syntax error if there isn't one), and
the first (possibly empty) field is everything from the start
to just before the tab. Then start just after the tab and use
strchr() again to find the terminating '\n'; the second field
is everything from just after the tab to just before the '\n'
(syntax error if its length is zero). You can use strcspn()
to check that the second field contains no white space and
squawk if it does (somebody added a third field you don't
understand).
The field I am really interested in is the second one : any hints & tips
appreciated as to do this in the most efficient manor.

The "most efficient manor" is the house of Usher. Resist
this unnecessary impulse for efficiency, lest your program meet
the same fate as did that storied manse.

(In other words: How long is this file, anyhow? How many
times will you scan its contents? If you sped up the scanning
by a factor of four hundred twenty gazillion, how much faster
would the program as a whole run? If you give your SUV a coat
of wax, will you improve its fuel economy by making it slipperier
or harm it by adding weight?)
 
G

Giorgos Keramidas

Which way would you guys recommened to best parse a multiline file
which contains two fields seperated by a tab. In this case its the
linux/proc/filesystems file a sample of which I have included below:

nodev usbfs
ext3
nodev fuse
vfat
ntfs
nodev binfmt_misc
udf
iso9660

The first field can be "empty" and concist of only a single tab
character. The seperator is a tab.

Is sscanf best suited to this? Or use strtok/strtok_r?

strtok() is not so nice, because it tries to modify the string you pass
to it. I would probably use strcspn() for this, with something like:

#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAXLINE 256

static void doline(char *buf, size_t bufsize);

int
main(void)
{
char buf[MAXLINE];
FILE *fp;

/*
* Add code here that opens /proc/filesystems file, instead of using
* `stdin' as the input file.
*/
fp = stdin;

clearerr(fp);
while (fgets(buf, sizeof buf, fp) != NULL) {
doline(buf, sizeof buf);
}
if (ferror(fp) != 0) {
perror("fgets");
exit(EXIT_FAILURE);
}
/*
* Add code here that closes the open file referenced by `fp'.
*/

return EXIT_SUCCESS;
}

static void
doline(char *buf, size_t bufsize)
{
char *field;
size_t pos, pos2, fieldsize;

assert(buf != NULL && bufsize > 0);
(void)bufsize;

pos = strcspn(buf, "\t");
if (buf[pos] == '\0') {
fprintf(stderr,
"warning: no TAB in `%s', skipping this line\n", buf);
return;
}
pos2 = strcspn(buf + pos + 1, "\t");

fieldsize = pos2 + 1;
field = malloc(fieldsize);
if (field == NULL) {
perror("malloc");
return;
}
strncpy(field, buf + pos + 1, fieldsize - 1);
field[fieldsize - 1] = '\0';
field[strcspn(field, "\n\r")] = '\0';
printf("%s\n", field);
free(field);
}

The trick is to use strcspn() to find out the 'part' of the original
string which you are interested in, and then you can do whatever you
like with this part. In the particular program, I'm temporarily
allocate a new string buffer, copy the original contents in this new
buffer, print the buffer and release its memory. Any other way you can
think about to use this substring is fine too :)
 
D

Dave Thompson

Richard said:
Which way would you guys recommened to best parse a multiline file which contains
two fields seperated by a tab. <snip>
strtok(..., "\t") will [lose empty fields]
Right.

fgets() plus sscanf() is a possibility, but it's a bit
tricky to use: The obvious "%s\t%s" will not do what you
want. (The first "%s" will skip any leading white space,
leaving you in the same hole as the strtok() approach, and
the "\t" will match any amount of any kind of white space,
tabs or other.) Something like "%[^\t]%*1[\t]%s" would do
a little better, but still wouldn't be fully satisfactory:

Not enough better. If the first field is empty and thus the first
%[^\t] matches nothing, *scanf stops and doesn't do the %*1[\t]s.

This is effectively the same problem of the people who periodically
try to use {,f}scanf to replace <ILLEGAL> fflush (input) </>.
(Some people, including IIRC Dan Pop, have recommended e.g.
if( scanf ("%*[^\n]%*1[\n]") < 2 ) getchar ();
but I consider that too much uglier than the obvious, though slightly
longer and possibly slightly less efficient
while( (ch = getchar()) != EOF && ch != '\n' ) ;
etc.

Plus unbounded %[...] or %s risks buffer overflow and resulting UB.
You should specify a length at most one less than the buffer size.
It would match the prefix of "foo\tbar baz goozle frobnitz"
without any warning of the trailing junk. You could use
"%[^\t]%*1[\t]%s%n" and then check that sscanf() had in fact
consumed the entire string ...

... but wouldn't it be simpler just to pick the line
apart for yourself? Read it in with fgets(), use strchr()
to find the first tab <snip>
Yes.

The "most efficient manor" is the house of Usher. Resist
this unnecessary impulse for efficiency, lest your program meet
the same fate as did that storied manse.
Yes. Or even the hundred-year shay, IIRC grade school. <G>

- David.Thompson1 at worldnet.att.net
 

Members online

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,900
Latest member
Nell636132

Latest Threads

Top