splitting a string and put it into an array

J

Jeremy Yallop

Sidney said:
Hang on, I've lost you there... Here's a bunch of random characters:

!&^fgfctq9786h0%(*&%h2owyp[1238n

Are you saying that a correct C++ compiler may compile this succesfully?
Same question for a correct C compiler?

Yes (for C, at least). Also, if it is "accepted" (whatever that may
mean) by a conforming implementation then it's a conforming program.
Ok, under the assumption that feeding

!&^fgfctq9786h0%(*&%h2owyp[1238n

...to a correct C compiler does not have to fail, what /does/ the
standard have to say on what can be expected of a compiler on such input?

The compiler must issue a diagnostic about the syntax error.

Jeremy.
 
C

CBFalconer

Sidney said:
.... snip ...

Hang on, I've lost you there... Here's a bunch of random characters:

!&^fgfctq9786h0%(*&%h2owyp[1238n

Are you saying that a correct C++ compiler may compile this
succesfully? Same question for a correct C compiler?

Offhand the !&^ and &% sequences may be a problem, but with
patience and a few macros I believe we could translate and parse
the rest :)
 
P

Paul Hsieh

Daniel Haude said:
["Followup-To:" header set to comp.lang.c.]
On 15 Jan 2004 03:58:59 -0800,
in Msg. said:
The C library has a totally worthless complement of string parsing
functions.

As far as I'm aware of it only has one function that deserves the name
"parsing function", which is sscanf().

An interesting point of view -- considering that sscanf is probably
the least generic of all the parsing mechanisms. Actually strtok
*would* be a good function, if only it were more like strtok_r as is
implemented in Linux.
Anyway, it's simple to roll your own.

And even simpler to get it wrong, or create an inadequate solution as
we shall see.
[...] For delimiter-separated text data I always use this function:

char **dh_splitstring(char *line, int maxfields, int sep)
{
char *a, *e, *s;
int i;
char **arr;

if (NULL == (arr = malloc((maxfields+1) * sizeof *arr))) return NULL;

e = line;
for (i = 0; i < maxfields; i++) {
/* skip leading WS */
for (s = e; *s && isspace(*s) && *s != sep; s++) ;
if (!*s) break;
for (a = s; *a && *a != sep; a++) ;
e = *a == sep ? a+1 : a;
/* chop off trailing WS */
for (a--; isspace(*a) && a > s; a--) ;
if (i < maxfields-1) *++a = 0;
arr = s;
}
for ( ; i < maxfields+1; i++) {
arr = NULL;
}
return arr;
}


A reasonably effort, but it betrays the limited scope that is very
typical of C programmers.

1) This code is complicated. You have several loops, and state
machines, and I have difficulty knowing how to know that this code is
correct. In the middle there you've hidden a nice "a > s" comparison
.... I didn't know you could compare pointers like that in a portable
way.

2) The code only shows the inner loop -- the original request asks for
filling a two dimensional array.

3) It encodes the classic C worthlessness of requiring that you
specify the size of your containers (array) up front, before you know
the size of the data you require. The original poster also posted to
C++ -- using an STL vector to hold the result is probably the right
answer.
line: The string to be parsed

With or without the '\n' on the end? And do you compensate for a
potential '\r' in there? Are you doing the typical C thing of pushing
issues and complexities upward?
maxfields: max number of fields

Its also the minimum malloc size. BTW, what if you decide to pass a
negative value for maxfields?
sep: field separator char (i.e., '\t')

tabs but not spaces? Tabs/spaces are chosen because of their human
readability/editability features. I means if you use them as
seperators you use them interchangeably, in arbitrary numbers also
with the possibility that one or the other might not be present. A
quick look at your algorithm makes it look like it will fail if one
pair of entries it only separated by spaces.
Returns: A newly-allocated, NULL-terminated array of pointers to strings
with maxfields+1 elements. All fields have leading and trailing whitespace
removed.

Warning: Modifies line. If this is not what you want, pass a copy.

Which makes it only barely better than strtok ...
The returned pointers point into different positions of 'line', so 'line'
must not be freed or modified as long as you want to use the result.

Ok ... this is another classic case of pushing up complexity upwards.
Multiple separator tokens are not lumped together, but result in empty
tokens ("\0"). This makes using tab-separated data files dangerous because
there may be different numbers of tabs between columns. I always use
semicolons.

Ok, I'm not sure the OP was asking that you change the policy they
have decided upon. Using arbitrary space or tab delimiters allows for
easy human editability, which might be a fairly high concern. For
humans its easy to miss a ";" and its also easy to confuse one for a
":".
 
P

Paul Hsieh

Default User said:
Oh yes. It's completely impossible to create any application using the
string-parsing utilities in C.

I never said that. However, attempting to do so ends up contributing
to the coffers of Security Focus, CERT, Reasoning, and other late/post
production software failure analysis companies.
Unless you are a competent programmer.

With a masochistic bent, of course. For this particular task its
actually best to completely leave the entire C string library behind
-- it just doesn't offer anything useful that can't be done better
using some other way.
You know, there are *other* things besides strtok() available. It's
really not that hard to use strchr() and your own state machine for
parsing if strtok() doesn't fit your needs.

Actually, strcspn(), and strspn() would probably be the better choices
if you really wanted to push it. But the point is that if you are
forced to implement your own state machines to hand hold the parsing
anyway, then you might as well roll your own right down to the raw
characters. A big for-switch statement with a few counters and flags
-- it will be just as readable.

Compare this to my Bstrlib based solution -- can you find a state
machine of any kind in there? If anywhere, it would be hidden away in
the thoroughly tested Bstrlib functions and is implemented in a
generically usable manner.
Don't get wrong, std::string has a lot of nice features, but to poo-poo
the C string capabilities portrays either a lack of experience, or lack
of skill.

As demonstrated by your impressive proposal to solve the OP's problem.
 
P

Peter Nilsson

Jeremy Yallop said:
In C89 an #error directive causes the issuance of a diagnostic
message; it isn't required to cause translation failure.

Ah yes, I see that in my copy of the C89 draft. Can I ask, what did C95 have
to say on the subject of #error?
 
P

pete

Paul said:
I never said that. However, attempting to do so ends up contributing
to the coffers of Security Focus, CERT, Reasoning, and other late/post
production software failure analysis companies.


With a masochistic bent, of course. For this particular task its
actually best to completely leave the entire C string library behind
-- it just doesn't offer anything useful that can't be done better
using some other way.


Actually, strcspn(), and strspn() would probably be the better choices
if you really wanted to push it.

There's
my head.

There's a beat in my head.

/* BEGIN str_tok_r.c */

#include <stdio.h>

char *str_tok_r (char *, const char *, char **);
size_t str_spn (const char *, const char *);
size_t str_cspn (const char *, const char *);
char *str_chr (const char *, int);
char *str_cpy(char *, const char *);
char *squeeze (char *, const char *);

#define STRING "\tThere's\n a\r beat in \r\tmy head. \n"

int main(void)
{
const char *const original = STRING;
char s1[sizeof STRING];

puts(original);
puts(squeeze(str_cpy(s1, original), "\n\r\t"));
return 0;
}

char *str_tok_r(char *s1, const char *s2, char **p1)
{
if (s1 != NULL) {
*p1 = s1;
}
s1 = *p1 + str_spn(*p1, s2);
if (*s1 == '\0') {
return NULL;
}
*p1 = s1 + str_cspn(s1, s2);
if (**p1 != '\0') {
*(*p1)++ = '\0';
}
return s1;
}

size_t str_spn(const char *s1, const char *s2)
{
const char *const p1 = s1;

while (*s1 != '\0' && str_chr(s2, *s1) != NULL) {
++s1;
}
return s1 - p1;
}

size_t str_cspn(const char *s1, const char *s2)
{
const char *const p1 = s1;

while (str_chr(s2, *s1) == NULL) {
++s1;
}
return s1 - p1;
}

char *str_chr(const char *s, int c)
{
while (*s != (char)c) {
if (!*s) {
return NULL;
}
++s;
}
return (char *)s;
}

char *str_cpy(char *s1, const char *s2)
{
char *const p1 = s1;

do {
*s1 = *s2++;
} while (*s1++ != '\0');
return p1;
}

char *squeeze(char *s1, const char *s2)
{
char *p3;
char const *p2;
char *const p1 = s1;

p2 = str_tok_r(s1, s2, &p3);
while (p2 != NULL) {
while (*p2 != '\0') {
*s1++ = *p2++;
}
p2 = str_tok_r(NULL, s2, &p3);
}
*s1 = '\0';
return p1;
}

/* END str_tok_r.c */
 
D

Daniel Haude

["Followup-To:" header set to comp.lang.c.]
On 16 Jan 2004 02:50:32 -0800,
in Msg. said:
And even simpler to get it wrong, or create an inadequate solution as
we shall see.

I've never claimed that my solution was adequate in that it solved the
OP's problem by 100%.
1) This code is complicated.

It's a mere 23 lines. You want to see complicated code?
You have several loops,

Just two, non-nested, and only one of them non-trivial.
and state machines,

zero state machines
and I have difficulty knowing how to know that this code is
correct.

Like with any third-party function you've got to either trust the
documentation or write your own.
In the middle there you've hidden a nice "a > s" comparison
... I didn't know you could compare pointers like that in a portable
way.

You can as long as they point into the same array.
2) The code only shows the inner loop -- the original request asks for
filling a two dimensional array.

Again, I never claimed to've been trying to solve the OP's problem. I was
merely giving an example of how to parse a string in C.
3) It encodes the classic C worthlessness of requiring that you
specify the size of your containers (array) up front, before you know
the size of the data you require.

Easily fixed by a reallocing mechanism that I didn't bother with. The
function I gave is specifically geared to parsing csv tables where the
number of columns is usually known.
The original poster also posted to
C++ -- using an STL vector to hold the result is probably the right
answer.

In C++ it would be.
With or without the '\n' on the end? And do you compensate for a
potential '\r' in there?

The "doc" (my comments) states that leading and trailing whitespace gets
chopped off all tokens.
Are you doing the typical C thing of pushing
issues and complexities upward?

No, but you're doing the typical thing of not reading the documentation.
Its also the minimum malloc size. BTW, what if you decide to pass a
negative value for maxfields?

UB, obviously. Trivially fixed with a single line. A bug.
tabs but not spaces? Tabs/spaces are chosen because of their human
readability/editability features.

Yes, but it's difficult to parse tables that contain empty cells or
elements with whitespace in them.
I means if you use them as
seperators you use them interchangeably, in arbitrary numbers also
with the possibility that one or the other might not be present.
A
quick look at your algorithm makes it look like it will fail if one
pair of entries it only separated by spaces.

You're right: My algorithm gets tripped up when anything that's
isspace() is used a s separator.
Which makes it only barely better than strtok ...

It's re-entrant, and in the usual file-parsing situation (reading csv
data) each line is typically only used once.

I specifically didn't want to allocate additional memory for the field's
contents in the function because it's 1) an unnecessary waste of memory
and performance most of the time, and 2) it's easily provided by 2 extra
standard function calls outside my routine.
Ok ... this is another classic case of pushing up complexity upwards.

It's better to push complexity upwards than implementing it downstairs
where it, although unneeded in most cases, may lead to resource and
performance penalties. Especially when the "complexity" involves nothing
but one call to strdup() and another one to free().
Ok, I'm not sure the OP was asking that you change the policy they
have decided upon.

Hey, I have better things to do than writing code for other people. All I
did was give a simple example of how to do efficient string parsing with a
few lines in C.

--Daniel
 
D

Default User

Paul said:
I never said that.

Yes, you did, you said the string parsing functions were worthless. You
can't use worthless things to create things of worth. I would by
"tricky" or "able to blow your foot off", but not "worthless.
However, attempting to do so ends up contributing
to the coffers of Security Focus, CERT, Reasoning, and other late/post
production software failure analysis companies.

Nonsense. You are saying that any attempt to use them, no matter the
dedication and skill level of the programmer, will fail. That is
manifestly untrue. While there have been products with problems, there
are products without problems as well.
With a masochistic bent, of course.

If you are working in C, then the choices are limited. While using these
functions takes some time to learn, once you become comfortable it's not
particularly onerous.
For this particular task its
actually best to completely leave the entire C string library behind
-- it just doesn't offer anything useful that can't be done better
using some other way.

That's your opinion, not one that I share. I've been using these
functions for a long time, and find them to be useful. For C++
programming, I would (and do) use std::string.
Actually, strcspn(), and strspn() would probably be the better choices
if you really wanted to push it. But the point is that if you are
forced to implement your own state machines to hand hold the parsing
anyway, then you might as well roll your own right down to the raw
characters. A big for-switch statement with a few counters and flags
-- it will be just as readable.

No, go the other way. Encapsulate this and make your own version of
strtok() that is safe. Then you have that in your personal library. Or
find one of the many ones already available.
As demonstrated by your impressive proposal to solve the OP's problem.

I made no attempt to solve said problem. The "problem" I was addressing
was your post.




Brian Rodenborn
 
J

Jerry Coffin

Hello,

i am an newbie and i have to to solve this problem as fast as i can. But
at this time i don´t have a lot of success.
Can anybody help me (and understand my english :))?

I have a .txt-file in which the data is structured in that way:
Project-Nr. ID name lastname
33 9 Lars Lundel
33 12 Emil Korla
34 19 Lara Keuler
33 13 Thorsten Lammert

These data have to be read out row by row.
Every row has to be splitted (delimiter is TAB) and has to be saved in

Under the circumstances, I would NOT use a 2D array -- instead, I'd use
a struct (or in C++ a class). I'd then create an array of those structs
(or in C++, a map or perhaps a set).

I'm not going to post code since you've cross-posted to c.l.c and
c.l.c++, and any code that's well-written for one will be off-topic in
the other.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,774
Messages
2,569,596
Members
45,142
Latest member
DewittMill
Top