How to tokenize string without using strtok

B

bubunia2000

Hi all,
I heard that strtok is not thread safe. So I want to write a
sample program which will tokenize string without using strtok.

Can I get a sample source code for the same.

For exp:
0.0.0.0--->I want to tokenize the string using delimiter as as dot.

Regards
Sujeet
 
F

Fred Kleinschmidt

Hi all,
I heard that strtok is not thread safe. So I want to write a
sample program which will tokenize string without using strtok.

Can I get a sample source code for the same.

For exp:
0.0.0.0--->I want to tokenize the string using delimiter as as dot.

Regards
Sujeet

One way is to write a function similar to strtok() but keep track of the
last place that a NULL was placed. In general this will probably mean two
extra parameters: the index in the array, and what the character there was.
Something like:

char *mystrtok( char *string, char *delimiters, int *index, char *c )
{
/* write the real code for your homework here to find n and to set index
and c */
return string+n; /* the token */
}
 
W

Walter Roberson

I heard that strtok is not thread safe. So I want to write a
sample program which will tokenize string without using strtok.

Correct, strtok is not thread safe, but C itself is not thread safe
so the matter is not within the perview of the C standard.
Can I get a sample source code for the same.

If your system provides thread facilities, it may well also
provide strtok_r(), a common name for a thread-safe work-alike to
strtok() .
For exp:
0.0.0.0--->I want to tokenize the string using delimiter as as dot.

Use strcspn(), then one of the counted-size memory copy routines.

But after that you'll have to figure out what you want to do if there
are multiple delimeters -- for example, how would you want to parse

0.0..255.255

Is that 0.0.0.255 with an extra .255 at the end, or is it
0.0.255.255 with the extra delimeter skipped, or is it
0.0 then an indicator that you've reached the end of the first grouping,
which is then followed by a second grouping 255.255, or is
it to be parsed as indicating the range 0.0 through 255.255 ?
 
J

Jordan Abel

Correct, strtok is not thread safe, but C itself is not thread safe
so the matter is not within the perview of the C standard.

strtok is also otherwise unsafe for the same reason - you can't start a
tokenization on a string while you're in the middle of another one and
then come back to the first.
 
B

Ben Pfaff

Jordan Abel said:
strtok is also otherwise unsafe for the same reason - you can't start a
tokenization on a string while you're in the middle of another one and
then come back to the first.

There are multiple other reasons to avoid strtok():

* It merges adjacent delimiters. If you use a comma as
your delimiter, then "a,,b,c" is three tokens, not
four. This is often the wrong thing to do. In fact,
it is only the right thing to do, in my experience,
when the delimiter set is limited to white space.

* The identity of the delimiter is lost, because it is
changed to a null terminator.

* It modifies the string that it tokenizes. This is bad
because it forces you to make a copy of the string if
you want to use it later. It also means that you can't
tokenize a string literal with it; this is not
necessarily something you'd want to do all the time but
it is surprising.
 
W

websnarf

Hi all,
I heard that strtok is not thread safe.

Well, technically, its not "thread-well defined". You *can* make it
threadsafe by using a thread local static for its implicit storage, but
this means that strtok doesn't literally follow its claimed
functionality if you try to tokenize across multiple threads (which is
a marginal case).

The issue is that strtok is not *re-entrant* which is a far worse
problem. It means you cannot even *nest* strtoks in two loops. For
example, it is very typical to read a textfile and want to break it
into lines, then from each line you would like to break it into
sub-tokens like words of some kind. So what you would *like* to do is
to write a two level nested loop where you strtok for lines on the
outer loop, and for words in the inner loop. You can't do that because
successive calls to strtok *have* to perform all the tokenization
before the next string is processed, which means nested processing is
not possible.
[...] So I want to write a
sample program which will tokenize string without using strtok.

The good news is that strspn and strcspn perform basically equivalent
functionality, except that you have to do the little bit of pointer
tracking yourself ...
Can I get a sample source code for the same.

.... the bad news is that nobody ever uses these functions. And so
example code is hard to come by.
For exp:
0.0.0.0--->I want to tokenize the string using delimiter as as dot.

Well, by itself, this doesn't require nesting, so you can just use
strtok directly. But I would avoid strtok anyways, since it would
otherwise prevent you from extending the inner loop of your code to do
further or other parsing.

For example, lets say you write some code using strtok that works and
does the IP address parsing you want. Then you want to see if the IP
address matches a given mask, which also has to be parsed. A neat, and
clever way to do it would be to simply simultaneously parse through the
IP address and the IP mask, interleaved, then match each address byte
as you go. The point being that if it mismatches early, then you don't
have to parse all the way to know there is a mismatch (much like strcmp
doesn't have to go down the whole string so long as there is an early
mismatch.)

But strtok screws you, because you can't bob back and forth between two
strings as you tokenize with it.

Its interesting that there have been great outside efforts to improve
strncpy and strncat via the functions strlcpy and strlcat, but that
similar effort has not been made for strtok.

In any event, if you want to avoid all these issues, you can use "The
Better String Library" (Bstrlib) given in the URL at the very bottom of
this post. As a rule of thumb, Bstrlib never has these or any kind of
similar issue, and the length of code to perform any non-trivial string
manipulation is typically half that required for equivalent C-library
code. In this case, the main functionality is one line of code:

struct bstrList * pl = bsplit (ipstr, '.');
 
D

Default User

Ben Pfaff wrote:

There are multiple other reasons to avoid strtok():

* It merges adjacent delimiters. If you use a comma as
your delimiter, then "a,,b,c" is three tokens, not
four. This is often the wrong thing to do. In fact,
it is only the right thing to do, in my experience,
when the delimiter set is limited to white space.

I will disagree slightly. If you are parsing text sentences into words
and don't care about punctuation, then merging is also correct.



Brian
 
B

Ben Bacarisse

... but C itself is not thread safe so the
matter is not within the perview of the C standard.

You keep saying thing that stop me short! I am interesting in the general
view of this. Are you saying that standard C does not provide enough
guarantees to implement a shared-memory mutex using varaibale testing and
setting alone?

I can see that how the threads get there in the first place is OT, but I
always thought the standard said enough to allow one to implement such a
thing if one were forced to need it.

Appoligies if this is OT, but I thought that what all the minimum
"abstract machine" requirements for "volatile" were about!
 
B

Ben Pfaff

Default User said:
I will disagree slightly. If you are parsing text sentences into words
and don't care about punctuation, then merging is also correct.

Good point. I'll revise my boilerplate text. How about this:

* It merges adjacent delimiters. If you use a comma as your
delimiter, then "a,,b,c" will be divided into three tokens,
not four. This is often the wrong thing to do. In fact, it
is only the right thing to do, in my experience, when the
delimiter set contains white space (for dividing a string
into "words") or it is known in advance that there will be
no adjacent delimiters.
 
W

Walter Roberson

You keep saying thing that stop me short! I am interesting in the general
view of this. Are you saying that standard C does not provide enough
guarantees to implement a shared-memory mutex using varaibale testing and
setting alone?

C has no idea what shared memory is, and provides no
operations that can test-and-set any variable atomically.
It does have sig_atomic_t

which is the integral type of an object that can be accessed
as an atomic entity, even in the presence of asychronous interrupts

The relevant section of C89 does not define "access", but the
C89 Rationale indicates,

The Committee concluded that about the only thing a strictly
conforming program can do in a signal handler is to assign a value
to a volatile static variable which can be written uninterruptedly
and promptly return. (THe header <signal.h> specifies a type
sig_atomic_t which can be so written.)

This clarifies that "access" includes at least -writing- to the
sig_atomic_t variable, but it does not tell us about -reading- the
sig_atomic_t variable within the signal handler.

Nothing in C89 gives test-and-set assurances, so although
you could possibly read the sig_atomic_t variable, test its value, and
write a new value based upon what you found there, you don't know
that you won't have had an interrupt between the time of the read and
the time of the write, so you might clobber a mutex lock that something
else has written there while you blinked.

The closest that you might be able to get is to mask off all
maskable signals except one, and then when you want to operate on
your mutex, raise() the signal and then rely upon the signal handler
mechanism to not allow anything else to raise() the same signal while
you are in the handler. That might work if your threading is single
processor with single dispatcher, but if your threading is multiprocessor
or multidispatcher then you can't be sure that no other copies of
the signal handler are in play.

I can see that how the threads get there in the first place is OT, but I
always thought the standard said enough to allow one to implement such a
thing if one were forced to need it.

No, because the standard is written to encompass processors which
have no test-and-set functionality. It is not uncommon for small
processors to not have that or equivilent operations, and to instead
rely upon interrupt levels to manage concurrancy. Obviously forcing
the existance of interrupt levels or test-and-set operations is beyond
what the standard should concern itself with.

In a case like this, it is what the standard does NOT say that is
important. The standard does NOT say that you cannot have an
implementation call that will implement mutexes (somehow). The
standard does not prohibit shared memory. But neither does the
C89 standard provide any certainty at all that the environment
is even -able- to support mutexes (or threads) through -any- means.
 
D

Default User

Ben said:
Good point. I'll revise my boilerplate text. How about this:

* It merges adjacent delimiters. If you use a comma as your
delimiter, then "a,,b,c" will be divided into three tokens,
not four. This is often the wrong thing to do. In fact, it
is only the right thing to do, in my experience, when the
delimiter set contains white space (for dividing a string
into "words") or it is known in advance that there will be
no adjacent delimiters.


I'm not sure the text captured what I was going for, I'll give an
example:

"In fact, it is only the right thing to do, in my experience, when the
delimiter set contains white space."

When parsing the text above into words, we want to count the ", "
following "fact", "do", or "experience" as a single delimiter. The
merging behavior of strtok() does that.




Brian
 
J

Joe Wright

Ben said:
Good point. I'll revise my boilerplate text. How about this:

* It merges adjacent delimiters. If you use a comma as your
delimiter, then "a,,b,c" will be divided into three tokens,
not four. This is often the wrong thing to do. In fact, it
is only the right thing to do, in my experience, when the
delimiter set contains white space (for dividing a string
into "words") or it is known in advance that there will be
no adjacent delimiters.
It seems to me that strtok() is a solution in search of a problem. I
learned it several years ago when I began C programming but have never
used it in any 'real' program since. What good is it?
 
D

Default User

Joe Wright wrote:

It seems to me that strtok() is a solution in search of a problem. I
learned it several years ago when I began C programming but have
never used it in any 'real' program since. What good is it?


It's a pretty compact way to tokenize a string, if you can handle the
limitations and are comfortable with the delimiter merging. It forms
the core of the parser for my text-adventure game, for instance, that's
why I was right on that "breaking strings into words" business.




Brian
 
B

Ben Pfaff

Joe Wright said:
It seems to me that strtok() is a solution in search of a problem. I
learned it several years ago when I began C programming but have never
used it in any 'real' program since. What good is it?

It's dangerous. I never use it, except in toy programs. I use
replacements that don't suffer from recursion issues, such as (in
programs where it is available) strtok_r.

I write a lot of programs that do various kinds of parsing of
strings. If you don't, then you might never find the need for
such a thing.
 
B

Ben Bacarisse

Nothing in C89 gives test-and-set assurances, so although you could
possibly read the sig_atomic_t variable, test its value, and write a new
value based upon what you found there, you don't know that you won't have
had an interrupt between the time of the read and the time of the write,
so you might clobber a mutex lock that something else has written there
while you blinked.

I think I see what you are talking about. I thought you were making a
more general point -- that one could not, even in principle, implement a
mutual exclusion algorithm (such as Peterson's) with what little
assurances standard C provides. I find this part of the standard obtuse
so I would not be happy to assert that one can (although my money would be
on a "yes" rather than a "no" at the moment!).
 
C

CBFalconer

Default said:
I'm not sure the text captured what I was going for, I'll give an
example:

"In fact, it is only the right thing to do, in my experience,
when the delimiter set contains white space."

When parsing the text above into words, we want to count the ", "
following "fact", "do", or "experience" as a single delimiter. The
merging behavior of strtok() does that.

I think a more useful general approach is:

Absorb leading blanks, followed by parsing the item up to
the first occurence of the delimiter. Absorb that delimiter.
This can detect an empty item.

Repeat



--
"The power of the Executive to cast a man into prison without
formulating any charge known to the law, and particularly to
deny him the judgement of his peers, is in the highest degree
odious and is the foundation of all totalitarian government
whether Nazi or Communist." -- W. Churchill, Nov 21, 1943
 
J

Jordan Abel

Joe Wright wrote:




It's a pretty compact way to tokenize a string, if you can handle the
limitations and are comfortable with the delimiter merging. It forms
the core of the parser for my text-adventure game, for instance, that's
why I was right on that "breaking strings into words" business.

You can also use it to parse any of the half-dozen or so unix-ish file
formats that have ':'-delimited fields.
 
C

Chris Torek

(This is true for C99 as well. On every real machine that really
supports multiprocessing and/or threads, one can write a suitable
mutual exclusion routine, callable from C code, but it is rare that
one can write a good one even in machine-specific C. It usually
requires some kind of assembly code: either inline __asm__, or just
a simple C-callable assembly routine.)

I think I see what you are talking about. I thought you were making a
more general point -- that one could not, even in principle, implement a
mutual exclusion algorithm (such as Peterson's) with what little
assurances standard C provides.

You cannot, because on some real multiprocessor machines, memory
synchronization instructions are required to obtain store/load
sequencing guarantees. Consider the following "obvious" code:

# cpu 4 # cpu 7
store reg,[addr] nop
nop store reg,[addr]
load reg,[addr] nop

where instructions written on the same source line are executed in
the same clock tick, and [addr] is the same on both machines. You
would expect that the "load" on CPU 4 would see the value stored
by the "store" on CPU 7. But in fact, it may not (this depends on
unpredictable events like interrupts, making debugging difficult).
To make the result predictable, and make cpu 4 "see" cpu 7's store,
we have to replace the "nop" no-ops with "memory synchronize"
instructions (possibly in several flavors, and maybe only one is
required on each CPU; it depends quite a bit on the CPU architecture).

While the Java language has built-in thread and locking support,
it turns out that the locking was broken on some architectures,
such as the Alpha. Getting it right was tricky; a committee spent
years defining the memory model. See
<http://www.cs.umd.edu/~pugh/java/memoryModel>.
 
B

Ben Bacarisse

n article said:
I think I see what you are talking about. I thought you were making a
more general point -- that one could not, even in principle, implement a
mutual exclusion algorithm (such as Peterson's) with what little
assurances standard C provides.

You cannot, because on some real multiprocessor machines, memory
synchronization instructions are required to obtain store/load sequencing
guarantees. Consider the following "obvious" code:

# cpu 4 # cpu 7
store reg,[addr] nop
nop store reg,[addr] load reg,[addr]
nop

where instructions written on the same source line are executed in the
same clock tick, and [addr] is the same on both machines. You would
expect that the "load" on CPU 4 would see the value stored by the "store"
on CPU 7. But in fact, it may not (this depends on unpredictable events
like interrupts, making debugging difficult). To make the result
predictable, and make cpu 4 "see" cpu 7's store, we have to replace the
"nop" no-ops with "memory synchronize" instructions (possibly in several
flavors, and maybe only one is required on each CPU; it depends quite a
bit on the CPU architecture).

OK, I see that problem. I had assumed that declaring a variable volatile
would oblige the compiler to insert such synchronisation operations at
least at each sequence point prior to accessing it. Looking at the latest
ISO draft, I can't say that I find any support for this asssumption, but
then I can't see any wording that says that such synchronisation is not
required.
 
C

Chris Torek

OK, I see that problem. I had assumed that declaring a variable volatile
would oblige the compiler to insert such synchronisation operations at
least at each sequence point prior to accessing it.

There are those who argue that it should. But consider the V9 SPARC,
for instance. There are six commonly-used memory barrier instructions:

membar #LoadLoad
membar #LoadStore
membar #StoreLoad
membar #StoreStore
membar #MemIssue
membar #Sync

Which of these should the compiler use? Which one(s) do you actually
need?
Looking at the latest ISO draft, I can't say that I find any support
for this asssumption, but then I can't see any wording that says that
such synchronisation is not required.

Presumably, the C99 folks (wisely) decided that this area was too
complex to tackle, and simply left it undefined, so that those
dealing with the systems can define them.

The problem with standardizing behavior is that you are then stuck
with it, even if it is not the right behavior. (Of course, the problem
with *not* standarizing behavior is that everyone runs in different
directions. But sometimes this is what you want anyway.)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,043
Latest member
CannalabsCBDReview

Latest Threads

Top