Christian said:
This definitely does to many comparisons. If *s1 == *s2 then *s1 and *s2
are either both zero, or non of them is zero, so it is absolutely
pointless to check both of them.
Instead of worrying about the performance of strncmp, you should ask
yourself why you make so many calls to it that it matters at all.
What are you doing that requires millions of calls to strncmp?
This is indeed the key question. I have had a case where I had a large
number of names and had to repeatedly look for names in that list and
came up with a very different solution: I made an index over the list
and "matched" the names I was looking for. The point is, I didn't simply
iterate over the list or make some tree search or hash function, but a
finite state machine (FSM). For fixed strings such as names it is very easy
to make a 'deterministic finite automaton' (DFA), a class of FSMs. What you
do is, you create an FSM of which each state represents the substring matched
so far from the beginning, the start state representing nothing matched yet.
Each state that represents a complete string of the original list is marked
an 'accept' state'. The matching algorithm is very simple: you start
from the start state and for each character of the name you are looking
for you find a transition to another state. If no transition is found the
name is not in the list and the search terminates. Otherwise, you follow
the transition to the next state and the process repeats until you don't
find a transition for the current character or you are at the end of the
name you are looking for. If the latter happens and you are in an accept
state, you have matched the name.
The advantage of this algorithm: the time complexity is O(n) where n is
the length of the name you are looking for. You might be tempted to say
it's O(n * t) where t is the number of transitions you have to follow at
each state, but t is limited by the size |T| of the character set and thus
we get O(n * |T|) = O(n).
If you have to, say, find x names in a list of y names the total complexity
becomes O(x * n) where n is now the average length of the x names, compared
with O(x * y * n) for linear search (the factor n then stems from strcmp) or
O(x * log y * n) for tree search. Hashing should have O(x * n) in the best
case. You probably don't get any faster than an FSM.
Building the FSM is about just as simple: for each name that you add to the
FSM you follow the FSM built so far similarly to the matching algorithm
described above. When you don't find a transition you add one to a state that
you also add and you move to that state, from which point on the algorithm
continues. The state you are in when you are at the end of the added string
is marked 'accept'. You can probably figure out for yourself how you would
remove names from the DFA, it's not a big deal.
Note that this is a very simplistic form of DFA: in general they can be much
more powerful and match patterns with repetitions (think of * in Unix file
name patterns). These are called regular expressions. To learn how to construct
such DFAs, read for instance 'Compilers: theory, techniques and tools' by Aho,
Sethi and Ullman (the 'dragon book'). Or you can generate them with e.g. lex.
Another note: you can use such an FSM as real index, pointing back to your
list of names by turning the 'accept' field into a pointer to the corresponding
list item (and NULL in states that are not accept states).
Finally, if it has advantages, it probably has disadvantages as well. That's
right: it consumes more memory and you had better take some care to do your
memory allocation efficiently: allocate states and transitions 1024 at a time
or some such.
--
ir. H.J.H.N. Kenter ^^
Electronic Design & Tools oo ) Philips Research Labs
Building WAY 3.23 =x= \ (e-mail address removed)
Prof. Holstlaan 4 (WAY31) | \ tel. +31 40 27 45334
5656 AA Eindhoven /|__ \ tfx. +31 40 27 44626
The Netherlands (____)_/
http://www.kenter.demon.nl/
Famous last words: Segmentation Fault (core dumped)