FindFirstIn

Discussion in 'C Programming' started by jacob navia, May 22, 2014.

  1. jacob navia

    jacob navia Guest

    Task:

    given a text, and a set of characters, find the first occurrence of any
    of the characters in the set.

    Interface

    There are two possibilities:
    1)
    size_t strFindFirst(char *text,char *set);

    The function would return the index of the first character in the string.

    2)
    char *strFindFirst(char *text,char *set);

    The function would return a pointer to a character in the middle of the
    string.

    Discussion
    ----------
    The first solution is OK if you are programing in C++ where counted
    strings are the norm. Lcc-win also provides you with counted strings
    using operator overloading and the first solution is choosen in that
    context.

    For a pure "standard C" solution however, the second solution could be
    more interesting: you have immediately the wanted character with just
    dereferencing a pointer.

    Actually of course both solutions are completely equivalent since you
    can add to the start of the string the index returned by the first
    solution and obtain a pointer that is equivalent to the second solution.

    A first approach:

    1 #include <string.h>
    2 char *strFindFirstIn(char *str,char *set)
    3 {
    4 char *pSet;
    5 int c = *str;
    6 while (*str) {
    7 pSet = set;
    8 while (*pSet) {
    9 if (c == *pSet++) return str;
    10 }
    11 c = *++str;
    12 }
    13 return str;
    14 }

    For each character in the text we search if it matches any of the
    characters in the string. If we have N characters in the text and M
    characters in the set, we make N x M character comparisons.

    A more sophisticated approach:

    1 char *strFindFirstIn1(char *str,char *set)
    2 {
    3 int c = *str;
    4 unsigned char tab[256];
    5 unsigned char *p=str;
    6
    7 if (*p==0) return p;
    8 memset(tab,0,sizeof(tab));
    9 while (*set) {
    10 tab[*(unsigned char *)set++]=1;
    11 }
    12 while (*p) {
    13 if (tab[*p]) return p;
    14 p++;
    15 }
    16 return p;
    17 }

    We make a table of 256 positions filled initially with zeroes. For each
    character in the set we fill the table with a "1" value. Then, we go
    through the text and see for each character if the position contans a
    zero (no match) or a one (match).

    Note that for repeated characters in the set, the first solution will
    again test for the character... but not in the second solution.

    Of course the second solution has a drawback if the text is very short
    and the set very small. It needs a zeroing of 256 positions, then the
    filling of the occupied positions.

    But in the case of a very long text and a longuish set, the second
    soution wins since it does only one comparison per character!

    How about you?

    Can you give a better solution?

    Comments welcome.

    Here is a test program.
    ----------------------

    1 #include <stdio.h>
    2 int main(void)
    3 {
    4 char *s=
    "This sentence will be changed to replace vowels by asterisks";
    5 char *p = strFindFirstIn1(s,"aeiou");
    6 while (*p) {
    7 *p++ = '*';
    8 p = strFindFirstIn(p,"aeiou");
    9 }
    10 printf("%s\n",s);
    11 }

    jacob
     
    jacob navia, May 22, 2014
    #1
    1. Advertisements

  2. jacob navia

    BartC Guest

    If the same set if being used to match against millions of short strings,
    then possibly creating the set can be a bottleneck. In this case, perhaps
    allow for the caller to supply the set (maybe using an extra function to
    create it separately). Then that overhead can be largely eliminated.

    Also, when searching the same (long) string many times, for a different set
    each time, it might be worth creating an index (table of char*) of the 256
    characters with the initial position in the string (with NULL if a character
    does not occur).

    Then search time for subsequent finds depends only on the length of the set
    (maximum 256). For a single character set, that would take no time at all.
     
    BartC, May 22, 2014
    #2
    1. Advertisements

  3. jacob navia

    Kaz Kylheku Guest

    #include <string.h>

    char *first_occurrence = strpbrk(str, set);

    Obscure trivia: the "break" in strpbrk and "span" in strspn come from Snobol.
     
    Kaz Kylheku, May 22, 2014
    #3
  4. Or

    size_t idx = strcspn(str, set);


    As far as complexity analysis goes, the first solution has a worst case of M x N operations, whereas the second one has 256 * M operations.

    - Anand
     
    Anand Hariharan, May 22, 2014
    #4
  5. jacob navia

    jacob navia Guest

    Le 22/05/2014 20:21, Kaz Kylheku a écrit :
    Well, who cares about that?

    How would you write that?

    That was my aim at posting the message. Something for people that start
    programming to see different ways of doing stuff.

    Thanks for your input.
     
    jacob navia, May 22, 2014
    #5
  6. jacob navia

    jacob navia Guest

    Le 22/05/2014 20:41, Anand Hariharan a écrit :
    The standard says about that, that
    "... computes the length of the maximum initial segment of the string
    pointed to by s1 which consists entirely of characters not from the
    string pointed to by s2."

    How clear. Instead of saying "returns the index of the first character
    of a set in a text"

    But anyway the goal was to show how that could be implemented.
     
    jacob navia, May 22, 2014
    #6
  7. They say different things so the authors could not have used your words
    instead. To compare two definitions for clarity they must say the same
    thing.
     
    Ben Bacarisse, May 22, 2014
    #7
  8. jacob navia

    jacob navia Guest

    Le 22/05/2014 22:44, Ben Bacarisse a écrit :
    Probably

    That standard is so well written that anybody understands what it says
    but me.
     
    jacob navia, May 22, 2014
    #8
  9. UB. You want char s[] = ...;
    Are you deliberately calling two separate versions?

    To avoid this (and to avoid duplicating the set) I'd write one call:

    p = s;
    while (*(p = strFindFirstIn(p, "aeiou"))) *p++ = '*';
     
    Ben Bacarisse, May 22, 2014
    #9
  10. I don't think it's hard. What is the index of the first character in
    the set "x" in the string "y"? I don't know. What is the length of the
    maximum initial segment of the string "y" which consists entirely of
    characters not in the string "x"? Clearly 1.
     
    Ben Bacarisse, May 22, 2014
    #10
  11. jacob navia

    jacob navia Guest

    Le 22/05/2014 23:49, Ben Bacarisse a écrit :

    Sure. To say that

    "If no character is found it returns the length of the string."

    that's way too complicated. I understand now. They had to choose a
    simpler way.

    GREAT!
     
    jacob navia, May 22, 2014
    #11
  12. jacob navia

    jacob navia Guest

    Le 22/05/2014 23:44, Ben Bacarisse a écrit :

    yes, thanks
    No, it is a typo
    Looks ugly. I prefer two calls...
     
    jacob navia, May 22, 2014
    #12
  13. jacob navia

    jacob navia Guest

    Le 22/05/2014 18:42, jacob navia a écrit :
    I rewrote the second function in assembler. Only 77 bytes!

    gcc (with -Os) is 350.
     
    jacob navia, May 22, 2014
    #13
  14. I can't argue with that, except to say I prefer the version where the
    error I pointed out is impossible! Anyway, I don't expect to persuade,
    just to put it out there. It's helpful, I think, to see options.
     
    Ben Bacarisse, May 23, 2014
    #14
  15. The sarcasm makes me *want* to disagree! But you get my point, I think.
    Had you presented specifications that covered the same cases in the same
    detail, an assessment could be made if one was indeed overly complex.
     
    Ben Bacarisse, May 23, 2014
    #15
  16. jacob navia

    Ike Naar Guest

    Another interesting metric would be the number of bugs.
     
    Ike Naar, May 23, 2014
    #16
  17. jacob navia

    Ian Collins Guest

    There are rather a lot of type mismatches there! Not least the return
    type and the variable returned...
     
    Ian Collins, May 23, 2014
    #17
  18. jacob navia

    jacob navia Guest

    Le 23/05/2014 08:13, Ian Collins a écrit :
    Since I am using the characters as an index into a table I have to avoid
    indexing using negative numbers. I am forced to use *unsigned* chars.

    Since in the interface I do not want to use unsigned char I am stuck
    with implicit conversions.

    I do not see any other way to get out of this problem. Do you?
     
    jacob navia, May 23, 2014
    #18
  19. jacob navia

    jacob navia Guest

    Le 23/05/2014 07:45, Ike Naar a écrit :
    Go ahead. If your remark is not just bad faith tell me why this program
    would be wrong
    1 .text 64
    2 ; char *strFindFirst(char *str,char *set)
    3 strFindFirst:
    4 ; Make space for a table of 256 positions
    5 subq $256,%rsp
    6 ; if (*str==0) return p;
    7 cmpb $0,(%rcx)
    8 je SetResultAndExit
    9 ; memset(tab,0,sizeof(tab));
    10 ; Save the registers rdi and rcx into r10 and r11
    11 movq %rdi,%r10
    12 movq %rcx,%r11
    13 ; Prepare for the stosq loop: Count is 32 (256/8),rax is zero
    14 ; and destination is rsp, stored in rdi.
    15 movl $32,%ecx
    16 xorl %eax,%eax
    17 movq %rsp,%rdi
    18 rep
    19 stosq
    20 movq %r10,%rdi
    21 movq %r11,%rcx
    22 jmp TestFillTableLoopCondition
    23 FillTable:
    24 ; tab[*(unsigned char *)set++]=1;
    25 incb (%rsp,%rax)
    26 incq %rdx
    27 TestFillTableLoopCondition:
    28 movb (%rdx),%al
    29 testb %al,%al
    30 jne FillTable
    31 SearchCharLoop:
    32 ; if (tab[*p]) return p;
    33 movb (%rcx),%al ; move char to al
    34 cmpb $0,(%rsp,%rax) ; index table with it
    35 jne SetResultAndExit ; test if nonzero
    36 incq %rcx ; next char
    37 cmpb $0,(%rcx) ; test if zero
    38 jne SearchCharLoop
    39 SetResultAndExit:
    40 movq %rcx,%rax ; set result
    41 addq $256,%rsp ; restore stack
    42 ret
     
    jacob navia, May 23, 2014
    #19
  20. Yes. You could write

    bool table[256];
    bool *tab = table - CHAR_MIN;

    (I used bool because that's logically the right type and you are a fan
    of C99, but there might be a reason you chose char.)

    You should certainly swap which pointer you do *(unsigned char *) on.
    If you make a local unsigned char *pset, change p to be char * and use
    the cast on tab[*p] instead, you get far fewer type mismatches with
    pretty much the same code complexity. (And you can remove c, it's not
    used.)

    But, since you return a pointer to the null byte when none of the
    non-null characters is found, logically the null is in the set of
    searched-for characters. The code gets a lot simpler if you treat the
    null just like the others:

    bool table[256] = {0};
    bool *tab = table - CHAR_MIN;
    do tab[*set] = true; while (*set++);

    while (!tab[*str]) ++str;
    return str;

    gcc compiles this to 20/21 instructions (depending on optimising for
    speed or size). That's fewer than your hand-coded assembler version,
    though it may, of course, be either more bytes, or slower, or both.
     
    Ben Bacarisse, May 23, 2014
    #20
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.
Similar Threads
There are no similar threads yet.
Loading...