# FindFirstIn

K

#### Kaz Kylheku

Or

size_t idx = strcspn(str, set);

As far as complexity analysis goes, the first solution has a worst case of M
x N operations, whereas the second one has 256 * M operations.

How do you figure that? Either one can be written in terms of the other:

char *xstrpbrk(char *str, char *set)
{
size_t len = strcspn(str, set);
return str[len] == 0 ? 0 : str + len;
}

and:

size_t xstrcspn(char *str, char *set)
{
char *brk = strpbrk(str, set);
return brk == 0 ? strlen(str) : brk - str;
}

K

#### Kaz Kylheku

Le 22/05/2014 20:21, Kaz Kylheku a Ã©crit :

How would you write that?

Like this:

#include <string.h>

char *first_occurrence = strpbrk(str, set);

M

#### Martin Shobe

jacob navia said:
Le 23/05/2014 08:13, Ian Collins a Ã©crit :
jacob navia wrote:

A more sophisticated approach:

1 char *strFindFirstIn1(char *str,char *set)
2 {
3 int c = *str;
4 unsigned char tab[256];
5 unsigned char *p=str;
6
7 if (*p==0) return p;
8 memset(tab,0,sizeof(tab));
9 while (*set) {
10 tab[*(unsigned char *)set++]=1;
11 }
12 while (*p) {
13 if (tab[*p]) return p;
14 p++;
15 }
16 return p;
17 }

There are rather a lot of type mismatches there! Not least the return
type and the variable returned...
Since I am using the characters as an index into a table I have to
avoid indexing using negative numbers. I am forced to use *unsigned*
chars.

Since in the interface I do not want to use unsigned char I am stuck
with implicit conversions.

I do not see any other way to get out of this problem. Do you?

Yes. You could write

bool table[256];
bool *tab = table - CHAR_MIN;

or possibly

bool table[CHAR_MAX - CHAR_MIN];
bool * tab = table - CHAR_MIN;

(You might want to put in a sanity check as well in the off chance that
this results in a table larger than you want to use (like when CHAR_BIT
is set to 32).)

[snip]

Martin Shobe

J

#### Johannes Bauer

Le 23/05/2014 07:45, Ike Naar a écrit :
would be wrong

Ah, so you write code and everyone else has to be your code monkey and
debug it for you, yes?

Well, let me be your code monkey: When I try it, it immediately
segfaults. Wrong calling convention. You expect something in rcx that my
compiler puts in either rdi or rsi.

Maybe you've just discovered two new metrics which would be interesting:
Portability. Maintainability. Because I can create a program that
segfaults in definitely less than 77 bytes.

Cheers,
Johannes

--
Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos über Rüdiger Thomas in dsa <[email protected]>

B

#### Ben Bacarisse

Martin Shobe said:
jacob navia said:
Le 23/05/2014 08:13, Ian Collins a Ã©crit :
jacob navia wrote:

A more sophisticated approach:

1 char *strFindFirstIn1(char *str,char *set)
2 {
3 int c = *str;
4 unsigned char tab[256];
5 unsigned char *p=str;
6
7 if (*p==0) return p;
8 memset(tab,0,sizeof(tab));
9 while (*set) {
10 tab[*(unsigned char *)set++]=1;
11 }
12 while (*p) {
13 if (tab[*p]) return p;
14 p++;
15 }
16 return p;
17 }

There are rather a lot of type mismatches there! Not least the return
type and the variable returned...

Since I am using the characters as an index into a table I have to
avoid indexing using negative numbers. I am forced to use *unsigned*
chars.

Since in the interface I do not want to use unsigned char I am stuck
with implicit conversions.

I do not see any other way to get out of this problem. Do you?

Yes. You could write

bool table[256];
bool *tab = table - CHAR_MIN;

or possibly

bool table[CHAR_MAX - CHAR_MIN];
bool * tab = table - CHAR_MIN;

Yes, that's better, and since I was tidying the original up I should
have gone that step further.
(You might want to put in a sanity check as well in the off chance
that this results in a table larger than you want to use (like when
CHAR_BIT is set to 32).)

I agree about the sanity check because the scheme only makes sense for
reasonably small ranges for char.

K

#### Keith Thompson

jacob navia said:
Le 23/05/2014 08:13, Ian Collins a Ã©crit :
jacob said:
A more sophisticated approach:

1 char *strFindFirstIn1(char *str,char *set)
2 {
3 int c = *str;
4 unsigned char tab[256];
5 unsigned char *p=str;
6
7 if (*p==0) return p;
8 memset(tab,0,sizeof(tab));
9 while (*set) {
10 tab[*(unsigned char *)set++]=1;
11 }
12 while (*p) {
13 if (tab[*p]) return p;
14 p++;
15 }
16 return p;
17 }

There are rather a lot of type mismatches there! Not least the return
type and the variable returned...
Since I am using the characters as an index into a table I have to avoid
indexing using negative numbers. I am forced to use *unsigned* chars.

Since in the interface I do not want to use unsigned char I am stuck
with implicit conversions.

There is no implicit conversion from unsigned char* to char* or vice
versa in standard C.

B

#### Ben Bacarisse

jacob navia said:
Le 22/05/2014 18:42, jacob navia a Ã©crit :
I rewrote the second function in assembler. Only 77 bytes!

gcc (with -Os) is 350.

I think that may be measuring other things. I get 120 bytes for your
second version (gcc -Os, version 4.8.2) and 69 for my re-write of it.
Maybe you have a particularly old or bad version of gcc?

I, too, can't use your assembler code, despite having the right hardware
and an assembler that understands the syntax (except for the comments).
The vastly more portable C wins every time, for me.

J

#### jacob navia

Le 23/05/2014 16:56, Johannes Bauer a écrit :
Ah, so you write code and everyone else has to be your code monkey and
debug it for you, yes?

Well, let me be your code monkey: When I try it, it immediately
segfaults. Wrong calling convention. You expect something in rcx that my
compiler puts in either rdi or rsi.

How clever of you.

As you know, it is for windows calling conventions. For a linux version
it would be even simpler since rsi doesn't need to be saved.
Maybe you've just discovered two new metrics which would be interesting:
Portability.

It is portable to:

Mac OS X
Linux
Solaris
Windows
The Surface tablet
and a long etc.

Yes, the calling conventions change

J

#### jacob navia

Le 23/05/2014 14:56, Kaz Kylheku a Ã©crit :
Like this:

#include <string.h>

char *first_occurrence = strpbrk(str, set);

Great.

You win. Very clever

J

#### jacob navia

Le 23/05/2014 13:47, Ben Bacarisse a Ã©crit :
bool table[256] = {0};
bool *tab = table - CHAR_MIN;
do tab[*set] = true; while (*set++);

while (!tab[*str]) ++str;
return str;

gcc compiles this to 20/21 instructions (depending on optimising for
speed or size). That's fewer than your hand-coded assembler version,
though it may, of course, be either more bytes, or slower, or both.

No, at least not in my Mac.
/tmp \$ cat strfind.c
#include <limits.h>
#include <stdbool.h>
char *strfind(char *str,char *set)
{
bool table[256] = {0};
bool *tab = table - CHAR_MIN;
do tab[*set] = true; while (*set++);

while (!tab[*str]) ++str;
return str;
}
/tmp \$ gcc -Os -c -std=c99 strfind.c
/tmp \$ objdump strfind.o
strfind.o:
code: 194 bytes <<<<<<<<<<<<<<<<-------<<<<<<<<<<<<<<<<<
debug: 64 bytes
/tmp \$ gcc -Os -c -S -std=c99 strfind.c
/tmp \$ cat strfind.s
.section __TEXT,__text,regular,pure_instructions
.globl _strfind
_strfind: ## @strfind
.cfi_startproc
## BB#0:
pushq %rbp
Ltmp2:
.cfi_def_cfa_offset 16
Ltmp3:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp4:
.cfi_def_cfa_register %rbp
subq \$272, %rsp ## imm = 0x110
movq [email protected](%rip), %rax
movq (%rax), %rcx
movq %rcx, -8(%rbp)
xorps %xmm0, %xmm0
movaps %xmm0, -32(%rbp)
movaps %xmm0, -48(%rbp)
movaps %xmm0, -64(%rbp)
movaps %xmm0, -80(%rbp)
movaps %xmm0, -96(%rbp)
movaps %xmm0, -112(%rbp)
movaps %xmm0, -128(%rbp)
movaps %xmm0, -144(%rbp)
movaps %xmm0, -160(%rbp)
movaps %xmm0, -176(%rbp)
movaps %xmm0, -192(%rbp)
movaps %xmm0, -208(%rbp)
movaps %xmm0, -224(%rbp)
movaps %xmm0, -240(%rbp)
movaps %xmm0, -256(%rbp)
movaps %xmm0, -272(%rbp)
leaq -272(%rbp), %rcx
LBB0_1: ## =>This Inner Loop Header: Depth=1
movsbq (%rsi), %rdx
movb \$1, 128(%rdx,%rcx)
cmpb \$0, (%rsi)
leaq 1(%rsi), %rsi
jne LBB0_1
decq %rdi
Depth=1
movsbq 1(%rdi), %rdx
incq %rdi
cmpb \$0, 128(%rdx,%rcx)
je LBB0_3
## BB#4:
movq (%rax), %rax
cmpq -8(%rbp), %rax
jne LBB0_6
## BB#5:
movq %rdi, %rax
addq \$272, %rsp ## imm = 0x110
popq %rbp
ret
LBB0_6:
callq ___stack_chk_fail
.cfi_endproc

..subsections_via_symbols
/tmp \$ gcc -v
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr
--with-gxx-include-dir=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk/usr/include/c++/4.2.1
Apple LLVM version 5.1 (clang-503.0.40) (based on LLVM 3.4svn)
Target: x86_64-apple-darwin13.2.0

Analysis:

Instead of using the stosq instruction, gcc repeats 32 times the store
of 16 bytes using xmm0. In my opinion this is a misguided optimisation.

J

#### jacob navia

Here is a gcc compatible version.
Tested under Mac OS X

As you can see, you just replace rcx, rdx by rdi, rsi

.globl _strFindFirst
_strFindFirst:
subq \$256,%rsp
cmpb \$0,(%rdi)
je _\$50
movq %rdi,%r10
movl \$32,%ecx
xorl %eax,%eax
movq %rsp,%rdi
rep
stosq
movq %r10,%rdi
jmp _\$17
_\$16:
incb (%rsp,%rax)
incq %rsi
_\$17:
movb (%rsi),%al
testb %al,%al
jne _\$16
_\$19:
movb (%rdi),%al
cmpb \$0,(%rsp,%rax)
jne _\$50
incq %rdi
_\$20:
cmpb \$0,(%rdi)
jne _\$19
_\$50:
movq %rdi,%rax
ret

K

#### Kaz Kylheku

Le 23/05/2014 16:56, Johannes Bauer a Ã©crit :

How clever of you.

As you know, it is for windows calling conventions. For a linux version
it would be even simpler since rsi doesn't need to be saved.

You will *get* RSI typing all that code.

J

#### jacob navia

Le 23/05/2014 17:39, Keith Thompson a Ã©crit :
There is no implicit conversion from unsigned char* to char* or vice
versa in standard C.

char *foo(char *str)
{
unsigned char *p = (unsigned char *)str;
return p; // This one
}

I thought that is an implicit conversion.

This is a bug in the language. Characters are not small integers. They
are represented by small integers but they aren't integers and as such
they have no sign.

K

#### Kaz Kylheku

Le 23/05/2014 17:39, Keith Thompson a Ã©crit :

char *foo(char *str)
{
unsigned char *p = (unsigned char *)str;
return p; // This one
}

I thought that is an implicit conversion.

What? That may be implicit, but it requires a diagnostic.

The implicit conversion from unsigned char * to char * which doesn't
require a diagnostic is this one:

char *foo(unsigned char *str)
{
void *v = str;
return v;
}
This is a bug in the language. Characters are not small integers. They
are represented by small integers but they aren't integers and as such
they have no sign.

Then again, there is hardware that gives us signed and unsigned operations on
bytes, and signed and unsigned chars map to that.

Characters may not be small integers, but C chars are small integers.

I

#### Ike Naar

Le 23/05/2014 07:45, Ike Naar a ?crit :

would be wrong

It's not bad faith. I just have too little x86-64 expertise to verify
that the code is correct.
1 .text 64
2 ; char *strFindFirst(char *str,char *set)
3 strFindFirst:
4 ; Make space for a table of 256 positions
5 subq \$256,%rsp
6 ; if (*str==0) return p;
7 cmpb \$0,(%rcx)
8 je SetResultAndExit

If you are coding for minimal size you could do away with
the code on lines 7 and 8.
9 ; memset(tab,0,sizeof(tab));
10 ; Save the registers rdi and rcx into r10 and r11
11 movq %rdi,%r10
12 movq %rcx,%r11
13 ; Prepare for the stosq loop: Count is 32 (256/8),rax is zero
14 ; and destination is rsp, stored in rdi.
15 movl \$32,%ecx
16 xorl %eax,%eax
17 movq %rsp,%rdi
18 rep
19 stosq
20 movq %r10,%rdi
21 movq %r11,%rcx
22 jmp TestFillTableLoopCondition
23 FillTable:
24 ; tab[*(unsigned char *)set++]=1;
25 incb (%rsp,%rax)
26 incq %rdx
27 TestFillTableLoopCondition:
28 movb (%rdx),%al
29 testb %al,%al
30 jne FillTable
31 SearchCharLoop:
32 ; if (tab[*p]) return p;
33 movb (%rcx),%al ; move char to al
34 cmpb \$0,(%rsp,%rax) ; index table with it
35 jne SetResultAndExit ; test if nonzero
36 incq %rcx ; next char
37 cmpb \$0,(%rcx) ; test if zero
38 jne SearchCharLoop
39 SetResultAndExit:
40 movq %rcx,%rax ; set result
41 addq \$256,%rsp ; restore stack
42 ret

B

#### Ben Bacarisse

jacob navia said:
Le 23/05/2014 13:47, Ben Bacarisse a Ã©crit :
bool table[256] = {0};
bool *tab = table - CHAR_MIN;
do tab[*set] = true; while (*set++);

while (!tab[*str]) ++str;
return str;

gcc compiles this to 20/21 instructions (depending on optimising for
speed or size). That's fewer than your hand-coded assembler version,
though it may, of course, be either more bytes, or slower, or both.

No, at least not in my Mac.
/tmp \$ gcc -v
Configured with:
--prefix=/Applications/Xcode.app/Contents/Developer/usr
--with-gxx-include-dir=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk/usr/include/c++/4.2.1
Apple LLVM version 5.1 (clang-503.0.40) (based on LLVM 3.4svn)
Target: x86_64-apple-darwin13.2.0

You'll have to help me out here. That does not look like a gcc version
at all, but I am not a Mac person. That output references LLVM and
clang which I find odd. Is there any correspondence with gcc version
numbers on other systems?

Oddly, I get the same code when I use clang rather than gcc on my Linux
box. Is that just a coincidence?

<snip>

I

#### Ike Naar

It's not bad faith. I just have too little x86-64 expertise to verify
that the code is correct.

If you are coding for minimal size you could do away with
the code on lines 7 and 8.

This is not correct.
Which proves that I'm lacking the x86-64 expertise.

K

#### Keith Thompson

jacob navia said:
Le 23/05/2014 17:39, Keith Thompson a Ã©crit :

char *foo(char *str)
{
unsigned char *p = (unsigned char *)str;
return p; // This one
}

I thought that is an implicit conversion.

It would be if it weren't a constraint violation.

char and unsigned char are distinct types (even if they have the same
representation), and they are not *compatible types* in the sense that
the standard defines that phrase.

There are implicit conversions between char and unsigned char, as there
are between any two arithmetic types, but there are no such conversions
for char* and unsigned char*.

See N1570 6.5.4p3:

Conversions that involve pointers, other than where permitted by the
constraints of 6.5.16.1, shall be specified by means of an explicit
cast.

and 6.5.16.1, which constrains the operands of a simple assignment (the
same constraints apply to return statements).

gcc with "-pedantic" or "-pedantic-errors" diagnoses the return
statement. Does the compiler you're using not do so?
This is a bug in the language. Characters are not small integers. They
are represented by small integers but they aren't integers and as such
they have no sign.

In C, char, signed char, and unsigned char are all small integer types.
I won't argue whether whether that's a good idea, but it's definitely a
deliberate design feature.