C can be very literal.

S

Seebs

In this code line is the start of the line buffer, pe is the end of
the line buffer+1, and p is the start of the area to be moved down.
These are all of type wchar_t *.
memmove(line, p ((pe - line) * sizeof(wchar_t)) - ((p - line) *
sizeof(wchar_t));

.... Why not (pe - p) * sizeof(wchar_t)?
It wasn't exactly following my code, but it was doing unnecessary
multiplying and dividing, even if by a power of two.

I am not sure that "by a power of two" matters at all in this case.
So I changed it
to this: (uint is a typedef'd unsigned int.)
memmove(line, p, (((uint)pe - (uint)line) - ((uint)p - (uint)line)));

This isn't safe, don't do that. There is no guarantee that a pointer
will fit inside a uint without losing information which could result
in tragic failures.
More typing, but shorter, faster code and fewer temps.

Also code that no longer makes sense.

Me, I'd suggest:
1. Consider using (pe - p) * sizeof(wchar_t).
2. Consider using wmemmove() and just using (pe - p).
3. Or just use (((char *) pe) - ((char *) p)) as the character count.

-s
 
D

DSF

Hello,

A little peek "under the hood."

C can be very literal in its translation from C source to the final
executable code. Whether that's a blessing or a curse depends on what
you're doing. Take this short example:

(The variables were made static to give them some "substance,"
otherwise the compiler would either optimize the whole thing away, or,
if I gave the variables "something to do," such as feeding them to
printf, has the compiler piling them into as many registers as it
can.)

#include <stdlib.h>

void foo(void);
void bar(void);

int main()
{
foo();
bar();

return EXIT_SUCCESS;
}

void foo(void)
{
static int a, b, c, d, e, f, g, h, i;

a = 0;
b = 0;
c = 0;
d = 0;
e = 0;
f = 0;
g = 0;
h = 0;
i = 0;
}

void bar(void)
{
static int a, b, c, d, e, f, g, h, i;

a = b = c = d = e = f = g = h = i = 0;
}

Both foo and bar do the same thing, but the resulting code is quite
different.

To keep things non-platform specific, pseudo-code for foo:

set register "A" to 0
store in memory location "a"
set register "B" to 0
store in memory location "b"
set register "C" to 0
store in memory location "c"
set register "A" to 0
store in memory location "d"
set register "B" to 0
store in memory location "e"
set register "C" to 0
store in memory location "f"
set register "A" to 0
store in memory location "g"
set register "B" to 0
store in memory location "h"
set register "C" to 0
store in memory location "i"

Pseudo-code for bar:
set register "A" to 0
store in memory location "a"
store in memory location "b"
store in memory location "c"
store in memory location "d"
store in memory location "e"
store in memory location "f"
store in memory location "g"
store in memory location "h"
store in memory location "i"

This is with size optimization enabled. Note that in foo, the same
three registers are set to zero repeatedly.

So, at least with my current compiler, ganging up equals (variables
of the same type, of course) produces shorter and faster code.

A practical example. I'd written some code to delete leading
whitespace on a line, the end result being moving X characters of a
wide character string downward.

In this code line is the start of the line buffer, pe is the end of
the line buffer+1, and p is the start of the area to be moved down.
These are all of type wchar_t *.

memmove(line, p ((pe - line) * sizeof(wchar_t)) - ((p - line) *
sizeof(wchar_t));
This was my second to last coding. The resulting pseudo-code:

temp1 = pe - line
temp1 = temp1 / 2 //The pointers represent characters, not bytes.
temp2 = p - line
temp2 = temp2 / 2
temp3 = temp1 - temp2
temp3 = temp3 * 2
(Result in temp3)

It wasn't exactly following my code, but it was doing unnecessary
multiplying and dividing, even if by a power of two. So I changed it
to this: (uint is a typedef'd unsigned int.)
memmove(line, p, (((uint)pe - (uint)line) - ((uint)p - (uint)line)));
This produced:
temp1 = pe - line
temp2 = p - line
temp1 = temp1 - temp2
(Result in temp1)

More typing, but shorter, faster code and fewer temps.

Just my two cents.
"'Later' is the beginning of what's not to be."
D.S. Fiscus
 
E

Eric Sosman

[...]
A practical example. I'd written some code to delete leading
whitespace on a line, the end result being moving X characters of a
wide character string downward.

In this code line is the start of the line buffer, pe is the end of
the line buffer+1, and p is the start of the area to be moved down.
These are all of type wchar_t *.

memmove(line, p ((pe - line) * sizeof(wchar_t)) - ((p - line) *
sizeof(wchar_t));
This was my second to last coding. The resulting pseudo-code:

temp1 = pe - line
temp1 = temp1 / 2 //The pointers represent characters, not bytes.
temp2 = p - line
temp2 = temp2 / 2
temp3 = temp1 - temp2
temp3 = temp3 * 2
(Result in temp3)

It wasn't exactly following my code, but it was doing unnecessary
multiplying and dividing, even if by a power of two. So I changed it
to this: (uint is a typedef'd unsigned int.)
memmove(line, p, (((uint)pe - (uint)line) - ((uint)p - (uint)line)));
This produced:
temp1 = pe - line
temp2 = p - line
temp1 = temp1 - temp2
(Result in temp1)

More typing, but shorter, faster code and fewer temps.

Also fewer guarantees of correctness. Quoth 6.3.2.3p6:
"Any pointer type may be converted to an integer type. Except
as previously specified [null pointer constant], the result is
implementation-defined." Converting the pointers to integers
and then subtracting is not guaranteed to give the same result
as subtracting the pointers and multiplying by the sizeof.
(Bentley and McIlroy's classic "Engineering a Sort Function"
mentions one machine for which your rewrite would definitely
*not* produce the right result.)

What sort of generated code do you get if you simplify the
size expression to `(pe - p) * sizeof(wchar_t)' instead of
writing out all those unnecessary operators?

Also, if you're so concerned about speed: Are you moving
too much data? Is pe ("the end of the line buffer+1") really
the proper endpoint, or ought you to be using a pointer just
past the terminator and perhaps well short of buffer's end?
Yes, it may take some extra work to find that terminator -- but
that's the sort of thing you're might well have already ...
 
J

Johannes Bauer

Am 01.11.2013 19:53, schrieb DSF:
So, at least with my current compiler, ganging up equals (variables
of the same type, of course) produces shorter and faster code.

If your current compiler misses such an obvious optimization, it is
quite frankly a piece of shit.

Just for reference, gcc 4.7 produces the expected:

080484a0 <foo>:
80484a0: c7 05 20 a0 04 08 00 movl $0x0,0x804a020
80484a7: 00 00 00
80484aa: c7 05 24 a0 04 08 00 movl $0x0,0x804a024
80484b1: 00 00 00
80484b4: c7 05 28 a0 04 08 00 movl $0x0,0x804a028
80484bb: 00 00 00
80484be: c7 05 2c a0 04 08 00 movl $0x0,0x804a02c
80484c5: 00 00 00
80484c8: c7 05 30 a0 04 08 00 movl $0x0,0x804a030
80484cf: 00 00 00
80484d2: c7 05 34 a0 04 08 00 movl $0x0,0x804a034
80484d9: 00 00 00
80484dc: c7 05 38 a0 04 08 00 movl $0x0,0x804a038
80484e3: 00 00 00
80484e6: c7 05 3c a0 04 08 00 movl $0x0,0x804a03c
80484ed: 00 00 00
80484f0: c7 05 40 a0 04 08 00 movl $0x0,0x804a040
80484f7: 00 00 00
80484fa: c3 ret
80484fb: 90 nop
80484fc: 8d 74 26 00 lea 0x0(%esi,%eiz,1),%esi

08048500 <bar>:
8048500: c7 05 44 a0 04 08 00 movl $0x0,0x804a044
8048507: 00 00 00
804850a: c7 05 48 a0 04 08 00 movl $0x0,0x804a048
8048511: 00 00 00
8048514: c7 05 4c a0 04 08 00 movl $0x0,0x804a04c
804851b: 00 00 00
804851e: c7 05 50 a0 04 08 00 movl $0x0,0x804a050
8048525: 00 00 00
8048528: c7 05 54 a0 04 08 00 movl $0x0,0x804a054
804852f: 00 00 00
8048532: c7 05 58 a0 04 08 00 movl $0x0,0x804a058
8048539: 00 00 00
804853c: c7 05 5c a0 04 08 00 movl $0x0,0x804a05c
8048543: 00 00 00
8048546: c7 05 60 a0 04 08 00 movl $0x0,0x804a060
804854d: 00 00 00
8048550: c7 05 64 a0 04 08 00 movl $0x0,0x804a064
8048557: 00 00 00
804855a: c3 ret
804855b: 66 90 xchg %ax,%ax
804855d: 66 90 xchg %ax,%ax
804855f: 90 nop

Regards,
Joe
 
D

DSF

Oops! a comma got lost in translation. After the first p. I'm
surprised no one caught it since memmove takes three arguments. It
should be:

memmove(line, p, ((pe - line) * sizeof(wchar_t)) - ((p - line) *
sizeof(wchar_t));

... Why not (pe - p) * sizeof(wchar_t)?

I am not sure that "by a power of two" matters at all in this case.

Because * sizeof(wchar_t) equals two, it compiles to "add register
to itself" instead of actually having to use a multiply instruction.
Higher multipliers that are powers of two can simply shift left.
Division can shift right.
This isn't safe, don't do that. There is no guarantee that a pointer
will fit inside a uint without losing information which could result
in tragic failures.

In this case, a pointer is the same size as an unsigned int. If I
had to rewrite the function for a different platform, this would be
the least of my problems.
Also code that no longer makes sense.

I admit it's a truckload of parenthesis, but I wouldn't go that far.
Me, I'd suggest:
1. Consider using (pe - p) * sizeof(wchar_t).
It's fine, but it bugs me that it becomes: subtract two numbers,
divide the result by two, multiply the result of the division by two.
2. Consider using wmemmove() and just using (pe - p).
No such thing 'round these here parts as wmemmove.
3. Or just use (((char *) pe) - ((char *) p)) as the character count.

That one hit the spot! Does it all in one instruction. (#1 takes
five, for all the dividing and multiplying.) And within memmove, can
be trimmed somewhat:

memmove(line, p, (char *)pe - (char *)p);
"'Later' is the beginning of what's not to be."
D.S. Fiscus
 
D

DSF

[...]
A practical example. I'd written some code to delete leading
whitespace on a line, the end result being moving X characters of a
wide character string downward.

In this code line is the start of the line buffer, pe is the end of
the line buffer+1, and p is the start of the area to be moved down.
These are all of type wchar_t *.

memmove(line, p ((pe - line) * sizeof(wchar_t)) - ((p - line) *
sizeof(wchar_t));
This was my second to last coding. The resulting pseudo-code:

temp1 = pe - line
temp1 = temp1 / 2 //The pointers represent characters, not bytes.
temp2 = p - line
temp2 = temp2 / 2
temp3 = temp1 - temp2
temp3 = temp3 * 2
(Result in temp3)

It wasn't exactly following my code, but it was doing unnecessary
multiplying and dividing, even if by a power of two. So I changed it
to this: (uint is a typedef'd unsigned int.)
memmove(line, p, (((uint)pe - (uint)line) - ((uint)p - (uint)line)));
This produced:
temp1 = pe - line
temp2 = p - line
temp1 = temp1 - temp2
(Result in temp1)

More typing, but shorter, faster code and fewer temps.

Also fewer guarantees of correctness. Quoth 6.3.2.3p6:
"Any pointer type may be converted to an integer type. Except
as previously specified [null pointer constant], the result is
implementation-defined." Converting the pointers to integers
and then subtracting is not guaranteed to give the same result
as subtracting the pointers and multiplying by the sizeof.
(Bentley and McIlroy's classic "Engineering a Sort Function"
mentions one machine for which your rewrite would definitely
*not* produce the right result.)

Portability is a non-issue here. Anyway, I was just trying to
illustrate how closely the compiler code and final code match.
What sort of generated code do you get if you simplify the
size expression to `(pe - p) * sizeof(wchar_t)' instead of
writing out all those unnecessary operators?

It's fine, but it bugs me that it becomes: subtract two numbers,
divide the result by two, multiply the result of the division by two.

Seebs came up with memmove(line, p, (char *)pe - (char *) -p);
This pointer subtraction compiles to one instruction.
Also, if you're so concerned about speed: Are you moving
too much data? Is pe ("the end of the line buffer+1") really
the proper endpoint, or ought you to be using a pointer just
past the terminator and perhaps well short of buffer's end?
Yes, it may take some extra work to find that terminator -- but
that's the sort of thing you're might well have already ...

That was a descriptive error on my part. pe points to one object,
in this case a wide character) past the end of the end-of-string
terminator, and not the end of the line buffer. pe's very existence
is to mark that point. Its creation earlier in the function allowed
me to replace a costly strlen() with pointer arithmetic.

"'Later' is the beginning of what's not to be."
D.S. Fiscus
 
D

DSF

Am 01.11.2013 19:53, schrieb DSF:


If your current compiler misses such an obvious optimization, it is
quite frankly a piece of shit.

I would not disagree as far as code generation goes.
Just for reference, gcc 4.7 produces the expected:

080484a0 <foo>:
80484a0: c7 05 20 a0 04 08 00 movl $0x0,0x804a020
80484a7: 00 00 00
80484aa: c7 05 24 a0 04 08 00 movl $0x0,0x804a024
80484b1: 00 00 00
80484b4: c7 05 28 a0 04 08 00 movl $0x0,0x804a028
80484bb: 00 00 00
80484be: c7 05 2c a0 04 08 00 movl $0x0,0x804a02c
80484c5: 00 00 00
80484c8: c7 05 30 a0 04 08 00 movl $0x0,0x804a030
80484cf: 00 00 00
80484d2: c7 05 34 a0 04 08 00 movl $0x0,0x804a034
80484d9: 00 00 00
80484dc: c7 05 38 a0 04 08 00 movl $0x0,0x804a038
80484e3: 00 00 00
80484e6: c7 05 3c a0 04 08 00 movl $0x0,0x804a03c
80484ed: 00 00 00
80484f0: c7 05 40 a0 04 08 00 movl $0x0,0x804a040
80484f7: 00 00 00
80484fa: c3 ret
80484fb: 90 nop
80484fc: 8d 74 26 00 lea 0x0(%esi,%eiz,1),%esi

08048500 <bar>:
8048500: c7 05 44 a0 04 08 00 movl $0x0,0x804a044
8048507: 00 00 00
804850a: c7 05 48 a0 04 08 00 movl $0x0,0x804a048
8048511: 00 00 00
8048514: c7 05 4c a0 04 08 00 movl $0x0,0x804a04c
804851b: 00 00 00
804851e: c7 05 50 a0 04 08 00 movl $0x0,0x804a050
8048525: 00 00 00
8048528: c7 05 54 a0 04 08 00 movl $0x0,0x804a054
804852f: 00 00 00
8048532: c7 05 58 a0 04 08 00 movl $0x0,0x804a058
8048539: 00 00 00
804853c: c7 05 5c a0 04 08 00 movl $0x0,0x804a05c
8048543: 00 00 00
8048546: c7 05 60 a0 04 08 00 movl $0x0,0x804a060
804854d: 00 00 00
8048550: c7 05 64 a0 04 08 00 movl $0x0,0x804a064
8048557: 00 00 00
804855a: c3 ret
804855b: 66 90 xchg %ax,%ax
804855d: 66 90 xchg %ax,%ax
804855f: 90 nop

Regards,
Joe

If we're going to be specific, here's what it compiles to for me:

Foo:
00404429 55 PUSH EBP
0040442A 8BEC MOV EBP,ESP
0040442C 33C0 XOR EAX,EAX
0040442E A3 D0264200 MOV DWORD PTR DS:[4226D0],EAX
00404433 33D2 XOR EDX,EDX
00404435 8915 D4264200 MOV DWORD PTR DS:[4226D4],EDX
0040443B 33C9 XOR ECX,ECX
0040443D 890D D8264200 MOV DWORD PTR DS:[4226D8],ECX
00404443 33C0 XOR EAX,EAX
00404445 A3 DC264200 MOV DWORD PTR DS:[4226DC],EAX
0040444A 33D2 XOR EDX,EDX
0040444C 8915 E0264200 MOV DWORD PTR DS:[4226E0],EDX
00404452 33C9 XOR ECX,ECX
00404454 890D E4264200 MOV DWORD PTR DS:[4226E4],ECX
0040445A 33C0 XOR EAX,EAX
0040445C A3 E8264200 MOV DWORD PTR DS:[4226E8],EAX
00404461 33D2 XOR EDX,EDX
00404463 8915 EC264200 MOV DWORD PTR DS:[4226EC],EDX
00404469 33C9 XOR ECX,ECX
0040446B 890D F0264200 MOV DWORD PTR DS:[4226F0],ECX
00404471 5D POP EBP
00404472 C3 RETN

Bar:
00404473 55 PUSH EBP
00404474 8BEC MOV EBP,ESP
00404476 33C0 XOR EAX,EAX
00404478 A3 14274200 MOV DWORD PTR DS:[422714],EAX
0040447D A3 10274200 MOV DWORD PTR DS:[422710],EAX
00404482 A3 0C274200 MOV DWORD PTR DS:[42270C],EAX
00404487 A3 08274200 MOV DWORD PTR DS:[422708],EAX
0040448C A3 04274200 MOV DWORD PTR DS:[422704],EAX
00404491 A3 00274200 MOV DWORD PTR DS:[422700],EAX
00404496 A3 FC264200 MOV DWORD PTR DS:[4226FC],EAX
0040449B A3 F8264200 MOV DWORD PTR DS:[4226F8],EAX
004044A0 A3 F4264200 MOV DWORD PTR DS:[4226F4],EAX
004044A5 5D POP EBP
004044A6 C3 RETN
 
J

Johannes Bauer

Am 05.11.2013 05:01, schrieb DSF:
If we're going to be specific, here's what it compiles to for me:
Foo:
00404429 55 PUSH EBP
0040442A 8BEC MOV EBP,ESP
0040442C 33C0 XOR EAX,EAX
0040442E A3 D0264200 MOV DWORD PTR DS:[4226D0],EAX
00404433 33D2 XOR EDX,EDX
00404435 8915 D4264200 MOV DWORD PTR DS:[4226D4],EDX
0040443B 33C9 XOR ECX,ECX
0040443D 890D D8264200 MOV DWORD PTR DS:[4226D8],ECX
00404443 33C0 XOR EAX,EAX
00404445 A3 DC264200 MOV DWORD PTR DS:[4226DC],EAX
0040444A 33D2 XOR EDX,EDX
0040444C 8915 E0264200 MOV DWORD PTR DS:[4226E0],EDX
00404452 33C9 XOR ECX,ECX
00404454 890D E4264200 MOV DWORD PTR DS:[4226E4],ECX
0040445A 33C0 XOR EAX,EAX
0040445C A3 E8264200 MOV DWORD PTR DS:[4226E8],EAX
00404461 33D2 XOR EDX,EDX
00404463 8915 EC264200 MOV DWORD PTR DS:[4226EC],EDX
00404469 33C9 XOR ECX,ECX
0040446B 890D F0264200 MOV DWORD PTR DS:[4226F0],ECX
00404471 5D POP EBP
00404472 C3 RETN

Bar:
00404473 55 PUSH EBP
00404474 8BEC MOV EBP,ESP
00404476 33C0 XOR EAX,EAX
00404478 A3 14274200 MOV DWORD PTR DS:[422714],EAX
0040447D A3 10274200 MOV DWORD PTR DS:[422710],EAX
00404482 A3 0C274200 MOV DWORD PTR DS:[42270C],EAX
00404487 A3 08274200 MOV DWORD PTR DS:[422708],EAX
0040448C A3 04274200 MOV DWORD PTR DS:[422704],EAX
00404491 A3 00274200 MOV DWORD PTR DS:[422700],EAX
00404496 A3 FC264200 MOV DWORD PTR DS:[4226FC],EAX
0040449B A3 F8264200 MOV DWORD PTR DS:[4226F8],EAX
004044A0 A3 F4264200 MOV DWORD PTR DS:[4226F4],EAX
004044A5 5D POP EBP
004044A6 C3 RETN

Wow, this is *really* bad code.

Entry points are completely unaligned and the compiler even forgets that
registers are cleared that it cleared *itself* two instructions
beforehand. Also it seems not to do well in lifetime analysis of the
registers (since it switches around eax, edx, ecx thinking those values
are going to be needed later on).

Still you make no mention what compiler you're using and if you have
optimizations turned on (I had in my example, obviously). This is the
interesting part. The generation of stackframes leads me to believe that
you haven't turned them on (because any halfways decent compiler does
not generate stackframes at a certain optimization level, having one
register more to fiddle around with).

Regards,
Joe
 
S

Stephen Sprunk

I would not disagree as far as code generation goes.


If we're going to be specific, here's what it compiles to for me:
...

If you're going to make vague complaints about what an unspecified
compiler produces with unspecified settings, you should expect to see
folks chime in with more specific examples to discuss.

For instance, GCC 4.2.4 for Linux/x86 produces this with -O0:

foo:
pushl %ebp
movl %esp, %ebp
movl $0, a.2059
movl $0, b.2060
movl $0, c.2061
movl $0, d.2062
movl $0, e.2063
movl $0, f.2064
movl $0, g.2065
movl $0, h.2066
movl $0, i.2067
popl %ebp
ret
....
bar:
pushl %ebp
movl %esp, %ebp
movl $0, i.2079
movl i.2079, %eax
movl %eax, h.2078
movl h.2078, %eax
movl %eax, g.2077
movl g.2077, %eax
movl %eax, f.2076
movl f.2076, %eax
movl %eax, e.2075
movl e.2075, %eax
movl %eax, d.2074
movl d.2074, %eax
movl %eax, c.2073
movl c.2073, %eax
movl %eax, b.2072
movl b.2072, %eax
movl %eax, a.2071
popl %ebp
ret

The latter is definitely suboptimal. However, GCC is well-known to
produce ridiculously inefficient (but completely literal) code when
optimization is disabled. When I switch to -O3, which is what I
normally use, I get this:

foo:
pushl %ebp
movl %esp, %ebp
popl %ebp
movl $0, a.2114
movl $0, b.2115
movl $0, c.2116
movl $0, d.2117
movl $0, e.2118
movl $0, f.2119
movl $0, g.2120
movl $0, h.2121
movl $0, i.2122
ret
....
bar:
pushl %ebp
movl %esp, %ebp
popl %ebp
movl $0, i.2134
movl $0, h.2133
movl $0, g.2132
movl $0, f.2131
movl $0, e.2130
movl $0, d.2129
movl $0, c.2128
movl $0, b.2127
movl $0, a.2126
ret

I'm a little curious why the latter has the order reversed, but the
final result (and efficiency) is identical, which is as expected.

S
 
P

Philip Lantz

Stephen said:
foo:
pushl %ebp
movl %esp, %ebp
popl %ebp
movl $0, a.2114
movl $0, b.2115
movl $0, c.2116
movl $0, d.2117
movl $0, e.2118
movl $0, f.2119
movl $0, g.2120
movl $0, h.2121
movl $0, i.2122
ret
...
bar:
pushl %ebp
movl %esp, %ebp
popl %ebp
movl $0, i.2134
movl $0, h.2133
movl $0, g.2132
movl $0, f.2131
movl $0, e.2130
movl $0, d.2129
movl $0, c.2128
movl $0, b.2127
movl $0, a.2126
ret

I'm a little curious why the latter has the order reversed, but the
final result (and efficiency) is identical, which is as expected.

Presumably because that's the order the assignments appear in the code.
The compiler doesn't have to do them in the same order as in the code,
but in this case there's clearly no reason not to.

The original source was:

void foo(void)
{
static int a, b, c, d, e, f, g, h, i;

a = 0;
b = 0;
c = 0;
d = 0;
e = 0;
f = 0;
g = 0;
h = 0;
i = 0;
}

void bar(void)
{
static int a, b, c, d, e, f, g, h, i;

a = b = c = d = e = f = g = h = i = 0;
}
 
J

James Kuyper

In this case, a pointer is the same size as an unsigned int. If I
had to rewrite the function for a different platform, this would be
the least of my problems.

There's no guarantee that (uint)pe - (uint)line calculates any
meaningful number. In particular, there's no guarantee that it
calculates the same number calculated by the correct code:
It's fine, but it bugs me that it becomes: subtract two numbers,
divide the result by two, multiply the result of the division by two.

If that kind of thing worries you, and you're not willing to count on
the compiler to optimize it away, you shouldn't be writing in C - it
doesn't provide the level of control you need to stay happy. Try
assembler instead. Keep in mind that there's no guarantee that your code
doesn't also compile to "subtract two numbers, divide by 2, multiply by
2". Only trust in the competence of your compiler's designers allows you
to assume it hasn't been pessimized that way. The level of incompetence
needed to design a compiler that translates (pe-p)*sizeof(*p) into
anything other than the equivalent of (char*)pe - (char*)p is pretty
substantial.
 
T

Tim Rentsch

Seebs said:
In this code line is the start of the line buffer, pe is the end of
the line buffer+1, and p is the start of the area to be moved down.
These are all of type wchar_t *.
memmove(line, p ((pe - line) * sizeof(wchar_t)) - ((p - line) *
sizeof(wchar_t));

... Why not (pe - p) * sizeof(wchar_t)?
It wasn't exactly following my code, but it was doing unnecessary
multiplying and dividing, even if by a power of two.

I am not sure that "by a power of two" matters at all in this case.
So I changed it
to this: (uint is a typedef'd unsigned int.)
memmove(line, p, (((uint)pe - (uint)line) - ((uint)p - (uint)line)));

This isn't safe, don't do that. There is no guarantee that a pointer
will fit inside a uint without losing information which could result
in tragic failures.
More typing, but shorter, faster code and fewer temps.

Also code that no longer makes sense.

Me, I'd suggest:
1. Consider using (pe - p) * sizeof(wchar_t). [snip 2&3]

Normally I would rather see this as (pe - p) * sizeof *p .
 
J

Jorgen Grahn

.
Still you make no mention what compiler you're using and if you have
optimizations turned on (I had in my example, obviously).

He did write "with size optimization enabled". But yeah, I don't see
why the identity of the compiler has to be kept secret ... and as long
as it is, this thread is rather irrelevant from my point of view.

/Jorgen
 
D

DSF

Am 05.11.2013 05:01, schrieb DSF:
If we're going to be specific, here's what it compiles to for me:
Foo:
00404429 55 PUSH EBP
0040442A 8BEC MOV EBP,ESP
0040442C 33C0 XOR EAX,EAX
0040442E A3 D0264200 MOV DWORD PTR DS:[4226D0],EAX
00404433 33D2 XOR EDX,EDX
00404435 8915 D4264200 MOV DWORD PTR DS:[4226D4],EDX
0040443B 33C9 XOR ECX,ECX
0040443D 890D D8264200 MOV DWORD PTR DS:[4226D8],ECX
00404443 33C0 XOR EAX,EAX
00404445 A3 DC264200 MOV DWORD PTR DS:[4226DC],EAX
0040444A 33D2 XOR EDX,EDX
0040444C 8915 E0264200 MOV DWORD PTR DS:[4226E0],EDX
00404452 33C9 XOR ECX,ECX
00404454 890D E4264200 MOV DWORD PTR DS:[4226E4],ECX
0040445A 33C0 XOR EAX,EAX
0040445C A3 E8264200 MOV DWORD PTR DS:[4226E8],EAX
00404461 33D2 XOR EDX,EDX
00404463 8915 EC264200 MOV DWORD PTR DS:[4226EC],EDX
00404469 33C9 XOR ECX,ECX
0040446B 890D F0264200 MOV DWORD PTR DS:[4226F0],ECX
00404471 5D POP EBP
00404472 C3 RETN

Bar:
00404473 55 PUSH EBP
00404474 8BEC MOV EBP,ESP
00404476 33C0 XOR EAX,EAX
00404478 A3 14274200 MOV DWORD PTR DS:[422714],EAX
0040447D A3 10274200 MOV DWORD PTR DS:[422710],EAX
00404482 A3 0C274200 MOV DWORD PTR DS:[42270C],EAX
00404487 A3 08274200 MOV DWORD PTR DS:[422708],EAX
0040448C A3 04274200 MOV DWORD PTR DS:[422704],EAX
00404491 A3 00274200 MOV DWORD PTR DS:[422700],EAX
00404496 A3 FC264200 MOV DWORD PTR DS:[4226FC],EAX
0040449B A3 F8264200 MOV DWORD PTR DS:[4226F8],EAX
004044A0 A3 F4264200 MOV DWORD PTR DS:[4226F4],EAX
004044A5 5D POP EBP
004044A6 C3 RETN

Wow, this is *really* bad code.

Entry points are completely unaligned and the compiler even forgets that
registers are cleared that it cleared *itself* two instructions
beforehand. Also it seems not to do well in lifetime analysis of the
registers (since it switches around eax, edx, ecx thinking those values
are going to be needed later on).

Still you make no mention what compiler you're using and if you have
optimizations turned on (I had in my example, obviously). This is the
interesting part. The generation of stackframes leads me to believe that
you haven't turned them on (because any halfways decent compiler does
not generate stackframes at a certain optimization level, having one
register more to fiddle around with).

Regards,
Joe

Sorry that I forgot to mention the compiler, I thought I had. It's
Borland C++ 5.01A. Yes, I know it was around when dinosaurs walked
the planet. I should use something newer, but it's like changing word
processors, but worse! I'd still want to use an IDE. So most editing
commands been changed and half my code will probably need to be
rewritten to even compile, let alone run properly. I'm just
*dreading* the time it will take to get to the comfort/experience
point I'm at now.

I did say that I had optimize for size turned on.

I must say, I find a lot of inefficient code when I'm assembly level
debugging. Some can be blamed on the age, since processor
manufacturers have changed which instructions to optimize over the
years, and efficient code in 1995 isn't necessarily so in 2013. With
that said, I've seen code that sets EAX to 3 and jumps to an
instruction that sets EAX to 3, instead of the following instruction.

I found "Standard Stack Frame" under debugging and turned it off. It
did eliminate the EBP manipulation. Optimization choices are pretty
slim.

On the bright side, the assembly code correlates very closely with
the C code, and sometimes having the compiler do *exactly* what you
want instead of what its designers think is best. That said, if I
could put on a futuristic "learning cap" and in fifteen minutes
completely understand the modern compiler of my choice, I wouldn't
hesitate for a second! :eek:)

"'Later' is the beginning of what's not to be."
D.S. Fiscus
 
D

DSF

There's no guarantee that (uint)pe - (uint)line calculates any
meaningful number. In particular, there's no guarantee that it
calculates the same number calculated by the correct code:

I assume the above statement is made on the basis of portability,
since on my compiler, it produces the same results (in bytes) as
subtracting two unsigned integers.
If that kind of thing worries you, and you're not willing to count on
the compiler to optimize it away, you shouldn't be writing in C - it
doesn't provide the level of control you need to stay happy. Try
assembler instead. Keep in mind that there's no guarantee that your code
doesn't also compile to "subtract two numbers, divide by 2, multiply by
2". Only trust in the competence of your compiler's designers allows you
to assume it hasn't been pessimized that way. The level of incompetence
needed to design a compiler that translates (pe-p)*sizeof(*p) into
anything other than the equivalent of (char*)pe - (char*)p is pretty
substantial.

The compiler (I thought for sure I mentioned it) is Borland C++
5.01A. For the reason why, see my other post in this thread. So I
*know* it won't optimize it away. :eek:)

I have stepped through enough assembly code whilst debugging to
state that I cannot trust the compiler to produce an efficient version
of my self-optimized C code. I have duplicated over half of the
string/memory manipulation functions of the RTL. Partly to support
16-bit characters, but also for speed and efficiency. To be honest, a
lot of the code is inefficient and slow because of the changes in
processor design over the years, but not all of it. I'm talking about
assembly here, not C, so their code was written by people, not their
compiler.

To brag a little (as much as one can about besting almost
20-year-old code) one of my string routines (I can't remember which
one, it's been years) had an average timing of 300 times faster than
the RTL code. That's times, not percent. And I have no delusions I
could do the same against a modern compiler's RTL.
 
S

Seebs

I assume the above statement is made on the basis of portability,
since on my compiler, it produces the same results (in bytes) as
subtracting two unsigned integers.

Yes. But there are lots of platforms where pointers cast to uint will
have lost some of their bits, and where you might occasionally see
strange behavior.

Assume ints are 32-bit, and pointers 64-bit, and consider what you
get from (uint) 0x100000010 - (uint) 0x0FFFFFFF0.
I have stepped through enough assembly code whilst debugging to
state that I cannot trust the compiler to produce an efficient version
of my self-optimized C code.

In that case, I think it's a safe bet that you would spend an order
of magnitude less time converting to a modern compiler than you are
spending writing code that's less maintainable and more likely to
have subtle bugs than what a modern compiler would do.
To brag a little (as much as one can about besting almost
20-year-old code) one of my string routines (I can't remember which
one, it's been years) had an average timing of 300 times faster than
the RTL code. That's times, not percent. And I have no delusions I
could do the same against a modern compiler's RTL.

Which sort of renders the entire exercise a little silly, no?

-s
 
J

James Kuyper

I assume the above statement is made on the basis of portability,
since on my compiler, it produces the same results (in bytes) as
subtracting two unsigned integers.

I was talking about guarantees provided by the C standard. There's no
limit on the number and variety of guarantees provided by other sources.
Which is one of the reasons I don't see much point in discussing those
other sources of guarantees, except in a forum specific to the
particular source.
The compiler (I thought for sure I mentioned it) is Borland C++

I can find no mention of it in any of messages before yesterday.
5.01A. For the reason why, see my other post in this thread. So I
*know* it won't optimize it away. :eek:) ....
To brag a little (as much as one can about besting almost
20-year-old code) ...

There may be legitimate reasons for worrying about and complaining about
the inadequacies of 20-year old compilers, though those reasons don't
apply to me. However, those inadequacies should be attributed to the age
of the compiler, not to the language it compiles. You Subject: header
should have been "Borland C++ can be very literal".
 
D

DSF

On Tue, 05 Nov 2013 02:10:21 -0600, Stephen Sprunk

Sorry this is a tiny bit late. Been busy.

If you're going to make vague complaints about what an unspecified
compiler produces with unspecified settings, you should expect to see
folks chime in with more specific examples to discuss.

For instance, GCC 4.2.4 for Linux/x86 produces this with -O0:

foo:
pushl %ebp
movl %esp, %ebp
movl $0, a.2059
movl $0, b.2060
movl $0, c.2061
movl $0, d.2062
movl $0, e.2063
movl $0, f.2064
movl $0, g.2065
movl $0, h.2066
movl $0, i.2067
popl %ebp
ret
...
bar:
pushl %ebp
movl %esp, %ebp
movl $0, i.2079
movl i.2079, %eax
movl %eax, h.2078
movl h.2078, %eax
movl %eax, g.2077
movl g.2077, %eax
movl %eax, f.2076
movl f.2076, %eax
movl %eax, e.2075
movl e.2075, %eax
movl %eax, d.2074
movl d.2074, %eax
movl %eax, c.2073
movl c.2073, %eax
movl %eax, b.2072
movl b.2072, %eax
movl %eax, a.2071
popl %ebp
ret

The latter is definitely suboptimal.

The former suffers from an inefficiency as well.
movl $0, a.2059
I'm not proficient in AA&T syntax, but I assume is the equivalent of:
move dword ptr a, 0

A move of 0 to a static address translates to:
C705F0D0410000000000 mov [0x41D0F0], 0x00000000
10 bytes per store.

A stack-relative address is a little better:
C7450400000000 mov [ebp+4], 0
7 bytes per store.

As compared to:
33C0 xor eax, eax
A3F0D04100 move [0x41d0f0], eax
7 bytes for initial store, 5 for each additional store.

"'Later' is the beginning of what's not to be."
D.S. Fiscus
 
D

DSF

On Tue, 05 Nov 2013 02:10:21 -0600, Stephen Sprunk

Sorry this is a tiny bit late. Been busy.

Busy meaning sick. Sooo...see below.
{snipped}
The former suffers from an inefficiency as well.
movl $0, a.2059
I'm not proficient in AA&T syntax, but I assume is the equivalent of:
move dword ptr a, 0

A move of 0 to a static address translates to:
C705F0D0410000000000 mov [0x41D0F0], 0x00000000
10 bytes per store.

A stack-relative address is a little better:
C7450400000000 mov [ebp+4], 0
7 bytes per store.

As compared to:
33C0 xor eax, eax
A3F0D04100 move [0x41d0f0], eax
7 bytes for initial store, 5 for each additional store.
Sorry for the C-syntax hex numbers!

DSF
"'Later' is the beginning of what's not to be."
D.S. Fiscus
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top