Memset is faster than simple loop?

A

AndersWang

Hi,

dose anybody here explain to me why memset would be faster than a
simple loop. I doubt about it!

In an int array scenario:

int array[10];

for(int i=0;i<10;i++) //ten loops
array=0;

or

memset(array,0,sizeof(array));

So, what will memset do inside? Here is a snippet from MS c-run-time
codes:

void * __cdecl memset (
void *dst,
int val,
size_t count
)
{
void *start = dst;

#if defined (_M_MRX000) || defined (_M_ALPHA) || defined (_M_PPC) ||
defined (_M_IA64)
{
extern void RtlFillMemory( void *, size_t count, char );

RtlFillMemory( dst, count, (char)val );
}
#else /* defined (_M_MRX000) || defined (_M_ALPHA) || defined
(_M_PPC) || defined (_M_IA64) */
while (count--) { //Watch
here!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
*(char *)dst = (char)val;
dst = (char *)dst + 1;
}
#endif /* defined (_M_MRX000) || defined (_M_ALPHA) || defined
(_M_PPC) || defined (_M_IA64) */

return(start);
}

memset initializes a block of memory byte by byte. So, the while loop
will be executed 4*10 times!


I got confused. Why people still believe memset is faster than loop
and pertains most of scenarios.


Thank you for your help!
 
C

Chris Dollin

dose anybody here explain to me why memset would be faster than a
simple loop.

`memset` is provided by the implementation. It may use
lots of Cunning Implementation Tricks that -- for whatever
reason -- might not be applied by default to ordinary
C code.

Whether or not it's /actually/ faster will depend on all
sorts of things.
I doubt about it!

Doubt is good. I think. Well, I'm not sure.
In an int array scenario:

int array[10];

for(int i=0;i<10;i++) //ten loops
array=0;

or

memset(array,0,sizeof(array));

So, what will memset do inside?


That depends.
Here is a snippet from MS c-run-time
codes:

(fx:snip #ifdef)
extern void RtlFillMemory( void *, size_t count, char );

RtlFillMemory( dst, count, (char)val );

(fx:snip #else)
while (count--) { //Watch
here!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
*(char *)dst = (char)val;
dst = (char *)dst + 1;
}

(fx:snip #endif)
memset initializes a block of memory byte by byte. So, the while loop
will be executed 4*10 times!

Urm, did you not notice the call to RtlFillMemory which happends
on suitable architectures?

Also you're /assuming/ that the compiler doesn't implement
`memset` with inline code, which it's allowed to. Perhaps
the original call
memset(array,0,sizeof(array));

was replaced by 10 move-integer instructions.
I got confused. Why people still believe memset is faster than loop
and pertains most of scenarios.

The essence of the truth here is measureument as opposed to
speculation.
 
C

Chris Dollin

Chris said:
Also you're /assuming/ that the compiler doesn't implement
`memset` with inline code, which it's allowed to. Perhaps
the original call


was replaced by 10 move-integer instructions.

Make that 10 clear-integer instructions.
 
J

Jens Thoms Toerring

dose anybody here explain to me why memset would be faster than a
simple loop. I doubt about it!
In an int array scenario:
int array[10];
for(int i=0;i<10;i++) //ten loops
array=0;
memset(array,0,sizeof(array));

So, what will memset do inside?


There's no guarantee that memset() is faster, it's probably just some
observation a number of people have made. And if it is faster it is
probably due to the implementation using some carefully tuned method
that exploits some features of the processor the program is running
on which the compiler may not be able to find when it compiles the loop.
But that doesn't has to be the case, it's a question of how good the
implementation of memset() is on the one hand and how god the compiler
is on the other hand (and the compiler could even be clever enough to
replace the loop by a single call of memset() and there goes all your
difference in speed;-).
Here is a snippet from MS c-run-time

void * __cdecl memset (
void *dst,
int val,
size_t count
)
{
void *start = dst;
#if defined (_M_MRX000) || defined (_M_ALPHA) || defined (_M_PPC) ||
defined (_M_IA64)
{
extern void RtlFillMemory( void *, size_t count, char );
RtlFillMemory( dst, count, (char)val );
}
#else /* defined (_M_MRX000) || defined (_M_ALPHA) || defined
(_M_PPC) || defined (_M_IA64) */
while (count--) { //Watch
here!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
*(char *)dst = (char)val;
dst = (char *)dst + 1;
}
#endif /* defined (_M_MRX000) || defined (_M_ALPHA) || defined
(_M_PPC) || defined (_M_IA64) */

memset initializes a block of memory byte by byte. So, the while loop
will be executed 4*10 times!

That's just one of many implementations and you can't deduce any
general statement from looking at a certain one. And, as you will
notice when you have a close look, memset() seems to be implemented
in a different way depending on the architecture, so you can't
even say how memset() is implemented for "MS c-run-time" but only
for "MS c-run-time" on a certain architecture.
I got confused. Why people still believe memset is faster than loop
and pertains most of scenarios.

It isn't a question of believes. You have to carefully measure the
behaviour for the implementation and architecture you are using.

Regards, Jens
 
A

AndersWang

Thank you for yr reply. Actually, I am working on measurement and I
have to agree the fact that you are right, memset is fater than loop
even in the example I wrote above.
 
E

Eric Sosman

dose anybody here explain to me why memset would be faster than a
simple loop. I doubt about it!

It might be faster, it might be slower, or there
might be no difference. The C language Standard says
nothing about the speeds of language constructs or
library functions, and those speeds -- and relative
speeds -- will be different in different implementations
of C.
[...] Here is a snippet from MS c-run-time
codes: [...]

Compilers play many tricks in pursuit of speedier or
smaller code. One of the tricks often played with the
Standard functions (which the compiler can "know about")
is to generate in-line machine instructions instead of
generating an actual function call. What looks like a
call on the fabs() function might actually produce an
FABS instruction in the generated code, and no function
call at all.

memset() is a fairly simple function -- not like
printf(), say -- and many compilers will play this kind
of game with it. If you write what looks like a call
to memset(), the program might actually use a block-clear
instruction or instruction sequence instead of a call.

But even if the implementation plays this kind of
game with memset() or sqrt() or whatever, it must still
provide an actual, callable function that does the same
thing (possibly at a different speed). This enables a
program to use a function pointer to call a library
function, without necessarily being able to predict
what function will be called until run-time:

#include <string.h> /* for memset */

void shortset(void *s, int c, size_t n) {
short *sp = s;
for (n /= sizeof(short); n > 0; --n)
*sp++ = c;
}

void intset(void *s, int c, size_t n) {
int *ip = s;
for (n /= sizeof(int); n > 0; --n)
*ip++ = c;
}

void (*fptr)(void*, int, size_t);
...
switch (something_unpredictable) {
default: fptr = memset; break;
case 1: fptr = shortset; break;
case 2: fptr = intset; break;
}
fptr(buffer, 42, sizeof buffer);

In short, the code you have found (MS disclosed
their source? Surprising, but not astonishing) may
be used only in oddball circumstances like the above,
while "ordinary" memset() calls wind up using another
mechanism altogether. You'll need to dig deeper.

... and you'll need to remember that all such
tricks and timings vary from one implementation to
the next; they aren't the province of the language
as such.
 
R

Racaille

Eric said:
In short, the code you have found (MS disclosed
their source? Surprising, but not astonishing) may

The source code to MSVCRT.DLL (the C library)
is available as part of their SDK, which is free to download
from microsoft itself.
 
R

Randy Howard

Thank you for yr reply. Actually, I am working on measurement and I
have to agree the fact that you are right, memset is fater than loop
even in the example I wrote above.

Well, just keep in mind that what's true for you today, on a particular
system with a particular development environment installed, may not be
true anywhere else, or even on your own system 6 months from now. You
can not extrapolate "rules" for this sort of thing and actually know
anything at all about what will happen in general.
 
A

AndersWang

Racaille said:
The source code to MSVCRT.DLL (the C library)
is available as part of their SDK, which is free to download
from microsoft itself.

You are also able to find the c run-time library source codes in your
VSStudio/vc_version/crt/src folder
 
E

Eric Sosman

Thank you for yr reply. Actually, I am working on measurement and I
have to agree the fact that you are right, memset is fater than loop
even in the example I wrote above.

An observation: If the time spent in memset() is a
significant or even a noticeable fraction of your program's
running time, it is highly unlikely that the cure is to use
a faster memset() equivalent. (After all, memset() generates
almost no "new information," and does not "advance the state
of the computation" by very much.) Rather, the cure is to
consider what it is about the program that makes the memset()
necessary, and to rearrange things so it isn't. For example,

char buffer[BIGSIZE];
...
memset (buffer, 0, sizeof buffer);
if (arriving)
strcat (buffer, "Hello,");
else
strcat (buffer, "Goodbye, cruel");
strcat (buffer, " world!");

.... doesn't really need the full effect of memset(), but just
its effect on buffer[0]. With that in mind, the code can be

char buffer[BIGSIZE];
...
buffer[0] = 0;
if (arriving)
... etc ...

Still better:

char buffer[BIGSIZE];
...
if (arriving)
strcpy (buffer, "Hello,");
else
strcpy (buffer, "Goodbye, cruel");
strcat (buffer, " world!");

And, of course, there are lots more alternatives. My point is
that a very large fraction of memset() calls can be eliminated
by transformations not much more complex than these, and when
you get rid of an operation altogether you get an infinite
improvement in its speed. It is said that "the fastest I/O is
the one you don't do," and this generalizes to "the fastest X
is the one you don't do."

Just about the only time memset() speed is crucial is when
one needs to "destroy" information. For example, an O/S may want
to ensure that a new memory page given to Program X doesn't still
contain data left in it by Previous Program Y, and so uses memset()
or an equivalent to clobber Y's data. The speed of memset() in
this kind of setting can indeed be important -- but in most user
programs, memset() should be a negligible fraction of the running
time, and even an infinite speedup makes a negligible overall
improvement.
 
A

AndersWang

Eric Sosman,

I appreciate your replies. I have to say I partially agree with you:)
An observation: If the time spent in memset() is a
significant or even a noticeable fraction of your program's
running time, it is highly unlikely that the cure is to use
a faster memset() equivalent.

That's right! I agree:)
(After all, memset() generates
almost no "new information," and does not "advance the state
of the computation" by very much.) Rather, the cure is to
consider what it is about the program that makes the memset()
necessary, and to rearrange things so it isn't.

It really depands on what you are gonna do. Let say, I have an
initialized integer array with ten thousand elements,all elements
inside are zero, and then I will fill out this array by user input,
actually I don't know how many elements will be assigned new values
other than 0.(assuming 0 is not valid from user input).

So, if I want to retrieve these values, I have to loop the array until
seeing the first zero element. In this scenario,"new information" is
also very important. If I can not initialize the array with 0 , no
expected result will happen next. memset() becomes crucial to my
program.
Just about the only time memset() speed is crucial is when
one needs to "destroy" information. For example, an O/S may want
to ensure that a new memory page given to Program X doesn't still
contain data left in it by Previous Program Y, and so uses memset()
or an equivalent to clobber Y's data. The speed of memset() in
this kind of setting can indeed be important -- but in most user
programs, memset() should be a negligible fraction of the running
time, and even an infinite speedup makes a negligible overall
improvement.

If the size of the memory block is very huge, it'll be a different
story,isn't it?
 
E

Eric Sosman

Eric Sosman,
[...]
(After all, memset() generates
almost no "new information," and does not "advance the state
of the computation" by very much.) Rather, the cure is to
consider what it is about the program that makes the memset()
necessary, and to rearrange things so it isn't.

It really depands on what you are gonna do. Let say, I have an
initialized integer array with ten thousand elements,all elements
inside are zero, and then I will fill out this array by user input,
actually I don't know how many elements will be assigned new values
other than 0.(assuming 0 is not valid from user input).

So, if I want to retrieve these values, I have to loop the array until
seeing the first zero element. In this scenario,"new information" is
also very important. If I can not initialize the array with 0 , no
expected result will happen next. memset() becomes crucial to my
program.

Probably not. Compared to the time taken by the input
itself, the time for memset() will likely be negligible. Ten
thousand elements, you say? Let's suppose we can get them all
with one disk read, which will take perhaps ten milliseconds.
In that ten milliseconds, your CPU's clock will tick ten to
thirty million times. With a dual-issue instruction pipeline
that's twenty to sixty million instruction opportunities, forty
to one-twenty with a quad-issue. ("Instruction opportunities"
are not necessarily "instructions executed" because the CPU will
eventually need to wait for memory to catch up to it, but you
get the idea: The speed difference between the CPU and the I/O
is about six or seven decimal orders of magnitude.)

"File system buffer cache," I hear you cry. Fine, but you've
told us this is some kind of sparse array, which means you can't
just read the data "in place." Instead, you need to read it into
some kind of buffer, then examine what you find in the buffer,
decide where it belongs in the big array, and deposit it there.
Do you think that will be faster or slower than the memset()?
And then you've got to scan the whole array afterwards, searching
out the data you deposited there -- faster than memset(), or
slower?

"But it's a *very* sparse array, and I'll only need to take
the decide-and-deposit step a few times, whereas memset() needs
to clear the entire thing." If the array is *that* sparse, it's
probably the wrong data structure: Stick the data on a linked
list, for example, and memset() will be entirely unnecessary.
If the size of the memory block is very huge, it'll be a different
story,isn't it?

Yes, clearing a large memory area to all-bytes-constant
will probably take a noticeable amount of time. My point is
that the clearing itself is often a waste, and could be avoided
altogether by rearranging the program.

An old friend of mine used to characterize this sort of thing
as a laudable effort to clean the beach of bottle caps and other
kinds of trash, to make the sand nice and neat around the whale
carcases.
 
A

AndersWang

Eric,

Now I understand you:)

Don't put too much energy on this sort of tiny things, try to figure
out how to get rid of the whale carcases. Don't use memset(),if not
necessary. If have to, memset() won't be ur concern on performance,
unless the purpose of the program is to test the speed of memset()
itself:)
 
J

Jens Thoms Toerring

Don't put too much energy on this sort of tiny things, try to figure
out how to get rid of the whale carcases. Don't use memset(),if not
necessary. If have to, memset() won't be ur concern on performance,
unless the purpose of the program is to test the speed of memset()
itself:)

If I understood Eric correctly he wasn't advising against the use
of memset() but against worrying about possible speed differences
between using memset() and a loop. Actually there are arguments for
using memset() instead of a loop: it's shorter (just one line of
code instead of 2 or 3), no need for a loop variable and it might
be easier to understand what you're doing there (and it might even
be a bit faster;-). On the other hand you have to know where you
can't use memset() (e.g. for zeroing out an array of doubles or
pointers).
Regards, Jens
 
U

user923005

An old friend of mine used to characterize this sort of thing
as a laudable effort to clean the beach of bottle caps and other
kinds of trash, to make the sand nice and neat around the whale
carcases.

Too bad it needs context, or that would be a Jim-Dandy Sig.

Duff's device makes for a faster memset, except when it makes things
slower (Dik Winter provided an example IIRC).

Typically, library memsets use CPU specific assembly instructions to:
Fill in register wide words at a time,
then with modulus remainder -- fill in bytes.

Alternatively, some memset variants will use fancy-pants registers
like the MMX stuff or other CPU specific instuctions.

At any rate, trying to speed up memset() is a tempest in a teapot if
there ever was one.

Once again, I am forced to cough up Mike Lee's mantra:
"The number one rule of code optimization is: Don't do it.
The number two rule of code optimization (for experts only) is: Don't
do it yet."

IMO-YMMV.
 
E

Eric Sosman

Jens said:
If I understood Eric correctly he wasn't advising against the use
of memset() but against worrying about possible speed differences
between using memset() and a loop. Actually there are arguments for
using memset() instead of a loop: it's shorter (just one line of
code instead of 2 or 3), no need for a loop variable and it might
be easier to understand what you're doing there (and it might even
be a bit faster;-). On the other hand you have to know where you
can't use memset() (e.g. for zeroing out an array of doubles or
pointers).

Right: that's my argument, with the further point that if
you're doing so much memory-clearing that memset() becomes a
bottleneck, you're doing far too much memory-clearing and should
reconsider your design.
 
E

Eric Sosman

user923005 said:
[...]
Duff's device makes for a faster memset, except when it makes things
slower (Dik Winter provided an example IIRC).

FWIW, Duff's original Device was not memset-like.

http://en.wikipedia.org/wiki/Duff's_device
[...]
Once again, I am forced to cough up Mike Lee's mantra:
"The number one rule of code optimization is: Don't do it.
The number two rule of code optimization (for experts only) is: Don't
do it yet."

s/Mike Lee/Michael A. Jackson/

Since I harp on this theme at such nauseating length, some
people have accused me of being in favor of slow code. I reject
the accusation: I am in favor of optimization, but only when it
makes economic sense. If I spend five minutes on an optimization
that shaves a microsecond off a piece of code that will run one
million times, I have made a bad bargain. But if the code will
run a million times a minute twenty-four by seven, or if the
microsecond savings is the difference between keeping the video
streaming smoothly or suffering a glitch, that's another matter.

Like Theodore Roosevelt's Big Stick, optimization skills
should be carefully honed, well maintained, and rarely used.
 
U

user923005

user923005 said:
[...]
Duff's device makes for a faster memset, except when it makes things
slower (Dik Winter provided an example IIRC).

FWIW, Duff's original Device was not memset-like.

http://en.wikipedia.org/wiki/Duff's_device
[snip]
It's also a C-FAQ, but (of course) the idea is used to unroll loops
and (ostensibly) make them faster.
20.35: What is "Duff's Device"?

A: It's a devastatingly deviously unrolled byte-copying loop,
devised by Tom Duff while he was at Lucasfilm. In its "classic"
form, it looks like:

register n = (count + 7) / 8; /* count > 0 assumed */
switch (count % 8)
{
case 0: do { *to = *from++;
case 7: *to = *from++;
case 6: *to = *from++;
case 5: *to = *from++;
case 4: *to = *from++;
case 3: *to = *from++;
case 2: *to = *from++;
case 1: *to = *from++;
} while (--n > 0);
}

where count bytes are to be copied from the array pointed to by
from to the memory location pointed to by to (which is a memory-
mapped device output register, which is why to isn't
incremented). It solves the problem of handling the leftover
bytes (when count isn't a multiple of 8) by interleaving a
switch statement with the loop which copies bytes 8 at a time.
(Believe it or not, it *is* legal to have case labels buried
within blocks nested in a switch statement like this. In his
announcement of the technique to C's developers and the world,
Duff noted that C's switch syntax, in particular its "fall
through" behavior, had long been controversial, and that "This
code forms some sort of argument in that debate, but I'm not
sure whether it's for or against.")

But appearances can be deceiving. Despite that wonderfully arcane
switch goo, it might not be faster (and could even be slower).

Most optimizing compilers can unroll loops automatically anyway.
 
R

Randy Howard

[duff's device]
But appearances can be deceiving. Despite that wonderfully arcane
switch goo, it might not be faster (and could even be slower).

Most optimizing compilers can unroll loops automatically anyway.

It turns out that isn't always a true. There was an interesting
article on this in DDJ in the last year or two that goes into some
depth on it and some variations on the same theme.
 
K

Kevin D. Quitt

You are also able to find the c run-time library source codes in your
VSStudio/vc_version/crt/src folder

I've checked my debian system all over and I can't find any such directory.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top