Memset is faster than simple loop?

K

Keith Thompson

Kevin D. Quitt said:
I've checked my debian system all over and I can't find any such directory.

<OT><SILLY>
Well, there's your problem; you need to look for a folder, not a
directory.
</SILLY></OT>
 
A

Anthony Irwin

Kevin said:
I've checked my debian system all over and I can't find any such directory.

I think you will find most of your headers in /usr/include

Kind Regards,
Anthony Irwin
 
F

Flash Gordon

Anthony Irwin wrote, On 28/03/07 05:38:
I think you will find most of your headers in /usr/include

I think you missed at least two points.

1) It is the source for the libraries not the headers being talked about.

2) Saying that the sources are in a specific directory is wrong in
general, because it depends on lots of things including the OS. The
same, of course, applies to headers, including that in both cases they
may not be available as text files.
 
T

Tor Rustad

Hi,

dose anybody here explain to me why memset would be faster than a
simple loop. I doubt about it!


This is a compiler QoI issue, and has nothing to do about the C
language. Also, it depend on which hardware you are running the compiled
program on. Speedup tricks working for one CPU, might change when you
run your program on a different model of the same CPU family.

How fast a memcpy() implementation execute, may also depend of what kind
of load/store you want to do, e.g.

L1 to L1
L1 to L2
L1 to main memory
etc.

first speedup rule, is to make sure data is aligned, before moving big
chunks of data. Then there are possibilities with pre-warming the cache,
loop-unrolling etc.

When the Pentium II came out, I wrote a memcpy() replacement in C, which
used all the tricks I knew, when I checked it vs the asm implementation
by Microsoft, my replacement ran 2x faster.

According to Intel Manuals, really fast memory transfers was possible
via the MMX registers, but I didn't check that out, since it would
require inline assembler.

Of course, when I measured my memcpy() implementation a couple of years
later on a P3, the latest Microsoft version in the standard library ran
2x faster!!! :)
 
Joined
Mar 12, 2013
Messages
1
Reaction score
0
Clearing and copying blocks of memory is a critical and common operation in any BIOS or OS. Just about all major processors(1) provide some combination of special purpose instructions, registers, and co-processors for block copying and filling that are much faster than looping through normal instructions.

The reason that memset and memcpy will always(2) be much faster than a loop on a typical processor is the special importance of these types of 'block' operations and the fact that memset will detect the type of processor and use the appropriate feature. These features are not compiler optimizations such as loop unrolling or bit alignment, but intrinsic platform capabilities.

If you look at the code sample from the OP, you will notice that if any of the major processors are defined it will call RTFillMemory. RTFillMemory will be platform dependent and take advantage of a block operation. Only when building for an unsupported processor will memset devolve into a loop.

Even if written in optimized assembly language, looping through a sequence of instructions adds a significant overhead. Loops move the instruction pointer either through branches or jumps that chew up clock cycles every time through the loop. If the code being looped is trivial such as a fill or copy, you may spend more clock cycles looping than processing. Block instructions internalize the loop to microcode inside a single instruction. They take a fraction of the time, sometimes a small fraction.

Block instructions are not only faster, but can sometimes operate in parallel with surrounding CPU instructions. They may use DMA or a co-processor or special purpose registers. In some situations, an optimizer may be able to entirely eliminate the time-cost of a memset.

Compilers (including JIT compilers) are better able to take advantage of processor specific block operations if the programmer called the language specific block function such as memset instead of a loop. Even if the compiler could detect that a loop could be replaced by a memset, this would be a really presumptious optimization.

Examples:
SIMD on many modern proecessors
REP command and CX register on x86
loop mode on 68010
Blitter co-processor on Amiga


(1) There may be RISC processors or some esoteric processors that give no special consideration to block operations.
(2) Of course there is no such thing as *always* but close enough so it makes no difference.

--arizonace
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,572
Members
45,045
Latest member
DRCM

Latest Threads

Top