Canonical way to copy an array

F

Frederick Gotham

What's the canonical way to copy an array in C++?

If we're copying a POD, we can use memcpy (but there could be a more
efficient alternative if we know that the blocks are suitably aligned).

Regardless of whether the array consists of POD's, we could use a loop such
as:

#include <cassert>

template<class T>
void CopyArray(T *pdest, T const *psource, T const *const psover)
{
assert(pdest);
assert(psource);
assert(psover); /* Contraversey here! */
assert(psover > psource);

do
{
*pdest++ = *psource++;
} while(psource != psover);
}

#define ARRLEN(arr) (sizeof(arr)/sizeof*(arr))

#define POVER(arr) ((arr) + ARRLEN((arr)))

int main()
{
int array1[5] = {1,2,3,4,5};

int array2[5];

CopyArray(array2,array1,POVER(array1));
}

However, I presume that there's a Standard Library function for doing this?
What's the most efficient method?

Also, how do you go about copy-constructing an array? Seems to me that
you'd have to use "aligned_storage", and then do a loop which uses
"placement new". I wonder why the following is disallowed:

int main()
{
int arr1[] = {1,2,3,4,5};

int arr2[] = arr1;
}
 
T

Thomas J. Gritzan

Frederick said:
What's the canonical way to copy an array in C++? [loop that copies elements]
However, I presume that there's a Standard Library function for doing this?
std::copy

What's the most efficient method?

Not defined by the standard.

You could use std::copy and when measured, that some way is faster for a
specific type, you could overload/specialize std::copy using the "faster"
method.
 
P

Philip Potter

Frederick Gotham said:
What's the canonical way to copy an array in C++?

If we're copying a POD, we can use memcpy (but there could be a more
efficient alternative if we know that the blocks are suitably aligned).

I doubt it. memcpy() normally knows all sorts of grubby details about
properties of memory addresses; if the addresses are suitably aligned,
memcpy() will normally use the more efficient implementation anyway.

Philip
 
C

Clark S. Cox III

What's the canonical way to copy an array in C++?
std::copy

If we're copying a POD, we can use memcpy

On a decent implementation, std::copy optimizes away to a call to
memcpy when appropriate (I know that GCC's implementation does this)
(but there could be a more efficient alternative if we know that the
blocks are suitably aligned).

Again, on a decent implementation, I doubt it. memcpy is likely to be
highly optimized (i.e. to take advantage of copying large, aligned
blocks, etc.)
Regardless of whether the array consists of POD's, we could use a loop such as:

[snip CopyArray implementation]

This is essentially what std::copy does when it's parameters are
pointers to non-POD objects. An example implementation of std::copy:

template< typename InputIterator, typename OutputIterator>
OutputIterator copy(InputIterator first, InputIterator last,
OutputIterator result)
{
while(first != last)
{
*result++ = *first++
}
return result;
}
However, I presume that there's a Standard Library function for doing
this? What's the most efficient method?

Also, how do you go about copy-constructing an array? Seems to me that
you'd have to use "aligned_storage", and then do a loop which uses
"placement new". I wonder why the following is disallowed:

int main()
{
int arr1[] = {1,2,3,4,5};

int arr2[] = arr1;
}

It's probably disallowed for no reason other than "that's the way C did it".
 
F

Frederick Gotham

Philip Potter posted:
I doubt it. memcpy() normally knows all sorts of grubby details about
properties of memory addresses; if the addresses are suitably aligned,
memcpy() will normally use the more efficient implementation anyway.

There would be the runtime overhead of checking whether the addresses are
suitably aligned, for instance:

(I've only written the following code in the last half-hour, so it's likely
to contain errors:)

#include <cstddef>
#include <limits>

using std::size_t;
using std::numeric_limits;

#define alignof(T) sizeof(T)
typedef long unsigned IntP; /* Integer type to store memory address */
#define STATIC_ASSERT(x) /* Nothing*/

template<class T>
void SpecialisedCopy(T *pdest,T const *psrc,size_t num)
{
while(num--) *pdest++ = *psrc++;
}

void *memcpy(void *const pdest,void const *const psrc,size_t const num)
{
STATIC_ASSERT(!numeric_limits<unsigned>::traps);
STATIC_ASSERT(!numeric_limits<unsigned short>::traps);

unsigned const dist_dest = (IntP)pdest % alignof(int);
unsigned const dist_src = (IntP)psrc % alignof(int);

if(dist_dest != dist_src)
return SpecialisedCopy<char unsigned>(
static_cast<char unsigned*>(pdest),
static_cast<char unsigned const*>(psrc), num), pdest;


switch(dist_dest)
{
case 0:
return SpecialisedCopy<unsigned>(
static_cast<unsigned*>(pdest),
static_cast<unsigned const*>(psrc), num/sizeof(int)), pdest;

case alignof(short):
return SpecialisedCopy<short unsigned>(
static_cast<short unsigned*>(pdest),
static_cast<short unsigned const*>(psrc), num/sizeof(short)),
pdest;

default:
return SpecialisedCopy<char unsigned>(
static_cast<char unsigned*>(pdest),
static_cast<char unsigned const*>(psrc), num), pdest;
}
}

int main()
{
int iarray1[16] = {},iarray2[16] = {};
short sarray1[16] = {},sarray2[16] = {};
char carray1[16] = {},carray2[16] = {};
float farray1[16] = {},farray2[16] = {};
double darray1[16] = {},darray2[16] = {};

memcpy(iarray2,iarray1,sizeof iarray2);
memcpy(sarray2,sarray1,sizeof sarray2);
memcpy(carray2,carray1,sizeof carray2);
memcpy(farray2,farray1,sizeof farray2);
memcpy(darray2,darray1,sizeof darray2);
}
 
G

Greg Comeau

I doubt it. memcpy() normally knows all sorts of grubby details about
properties of memory addresses; if the addresses are suitably aligned,
memcpy() will normally use the more efficient implementation anyway.

Maybe I've been asleep at the wheel, but I'm curious which
implementations of memcpy() knows all the grubby details
and actually uses them?
 
P

Philip Potter

Frederick Gotham said:
Philip Potter posted:

There would be the runtime overhead of checking whether the addresses are
suitably aligned

Oh noes! A bitwise-and, a check-for-zero, and a conditional branch of
overhead, in a function that typically copies large arrays of memory! The
extra gains you can get for using memcpy(), which will have been written
carefully, under controlled conditions, reviewed, rewritten, and tested
until it is as fast as is humanly possible, far outweigh this
"inefficiency".

Also note that in order to avoid these checks for alignment, you need to add
explicit function calls to aligned_memcpy() or equally_unaligned_memcpy() or
unequally_unaligned_memcpy(), none of which are substitutable for each
other. This makes your code much harder to maintain. (I don't really want to
keep track of alignments of all my arrays in order to use the right copy
function.)

See below for how memcpy() often beats hand-coded memory copying routines.
template<class T>
void SpecialisedCopy(T *pdest,T const *psrc,size_t num)
{
while(num--) *pdest++ = *psrc++;
}

You're relying greatly on your compiler here to unroll this loop well.
(Quite possibly your reliance is well-placed; I don't know.) Often memcpy()
will copy in blocks of 8 or 16 (using 8 or 16 registers) for the inner loop,
reducing problems of memory latency greatly. Of course it needs prologue and
epilogue code to handle the beginning and end of the array.
if(dist_dest != dist_src)
return SpecialisedCopy<char unsigned>(
static_cast<char unsigned*>(pdest),
static_cast<char unsigned const*>(psrc), num), pdest;

If gcd(dist_dest-dist_src,sizeof(machine_register_type)) > 1
(for example, if the arrays are misaligned by 16 bits on a machine with
32-bit registers), you can still copy shorts and be faster than copying
chars.
switch(dist_dest)
{
case 0:
return SpecialisedCopy<unsigned>(
static_cast<unsigned*>(pdest),
static_cast<unsigned const*>(psrc), num/sizeof(int)), pdest;

Assumes that num % (sizeof(unsigned)) == 0.
case alignof(short):
return SpecialisedCopy<short unsigned>(
static_cast<short unsigned*>(pdest),
static_cast<short unsigned const*>(psrc), num/sizeof(short)),

Assumes that num % (sizeof(unsigned short)) == 0.

Also, you'd get faster results by copying ONE short, then copying the rest
using unsigned (as you did above), then at the end copying shorts or chars
to make up the rest.
pdest;

default:
return SpecialisedCopy<char unsigned>(
static_cast<char unsigned*>(pdest),
static_cast<char unsigned const*>(psrc), num), pdest;
}

Similarly here, you can get better results by copying chars until the arrays
are aligned, then using unsigned.

Also note that your code becomes less efficient still if
sizeof(unsigned)!=sizeof(short)*2, and sizeof(short)!=sizeof(char)*2. (On a
machine with 64-bit integer registers, you may need to consider copying 64,
32, 16 or 8 bits at a time).

Philip
 
P

Philip Potter

Greg Comeau said:
Maybe I've been asleep at the wheel, but I'm curious which
implementations of memcpy() knows all the grubby details
and actually uses them?

I can't give concrete examples; I just know the theory, not who has put it
into practice. Of course, the particular "grubby details" I refer to are
things such as:

* The size of an integer register
* The number of available registers during a function call
* The size of any larger registers that may actually be faster for the
job[1]
* The latency of a cache access (since we're accessing contiguous memory,
assume cache hits are very likely) for aligned and unaligned (if possible)
fetches
* Whether the machine has any specialized hardware for memcpy()ing!

And the uses they get put to are:
* How deeply the loop can be unrolled (constrained by the number of
registers available)
* How much data can be fetched from memory at once (constrained by the size
of a machine register times the available registers)
* How deeply the loop needs to be unrolled to eliminate memory latency
(constrained by... memory latency).
* Whether the specialized hardware can do it faster!

It would surprise me if many serious memcpy() implementations didn't take
advantage of these sorts of things.

Philip

[1] I have heard that the Pentium's MMX registers are better for copying
memory than its integer registers.
 
J

Jerry Coffin

[ ... ]
Also, how do you go about copy-constructing an array?

You don't. Copy-constructing requires a copy-contructor, and built-in
arrays don't have them.

You can copy-construct a vector, or you can construct a vector as a copy
of the data in an array.
Seems to me that
you'd have to use "aligned_storage", and then do a loop which uses
"placement new".

I believe that's pretty much how vector's copy ctor is usually
implemented. Constructing a vector from an array probably does pretty
much the same thing, though I haven't checked to be sure.

If you want something that doesn't use dynamic storage, you might
consider TR1's array class.
 
J

Jerry Coffin

[email protected] says... said:
Maybe I've been asleep at the wheel, but I'm curious which
implementations of memcpy() knows all the grubby details
and actually uses them?

I'm not sure I've ever seen one that did.

OTOH, I've seen some compilers that were good at recognizing and
optimizing block copies. Just for one obvious example, for quite a while
now, Visual C++ has been able to recognize a loop like:

for (int i=0; i<size; i++)
whatever = whatever2;

and turn it into code like:

mov ecx, size
lea esi, dword ptr whatever2
lea edi, dword ptr whatever
rep movsd

In theory, you can do a little better than this using SSE2 instructions,
in but in fact it all comes out about the same unless (nearly) all of
the data is in the cache to start with -- otherwise, the memory bus is
the bottleneck.

I suppose it's possible that MS' current memmove and/or memcpy do a bit
more though -- normally they just forward to RtlMemoryMove and
RtlMemoryCopy respectively, and I haven't looked through either of them
recently to figure out how fancy they get. The last time I looked, they
weren't particularly impressive, but that was quite a while ago (around
NT 4 or so).
 
F

Frederick Gotham

Philip Potter posted:
(On a machine with 64-bit integer registers, you may need to consider
copying 64, 32, 16 or 8 bits at a time).


One thing which I considerd was the following (but I don't think it's
strictly allowed). It will only work if the array length is known at
compile time. (Alternatively, you could get it to work with a runtime
length, but it would involve thousands of template specialisations and a
switch statement.)

#include <cstddef>

using std::size_t;

template<class T, size_t len>
void CopyArray(T (&dst)[len], T const (&src)[len])
{
struct Copiable {
T arr[len];
};

reinterpret_cast<Copiable&>(dst) =
reinterpret_cast<Copiable const &>(src);
}

int main()
{
int iarray1[16] = {},iarray2[16] = {};
short sarray1[16] = {},sarray2[16] = {};
char carray1[16] = {},carray2[16] = {};
float farray1[16] = {},farray2[16] = {};
double darray1[16] = {},darray2[16] = {};

CopyArray(iarray2,iarray1);
CopyArray(sarray2,sarray1);
CopyArray(carray2,carray1);
CopyArray(farray2,farray1);
CopyArray(darray2,darray1);
}

Do you think could this be more efficient than memcpy?
 
C

Clark S. Cox III

Frederick said:
Philip Potter posted:
(On a machine with 64-bit integer registers, you may need to consider
copying 64, 32, 16 or 8 bits at a time).


One thing which I considerd was the following (but I don't think it's
strictly allowed). It will only work if the array length is known at
compile time. (Alternatively, you could get it to work with a runtime
length, but it would involve thousands of template specialisations and a
switch statement.)

#include <cstddef>

using std::size_t;

template<class T, size_t len>
void CopyArray(T (&dst)[len], T const (&src)[len])
{
struct Copiable {
T arr[len];
};

reinterpret_cast<Copiable&>(dst) =
reinterpret_cast<Copiable const &>(src);
}

int main()
{
int iarray1[16] = {},iarray2[16] = {};
short sarray1[16] = {},sarray2[16] = {};
char carray1[16] = {},carray2[16] = {};
float farray1[16] = {},farray2[16] = {};
double darray1[16] = {},darray2[16] = {};

CopyArray(iarray2,iarray1);
CopyArray(sarray2,sarray1);
CopyArray(carray2,carray1);
CopyArray(farray2,farray1);
CopyArray(darray2,darray1);
}

Do you think could this be more efficient than memcpy?

As with any question of efficiency, the answer is a definite "Maybe,
maybe not". There's no way to say in the general case; it may be faster
on one platform, slower on another, or it may actually be a memcpy call
under the hood.
 
C

Clark S. Cox III

Greg said:
Maybe I've been asleep at the wheel, but I'm curious which
implementations of memcpy() knows all the grubby details
and actually uses them?

I know that Apple's implementation of memcpy/memmove for MacOSX takes
advantage of several grubby details and actually has a few separate
internal implementations (for different alignments, for copies less than
32 bytes, etc.)
 
G

Greg

Clark said:
I know that Apple's implementation of memcpy/memmove for MacOSX takes
advantage of several grubby details and actually has a few separate
internal implementations (for different alignments, for copies less than
32 bytes, etc.)

The PowerPC implementation of memcpy on Mac OS X uses floating point
registers to copy 64 bits of memory with each operation.

Greg
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,042
Latest member
icassiem

Latest Threads

Top