Containers and sorting objects vs. pointers

M

Matthias Kaeppler

Hi,

in my program, I have to sort containers of objects which can be 2000
items big in some cases. Since STL containers are based around copying
and since I need to sort these containers quite frequently, I thought
it'd be a better idea to manage additional containers which are
initialized with pointers to the objects in the primary containers and
sort those (only pointers have to be copied around then).

However, that also means if I have 2000 objects, I have to create *two*
containers holding 2000 objects each, the first with the real objects
and the second with pointers to those.

My question is:
If those objects I'm talking about are between 8 and 16 bytes, would it
be faster to simply perform the sort algos on the objects themselves or
is it better to create that additional container with pointers and
continue working with that?
 
P

pven

Hi,

I think the best approach is to sort with the primary container itself
with your object structures.

Also, if you are planning to use an additional container that stores
the pointer to the objects in the primary container, ensure that the
objects in the primary container are on the heap.

The reason for this is, When a container rearranges itself (lets say
you keep adding objects and the container needs to move the objects to
a new memory location which is bigger), it will shift objects around by
calling the copy constructor of your object.

So after such an operation, your pointers in the other container might
become invalid.
 
P

pven

Also, if you are planning to use an additional container that stores
the pointer to the objects in the primary container, ensure that the
objects in the primary container are on the heap.

The reason for this is, When a container rearranges itself (lets say
you keep adding objects and the container needs to move the objects to
a new memory location which is bigger), it will shift objects around by
calling the copy constructor of your object.

So after such an operation, your pointers in the other container might
become invalid.

My bad,

The above logic applies only if you are using a vector container.

After a little thought, I suggest you use a list container for storing
the objects, Sorting and re-arranging will be efficent. List is similar
to linked list datastructure..
 
M

Mercator

Matthias said:
My question is:
If those objects I'm talking about are between 8 and 16 bytes, would it
be faster to simply perform the sort algos on the objects themselves or
is it better to create that additional container with pointers and
continue working with that?

C++ has time functions which help you finding the answer. Performance
assertions without measurement are moot. See also:
http://www.informit.com/guides/content.asp?g=cplusplus&seqNum=156
 
M

Matthias Kaeppler

pven said:
My bad,

The above logic applies only if you are using a vector container.

After a little thought, I suggest you use a list container for storing
the objects, Sorting and re-arranging will be efficent. List is similar
to linked list datastructure..

Yep, that's exactly how I do it right now; objects
in a linked list, pointers in an std::vector (the
latter mostly because IIRC std::partition doesn't
work on lists).
 
M

Matthias Kaeppler

Mercator said:
C++ has time functions which help you finding the answer. Performance
assertions without measurement are moot. See also:
http://www.informit.com/guides/content.asp?g=cplusplus&seqNum=156

That's great, thanks. I have been looking for
something like this for quite some time.
Probably that'll answer my question most precisely
(whereas I tend to agree with the other poster
that it's probably less overhead to sort the real
objects, since they can be copied with a trivial
copy constructor and are quite lean).

Regards,
Matthias
 
B

Bart

Matthias said:
Hi,

in my program, I have to sort containers of objects which can be 2000
items big in some cases. Since STL containers are based around copying
and since I need to sort these containers quite frequently, I thought
it'd be a better idea to manage additional containers which are
initialized with pointers to the objects in the primary containers and
sort those (only pointers have to be copied around then).

However, that also means if I have 2000 objects, I have to create *two*
containers holding 2000 objects each, the first with the real objects
and the second with pointers to those.

My question is:
If those objects I'm talking about are between 8 and 16 bytes, would it
be faster to simply perform the sort algos on the objects themselves or
is it better to create that additional container with pointers and
continue working with that?

As with anything performance-related, there's no point in discussing
this without performing some tests, but I suspect the sorting algorithm
would only copy the objects when required. Why do you think you would
incur any penalty over copying them yourself?
 
S

Stuart MacMartin

Sorry, I'm not impressed with the answers so far.

2000 is a small number. Try just sorting the container (use std::sort
and provide either an operator< or a predicate). If this is not a
performance problem, don't worry about it.

If it is a performance problem, first check that it's implemented
right:

1. Make sure a < b < c ==> a < c and !(b < a). f your < operator or
predicate is at all complicated you should add debugging tests to be
sure. Otherwise std::sort() might underrun and give the impression of
performance problem.

2. Make sure you aren't running on Windows with condition of a sorted
list + one unsorted item at the end. With only 2000 items this
shouldn't be an issue, but if you had 100X this amount you may find
yourself waiting minutes or hours for the sort to finish.

3. If the implementation is correct and you still have a problem, there
are two places your sort will be spending time:
a) comparing
b) swapping.

(b) can be significant with large objects, especially if the object
owns lots of additional data. There can be lots of swapping while you
sort. As a general rule I never sort anything other than vectors of
pointers or trivial objects, but I tend to deal with large numbers of
objects (10s of thousands to millions). Here's where that heap comment
someone else made came into play:

If you add an item to a vector, every item in the vector can be moved.
If you're going to have an auxiliary list of pointers, you must do one
of four things:

1) Allocate all items and never add another, or at least reserve() the
required space. Now the pointers are fixed.
2) Every time you add an item, toss your sorted list, and create it
again as needed
3) Use a linked list or other structure to hold your master list of
objects. This keeps your objects from moving around
4) Allocate objects on the heap, and only have vectors of pointers.

I assume you don't have a problem copying objects in general because
you're willing to have a vector of them by value. But this is the one
time when you might consider using a linked list over some other data
structure.

Stuart
 
S

Stuart MacMartin

Three other little points:

1. Sorting a singly linked list is really inefficent.
Depending on the sort algorithm, sorting a
doubly linked list can be inefficient

2. If the objects are tight enough that you haven't used a heap
but just have them in a vector, chances are sorting the vector
will be good enough. If not, revisit your other decisions.

3. The comparison operator makes a big difference
the runtime of your sort:
a) Make it as tight as possible
b) Use sort() with an inlined operator< or a function object
to make sure the comparison is inlined
c) Never treat two objects as equal in your ordering.
If you can have several equivalent objects, like "red" vs.
"black",
your comparison should return &obj1 < &obj2 if they are
both red or both black.

Stuart
 
M

Matthias Kaeppler

Thanks Stuart, that was quite insightful.
I am now simply holding a vector with the objects directly. Previously I
had used a doubly linked list to hold the objects and a vector to hold
the pointers.

Here is a predicate function for sorting File objects by name ascending:

struct FileLessNameA
: std::binary_function<const File&,const File&,bool> {
bool operator() (const File& lhs, const File& rhs) const {
std::string path1( lhs.get_name().c_str() );
std::string path2( rhs.get_name().c_str() );
boost::algorithm::to_lower( path1 );
boost::algorithm::to_lower( path2 );
return path1 < path2;
}
};

I am converting to all lowercase because I don't want FOO to be less
than foo.

PS: A File object is actually nothing more than a
boost::filesystem::path and a reference counted smart pointer to a
Gnome::Vfs::FileInfo object. I think it's maybe 8-16 bytes in size.
 
M

Mercator

Matthias said:
Here is a predicate function for sorting File objects by name ascending:

struct FileLessNameA
: std::binary_function<const File&,const File&,bool> {
bool operator() (const File& lhs, const File& rhs) const {
std::string path1( lhs.get_name().c_str() );
std::string path2( rhs.get_name().c_str() );

Oops, you copy (duplicate!) strings here all the time! Certainly a
performance bottleneck. Better use something like stricmp:

return stricmp (lhs.get_name().c_str(),rhs.get_name().c_str()) < 0;

(assuming File::get_name() returns a reference to string)
boost::algorithm::to_lower( path1 );
boost::algorithm::to_lower( path2 );
return path1 < path2;
}
};

I am converting to all lowercase because I don't want FOO to be less
than foo.

PS: A File object is actually nothing more than a
boost::filesystem::path and a reference counted smart pointer to a
Gnome::Vfs::FileInfo object. I think it's maybe 8-16 bytes in size.

BTW, do you really need _that_ complexity: Gnome, smart pointer, Boost?
 
M

Matthias Kaeppler

Mercator said:
Oops, you copy (duplicate!) strings here all the time! Certainly a
performance bottleneck. Better use something like stricmp:

return stricmp (lhs.get_name().c_str(),rhs.get_name().c_str()) < 0;

What do you mean? I only want the lowercase conversion for comparison
purposes, I don't want to modify the original strings (I am viewing
directory contents, no way I would modify the filenames!).
I /have/ to work with copies here.

And where does that stricmp call take into account converting to all
lowercase?
(assuming File::get_name() returns a reference to string)

It returns by value.
BTW, do you really need _that_ complexity: Gnome, smart pointer, Boost?

Yes. What should I say? I am developing a filemanager. I want an
abstraction from the real filesystem, and that's what Gnome::Vfs does
for me.
I also want lambda expressions, smart pointers, easy working with paths
and other convenience stuff, so I need Boost. For example:
Boost.filesystem has functionality for determining root paths, easy
concatenation of paths etc, which I found to be lacking in Gnome::Vfs.
They combine quite nicely. Boost.Filesystem however has no decent
functionality whatsoever for actually working with files (concurrent
transfers with means to listen for progress, determining file types and
so on).
 
M

Mercator

Matthias said:
What do you mean? I only want the lowercase conversion for comparison
purposes, I don't want to modify the original strings (I am viewing
directory contents, no way I would modify the filenames!).
I /have/ to work with copies here.

stricmp doesn't change the string. It just compares strings case
insensitive (of course, without creating/allocating new strings). BTW,
you can easily write your own stricmp (with std::tolower) if your
platform doesn't have one.
And where does that stricmp call take into account converting to all
lowercase?


It returns by value.
Oops.


Yes. What should I say? I am developing a filemanager. I want an
abstraction from the real filesystem, and that's what Gnome::Vfs does
for me.
I also want lambda expressions, smart pointers, easy working with paths
and other convenience stuff, so I need Boost. For example:
Boost.filesystem has functionality for determining root paths, easy
concatenation of paths etc, which I found to be lacking in Gnome::Vfs.
They combine quite nicely. Boost.Filesystem however has no decent
functionality whatsoever for actually working with files (concurrent
transfers with means to listen for progress, determining file types and
so on).

http://c2.com/cgi/wiki?KeepItSimple
http://c2.com/cgi/wiki?DoTheSimplestThingThatCouldPossiblyWork
 
M

Matthias Kaeppler

Mercator said:
stricmp doesn't change the string. It just compares strings case
insensitive (of course, without creating/allocating new strings). BTW,
you can easily write your own stricmp (with std::tolower) if your
platform doesn't have one.

Is stricmp platform independent? I don't think so! But if you know a
platform independent wrapper for stricmp, let me know.

Honestly, I don't have the impression you even read what I wrote :-/
Those "patterns", despite their poor expressiveness quite correct, are
mostly bladdering about things which should be pretty obvious to anybody.

So why not give concrete examples where you think my code could be made
simpler without losing its level of abstraction and ease of use? Your
suggestion for using stricmp may result in more efficient, but less
portable code, so it's not a serious alternative.

I'm also eager to hear what you suggest as a replacement for Gnome::Vfs
and boost such that your solution is a) more efficient and b) still
provides the same functionality.

I don't have the impression that my code is overly complex, but I'm
always open for /constructive/ feedback (those two articles were
certainly not) which proves me wrong.

Regards,
Matthias
 
S

Stuart MacMartin

I did say keep your comparison function lean.
In particular, anything allocating memory on the heap (like string
copies) is a really bad idea.

Isn't stricmp available everywhere?
Whether it is or not, you're using std::string and I assume there's a
standard method or function for comparing strings case insensitive.
(Sorry, I don't use it so I don't know). Since you need it now, find
out and you'll always know.

BTW: using extra utility classes can make things simpler or more
complex, depending on the situation. I didn't read those links either,
but I assume they're a more long-winded way of quoting Einstein:
"Everything should be made as simple as possible, but not simpler"

Stuart
 
M

Matthias Kaeppler

Stuart said:
I did say keep your comparison function lean.
In particular, anything allocating memory on the heap (like string
copies) is a really bad idea.

How do you know that std::string allocates memory on the heap? I can't
remember standard C++ demanding that. AFAIK, this is up to the
implementation.
Isn't stricmp available everywhere?

That's not the point.

You may want to read this article by Matt Austern dealing with the
problems of string comparison:
http://lafstern.org/matt/col2_new.pdf
Whether it is or not, you're using std::string and I assume there's a
standard method or function for comparing strings case insensitive.

No, there's not. Strings compare case sensitive. Read the article to
learn why lexicographical_compare has its subtle problems.
BTW: using extra utility classes can make things simpler or more
complex, depending on the situation. I didn't read those links either,
but I assume they're a more long-winded way of quoting Einstein:
"Everything should be made as simple as possible, but not simpler"

Let me answer this with a quote from Austerns paper:
"For every complex problem there is a solution that is simple, neat, and
wrong."
Just like the suggestion using stricmp or std::lexicographical_compare.
boost::algorithm::to_lower doesn't suffer from this problem, because it
takes locales into account.

However, instead of implementing the approach presented in Austerns
paper on my own, I rather ask on the boost list if it's not already in
their library (or something equivalent). Though I couldn't find it yet,
I'd be surprised if it's not in there (if I'm not mistaken, Matt Austern
is a long time contributor to the boost libraries).
 
M

Matthias Kaeppler

Matthias said:
How do you know that std::string allocates memory on the heap? I can't
remember standard C++ demanding that. AFAIK, this is up to the
implementation.

.... and my C++ implementation indeed seems to allocate the string data
on the heap. So much about that :D
 
M

Mercator

Matthias said:
How do you know that std::string allocates memory on the heap? I can't
remember standard C++ demanding that. AFAIK, this is up to the
implementation.

How would you implement a string template the holds (almost) arbitrary
long strings without using the heap? (BTW, yes, know about 'small
string optimization').
That's not the point.

You may want to read this article by Matt Austern dealing with the
problems of string comparison:
http://lafstern.org/matt/col2_new.pdf

The point is that you need not allocate any new strings in order to
compare two strings case insensitively. Things get a little more
complicated if (and only if) you want to be 'locale aware'. But that's
not the point ...
 
M

Matthias Kaeppler

Mercator said:
The point is that you need not allocate any new strings in order to
compare two strings case insensitively. Things get a little more
complicated if (and only if) you want to be 'locale aware'. But that's
not the point ...

Yes, that's correct.
I have already raised the point on boost users mailing list. We came to
the conclusion that a functor similar to is_iequal for comparing using
"less than" semantics would be a useful addition.

I have slightly modified compare.hpp to feature this functor. Pavol
Droba, the man behind the string algo library, has already agreed to add
such a functor to the library if he finds the time. It will probably
look something like this:

struct is_iless
{
//! Constructor
/*!
\param Loc locales used for comparison
*/
is_iless( const std::locale& Loc=std::locale() ) :
m_Loc( Loc ) {}

//! Function operator
/*!
Compare two operands. Case is ignored.
*/
template< typename T1, typename T2 >
bool operator ()( const T1& Arg1, const T2& Arg2 ) const
{
return std::toupper(Arg1,m_Loc)<std::toupper(Arg2,m_Loc);
}

private:
std::locale m_Loc;
};

It works just fine and is exactly what I need. I can then use it as a
predicate for std::lexicographical_compare to test two strings for a
'less than' relation. Thus I can also get rid of the two copies without
losing any of the functionality.

Regards,
Matthias
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,535
Members
45,007
Latest member
obedient dusk

Latest Threads

Top