list vs. set speed

Henrik Goldman · Jun 27, 2007

I have a collection of data consting of a time_t timestamp, a std::string
username and a hostname.

This information is read from a text file which can easily be up to 100 mb.
This means that there is a lot of records to be processed.

Until now I've been using a std::list but it seems to be too slow.
The reason why is that duplicates (e.g. same time and user and host) are not
allowed. So until now a linear search has been done to reject all duplicate
elements when inserting. This can take a lot of time though.
In order to speed it up I'm looking for good ideas. I was thinking about a
multimap which consist of time_t as first and username+hostname as second.
However it seems akward since the data is really one entity.

Maybe set is better? However it can only be better if the search speed is
faster.

Thanks.

-- Henrik

Victor Bazarov · Jun 27, 2007

Henrik said:
I have a collection of data consting of a time_t timestamp, a
std::string username and a hostname.

This information is read from a text file which can easily be up to
100 mb. This means that there is a lot of records to be processed.

Order of one million records?

Until now I've been using a std::list but it seems to be too slow.
The reason why is that duplicates (e.g. same time and user and host)
are not allowed. So until now a linear search has been done to reject
all duplicate elements when inserting. This can take a lot of time
though. In order to speed it up I'm looking for good ideas. I was thinking
about a multimap which consist of time_t as first and
username+hostname as second. However it seems akward since the data
is really one entity.
Maybe set is better? However it can only be better if the search
speed is faster.

With as significant number of elements as yours, the difference
between the linear search and logarithmic search becomes exreme.
However, I'd probably try a slightly different approach -- insert
all of them in a vector, then sort it, then remove duplicates by
using 'std::unique'.

Just like with any other performance problem, you won't know which
one is better until you actually measure.

V

joe · Jun 27, 2007

I have a collection of data consting of a time_t timestamp, a std::string
username and a hostname.

This information is read from a text file which can easily be up to 100 mb.
This means that there is a lot of records to be processed.

Until now I've been using a std::list but it seems to be too slow.
The reason why is that duplicates (e.g. same time and user and host) are not
allowed. So until now a linear search has been done to reject all duplicate
elements when inserting. This can take a lot of time though.
In order to speed it up I'm looking for good ideas. I was thinking about a
multimap which consist of time_t as first and username+hostname as second.
However it seems akward since the data is really one entity.

Maybe set is better? However it can only be better if the search speed is
faster.

Thanks.

-- Henrik

The set would get you O(log2(N)) search time, which is bunches better
than std::list for lot of data.

Do you know anything about the data? Is it totally random, or are
entries chronological? If they are, then you could use a vector<> to
store the data, std::lower_bound() to find the first record with the
correct time and a linear search from there to determine if the data
is already present. This may give you better performance than either
list or set.

joe

Henrik Goldman · Jun 27, 2007

The set would get you O(log2(N)) search time, which is bunches better

than std::list for lot of data.

Do you know anything about the data? Is it totally random, or are
entries chronological? If they are, then you could use a vector<> to
store the data, std::lower_bound() to find the first record with the
correct time and a linear search from there to determine if the data
is already present. This may give you better performance than either
list or set.

The records are not guarrantied to be in correct order. They should be
chronological. However there may be multiple input data sources involved so
that you get an overlapping dataset. E.g. input A might be newer or older
then B. For this reason it's important that the data is sorted in the
correct order.
I think I'll give it a try initially with the set and see how it goes.

Thanks.

-- Henrik

Henrik Goldman · Jun 27, 2007

This information is read from a text file which can easily be up to

Order of one million records?

That might be too much. Not all the lines contain information. However
several thousands of records is a sure bet.

With as significant number of elements as yours, the difference
between the linear search and logarithmic search becomes exreme.
However, I'd probably try a slightly different approach -- insert
all of them in a vector, then sort it, then remove duplicates by
using 'std::unique'.

Just like with any other performance problem, you won't know which
one is better until you actually measure.

Thanks for the suggestion. I don't think it will work out though... See my
other post.

-- Henrik

Victor Bazarov · Jun 27, 2007

Henrik said:
That might be too much. Not all the lines contain information. However
several thousands of records is a sure bet.

Thanks for the suggestion.
Suggestions.

I don't think it will work out though...

Really? Why?

See my other post.

Well, see my other posts on performance of containers, then.

V

Henrik Goldman · Jun 27, 2007

It doesn't seem to work fully as expected.

Here is the set class:

class CUser

{

bool operator<(const CLicenseUser &rhs) const

{

return (StrLower(m_strUsername) == StrLower(rhs.m_strUsername)) ?

(StrLower(m_strHostname) < StrLower(rhs.m_strHostname)) :

(StrLower(m_strUsername) < StrLower(rhs.m_strUsername));

}

string m_strUsername;

string m_strHostname;

};

class CDeniedUser : public CUser
{
public:
CDeniedUser()
{
m_tDeny = 0;
}

time_t m_tDeny;

bool operator<(const CDeniedUser &rhs) const
{
return (m_tDeny < rhs.m_tDeny && ((CLicenseUser) *this < (CLicenseUser)
rhs));
}
};

Unfortunatly it doesn't get all the items inserted:

set<CDeniedUser> S;

CDeniedUser U1;
U1.m_strHostname = "A";
U1.m_strUsername = "A";
U1.m_tDeny = 1;

bool b = S.insert(U1).second;

S.size();

b = S.insert(U1).second;

U1.m_tDeny = 0;

b = S.insert(U1).second;

S.size();

printf("%d\n", itFeature->m_UserSet.begin()->m_tDeny);

The problem here is that the item with tDeny = 0 doesn't get inserted.
However the less function should be ok.

What am I missing?

-- Henrik

Victor Bazarov · Jun 27, 2007

Henrik said:
It doesn't seem to work fully as expected.

Here is the set class:

class CUser

{

bool operator<(const CLicenseUser &rhs) const

What's "CLicenseUser"?

{

return (StrLower(m_strUsername) == StrLower(rhs.m_strUsername)) ?

(StrLower(m_strHostname) < StrLower(rhs.m_strHostname)) :

(StrLower(m_strUsername) < StrLower(rhs.m_strUsername));

}

string m_strUsername;

string m_strHostname;

};

class CDeniedUser : public CUser
{
public:
CDeniedUser()
{
m_tDeny = 0;

FAQ 10.6.

}

time_t m_tDeny;

bool operator<(const CDeniedUser &rhs) const
{
return (m_tDeny < rhs.m_tDeny && ((CLicenseUser) *this <
(CLicenseUser) rhs));

Are you sure this implementation provides strict weak ordering?
Are you sure you need '&&' there? When there are two parts of
some structure that should be compared and one is more important,
the model is usually

return important < rhs.important ||
(important == rhs.important && lessimportant < lessimportant);

}
};

Unfortunatly it doesn't get all the items inserted:

set<CDeniedUser> S;

CDeniedUser U1;
U1.m_strHostname = "A";
U1.m_strUsername = "A";
U1.m_tDeny = 1;

bool b = S.insert(U1).second;

S.size();

b = S.insert(U1).second;

U1.m_tDeny = 0;

b = S.insert(U1).second;

S.size();

printf("%d\n", itFeature->m_UserSet.begin()->m_tDeny);

The problem here is that the item with tDeny = 0 doesn't get inserted.
However the less function should be ok.

What am I missing?

Proper less function, most likely.

V

Henrik Goldman · Jun 27, 2007

What's "CLicenseUser"?

Sorry about that. It should have been CUser. I renamed some of the names
when posting.

FAQ 10.6.

Thank you ... I'll keep that in mind. However as the FAQ says:

"Note: There is no performance difference if the type of x_ is some
built-in/intrinsic type, such as int or char* or float. But even in these
cases, my personal preference is to set those data members in the
initialization list rather than via assignment for consistency."

time_t is of type int so it won't give any performance benifits.

Are you sure this implementation provides strict weak ordering?
Are you sure you need '&&' there?

No... This is why I'm asking. Sorry for asking then!
Prior to asking I did my homework by searching for examples on google.
However there were no good examples found.

When there are two parts of
some structure that should be compared and one is more important,
the model is usually

return important < rhs.important ||
(important == rhs.important && lessimportant < lessimportant);

Why would you compare "lessimportant < lessimportant"?

Proper less function, most likely.

Thanks I actually figured out that one on my own.

Maybe you could explain why set<> has a less function? Wouldn't it be more
natural to have an absolute compare to see if two items are unique or not.
Naturally it's an issue of speed but what if A is not less then B. It
doesn't tell anything about if A is equal or not to B. Maybe it has
something to do with how the less function is used?
E.g. set<int>: 1 is less then 2. If I insert 1 and then 1 I'd expect only 1
insert. If I would then insert 2 it would be more then 1. Can I then be
guarrantied of the order? What if 2 would be inserted before 1?

Thanks.

-- Henrik

Bo Persson · Jun 27, 2007

Henrik Goldman wrote:
:::: bool operator<(const CLicenseUser &rhs) const
:::
::: What's "CLicenseUser"?
:::
::
:: Sorry about that. It should have been CUser. I renamed some of the
:: names when posting.
::
:::: m_tDeny = 0;
:::
::: FAQ 10.6.
:::
::
:: Thank you ... I'll keep that in mind. However as the FAQ says:
::
:: "Note: There is no performance difference if the type of x_ is some
:: built-in/intrinsic type, such as int or char* or float. But even
:: in these cases, my personal preference is to set those data
:: members in the initialization list rather than via assignment for
:: consistency."
::
:: time_t is of type int so it won't give any performance benifits.
::
:::: bool operator<(const CDeniedUser &rhs) const
:::: {
:::: return (m_tDeny < rhs.m_tDeny && ((CLicenseUser) *this <
:::: (CLicenseUser) rhs));
:::
::: Are you sure this implementation provides strict weak ordering?
::: Are you sure you need '&&' there?
::
:: No... This is why I'm asking. Sorry for asking then!
:: Prior to asking I did my homework by searching for examples on
:: google. However there were no good examples found.
::
::: When there are two parts of
::: some structure that should be compared and one is more important,
::: the model is usually
:::
::: return important < rhs.important ||
::: (important == rhs.important && lessimportant <
::: lessimportant);
:::
::
:: Why would you compare "lessimportant < lessimportant"?

Thst would be

lessimportant < rhs.lessimportant

Think last name, first name.

::
::: Proper less function, most likely.
:::
::
:: Thanks I actually figured out that one on my own.
::
:: Maybe you could explain why set<> has a less function? Wouldn't it
:: be more natural to have an absolute compare to see if two items
:: are unique or not. Naturally it's an issue of speed but what if A
:: is not less then B. It doesn't tell anything about if A is equal
:: or not to B. Maybe it has something to do with how the less
:: function is used?

Yes. There can be an order between objects, whether they are unique or
not.

struct person
{
std::string last_name;
std::string first_name;
int age;
};

person x = {"Smith", "John", 32};
person y = {"Smith", "John", 47};

When sorted by name, x and y are identical. When sorted by age, they
are not.

Bo Persson

Victor Bazarov · Jun 27, 2007

Henrik said:
Sorry about that. It should have been CUser. I renamed some of the
names when posting.

Thank you ... I'll keep that in mind. However as the FAQ says:

"Note: There is no performance difference if the type of x_ is some
built-in/intrinsic type, such as int or char* or float. But even in
these cases, my personal preference is to set those data members in
the initialization list rather than via assignment for consistency."

time_t is of type int so it won't give any performance benifits.

Nowhere in the Standard is it guaranteed to be "of type int".

No... This is why I'm asking. Sorry for asking then!
Prior to asking I did my homework by searching for examples on google.
However there were no good examples found.

Why would you compare "lessimportant < lessimportant"?

If you cannot decide based on more inportant (if they are the same).

Thanks I actually figured out that one on my own.

Maybe you could explain why set<> has a less function?

The set<> does not have a less function. It requires the elements
to be sortable, and by default uses 'std::less' as the sorting
predicate. You're free to provide your own predicate.

Wouldn't it be
more natural to have an absolute compare to see if two items are
unique or not.

I don't know what you mean. What's "absolute compare"? How would
you define uniqueness? The Standard says that two keys (items for
'std::set') 'a' and 'b' are *not* unique iff (!(a < b) && !(b < a))
yields 'true'.

Naturally it's an issue of speed but what if A is not
less then B. It doesn't tell anything about if A is equal or not to
B. Maybe it has something to do with how the less function is used?

Maybe. I am not sure what you mean by "how the less function is used".

E.g. set<int>: 1 is less then 2. If I insert 1 and then 1 I'd expect
only 1 insert. If I would then insert 2 it would be more then 1. Can
I then be guarrantied of the order?

Of what order?

What if 2 would be inserted
before 1?

What's the meaning of "inserted before" here? The final order is not
1-2 but 2-1 or the insertion process first adds (inserts) 2 and then
1?

What book are you reading that doesn't explain all that, BTW?

V

Victor Bazarov · Jun 27, 2007

Bo said:
Thst would be

lessimportant < rhs.lessimportant

Ah, yes, good catch, both of you! I was too hasty and forgot to
type 'rhs.' there.

[..]

V

=?ISO-8859-1?Q?Erik_Wikstr=F6m?= · Jun 27, 2007

[snip]

Proper less function, most likely.

Click to expand...

Thanks I actually figured out that one on my own.

Maybe you could explain why set<> has a less function? Wouldn't it be more
natural to have an absolute compare to see if two items are unique or not.
Naturally it's an issue of speed but what if A is not less then B. It
doesn't tell anything about if A is equal or not to B. Maybe it has
something to do with how the less function is used?

Yes, since std::set cares primarily about the ordering of elements and
only then if two elements are equal. Remember that if a < b and b < a
are both false it must mean (mathematically) that a == b. So if you can
determine which of two objects is the lesser one then you can also
determine if they are equal (provided that the operator< gives a strict
ordering). The reverse is not true.

E.g. set<int>: 1 is less then 2. If I insert 1 and then 1 I'd expect only 1
insert. If I would then insert 2 it would be more then 1. Can I then be
guarrantied of the order? What if 2 would be inserted before 1?

std::set is ordered by the comparison function provided (by default
std::less) so if that's correctly implemented in your data-type you can
depend on the ordering in the set.

Merging a list of similar items	6	Jun 13, 2007
std::priority_queue: Abysmally bad performance versus std::set, can'tfigure out why.	5	Nov 30, 2012
surprising template size and speed	45	Dec 12, 2006
I'm tempted to quit out of frustration	1	Aug 13, 2023
Interning own classes like strings for speed and size?	11	Dec 27, 2010
generating unique set of dicts from a list of dicts	0	Jan 10, 2012
Efficient searching through stl map?	3	Apr 9, 2006
Weird Behavior with Rays in C and OpenGL	4	Feb 13, 2024

list vs. set speed

Henrik Goldman

Victor Bazarov

joe

Henrik Goldman

Henrik Goldman

Victor Bazarov

Henrik Goldman

Victor Bazarov

Henrik Goldman

Bo Persson

Victor Bazarov

Victor Bazarov

=?ISO-8859-1?Q?Erik_Wikstr=F6m?=

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads