list vs. set speed

H

Henrik Goldman

I have a collection of data consting of a time_t timestamp, a std::string
username and a hostname.

This information is read from a text file which can easily be up to 100 mb.
This means that there is a lot of records to be processed.

Until now I've been using a std::list but it seems to be too slow.
The reason why is that duplicates (e.g. same time and user and host) are not
allowed. So until now a linear search has been done to reject all duplicate
elements when inserting. This can take a lot of time though.
In order to speed it up I'm looking for good ideas. I was thinking about a
multimap which consist of time_t as first and username+hostname as second.
However it seems akward since the data is really one entity.

Maybe set is better? However it can only be better if the search speed is
faster.

Thanks.

-- Henrik
 
V

Victor Bazarov

Henrik said:
I have a collection of data consting of a time_t timestamp, a
std::string username and a hostname.

This information is read from a text file which can easily be up to
100 mb. This means that there is a lot of records to be processed.

Order of one million records?
Until now I've been using a std::list but it seems to be too slow.
The reason why is that duplicates (e.g. same time and user and host)
are not allowed. So until now a linear search has been done to reject
all duplicate elements when inserting. This can take a lot of time
though. In order to speed it up I'm looking for good ideas. I was thinking
about a multimap which consist of time_t as first and
username+hostname as second. However it seems akward since the data
is really one entity.
Maybe set is better? However it can only be better if the search
speed is faster.

With as significant number of elements as yours, the difference
between the linear search and logarithmic search becomes exreme.
However, I'd probably try a slightly different approach -- insert
all of them in a vector, then sort it, then remove duplicates by
using 'std::unique'.

Just like with any other performance problem, you won't know which
one is better until you actually measure.

V
 
J

joe

I have a collection of data consting of a time_t timestamp, a std::string
username and a hostname.

This information is read from a text file which can easily be up to 100 mb.
This means that there is a lot of records to be processed.

Until now I've been using a std::list but it seems to be too slow.
The reason why is that duplicates (e.g. same time and user and host) are not
allowed. So until now a linear search has been done to reject all duplicate
elements when inserting. This can take a lot of time though.
In order to speed it up I'm looking for good ideas. I was thinking about a
multimap which consist of time_t as first and username+hostname as second.
However it seems akward since the data is really one entity.

Maybe set is better? However it can only be better if the search speed is
faster.

Thanks.

-- Henrik

The set would get you O(log2(N)) search time, which is bunches better
than std::list for lot of data.

Do you know anything about the data? Is it totally random, or are
entries chronological? If they are, then you could use a vector<> to
store the data, std::lower_bound() to find the first record with the
correct time and a linear search from there to determine if the data
is already present. This may give you better performance than either
list or set.

joe
 
H

Henrik Goldman

The set would get you O(log2(N)) search time, which is bunches better
than std::list for lot of data.

Do you know anything about the data? Is it totally random, or are
entries chronological? If they are, then you could use a vector<> to
store the data, std::lower_bound() to find the first record with the
correct time and a linear search from there to determine if the data
is already present. This may give you better performance than either
list or set.

The records are not guarrantied to be in correct order. They should be
chronological. However there may be multiple input data sources involved so
that you get an overlapping dataset. E.g. input A might be newer or older
then B. For this reason it's important that the data is sorted in the
correct order.
I think I'll give it a try initially with the set and see how it goes.

Thanks.

-- Henrik
 
H

Henrik Goldman

This information is read from a text file which can easily be up to
Order of one million records?

That might be too much. Not all the lines contain information. However
several thousands of records is a sure bet.
With as significant number of elements as yours, the difference
between the linear search and logarithmic search becomes exreme.
However, I'd probably try a slightly different approach -- insert
all of them in a vector, then sort it, then remove duplicates by
using 'std::unique'.

Just like with any other performance problem, you won't know which
one is better until you actually measure.

Thanks for the suggestion. I don't think it will work out though... See my
other post.

-- Henrik
 
V

Victor Bazarov

Henrik said:
That might be too much. Not all the lines contain information. However
several thousands of records is a sure bet.


Thanks for the suggestion.
Suggestions.

I don't think it will work out though...

Really? Why?
See my other post.

Well, see my other posts on performance of containers, then.

V
 
H

Henrik Goldman

It doesn't seem to work fully as expected.

Here is the set class:

class CUser

{

bool operator<(const CLicenseUser &rhs) const

{

return (StrLower(m_strUsername) == StrLower(rhs.m_strUsername)) ?

(StrLower(m_strHostname) < StrLower(rhs.m_strHostname)) :

(StrLower(m_strUsername) < StrLower(rhs.m_strUsername));

}

string m_strUsername;

string m_strHostname;

};

class CDeniedUser : public CUser
{
public:
CDeniedUser()
{
m_tDeny = 0;
}

time_t m_tDeny;

bool operator<(const CDeniedUser &rhs) const
{
return (m_tDeny < rhs.m_tDeny && ((CLicenseUser) *this < (CLicenseUser)
rhs));
}
};

Unfortunatly it doesn't get all the items inserted:

set<CDeniedUser> S;

CDeniedUser U1;
U1.m_strHostname = "A";
U1.m_strUsername = "A";
U1.m_tDeny = 1;

bool b = S.insert(U1).second;

S.size();

b = S.insert(U1).second;

U1.m_tDeny = 0;

b = S.insert(U1).second;

S.size();

printf("%d\n", itFeature->m_UserSet.begin()->m_tDeny);



The problem here is that the item with tDeny = 0 doesn't get inserted.
However the less function should be ok.

What am I missing?


-- Henrik
 
V

Victor Bazarov

Henrik said:
It doesn't seem to work fully as expected.

Here is the set class:

class CUser

{

bool operator<(const CLicenseUser &rhs) const

What's "CLicenseUser"?
{

return (StrLower(m_strUsername) == StrLower(rhs.m_strUsername)) ?

(StrLower(m_strHostname) < StrLower(rhs.m_strHostname)) :

(StrLower(m_strUsername) < StrLower(rhs.m_strUsername));

}

string m_strUsername;

string m_strHostname;

};

class CDeniedUser : public CUser
{
public:
CDeniedUser()
{
m_tDeny = 0;

FAQ 10.6.
}

time_t m_tDeny;

bool operator<(const CDeniedUser &rhs) const
{
return (m_tDeny < rhs.m_tDeny && ((CLicenseUser) *this <
(CLicenseUser) rhs));

Are you sure this implementation provides strict weak ordering?
Are you sure you need '&&' there? When there are two parts of
some structure that should be compared and one is more important,
the model is usually

return important < rhs.important ||
(important == rhs.important && lessimportant < lessimportant);
}
};

Unfortunatly it doesn't get all the items inserted:

set<CDeniedUser> S;

CDeniedUser U1;
U1.m_strHostname = "A";
U1.m_strUsername = "A";
U1.m_tDeny = 1;

bool b = S.insert(U1).second;

S.size();

b = S.insert(U1).second;

U1.m_tDeny = 0;

b = S.insert(U1).second;

S.size();

printf("%d\n", itFeature->m_UserSet.begin()->m_tDeny);



The problem here is that the item with tDeny = 0 doesn't get inserted.
However the less function should be ok.

What am I missing?

Proper less function, most likely.

V
 
H

Henrik Goldman

What's "CLicenseUser"?

Sorry about that. It should have been CUser. I renamed some of the names
when posting.
FAQ 10.6.

Thank you ... I'll keep that in mind. However as the FAQ says:

"Note: There is no performance difference if the type of x_ is some
built-in/intrinsic type, such as int or char* or float. But even in these
cases, my personal preference is to set those data members in the
initialization list rather than via assignment for consistency."

time_t is of type int so it won't give any performance benifits.
Are you sure this implementation provides strict weak ordering?
Are you sure you need '&&' there?

No... This is why I'm asking. Sorry for asking then!
Prior to asking I did my homework by searching for examples on google.
However there were no good examples found.
When there are two parts of
some structure that should be compared and one is more important,
the model is usually

return important < rhs.important ||
(important == rhs.important && lessimportant < lessimportant);

Why would you compare "lessimportant < lessimportant"?
Proper less function, most likely.

Thanks I actually figured out that one on my own.

Maybe you could explain why set<> has a less function? Wouldn't it be more
natural to have an absolute compare to see if two items are unique or not.
Naturally it's an issue of speed but what if A is not less then B. It
doesn't tell anything about if A is equal or not to B. Maybe it has
something to do with how the less function is used?
E.g. set<int>: 1 is less then 2. If I insert 1 and then 1 I'd expect only 1
insert. If I would then insert 2 it would be more then 1. Can I then be
guarrantied of the order? What if 2 would be inserted before 1?

Thanks.

-- Henrik
 
B

Bo Persson

Henrik Goldman wrote:
:::: bool operator<(const CLicenseUser &rhs) const
:::
::: What's "CLicenseUser"?
:::
::
:: Sorry about that. It should have been CUser. I renamed some of the
:: names when posting.
::
:::: m_tDeny = 0;
:::
::: FAQ 10.6.
:::
::
:: Thank you ... I'll keep that in mind. However as the FAQ says:
::
:: "Note: There is no performance difference if the type of x_ is some
:: built-in/intrinsic type, such as int or char* or float. But even
:: in these cases, my personal preference is to set those data
:: members in the initialization list rather than via assignment for
:: consistency."
::
:: time_t is of type int so it won't give any performance benifits.
::
:::: bool operator<(const CDeniedUser &rhs) const
:::: {
:::: return (m_tDeny < rhs.m_tDeny && ((CLicenseUser) *this <
:::: (CLicenseUser) rhs));
:::
::: Are you sure this implementation provides strict weak ordering?
::: Are you sure you need '&&' there?
::
:: No... This is why I'm asking. Sorry for asking then!
:: Prior to asking I did my homework by searching for examples on
:: google. However there were no good examples found.
::
::: When there are two parts of
::: some structure that should be compared and one is more important,
::: the model is usually
:::
::: return important < rhs.important ||
::: (important == rhs.important && lessimportant <
::: lessimportant);
:::
::
:: Why would you compare "lessimportant < lessimportant"?

Thst would be

lessimportant < rhs.lessimportant

Think last name, first name.

::
::: Proper less function, most likely.
:::
::
:: Thanks I actually figured out that one on my own.
::
:: Maybe you could explain why set<> has a less function? Wouldn't it
:: be more natural to have an absolute compare to see if two items
:: are unique or not. Naturally it's an issue of speed but what if A
:: is not less then B. It doesn't tell anything about if A is equal
:: or not to B. Maybe it has something to do with how the less
:: function is used?

Yes. There can be an order between objects, whether they are unique or
not.

struct person
{
std::string last_name;
std::string first_name;
int age;
};

person x = {"Smith", "John", 32};
person y = {"Smith", "John", 47};

When sorted by name, x and y are identical. When sorted by age, they
are not.


Bo Persson
 
V

Victor Bazarov

Henrik said:
Sorry about that. It should have been CUser. I renamed some of the
names when posting.


Thank you ... I'll keep that in mind. However as the FAQ says:

"Note: There is no performance difference if the type of x_ is some
built-in/intrinsic type, such as int or char* or float. But even in
these cases, my personal preference is to set those data members in
the initialization list rather than via assignment for consistency."

time_t is of type int so it won't give any performance benifits.

Nowhere in the Standard is it guaranteed to be "of type int".
No... This is why I'm asking. Sorry for asking then!
Prior to asking I did my homework by searching for examples on google.
However there were no good examples found.


Why would you compare "lessimportant < lessimportant"?

If you cannot decide based on more inportant (if they are the same).
Thanks I actually figured out that one on my own.

Maybe you could explain why set<> has a less function?

The set<> does not have a less function. It requires the elements
to be sortable, and by default uses 'std::less' as the sorting
predicate. You're free to provide your own predicate.
Wouldn't it be
more natural to have an absolute compare to see if two items are
unique or not.

I don't know what you mean. What's "absolute compare"? How would
you define uniqueness? The Standard says that two keys (items for
'std::set') 'a' and 'b' are *not* unique iff (!(a < b) && !(b < a))
yields 'true'.
Naturally it's an issue of speed but what if A is not
less then B. It doesn't tell anything about if A is equal or not to
B. Maybe it has something to do with how the less function is used?

Maybe. I am not sure what you mean by "how the less function is used".
E.g. set<int>: 1 is less then 2. If I insert 1 and then 1 I'd expect
only 1 insert. If I would then insert 2 it would be more then 1. Can
I then be guarrantied of the order?

Of what order?
What if 2 would be inserted
before 1?

What's the meaning of "inserted before" here? The final order is not
1-2 but 2-1 or the insertion process first adds (inserts) 2 and then
1?

What book are you reading that doesn't explain all that, BTW?

V
 
?

=?ISO-8859-1?Q?Erik_Wikstr=F6m?=

[snip]
Proper less function, most likely.

Thanks I actually figured out that one on my own.

Maybe you could explain why set<> has a less function? Wouldn't it be more
natural to have an absolute compare to see if two items are unique or not.
Naturally it's an issue of speed but what if A is not less then B. It
doesn't tell anything about if A is equal or not to B. Maybe it has
something to do with how the less function is used?

Yes, since std::set cares primarily about the ordering of elements and
only then if two elements are equal. Remember that if a < b and b < a
are both false it must mean (mathematically) that a == b. So if you can
determine which of two objects is the lesser one then you can also
determine if they are equal (provided that the operator< gives a strict
ordering). The reverse is not true.
E.g. set<int>: 1 is less then 2. If I insert 1 and then 1 I'd expect only 1
insert. If I would then insert 2 it would be more then 1. Can I then be
guarrantied of the order? What if 2 would be inserted before 1?

std::set is ordered by the comparison function provided (by default
std::less) so if that's correctly implemented in your data-type you can
depend on the ordering in the set.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,077
Latest member
SangMoor21

Latest Threads

Top