Merging a list of similar items

H

Henrik Goldman

Hello,

I have a dataset which consist of a string username and string hostname as a
key and then an integer representing a count as the matching "second" value
in a pair.

So far I've used std::list to store this information by inserting objects
containing all 3 pieces of information. However now some users has datasets
in the range of 12000 records and my list seems to be taking like forever to
complete inserting and finding.
For this reason I'm looking for ways to speed this up.

Here is a bit of code from the existing solution (which works but is too
slow):

iterator Find(const string &strUsername, const string &strHostname)
{
iterator it;

for (it = begin(); it != end(); it++)
{
if (StrNoCaseCompare(it->m_strUsername, strUsername) &&
StrNoCaseCompare(it->m_strHostname, strHostname))
return it;
}

return end();
}

void Insert(const string &strUsername, const string &strHostname, int
nCount)
{
CDailyUsage DU;

iterator it = Find(strUsername, strHostname);

if (it == end())
{
DU.m_strUsername = strUsername;
DU.m_strHostname = strHostname;
DU.m_nCount = nCount;

PushBack(DU);
}
else
{
it->m_nCount += nCount;
}
}

I'm thinking about replacing it with map<pair<username,hostname>, ncount>
but I don't know how much this will give me.
Are anyone up for good ideas?

Thanks.

-- Henrik
 
G

Ge Chunyuan

hi:

stl::map is much more suitable for your requirement.

As you mentioned the total size of dataset is around 12000,
it is a very big number and usually the stl::list is implemented in
linkedlink.
It will travers from the begin to the end, this means the searching
time complexity is O(n).

But map is always implemented in RB tree, the complexity is O(lg(n))

It is much fast if the size is bigger
 
H

Henrik Goldman

But map is always implemented in RB tree, the complexity is O(lg(n))
It is much fast if the size is bigger

Do you know if the construction I wrote is valid?

E.g. how would I do inserts and lookups in a map with a pair as 'first'.
Does pair<> have a proper compare function?

-- Henrik
 
Z

Zeppe

Henrik said:
Do you know if the construction I wrote is valid?

E.g. how would I do inserts and lookups in a map with a pair as 'first'.
Does pair<> have a proper compare function?

No, it hasn't. Actually, you could write one for your problem, but at
this point it's better to wrap the information in a small class (or
struct, if you do not really need a class for this). For the map to
work, the operator< is needed. The simplest way is:

struct Account
{
std::string username;
std::string hostname;
bool operator<(const Account& a) const
{
return (username == a.username) ?
(hostname < hostname) :
(username < username);
}
};

and then...

std::map<Account, std::string> passwords;

will work fine.

Regards,

Zeppe
 
J

Jim Langston

Henrik Goldman said:
Hello,

I have a dataset which consist of a string username and string hostname as
a key and then an integer representing a count as the matching "second"
value in a pair.

So far I've used std::list to store this information by inserting objects
containing all 3 pieces of information. However now some users has
datasets in the range of 12000 records and my list seems to be taking like
forever to complete inserting and finding.
For this reason I'm looking for ways to speed this up.

Here is a bit of code from the existing solution (which works but is too
slow):

iterator Find(const string &strUsername, const string &strHostname)
{
iterator it;

for (it = begin(); it != end(); it++)
{
if (StrNoCaseCompare(it->m_strUsername, strUsername) &&
StrNoCaseCompare(it->m_strHostname, strHostname))
return it;
}

return end();
}

void Insert(const string &strUsername, const string &strHostname, int
nCount)
{
CDailyUsage DU;

iterator it = Find(strUsername, strHostname);

if (it == end())
{
DU.m_strUsername = strUsername;
DU.m_strHostname = strHostname;
DU.m_nCount = nCount;

PushBack(DU);
}
else
{
it->m_nCount += nCount;
}
}

I'm thinking about replacing it with map<pair<username,hostname>, ncount>
but I don't know how much this will give me.
Are anyone up for good ideas?

A std::map should give you a whole lot. A list is unindexed and for every
search you are iteratirng though every possible record. A map is indexed (I
believe they use binary trees or better) and searches are a lot faster, a
WHOLE lot faster. The difficulty comes in your strNoCase compare, but you
can get around that by making the key a class with an operator<..

The way you are using this you probably want a set anyway since your data is
in your class/structure.

Output of following program showing your type of functions for List, Map and
Set is:

*** List ***
Inserting group 0 100 unique entries to list: 94ns
Inserting group 1 100 unique entries to list: 266ns
Inserting group 2 100 unique entries to list: 343ns
Inserting group 3 100 unique entries to list: 469ns
Inserting group 4 100 unique entries to list: 609ns
Inserting group 5 100 unique entries to list: 719ns
Inserting group 6 100 unique entries to list: 860ns
Inserting group 7 100 unique entries to list: 1000ns
Inserting group 8 100 unique entries to list: 1171ns
Inserting group 9 100 unique entries to list: 1250ns
Search for non existant entry in List: 16ns
List size:1000

*** Map ***
Inserting group 0 100 unique entries to map: 47ns
Inserting group 1 100 unique entries to map: 62ns
Inserting group 2 100 unique entries to map: 63ns
Inserting group 3 100 unique entries to map: 78ns
Inserting group 4 100 unique entries to map: 78ns
Inserting group 5 100 unique entries to map: 94ns
Inserting group 6 100 unique entries to map: 94ns
Inserting group 7 100 unique entries to map: 94ns
Inserting group 8 100 unique entries to map: 78ns
Inserting group 9 100 unique entries to map: 94ns
Search for non existant entry in Map: 0.485ns
Map size:1000

*** Set ***
Inserting group 0 100 unique entries to set: 46ns
Inserting group 1 100 unique entries to set: 79ns
Inserting group 2 100 unique entries to set: 78ns
Inserting group 3 100 unique entries to set: 78ns
Inserting group 4 100 unique entries to set: 78ns
Inserting group 5 100 unique entries to set: 78ns
Inserting group 6 100 unique entries to set: 93ns
Inserting group 7 100 unique entries to set: 94ns
Inserting group 8 100 unique entries to set: 94ns
Inserting group 9 100 unique entries to set: 94ns
Search for non existant entry in Set: 0.531ns
Set size:1000

The main thing to notice is the search times. For your list it's a linear
search, takes longer and longer the more entries you have. For map and set,
they use indexes, so QUITE a bit faster. Here is program: (Please note
that followoing program is not optmized, I just tried to follow the way you
were doing things with maps and sets also).

#include <iostream>
#include <string>
#include <ctime>
#include <cstring>
#include <utility>
#include <sstream>

#include <list>
#include <map>
#include <set>

std::string Lowercase( const std::string& MixedString )
{
std::string ReturnValue = "";
for ( std::string::size_type i = 0; i < MixedString.size(); ++i )
ReturnValue += static_cast<char>( tolower( MixedString ) );
return ReturnValue;
}

bool StrNoCaseCompare( const std::string& Str1, const std::string& Str2 )
{
if ( Lowercase( Str1 ) == Lowercase( Str2 ) )
return true;

return false;
}


class UserInfo
{
public:
UserInfo( const std::string& Username, const std::string& Hostname ):
Username( Username ), Hostname( Hostname ) {}
std::string Username;
std::string Hostname;
int count; // Only used in set
};

bool operator<( const UserInfo& lhs, const UserInfo& rhs )
{
// case insensitve compare
// use your fastest method, this one just thrown together
std::string Userlhs = Lowercase( lhs.Username );
std::string Hostlhs = Lowercase( lhs.Hostname );
std::string Userrhs = Lowercase( rhs.Username );
std::string Hostrhs = Lowercase( rhs.Hostname );

if ( Userlhs == Userrhs )
if ( Hostlhs < Hostrhs )
return true;
else
return false;
else if ( Userlhs < Userrhs )
return true;
else
return false;
}


typedef std::list<std::pair<UserInfo, int> > DataList;
typedef std::map<UserInfo, int> DataMap;
typedef std::set<UserInfo> DataSet;

DataList::iterator FindInList(const std::string &strUsername, const
std::string &strHostname, DataList& Data)
{
DataList::iterator it;

for (DataList::iterator it = Data.begin(); it != Data.end(); ++it)
{
if (StrNoCaseCompare(it->first.Username, strUsername) &&
StrNoCaseCompare(it->first.Hostname, strHostname))
return it;
}

return Data.end();
}

void InsertToList(const std::string &strUsername, const std::string
&strHostname, int nCount, DataList& Data)
{
DataList::iterator it = FindInList(strUsername, strHostname, Data);

if (it == Data.end())
{
Data.push_back(std::make_pair( UserInfo( strUsername, strHostname ),
nCount ));
}
else
{
it->second += nCount;
}
}

DataMap::iterator FindInMap(const std::string &strUsername, const
std::string &strHostname, DataMap& Data)
{
return Data.find( UserInfo( strUsername, strHostname ));
}


void InsertToMap(const std::string &strUsername, const std::string
&strHostname, int nCount, DataMap& Data)
{
UserInfo Entry( strUsername, strHostname );
DataMap::iterator it = Data.find( Entry );

if (it == Data.end())
{
Data.insert(std::make_pair( Entry, nCount ));
}
else
{
it->second += nCount;
}
}

DataSet::iterator FindInSet(const std::string &strUsername, const
std::string &strHostname, DataSet& Data)
{
return Data.find( UserInfo( strUsername, strHostname ));
}

void InsertToSet(const std::string &strUsername, const std::string
&strHostname, int nCount, DataSet& Data)
{
UserInfo Entry( strUsername, strHostname );
Entry.count = nCount;

DataSet::iterator it = Data.find( Entry );

if (it == Data.end())
{
Data.insert(Entry);
}
else
{
it->count += nCount;
}
}

std::string StrmConvert( int Number )
{
std::stringstream Temp;
Temp << Number;
std::string Output;
Temp >> Output;
return Output;
}


int main ()
{
const int Entries = 100;

DataList ListData;
DataMap MapData;
DataSet SetData;

clock_t Start;
clock_t End;

// Insert to list
std::cout << "*** List ***\n";
for ( int Group = 0; Group < 10; ++Group )
{
std::string GroupPrefix = StrmConvert( Group );
Start = clock();
for ( int i = 0; i < Entries; ++i )
{
InsertToList( GroupPrefix + StrmConvert( i ), GroupPrefix +
StrmConvert( i ), i, ListData );
}
End = clock();
std::cout << "Inserting group " << Group << " " << Entries << "
unique entries to list: " << End - Start << "ns" << "\n";
}
Start = clock();
FindInList( "Bogus", "Bogus", ListData );
End = clock();
std::cout << "Search for non existant entry in List: " << End - Start <<
"ns\n";

std::cout << "List size:" << ListData.size() << "\n";

// Insert to map
std::cout << "\n*** Map ***\n";
for ( int Group = 0; Group < 10; ++Group )
{
std::string GroupPrefix = StrmConvert( Group );
Start = clock();
for ( int i = 0; i < Entries; ++i )
{
InsertToMap( GroupPrefix + StrmConvert( i ), GroupPrefix +
StrmConvert( i ), i, MapData );
}
End = clock();
std::cout << "Inserting group " << Group << " " << Entries << "
unique entries to map: " << End - Start << "ns" << "\n";
}
Start = clock();
// Map find is so fast one search reports 0ns
for ( int i = 0; i < 1000; ++i )
FindInMap( "Bogus", "Bogus", MapData );
End = clock();
std::cout << "Search for non existant entry in Map: " <<
static_cast<double>( End - Start ) / 1000.0 << "ns\n";
std::cout << "Map size:" << MapData.size() << "\n";

// Insert to set
std::cout << "\n*** Set ***\n";
for ( int Group = 0; Group < 10; ++Group )
{
std::string GroupPrefix = StrmConvert( Group );
Start = clock();
for ( int i = 0; i < Entries; ++i )
{
InsertToSet( GroupPrefix + StrmConvert( i ), GroupPrefix +
StrmConvert( i ), i, SetData );
}
End = clock();
std::cout << "Inserting group " << Group << " " << Entries << "
unique entries to set: " << End - Start << "ns" << "\n";
}
Start = clock();
// Set find is so fast one search reports 0ns
for ( int i = 0; i < 1000; ++i )
FindInSet( "Bogus", "Bogus", SetData );
End = clock();
std::cout << "Search for non existant entry in Set: " <<
static_cast<double>( End - Start ) / 1000.0 << "ns\n";

std::cout << "Set size:" << SetData.size() << "\n";

}
 
H

Henrik Goldman

and then...

std::map<Account, std::string> passwords;

will work fine.

Thanks alot. After this change it seems only to take less than a second now.
So this certainly made a great impact.

-- Henrik
 
M

Mark

Zeppe said:
No, it hasn't. Actually, you could write one for your problem, but at
this point it's better to wrap the information in a small class (or
struct, if you do not really need a class for this). For the map to
work, the operator< is needed. The simplest way is:

struct Account
{
std::string username;
std::string hostname;
bool operator<(const Account& a) const
{
return (username == a.username) ?
(hostname < hostname) :
(username < username);
}
};

and then...

std::map<Account, std::string> passwords;

will work fine.

Regards,

Zeppe

Or you could also use boost::tuples, then you wouldn't need to define any
struct Account to define the operator< The tuples come with their own
comparisons if you include boost/tuple/tuple_comparison.hpp, then the
map will be able to insert with its operator< like this

#include <iostream>
#include <map>
#include "boost/tuple/tuple.hpp"
#include "boost/tuple/tuple_comparison.hpp"

using boost::tuple;
using boost::make_tuple;
using std::map;
using std::make_pair;
using std::string;
using std::cout;
using std::endl;
typedef map< tuple<string, string>, string > PASSMAP;
typedef PASSMAP::iterator PASSMAP_ITER;

int main()
{
map< tuple<string, string>, string > passwords;

passwords.insert(make_pair(make_tuple("host1", "user1"), "user3_password"));
passwords.insert(make_pair(make_tuple("host3", "userz"), "user2_password"));
passwords.insert(make_pair(make_tuple("host3", "usera"), "user3_password"));

for (PASSMAP_ITER pmi = passwords.begin();
pmi != passwords.end(); ++pmi)
{
cout << "host " << boost::tuples::get<0>((*pmi).first) << endl;
cout << "user " << boost::tuples::get<1>((*pmi).first) << endl;
cout << "passwd " << (*pmi).second << endl;
}

/* output is sorted by first tuple, then second tuple
host host1
user user1
passwd user3_password
host host3
user usera
passwd user3_password
host host3
user userz
passwd user2_password
*/
return 0;
}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top