STL map or hash map using struct as data and find it

kl · Dec 31, 2007

Hi,

I'm trying to learn some STL using map or hash_map would maybe even
better to use but I don't really know how to find a specific struct
depending on a DWORD value that is in range of two DWORD values (see
below for more).

So what I trying to achieve is creating a look up table of a IP adress
with a start adress (a DWORD value) and a end adress (a DWORD
value) which I would like to look up to the get the Country name
using a specific IP adress (also in DWORD) which is random access to
the table.

It is easy to use vector, linked list but I would like to try to use
map or hash map for more effecient way to look up or at least test it
out to compare some.

The code is basicly looks like this:

//Struct definition
struct IPCOUNTRY
{
DWORD startIP; //IP start adress in DWORD which is always unique
DWORD endIP; //IP end adress in DWORD which is always unique
char COUNTRYNAME[45]; //Country name like England, Germany Spain
etc...
};

typedef map <DWORD, IPCOUNTRY> mIPC;
mIPC mIPcountry;

//Initilize the map when and call this function during start up
void initilizeMap()
{

//read data from file and insert it into a map
IPCOUNTRY g_structIP;
FILE *fp=NULL;
fp=fopen("ipcountry.dat", "rb");

if(fp!=NULL)
{
while( !feof( fp ) )
{
if(fread(&g_structIP, sizeof(g_structIP), 1, fp)!=0)
{

mIPcountry[g_structIP.startIP] = g_structIP;
}
}
}
fclose(fp);
}

struct compareIPC
{
bool operator()(const IPCOUNTRY* ipc1, const IPCOUNTRY* icp2) const
{

return strcmp(ipc1, ipc2) < 0;
}
};

//This function will be called with IPaddress set and set the
countryname depending which country it belongs
//to using the map look up table
int lookupCountryName(DWORD IPadress, char *countryname)
{
//ok here I'm lost what I should exactly do

mIPC::iterator mIPCbegin = mIPcountry.begin();
mIPC::iterator mIPCend = mIPcountry.end();

return 0;
}

kl · Dec 31, 2007

Sorry I accidantly clicked on the send button.

Hi,

I'm trying to learn some STL using map or hash_map would maybe even
better to use but I don't really know how to find a specific struct
depending on a DWORD value that is in range of two DWORD values (see
below for more).

So what I trying to achieve is creating a look up table of a IP adress
with a start adress (a DWORD value) and a end adress (a DWORD
value) which I would like to look up to the get the Country name
using a specific IP adress (also in DWORD) which is random access to
the table.

It is easy to use vector, linked list but I would like to try to use
map or hash map for more effecient way to look up or at least test it
out to compare some.

The code is basicly looks like this:

//Struct definition
struct IPCOUNTRY
{
DWORD startIP; //IP start adress in DWORD which is always unique
DWORD endIP; //IP end adress in DWORD which is always unique
char COUNTRYNAME[45]; //Country name like England, Germany Spain
etc...

};

typedef map <DWORD, IPCOUNTRY> mIPC;
mIPC mIPcountry;

//Initilize the map when and call this function during start up
void initilizeMap()
{

//read data from file and insert it into a map
IPCOUNTRY g_structIP;
FILE *fp=NULL;
fp=fopen("ipcountry.dat", "rb");

if(fp!=NULL)
{
while( !feof( fp ) )
{
if(fread(&g_structIP, sizeof(g_structIP), 1, fp)!=0)
{

mIPcountry[g_structIP.startIP] = g_structIP;
}
}
}
fclose(fp);

}

Here is a more correct comapare function:

struct compareIPC
{
bool operator()(const IPCOUNTRY* ipc1, const IPCOUNTRY* ipc2) const
{
if((ipc2.startIP>=ipc1.startIP) && (ipc2.startIP <=
ipc1.endIP))
return true;
return false;
}

};

//This function will be called with IPaddress set and set the
// countryname depending which country it belongs
//to using the map look up table
int lookupCountryName(DWORD IPadress, char *countryname)
{
//ok here I'm lost what I should exactly do
IPCOUNTRY tempIPC;

tempIPC.startIP = IPadress;
tempIPC.endIP = IPadress;

mIPC::iterator mIPCbegin = mIPcountry.begin();
mIPC::iterator mIPCend = mIPcountry.end();

//I would do something like this but also include the tempIPC to
find the correct item

mIPC::iterator iterResult = find_if(mIPCbegin,
mIPCend,insideiprange);

//code to get the result and set the country name

return 0;
}

I tried my best to explain what I'm trying to do and
hopefully some one can shed some light of this.

regards
Kl

Jim Langston · Dec 31, 2007

kl said:
Hi,

I'm trying to learn some STL using map or hash_map would maybe even
better to use but I don't really know how to find a specific struct
depending on a DWORD value that is in range of two DWORD values (see
below for more).

So what I trying to achieve is creating a look up table of a IP adress
with a start adress (a DWORD value) and a end adress (a DWORD
value) which I would like to look up to the get the Country name
using a specific IP adress (also in DWORD) which is random access to
the table.

If I'm understanding you correctly, you want to find a range of elemets in a
map between an upper bound and a lower bound value. Luckily, std::map has
lower_bound() and upper_bound() which both return an iteratore.
lower_bound() returning an iterator to the lowest key that is equal or
greater than the value, upper_bound() returning an iterator to the highest
key equal or less than the value.

I do believe than an IP4 address has it's MSB as the left most of the IP
address so this should work fine for a range of IP addresses.

Of course you'll need to compare the iterators for not found (compare to
..end()).

James Kanze · Dec 31, 2007

I'm trying to learn some STL using map or hash_map would maybe
even better to use but I don't really know how to find a
specific struct depending on a DWORD value that is in range of
two DWORD values (see below for more).

A hash map generally cannot be made to work with a range. In
the case of std::map, it will take a somewhat special ordering
function, but I think it can be made to work. I'd still tend to
favor a sorted std::vector, using lower_bound for the look-up.

So what I trying to achieve is creating a look up table of a
IP adress with a start adress (a DWORD value) and a end
adress (a DWORD value) which I would like to look up to the
get the Country name using a specific IP adress (also in
DWORD) which is random access to the table.

Just a guess, but I would imagine that lookups outnumber
insertions by a very significant margin. That most likely, in
fact, that table is initialized once from a file, before first
use, and never modified afterwards.

In that case, std::vector and lower_bound are definitely the way
to go. In fact, I'd recommend sorting the initialization file
once and for all, and just asserting on the order while reading
it.

It is easy to use vector, linked list but I would like to try
to use map or hash map for more effecient way to look up or at
least test it out to compare some.

A sorted vector and lower_bound is likely to be faster than
std::map, since it will make roughly the same number of
comparisons, and will have a lot better locality. If speed is
really an issue, of course, the solution in this particular case
is probably to use a trie starting with the high order byte. My
guess is that a single table lookup (using direct indexation)
will suffice 99% of the time, or more.

The code is basicly looks like this:

//Struct definition
struct IPCOUNTRY
{
DWORD startIP; //IP start adress in DWORD which is always unique
DWORD endIP; //IP end adress in DWORD which is always unique
char COUNTRYNAME[45]; //Country name like England, Germany Spain etc...
};

Unless you need static initialization, you should probably use
std::string instead of char[45]. More importantly, you should
generally avoid all caps except for macros, and probably use
more or less standard types, like uint32_t, instead of some
typedef to an unknown type, like DWORD (which a priori I would
expect to be a 64 bit type).

typedef map <DWORD, IPCOUNTRY> mIPC;
mIPC mIPcountry;

That doesn't look right to me. You're not mapping a single IP
address to a country. mIPC.lower_bound will still work, but
only if you assume a dense allocation, with no holes. I'd go
for something like:

typedef std::set< IpCountry, IpCountryCmp >
Map ;
Map myIpCountryMap ;

Defining IpCountryCmp is not going to be that simple, however.

//Initilize the map when and call this function during start up
void initilizeMap()
{
//read data from file and insert it into a map
IPCOUNTRY g_structIP;
FILE *fp=NULL;
fp=fopen("ipcountry.dat", "rb");

if(fp!=NULL)
{
while( !feof( fp ) )
{
if(fread(&g_structIP, sizeof(g_structIP), 1, fp)!=0)
{
mIPcountry[g_structIP.startIP] = g_structIP;
}
}
}
fclose(fp);
}

The above code will not work at all, for several reasons.

First, of course, you haven't told us the format of the file
you're reading, so it's difficult to say how you really should
read it, but fread is only a solution if you're reading the data
into a buffer of raw bytes, which you then parse manually.
(Most likely, you're dealing with a file in CSV format, unless
you're reading directly from a data base connection. So some
parsing will probably be necessary anyway.)

Secondly, feof only tells you whether the last read encountered
EOF or not; if there was some other type of error, it will
remain false. Forever.

And fclose on a null pointer is undefined behavior.

The classical solution for this problem would be to write an
operator>> for IPCOUNTRY, and use it on an ifstream. Using the
FILE* functions of C here is a sure loser.

struct compareIPC
{
bool operator()(const IPCOUNTRY* ipc1, const IPCOUNTRY* icp2) const
{
return strcmp(ipc1, ipc2) < 0;
}
};

strcmp on something which isn't a string (a null terminated
sequence of char) is a sure recepe for disaster.

//This function will be called with IPaddress set and set the
countryname depending which country it belongs
//to using the map look up table
int lookupCountryName(DWORD IPadress, char *countryname)
{
//ok here I'm lost what I should exactly do

mIPC::iterator mIPCbegin = mIPcountry.begin();
mIPC::iterator mIPCend = mIPcountry.end();

return 0;
}

If you're using std::set, you'll need to construct a "key", i.e.
an IPCOUNTRY, such that the comparison function will work
correctly. If you do this, I'd probably use something like:

class IpCountryEntry
{
public:
//! \post
//! country() == ""
//! start() == 0
//! end() == 0
IpCountryEntry() ;
//! \post
//! country() == country
//! start() == start
//! end() == end + 1
//! for half open range...
IpCountryEntry( uint32_t start,
uint32_t end,
std::string const& country ) ;
//! \post
//! country() == ""
//! start() == ipAddress
//! end() == ipAddress
IpCountryEntry( uint32_t ipAddress ) ;

std::string country() const ;
uint32_t start() const ;
uint32_t end() const ;

bool operator<(
IpCountryEntry const& other ) const ;

// ...
} ;

The trick would be to ensure that operator< always returned
false if one of the ranges was completely included in the other
(and probably raised an exception if the ranges overlapped).
You could then use the find() function of std::set.

But as I said, a sorted std::vector and std::lower_bound would
probably be more appropriate. Note that the comparison function
of lower_bound is a template argument, and the value argument's
type is also separate from the value_type of the iterator, so
you can invoke something like:

std::lower_bound( v.begin(), v.end(),
ipAddress,
CmpIpEntry() ) ;

where CmpIpEntry is something like:

struct CmpIpEntry
{
bool operator( IpCountryEntry const& lhs,
IpCountryEntry const& rhs ) const ;
bool operator( IpCountryEntry const& lhs,
uint32_t rhs ) const ;
bool operator( uint32_t lhs,
IpCountryEntry const& rhs ) const ;
} ;

(Formally, at least according to the latest draft, the third
function above should never be called. But it's trivial to
implement, just in case.)

kl · Dec 31, 2007

A hash map generally cannot be made to work with a range. In
the case of std::map, it will take a somewhat special ordering
function, but I think it can be made to work. I'd still tend to
favor a sorted std::vector, using lower_bound for the look-up.

Intresting, maybe this approach I should go with.

I need to read on a bit on the lower_bound before I can comment if it
would work.

Just a guess, but I would imagine that lookups outnumber
insertions by a very significant margin. That most likely, in
fact, that table is initialized once from a file, before first
use, and never modified afterwards.

Correct.

In that case, std::vector and lower_bound are definitely the way
to go. In fact, I'd recommend sorting the initialization file
once and for all, and just asserting on the order while reading
it.

ok, seems like the way to go with vectors.

A sorted vector and lower_bound is likely to be faster than
std::map, since it will make roughly the same number of
comparisons, and will have a lot better locality. If speed is
really an issue, of course, the solution in this particular case
is probably to use a trie starting with the high order byte. My
guess is that a single table lookup (using direct indexation)
will suffice 99% of the time, or more.

Why I would like to have some speed up is because the file has around
80'000 lines

The code is basicly looks like this:
//Struct definition
struct IPCOUNTRY
{
DWORD startIP; //IP start adress in DWORD which is always unique
DWORD endIP; //IP end adress in DWORD which is always unique
char COUNTRYNAME[45]; //Country name like England, Germany Spain etc...
};

Click to expand...

Unless you need static initialization, you should probably use
std::string instead of char[45]. More importantly, you should
generally avoid all caps except for macros, and probably use
more or less standard types, like uint32_t, instead of some
typedef to an unknown type, like DWORD (which a priori I would
expect to be a 64 bit type).

I'm aware of this defines and working on cleaning my app to use string
instead of array of char's same goes with DWORD. If I just did it from
the beginning it would save me alot of time but me decision was to
keep a small memory footprint of the application but today I will
never think of doing this.
Stability and portability goes before small usage of memory & gaining
speed especially on todays computers somthing I have learned from the
current code.

typedef map <DWORD, IPCOUNTRY> mIPC;
mIPC mIPcountry;

Click to expand...

That doesn't look right to me. You're not mapping a single IP
address to a country. mIPC.lower_bound will still work, but
only if you assume a dense allocation, with no holes. I'd go
for something like:

typedef std::set< IpCountry, IpCountryCmp >
Map ;
Map myIpCountryMap ;

Defining IpCountryCmp is not going to be that simple, however.

//Initilize the map when and call this function during start up
void initilizeMap()
{
//read data from file and insert it into a map
IPCOUNTRY g_structIP;
FILE *fp=NULL;
fp=fopen("ipcountry.dat", "rb");
if(fp!=NULL)
{
while( !feof( fp ) )
{
if(fread(&g_structIP, sizeof(g_structIP), 1, fp)!=0)
{
mIPcountry[g_structIP.startIP] = g_structIP;
}
}
}
fclose(fp);
}

Click to expand...

The above code will not work at all, for several reasons.

First, of course, you haven't told us the format of the file
you're reading, so it's difficult to say how you really should
read it, but fread is only a solution if you're reading the data
into a buffer of raw bytes, which you then parse manually.
(Most likely, you're dealing with a file in CSV format, unless
you're reading directly from a data base connection. So some
parsing will probably be necessary anyway.)

Well basicly the format of file is the IPCOUNTRY struct.

I have created a seperate converter function from CSV file format to
STRUCT format (this is never done on the retail application) this is
also have to do with speed when reading from the file.
Maybe not future proof and user friendly but imagine if everytime you
start up an application and it gonna parse 80'000+ lines of column
seperated lines with strtok or similar... maybe not a big deal with
todays computer. I just thought this was a neat and lightweight
solution at that point of time.

Secondly, feof only tells you whether the last read encountered
EOF or not; if there was some other type of error, it will
remain false. Forever.

And fclose on a null pointer is undefined behavior.

Sorry for the typo... it is suppose to be inside the if(fp!=NULL)
{..}.

The classical solution for this problem would be to write an
operator>> for IPCOUNTRY, and use it on an ifstream. Using the
FILE* functions of C here is a sure loser.

strcmp on something which isn't a string (a null terminated
sequence of char) is a sure recepe for disaster.

This piece of code was a mistake and post accidantly.

If you're using std::set, you'll need to construct a "key", i.e.
an IPCOUNTRY, such that the comparison function will work
correctly. If you do this, I'd probably use something like:

class IpCountryEntry
{
public:
//! \post
//! country() == ""
//! start() == 0
//! end() == 0
IpCountryEntry() ;
//! \post
//! country() == country
//! start() == start
//! end() == end + 1
//! for half open range...
IpCountryEntry( uint32_t start,
uint32_t end,
std::string const& country ) ;
//! \post
//! country() == ""
//! start() == ipAddress
//! end() == ipAddress
IpCountryEntry( uint32_t ipAddress ) ;

std::string country() const ;
uint32_t start() const ;
uint32_t end() const ;

bool operator<(
IpCountryEntry const& other ) const ;

// ...
} ;

The trick would be to ensure that operator< always returned
false if one of the ranges was completely included in the other
(and probably raised an exception if the ranges overlapped).
You could then use the find() function of std::set.

But as I said, a sorted std::vector and std::lower_bound would
probably be more appropriate. Note that the comparison function
of lower_bound is a template argument, and the value argument's
type is also separate from the value_type of the iterator, so
you can invoke something like:

std::lower_bound( v.begin(), v.end(),
ipAddress,
CmpIpEntry() ) ;

where CmpIpEntry is something like:

struct CmpIpEntry
{
bool operator( IpCountryEntry const& lhs,
IpCountryEntry const& rhs ) const ;
bool operator( IpCountryEntry const& lhs,
uint32_t rhs ) const ;
bool operator( uint32_t lhs,
IpCountryEntry const& rhs ) const ;
} ;

(Formally, at least according to the latest draft, the third
function above should never be called. But it's trivial to
implement, just in case.)

I will look into this directly.

Great info James and thanks for your feedback much appreciated!

James Kanze · Dec 31, 2007

[...]

Why I would like to have some speed up is because the file has
around 80'000 lines

But how often are you doing a look-up. Still, 80000 lines is
probably enough to want to use something more efficient than a
linear search.

[...]

//Initilize the map when and call this function during start up
void initilizeMap()
{
//read data from file and insert it into a map
IPCOUNTRY g_structIP;
FILE *fp=NULL;
fp=fopen("ipcountry.dat", "rb");
if(fp!=NULL)
{
while( !feof( fp ) )
{
if(fread(&g_structIP, sizeof(g_structIP), 1, fp)!=0)
{
mIPcountry[g_structIP.startIP] = g_structIP;
}
}
}
fclose(fp);
}

Click to expand...

The above code will not work at all, for several reasons.
First, of course, you haven't told us the format of the file
you're reading, so it's difficult to say how you really should
read it, but fread is only a solution if you're reading the data
into a buffer of raw bytes, which you then parse manually.
(Most likely, you're dealing with a file in CSV format, unless
you're reading directly from a data base connection. So some
parsing will probably be necessary anyway.)

Click to expand...

Well basicly the format of file is the IPCOUNTRY struct.

Which is? The format of the IPCOUNTRY struct will vary
enormously between compilers, even on the same platform, and can
change from one release of the compiler to the next, or even
with different compiler options. When you say that the format
is basically the same as that of the IPCOUNTRY struct, you're
basically saying that you don't know the format, and that you
don't care, because you're not going to ever reread the data at
a later date, when the program may have been recompiled with a
more recent version of the compiler, or with different options.

I have created a seperate converter function from CSV file
format to STRUCT format (this is never done on the retail
application) this is also have to do with speed when reading
from the file.

As long as you ensure that both this separate program and your
application are compiled with the same version of the compiler,
using the same options. It still sounds error prone to me (but
I can imagine cases where it is justified).

Maybe not future proof and user friendly but imagine if
everytime you start up an application and it gonna parse
80'000+ lines of column seperated lines with strtok or
similar... maybe not a big deal with todays computer. I just
thought this was a neat and lightweight solution at that point
of time.

Do you know of many programs that don't parse a couple of
thousand lines on start up?

I guess it depends on the context. I work on large scale
servers. They're started at most once a day---most of the
projects I've worked on are started once every three or four
years, or more. Whether the start up lasts 1 second, or 5 or
even 30, isn't a big deal.

Still, if you have to parse it, you have to parse it. Doing so
in a separate process is going to take (slightly) more time than
doing so in your own process. What might be useful is if you
only have to update the data (from the CVS file) once a week or
less, but you restart the individual application a lot. In that
case, I might use a memory dump with a header containing the
last update (from the CVS file) date and a checksum based on the
compiler/compiler options. (For the latter, on a Unix system,
I'd simply use the output of /dev/random to initialize an array
of about 16 bytes, regenerating the data each time I relinked.
So anytime I install a newly relinked version of the program, it
will go to the CVS data the first time.) And I'd still put the
parser for the CVS data in the same binary image, with something
like:

if ( binaryDumpPresent() && binaryDumpOKToUse() ) {
initializeFromBinaryDump() ;
} else {
initializeFromCVSFile() ;
createBinaryDumpFromData() ;
}

In this case, of course:

-- The actual data elements should be POD's, with only basic
types (integral types, perhaps floating point types as well,
although there's no need here, but no classes or pointers).

-- If you're using a sorted vector, you can do this with a
single read/write, no need to read or write each element
separately. (In fact, I'd use two reads/writes: one for the
header, and one for the rest of the data.)

Depending on portability constraints, I might even use mmap (or
its equivalent under Windows); I've done this once (with a file
that took something like five minutes to parse on my old Sun
Sparc, resulted in a memory image of something like 10-15
Megabytes, and in which in 90% of the runs, I never accessed
beyond the first 128 entries---out of more than a
million---anyway).

Although formally not guaranteed... If you define the header
something like:

struct IpCountryMapHeader
{
unsigned char magic[ 16 ] ;
time_t saved ;
size_t entryCount ;
} ;

and each element as:

struct IpCountryMapEntry
{
uint32_t start ;
uint32_t end ;
char countryName[ maxCountryNameLength + 1 ] ;
} ;

will doubtlessly work. (In theory, you could run afoul of
alignment considerations if both size_t and time_t are smaller
than uint32_t. In practice, if size_t is less than 32 bits,
your table won't fit into memory, and time_t will never be less
than 32 bits anyway.)

Under Unix, a simple script like:

(
echo 'unsigned char referenceMagic[] = {' ;
od -tx1 -N 16 /dev/random |
tr '[:lower:]' '[:upper:]' |
sed -e '2,$d'
-e 's:^[0-9A-Z]* ::'
-e 's:[0-9A-F][0-9A-F]:0x&,:g' ;
echo '} ;'
) > referenceMagic.cc

in your makefile can be used to regenerate the magic every time
you relink; under Windows, you may have to write a simple C/C++
program to do something similar, using some sort of hash value
of the machines IP address, your current process id (a hex dump
of the HANDLE?), the current time, etc., to seed a quality
random generator, and output from it. (I use scripts like this
all the time in my makefiles, but in this case, if the OS
doesn't support /dev/random, or something similar, the script
won't work, even if you have a full Unix tool kit installed.)

(Also, FWIW: most of the tools which generate your CSV file will
generate closed intervals. I'd convert this to the classical
Unix half open interval when I read the file, and use that from
then on.)

Note too that std::lower_bound will work fine on a C style
array, either loaded with ifstream::read() (on a file opened in
binary mode) or mmapped. Just call it with startAddress,
startAddress + header.entryCount as the first two arguments.

kl · Dec 31, 2007

[...]

Why I would like to have some speed up is because the file has
around 80'000 lines

Click to expand...

But how often are you doing a look-up. Still, 80000 lines is
probably enough to want to use something more efficient than a
linear search.

It can vary a lot, from 200 to almost 20000 times accessing the look
up table within seconds (depending of your network connection/
timeouts) and also
done using multithreading which can be 64-128 threads depending of
settings.
It has to do as my software scans through different servers (actually
game servers) and ask for some info and the lookup table in this case
translatets the game server ip address to a country (for filtering
purpose).

[...]

//Initilize the map when and call this function during start up
void initilizeMap()
{
//read data from file and insert it into a map
IPCOUNTRY g_structIP;
FILE *fp=NULL;
fp=fopen("ipcountry.dat", "rb");
if(fp!=NULL)
{
while( !feof( fp ) )
{
if(fread(&g_structIP, sizeof(g_structIP), 1, fp)!=0)
{
mIPcountry[g_structIP.startIP] = g_structIP;
}
}
}
fclose(fp);
}
The above code will not work at all, for several reasons.
First, of course, you haven't told us the format of the file
you're reading, so it's difficult to say how you really should
read it, but fread is only a solution if you're reading the data
into a buffer of raw bytes, which you then parse manually.
(Most likely, you're dealing with a file in CSV format, unless
you're reading directly from a data base connection. So some
parsing will probably be necessary anyway.)

Click to expand...

Well basicly the format of file is the IPCOUNTRY struct.

Click to expand...

Which is? The format of the IPCOUNTRY struct will vary
enormously between compilers, even on the same platform, and can
change from one release of the compiler to the next, or even
with different compiler options. When you say that the format
is basically the same as that of the IPCOUNTRY struct, you're
basically saying that you don't know the format, and that you
don't care, because you're not going to ever reread the data at
a later date, when the program may have been recompiled with a
more recent version of the compiler, or with different options.

I have created a seperate converter function from CSV file
format to STRUCT format (this is never done on the retail
application) this is also have to do with speed when reading
from the file.

Click to expand...

As long as you ensure that both this separate program and your
application are compiled with the same version of the compiler,
using the same options. It still sounds error prone to me (but
I can imagine cases where it is justified).

Maybe not future proof and user friendly but imagine if
everytime you start up an application and it gonna parse
80'000+ lines of column seperated lines with strtok or
similar... maybe not a big deal with todays computer. I just
thought this was a neat and lightweight solution at that point
of time.

Click to expand...

Do you know of many programs that don't parse a couple of
thousand lines on start up?

I guess it depends on the context. I work on large scale
servers. They're started at most once a day---most of the
projects I've worked on are started once every three or four
years, or more. Whether the start up lasts 1 second, or 5 or
even 30, isn't a big deal.

Still, if you have to parse it, you have to parse it. Doing so
in a separate process is going to take (slightly) more time than
doing so in your own process. What might be useful is if you
only have to update the data (from the CVS file) once a week or
less, but you restart the individual application a lot. In that
case, I might use a memory dump with a header containing the
last update (from the CVS file) date and a checksum based on the
compiler/compiler options. (For the latter, on a Unix system,
I'd simply use the output of /dev/random to initialize an array
of about 16 bytes, regenerating the data each time I relinked.
So anytime I install a newly relinked version of the program, it
will go to the CVS data the first time.) And I'd still put the
parser for the CVS data in the same binary image, with something
like:

if ( binaryDumpPresent() && binaryDumpOKToUse() ) {
initializeFromBinaryDump() ;
} else {
initializeFromCVSFile() ;
createBinaryDumpFromData() ;
}

In this case, of course:

-- The actual data elements should be POD's, with only basic
types (integral types, perhaps floating point types as well,
although there's no need here, but no classes or pointers).

-- If you're using a sorted vector, you can do this with a
single read/write, no need to read or write each element
separately. (In fact, I'd use two reads/writes: one for the
header, and one for the rest of the data.)

Depending on portability constraints, I might even use mmap (or
its equivalent under Windows); I've done this once (with a file
that took something like five minutes to parse on my old Sun
Sparc, resulted in a memory image of something like 10-15
Megabytes, and in which in 90% of the runs, I never accessed
beyond the first 128 entries---out of more than a
million---anyway).

Although formally not guaranteed... If you define the header
something like:

struct IpCountryMapHeader
{
unsigned char magic[ 16 ] ;
time_t saved ;
size_t entryCount ;
} ;

and each element as:

struct IpCountryMapEntry
{
uint32_t start ;
uint32_t end ;
char countryName[ maxCountryNameLength + 1 ] ;
} ;

will doubtlessly work. (In theory, you could run afoul of
alignment considerations if both size_t and time_t are smaller
than uint32_t. In practice, if size_t is less than 32 bits,
your table won't fit into memory, and time_t will never be less
than 32 bits anyway.)

Under Unix, a simple script like:

(
echo 'unsigned char referenceMagic[] = {' ;
od -tx1 -N 16 /dev/random |
tr '[:lower:]' '[:upper:]' |
sed -e '2,$d'
-e 's:^[0-9A-Z]* ::'
-e 's:[0-9A-F][0-9A-F]:0x&,:g' ;
echo '} ;'
) > referenceMagic.cc

in your makefile can be used to regenerate the magic every time
you relink; under Windows, you may have to write a simple C/C++
program to do something similar, using some sort of hash value
of the machines IP address, your current process id (a hex dump
of the HANDLE?), the current time, etc., to seed a quality
random generator, and output from it. (I use scripts like this
all the time in my makefiles, but in this case, if the OS
doesn't support /dev/random, or something similar, the script
won't work, even if you have a full Unix tool kit installed.)

(Also, FWIW: most of the tools which generate your CSV file will
generate closed intervals. I'd convert this to the classical
Unix half open interval when I read the file, and use that from
then on.)

Note too that std::lower_bound will work fine on a C style
array, either loaded with ifstream::read() (on a file opened in
binary mode) or mmapped. Just call it with startAddress,
startAddress + header.entryCount as the first two arguments.

--
James Kanze (GABI Software) email:[email protected]
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34- Dölj citerad text -

- Visa citerad text -

I will improve the the structure or even add it into a class (try to
avoid that as it is very simple lookup table with one purpose of
function to translate an ip address to a country).

However the structure is working fine currently and has worked fine
for while but ofcourse improvments can be done and I'm open to support
cross platforms had some plans to port my app to Linux one day and it
is today a Windows app but I don't want use any Microsoft propertiary
classes such as MFC or other stuff hence that's why I'm looking into
STL.

It is developed in C but consider to port it over to C++ as it is a
more modern and the software has evolved it self from very primitive
to more advanced.

James Kanze · Jan 1, 2008

On Dec 31, 11:48 am, kl <[email protected]> wrote:

Click to expand...

[...]

Click to expand...

I will improve the the structure or even add it into a class
(try to avoid that as it is very simple lookup table with one
purpose of function to translate an ip address to a country).
However the structure is working fine currently and has worked fine
for while

There's nothing necessarily wrong with using a struct here; as I
explained, given your performance considerations, and the fact
that the table is not updated once initialized, I'd seriously
consider a using a C style array of POD struct and memory
mapping it. With the precautions that I explained, to avoid
errors.

but of course improvments can be done and I'm open to support
cross platforms had some plans to port my app to Linux one day
and it is today a Windows app but I don't want use any
Microsoft propertiary classes such as MFC or other stuff hence
that's why I'm looking into STL.

Isn't DWORD a Microsoftism? I've never seen it used on a Unix
machine. (In fact, the only place I've seen it used in real
code, other than in the Windows API, is on IBM mainframes and on
the old Interdata 8/32. Where it was an 8 byte entity.)

More generally: unless the struct consists entirely of char and
char[] (possibly signed or unsigned), you'll have to consider
that if you write it as a binary memory image, you'll only be
able to read it as a binary memory image with the same program
which wrote it (or a program compiled with exactly the same
compiler, compiler version and options). From what I
understand, this is probably an acceptable restriction. Whether
it is worth the bother depends on the application; if you're
writing a server that is started once per day, or less
frequently, it's probably not worth the bother---reparsing the
CVS each start-up is not an issue. If its a small service
application, invoked each time it is used, start-up definitely
is an issue, and using the binary memory image is definitly the
way to go. Even if you don't use memory mapping, a single read
for the entire table will be a lot faster than any other
solution (and is very simple as well).

It is developed in C but consider to port it over to C++ as it
is a more modern and...

That explains the use of FILE*

.

(Honestly, for this level of I/O, I'd define a very slim wrapper
around the basic system level functions, and use it. You'd have
to port it to Linux, but neither FILE* nor iostream are really
designed for effective binary I/O.)

Hash Map slower than Map.	7	Oct 8, 2007
Question on initializing STL hash-maps using Visual C++ Express	2	Oct 17, 2010
how to achieve thread safety at element level in STL data structureslike hash map, vectors	7	Dec 4, 2007
stl map, lower_bound and upper_bound	3	Aug 14, 2006
Using STL map	0	Mar 23, 2006
map with pair as key	11	Jun 20, 2006
I Need Help with making a function that draws in a canvas using location data.	1	Dec 17, 2021
STL map - Can I find() on a subset of key	2	May 6, 2005

STL map or hash map using struct as data and find it

kl

kl

Jim Langston

James Kanze

kl

James Kanze

kl

James Kanze

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads