STL map or hash map using struct as data and find it

Discussion in 'C++' started by kl, Dec 31, 2007.

  1. kl

    kl Guest

    Hi,

    I'm trying to learn some STL using map or hash_map would maybe even
    better to use but I don't really know how to find a specific struct
    depending on a DWORD value that is in range of two DWORD values (see
    below for more).

    So what I trying to achieve is creating a look up table of a IP adress
    with a start adress (a DWORD value) and a end adress (a DWORD
    value) which I would like to look up to the get the Country name
    using a specific IP adress (also in DWORD) which is random access to
    the table.

    It is easy to use vector, linked list but I would like to try to use
    map or hash map for more effecient way to look up or at least test it
    out to compare some.


    The code is basicly looks like this:

    //Struct definition
    struct IPCOUNTRY
    {
    DWORD startIP; //IP start adress in DWORD which is always unique
    DWORD endIP; //IP end adress in DWORD which is always unique
    char COUNTRYNAME[45]; //Country name like England, Germany Spain
    etc...
    };


    typedef map <DWORD, IPCOUNTRY> mIPC;
    mIPC mIPcountry;

    //Initilize the map when and call this function during start up
    void initilizeMap()
    {

    //read data from file and insert it into a map
    IPCOUNTRY g_structIP;
    FILE *fp=NULL;
    fp=fopen("ipcountry.dat", "rb");

    if(fp!=NULL)
    {
    while( !feof( fp ) )
    {
    if(fread(&g_structIP, sizeof(g_structIP), 1, fp)!=0)
    {

    mIPcountry[g_structIP.startIP] = g_structIP;
    }
    }
    }
    fclose(fp);
    }


    struct compareIPC
    {
    bool operator()(const IPCOUNTRY* ipc1, const IPCOUNTRY* icp2) const
    {

    return strcmp(ipc1, ipc2) < 0;
    }
    };


    //This function will be called with IPaddress set and set the
    countryname depending which country it belongs
    //to using the map look up table
    int lookupCountryName(DWORD IPadress, char *countryname)
    {
    //ok here I'm lost what I should exactly do

    mIPC::iterator mIPCbegin = mIPcountry.begin();
    mIPC::iterator mIPCend = mIPcountry.end();



    return 0;
    }
    kl, Dec 31, 2007
    #1
    1. Advertising

  2. kl

    kl Guest

    On 31 Dec, 11:48, kl <> wrote:

    Sorry I accidantly clicked on the send button.

    > Hi,
    >
    > I'm trying to learn some STL using map or hash_map would maybe even
    > better to use but I don't really know how to find a specific struct
    > depending on a DWORD value that is in range of two DWORD values (see
    > below for more).
    >
    > So what I trying to achieve is creating a look up table of a IP adress
    > with a start adress (a DWORD value)  and a end adress  (a DWORD
    > value)  which I would like to look up to the get the Country name
    > using a specific IP adress (also in DWORD) which is random access to
    > the table.
    >
    > It is easy to use vector, linked list  but I would like to try to use
    > map or hash map for more effecient way to look up or at least test it
    > out to compare some.
    >
    > The code is basicly looks like this:
    >
    > //Struct definition
    > struct IPCOUNTRY
    > {
    >         DWORD startIP;   //IP start adress in DWORD which is always unique
    >         DWORD endIP;    //IP end adress in DWORD which is always unique
    >         char COUNTRYNAME[45]; //Country name like England, Germany Spain
    > etc...
    >
    > };
    >
    > typedef map <DWORD, IPCOUNTRY> mIPC;
    > mIPC mIPcountry;
    >
    > //Initilize the map when and call this function during start up
    > void initilizeMap()
    > {
    >
    >                //read data from file and insert it into a map
    >         IPCOUNTRY g_structIP;
    >         FILE *fp=NULL;
    >         fp=fopen("ipcountry.dat", "rb");
    >
    >         if(fp!=NULL)
    >         {
    >                 while( !feof( fp ) )
    >                 {
    >                         if(fread(&g_structIP, sizeof(g_structIP), 1, fp)!=0)
    >                         {
    >
    >                                   mIPcountry[g_structIP.startIP] = g_structIP;
    >                         }
    >                 }
    >         }
    >         fclose(fp);
    >
    > }
    >


    Here is a more correct comapare function:

    struct compareIPC
    {
      bool operator()(const IPCOUNTRY* ipc1, const IPCOUNTRY* ipc2) const
      {
    if((ipc2.startIP>=ipc1.startIP) && (ipc2.startIP <=
    ipc1.endIP))
    return true;
    return false;
      }

    };

    //This function will be called with IPaddress set and set the
    // countryname depending which country it belongs
    //to using the map look up table
    int lookupCountryName(DWORD IPadress, char *countryname)
    {
       //ok here I'm lost what I should exactly do
    IPCOUNTRY tempIPC;

    tempIPC.startIP = IPadress;
    tempIPC.endIP = IPadress;

       mIPC::iterator mIPCbegin = mIPcountry.begin();
       mIPC::iterator mIPCend =  mIPcountry.end();

    //I would do something like this but also include the tempIPC to
    find the correct item

    mIPC::iterator iterResult = find_if(mIPCbegin,
    mIPCend,insideiprange);

    //code to get the result and set the country name

       return 0;
    }

    I tried my best to explain what I'm trying to do and
    hopefully some one can shed some light of this.


    regards
    Kl
    kl, Dec 31, 2007
    #2
    1. Advertising

  3. kl

    Jim Langston Guest

    kl wrote:
    > Hi,
    >
    > I'm trying to learn some STL using map or hash_map would maybe even
    > better to use but I don't really know how to find a specific struct
    > depending on a DWORD value that is in range of two DWORD values (see
    > below for more).
    >
    > So what I trying to achieve is creating a look up table of a IP adress
    > with a start adress (a DWORD value) and a end adress (a DWORD
    > value) which I would like to look up to the get the Country name
    > using a specific IP adress (also in DWORD) which is random access to
    > the table.


    If I'm understanding you correctly, you want to find a range of elemets in a
    map between an upper bound and a lower bound value. Luckily, std::map has
    lower_bound() and upper_bound() which both return an iteratore.
    lower_bound() returning an iterator to the lowest key that is equal or
    greater than the value, upper_bound() returning an iterator to the highest
    key equal or less than the value.

    I do believe than an IP4 address has it's MSB as the left most of the IP
    address so this should work fine for a range of IP addresses.

    Of course you'll need to compare the iterators for not found (compare to
    ..end()).
    Jim Langston, Dec 31, 2007
    #3
  4. kl

    James Kanze Guest

    On Dec 31, 11:48 am, kl <> wrote:
    > I'm trying to learn some STL using map or hash_map would maybe
    > even better to use but I don't really know how to find a
    > specific struct depending on a DWORD value that is in range of
    > two DWORD values (see below for more).


    A hash map generally cannot be made to work with a range. In
    the case of std::map, it will take a somewhat special ordering
    function, but I think it can be made to work. I'd still tend to
    favor a sorted std::vector, using lower_bound for the look-up.

    > So what I trying to achieve is creating a look up table of a
    > IP adress with a start adress (a DWORD value) and a end
    > adress (a DWORD value) which I would like to look up to the
    > get the Country name using a specific IP adress (also in
    > DWORD) which is random access to the table.


    Just a guess, but I would imagine that lookups outnumber
    insertions by a very significant margin. That most likely, in
    fact, that table is initialized once from a file, before first
    use, and never modified afterwards.

    In that case, std::vector and lower_bound are definitely the way
    to go. In fact, I'd recommend sorting the initialization file
    once and for all, and just asserting on the order while reading
    it.

    > It is easy to use vector, linked list but I would like to try
    > to use map or hash map for more effecient way to look up or at
    > least test it out to compare some.


    A sorted vector and lower_bound is likely to be faster than
    std::map, since it will make roughly the same number of
    comparisons, and will have a lot better locality. If speed is
    really an issue, of course, the solution in this particular case
    is probably to use a trie starting with the high order byte. My
    guess is that a single table lookup (using direct indexation)
    will suffice 99% of the time, or more.

    > The code is basicly looks like this:


    > //Struct definition
    > struct IPCOUNTRY
    > {
    > DWORD startIP; //IP start adress in DWORD which is always unique
    > DWORD endIP; //IP end adress in DWORD which is always unique
    > char COUNTRYNAME[45]; //Country name like England, Germany Spain etc...
    > };


    Unless you need static initialization, you should probably use
    std::string instead of char[45]. More importantly, you should
    generally avoid all caps except for macros, and probably use
    more or less standard types, like uint32_t, instead of some
    typedef to an unknown type, like DWORD (which a priori I would
    expect to be a 64 bit type).

    > typedef map <DWORD, IPCOUNTRY> mIPC;
    > mIPC mIPcountry;


    That doesn't look right to me. You're not mapping a single IP
    address to a country. mIPC.lower_bound will still work, but
    only if you assume a dense allocation, with no holes. I'd go
    for something like:

    typedef std::set< IpCountry, IpCountryCmp >
    Map ;
    Map myIpCountryMap ;

    Defining IpCountryCmp is not going to be that simple, however.

    > //Initilize the map when and call this function during start up
    > void initilizeMap()
    > {
    > //read data from file and insert it into a map
    > IPCOUNTRY g_structIP;
    > FILE *fp=NULL;
    > fp=fopen("ipcountry.dat", "rb");


    > if(fp!=NULL)
    > {
    > while( !feof( fp ) )
    > {
    > if(fread(&g_structIP, sizeof(g_structIP), 1, fp)!=0)
    > {
    > mIPcountry[g_structIP.startIP] = g_structIP;
    > }
    > }
    > }
    > fclose(fp);
    > }


    The above code will not work at all, for several reasons.

    First, of course, you haven't told us the format of the file
    you're reading, so it's difficult to say how you really should
    read it, but fread is only a solution if you're reading the data
    into a buffer of raw bytes, which you then parse manually.
    (Most likely, you're dealing with a file in CSV format, unless
    you're reading directly from a data base connection. So some
    parsing will probably be necessary anyway.)

    Secondly, feof only tells you whether the last read encountered
    EOF or not; if there was some other type of error, it will
    remain false. Forever.

    And fclose on a null pointer is undefined behavior.

    The classical solution for this problem would be to write an
    operator>> for IPCOUNTRY, and use it on an ifstream. Using the
    FILE* functions of C here is a sure loser.

    > struct compareIPC
    > {
    > bool operator()(const IPCOUNTRY* ipc1, const IPCOUNTRY* icp2) const
    > {
    > return strcmp(ipc1, ipc2) < 0;
    > }
    > };


    strcmp on something which isn't a string (a null terminated
    sequence of char) is a sure recepe for disaster.

    > //This function will be called with IPaddress set and set the
    > countryname depending which country it belongs
    > //to using the map look up table
    > int lookupCountryName(DWORD IPadress, char *countryname)
    > {
    > //ok here I'm lost what I should exactly do


    > mIPC::iterator mIPCbegin = mIPcountry.begin();
    > mIPC::iterator mIPCend = mIPcountry.end();


    > return 0;
    > }


    If you're using std::set, you'll need to construct a "key", i.e.
    an IPCOUNTRY, such that the comparison function will work
    correctly. If you do this, I'd probably use something like:

    class IpCountryEntry
    {
    public:
    //! \post
    //! country() == ""
    //! start() == 0
    //! end() == 0
    IpCountryEntry() ;
    //! \post
    //! country() == country
    //! start() == start
    //! end() == end + 1
    //! for half open range...
    IpCountryEntry( uint32_t start,
    uint32_t end,
    std::string const& country ) ;
    //! \post
    //! country() == ""
    //! start() == ipAddress
    //! end() == ipAddress
    IpCountryEntry( uint32_t ipAddress ) ;

    std::string country() const ;
    uint32_t start() const ;
    uint32_t end() const ;

    bool operator<(
    IpCountryEntry const& other ) const ;

    // ...
    } ;

    The trick would be to ensure that operator< always returned
    false if one of the ranges was completely included in the other
    (and probably raised an exception if the ranges overlapped).
    You could then use the find() function of std::set.

    But as I said, a sorted std::vector and std::lower_bound would
    probably be more appropriate. Note that the comparison function
    of lower_bound is a template argument, and the value argument's
    type is also separate from the value_type of the iterator, so
    you can invoke something like:

    std::lower_bound( v.begin(), v.end(),
    ipAddress,
    CmpIpEntry() ) ;

    where CmpIpEntry is something like:

    struct CmpIpEntry
    {
    bool operator( IpCountryEntry const& lhs,
    IpCountryEntry const& rhs ) const ;
    bool operator( IpCountryEntry const& lhs,
    uint32_t rhs ) const ;
    bool operator( uint32_t lhs,
    IpCountryEntry const& rhs ) const ;
    } ;

    (Formally, at least according to the latest draft, the third
    function above should never be called. But it's trivial to
    implement, just in case.)

    --
    James Kanze (GABI Software) email:
    Conseils en informatique orientée objet/
    Beratung in objektorientierter Datenverarbeitung
    9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
    James Kanze, Dec 31, 2007
    #4
  5. kl

    kl Guest

    On 31 Dec, 13:21, James Kanze <> wrote:
    > On Dec 31, 11:48 am, kl <> wrote:
    >
    > > I'm trying to learn some STL using map or hash_map would maybe
    > > even better to use but I don't really know how to find a
    > > specific struct depending on a DWORD value that is in range of
    > > two DWORD values (see below for more).

    >
    > A hash map generally cannot be made to work with a range.  In
    > the case of std::map, it will take a somewhat special ordering
    > function, but I think it can be made to work.  I'd still tend to
    > favor a sorted std::vector, using lower_bound for the look-up.


    Intresting, maybe this approach I should go with.

    I need to read on a bit on the lower_bound before I can comment if it
    would work.


    > > So what I trying to achieve is creating a look up table of a
    > > IP adress with a start adress (a DWORD value)  and a end
    > > adress  (a DWORD value)  which I would like to look up to the
    > > get the Country name using a specific IP adress (also in
    > > DWORD) which is random access to the table.

    >
    > Just a guess, but I would imagine that lookups outnumber
    > insertions by a very significant margin.  That most likely, in
    > fact, that table is initialized once from a file, before first
    > use, and never modified afterwards.



    Correct.


    > In that case, std::vector and lower_bound are definitely the way
    > to go.  In fact, I'd recommend sorting the initialization file
    > once and for all, and just asserting on the order while reading
    > it.


    ok, seems like the way to go with vectors.

    > > It is easy to use vector, linked list  but I would like to try
    > > to use map or hash map for more effecient way to look up or at
    > > least test it out to compare some.

    >
    > A sorted vector and lower_bound is likely to be faster than
    > std::map, since it will make roughly the same number of
    > comparisons, and will have a lot better locality.  If speed is
    > really an issue, of course, the solution in this particular case
    > is probably to use a trie starting with the high order byte.  My
    > guess is that a single table lookup (using direct indexation)
    > will suffice 99% of the time, or more.
    >


    Why I would like to have some speed up is because the file has around
    80'000 lines

    > > The code is basicly looks like this:
    > > //Struct definition
    > > struct IPCOUNTRY
    > > {
    > >         DWORD startIP;   //IP start adress in DWORD which is always unique
    > >         DWORD endIP;    //IP end adress in DWORD which is always unique
    > >         char COUNTRYNAME[45]; //Country name like England, Germany Spain etc...
    > > };

    >
    > Unless you need static initialization, you should probably use
    > std::string instead of char[45].  More importantly, you should
    > generally avoid all caps except for macros, and probably use
    > more or less standard types, like uint32_t, instead of some
    > typedef to an unknown type, like DWORD (which a priori I would
    > expect to be a 64 bit type).



    I'm aware of this defines and working on cleaning my app to use string
    instead of array of char's same goes with DWORD. If I just did it from
    the beginning it would save me alot of time but me decision was to
    keep a small memory footprint of the application but today I will
    never think of doing this.
    Stability and portability goes before small usage of memory & gaining
    speed especially on todays computers somthing I have learned from the
    current code. :)



    > > typedef map <DWORD, IPCOUNTRY> mIPC;
    > > mIPC mIPcountry;

    >
    > That doesn't look right to me.  You're not mapping a single IP
    > address to a country.  mIPC.lower_bound will still work, but
    > only if you assume a dense allocation, with no holes.  I'd go
    > for something like:
    >
    >     typedef std::set< IpCountry, IpCountryCmp >
    >                         Map ;
    >     Map                 myIpCountryMap ;
    >
    > Defining IpCountryCmp is not going to be that simple, however.
    >
    >
    >
    >
    >
    > > //Initilize the map when and call this function during start up
    > > void initilizeMap()
    > > {
    > >                //read data from file and insert it into a map
    > >         IPCOUNTRY g_structIP;
    > >         FILE *fp=NULL;
    > >         fp=fopen("ipcountry.dat", "rb");
    > >         if(fp!=NULL)
    > >         {
    > >                 while( !feof( fp ) )
    > >                 {
    > >                         if(fread(&g_structIP, sizeof(g_structIP), 1, fp)!=0)
    > >                         {
    > >                                   mIPcountry[g_structIP.startIP] = g_structIP;
    > >                         }
    > >                 }
    > >         }
    > >         fclose(fp);
    > > }

    >
    > The above code will not work at all, for several reasons.
    >
    > First, of course, you haven't told us the format of the file
    > you're reading, so it's difficult to say how you really should
    > read it, but fread is only a solution if you're reading the data
    > into a buffer of raw bytes, which you then parse manually.
    > (Most likely, you're dealing with a file in CSV format, unless
    > you're reading directly from a data base connection.  So some
    > parsing will probably be necessary anyway.)



    Well basicly the format of file is the IPCOUNTRY struct.

    I have created a seperate converter function from CSV file format to
    STRUCT format (this is never done on the retail application) this is
    also have to do with speed when reading from the file.
    Maybe not future proof and user friendly but imagine if everytime you
    start up an application and it gonna parse 80'000+ lines of column
    seperated lines with strtok or similar... maybe not a big deal with
    todays computer. I just thought this was a neat and lightweight
    solution at that point of time.



    > Secondly, feof only tells you whether the last read encountered
    > EOF or not; if there was some other type of error, it will
    > remain false.  Forever.
    >
    > And fclose on a null pointer is undefined behavior.


    Sorry for the typo... it is suppose to be inside the if(fp!=NULL)
    {..}.


    > The classical solution for this problem would be to write an
    > operator>> for IPCOUNTRY, and use it on an ifstream.  Using the
    > FILE* functions of C here is a sure loser.
    >
    > > struct compareIPC
    > > {
    > >   bool operator()(const IPCOUNTRY* ipc1, const IPCOUNTRY* icp2) const
    > >   {
    > >     return strcmp(ipc1, ipc2) < 0;
    > >   }
    > > };

    >
    > strcmp on something which isn't a string (a null terminated
    > sequence of char) is a sure recepe for disaster.


    This piece of code was a mistake and post accidantly.

    > > //This function will be called with IPaddress set and set the
    > > countryname depending which country it belongs
    > > //to using the map look up table
    > > int lookupCountryName(DWORD IPadress, char *countryname)
    > > {
    > >    //ok here I'm lost what I should exactly do
    > >    mIPC::iterator mIPCbegin = mIPcountry.begin();
    > >    mIPC::iterator mIPCend =  mIPcountry.end();
    > >    return 0;
    > > }

    >
    > If you're using std::set, you'll need to construct a "key", i.e.
    > an IPCOUNTRY, such that the comparison function will work
    > correctly.  If you do this, I'd probably use something like:
    >
    >     class IpCountryEntry
    >     {
    >     public:
    >         //! \post
    >         //!     country() == ""
    >         //!     start() == 0
    >         //!     end()   == 0
    >         IpCountryEntry() ;
    >         //! \post
    >         //!     country() == country
    >         //!     start() == start
    >         //!     end()   == end + 1
    >         //!                    for half open range...
    >         IpCountryEntry( uint32_t start,
    >                         uint32_t end,
    >                         std::string const& country ) ;
    >         //! \post
    >         //!     country() == ""
    >         //!     start() == ipAddress
    >         //!     end()   == ipAddress
    >         IpCountryEntry( uint32_t ipAddress ) ;
    >
    >         std::string         country() const ;
    >         uint32_t            start()   const ;
    >         uint32_t            end()     const ;
    >
    >         bool                operator<(
    >             IpCountryEntry const& other ) const ;
    >
    >         //  ...
    >     } ;
    >
    > The trick would be to ensure that operator< always returned
    > false if one of the ranges was completely included in the other
    > (and probably raised an exception if the ranges overlapped).
    > You could then use the find() function of std::set.
    >
    > But as I said, a sorted std::vector and std::lower_bound would
    > probably be more appropriate.  Note that the comparison function
    > of lower_bound is a template argument, and the value argument's
    > type is also separate from the value_type of the iterator, so
    > you can invoke something like:
    >
    >     std::lower_bound( v.begin(), v.end(),
    >                       ipAddress,
    >                       CmpIpEntry() ) ;
    >
    > where CmpIpEntry is something like:
    >
    >     struct CmpIpEntry
    >     {
    >         bool            operator( IpCountryEntry const& lhs,
    >                                   IpCountryEntry const& rhs ) const ;
    >         bool            operator( IpCountryEntry const& lhs,
    >                                   uint32_t              rhs ) const ;
    >         bool            operator( uint32_t              lhs,
    >                                   IpCountryEntry const& rhs ) const ;
    >     } ;
    >
    > (Formally, at least according to the latest draft, the third
    > function above should never be called.  But it's trivial to
    > implement, just in case.)


    I will look into this directly.

    Great info James and thanks for your feedback much appreciated!


    > --
    > James Kanze (GABI Software)             email:
    > Conseils en informatique orientée objet/
    >                    Beratung in objektorientierter Datenverarbeitung
    > 9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34- Dölj citerad text -
    >
    > - Visa citerad text -
    kl, Dec 31, 2007
    #5
  6. kl

    James Kanze Guest

    On Dec 31, 1:51 pm, kl <> wrote:
    > On 31 Dec, 13:21, James Kanze <> wrote:
    > > On Dec 31, 11:48 am, kl <> wrote:


    [...]
    > > > It is easy to use vector, linked list but I would like to try
    > > > to use map or hash map for more effecient way to look up or at
    > > > least test it out to compare some.


    > > A sorted vector and lower_bound is likely to be faster than
    > > std::map, since it will make roughly the same number of
    > > comparisons, and will have a lot better locality. If speed
    > > is really an issue, of course, the solution in this
    > > particular case is probably to use a trie starting with the
    > > high order byte. My guess is that a single table lookup
    > > (using direct indexation) will suffice 99% of the time, or
    > > more.


    > Why I would like to have some speed up is because the file has
    > around 80'000 lines


    But how often are you doing a look-up. Still, 80000 lines is
    probably enough to want to use something more efficient than a
    linear search.

    [...]
    > > > //Initilize the map when and call this function during start up
    > > > void initilizeMap()
    > > > {
    > > > //read data from file and insert it into a map
    > > > IPCOUNTRY g_structIP;
    > > > FILE *fp=NULL;
    > > > fp=fopen("ipcountry.dat", "rb");
    > > > if(fp!=NULL)
    > > > {
    > > > while( !feof( fp ) )
    > > > {
    > > > if(fread(&g_structIP, sizeof(g_structIP), 1, fp)!=0)
    > > > {
    > > > mIPcountry[g_structIP.startIP] = g_structIP;
    > > > }
    > > > }
    > > > }
    > > > fclose(fp);
    > > > }


    > > The above code will not work at all, for several reasons.


    > > First, of course, you haven't told us the format of the file
    > > you're reading, so it's difficult to say how you really should
    > > read it, but fread is only a solution if you're reading the data
    > > into a buffer of raw bytes, which you then parse manually.
    > > (Most likely, you're dealing with a file in CSV format, unless
    > > you're reading directly from a data base connection. So some
    > > parsing will probably be necessary anyway.)


    > Well basicly the format of file is the IPCOUNTRY struct.


    Which is? The format of the IPCOUNTRY struct will vary
    enormously between compilers, even on the same platform, and can
    change from one release of the compiler to the next, or even
    with different compiler options. When you say that the format
    is basically the same as that of the IPCOUNTRY struct, you're
    basically saying that you don't know the format, and that you
    don't care, because you're not going to ever reread the data at
    a later date, when the program may have been recompiled with a
    more recent version of the compiler, or with different options.

    > I have created a seperate converter function from CSV file
    > format to STRUCT format (this is never done on the retail
    > application) this is also have to do with speed when reading
    > from the file.


    As long as you ensure that both this separate program and your
    application are compiled with the same version of the compiler,
    using the same options. It still sounds error prone to me (but
    I can imagine cases where it is justified).

    > Maybe not future proof and user friendly but imagine if
    > everytime you start up an application and it gonna parse
    > 80'000+ lines of column seperated lines with strtok or
    > similar... maybe not a big deal with todays computer. I just
    > thought this was a neat and lightweight solution at that point
    > of time.


    Do you know of many programs that don't parse a couple of
    thousand lines on start up?

    I guess it depends on the context. I work on large scale
    servers. They're started at most once a day---most of the
    projects I've worked on are started once every three or four
    years, or more. Whether the start up lasts 1 second, or 5 or
    even 30, isn't a big deal.

    Still, if you have to parse it, you have to parse it. Doing so
    in a separate process is going to take (slightly) more time than
    doing so in your own process. What might be useful is if you
    only have to update the data (from the CVS file) once a week or
    less, but you restart the individual application a lot. In that
    case, I might use a memory dump with a header containing the
    last update (from the CVS file) date and a checksum based on the
    compiler/compiler options. (For the latter, on a Unix system,
    I'd simply use the output of /dev/random to initialize an array
    of about 16 bytes, regenerating the data each time I relinked.
    So anytime I install a newly relinked version of the program, it
    will go to the CVS data the first time.) And I'd still put the
    parser for the CVS data in the same binary image, with something
    like:

    if ( binaryDumpPresent() && binaryDumpOKToUse() ) {
    initializeFromBinaryDump() ;
    } else {
    initializeFromCVSFile() ;
    createBinaryDumpFromData() ;
    }

    In this case, of course:

    -- The actual data elements should be POD's, with only basic
    types (integral types, perhaps floating point types as well,
    although there's no need here, but no classes or pointers).

    -- If you're using a sorted vector, you can do this with a
    single read/write, no need to read or write each element
    separately. (In fact, I'd use two reads/writes: one for the
    header, and one for the rest of the data.)

    Depending on portability constraints, I might even use mmap (or
    its equivalent under Windows); I've done this once (with a file
    that took something like five minutes to parse on my old Sun
    Sparc, resulted in a memory image of something like 10-15
    Megabytes, and in which in 90% of the runs, I never accessed
    beyond the first 128 entries---out of more than a
    million---anyway).

    Although formally not guaranteed... If you define the header
    something like:

    struct IpCountryMapHeader
    {
    unsigned char magic[ 16 ] ;
    time_t saved ;
    size_t entryCount ;
    } ;

    and each element as:

    struct IpCountryMapEntry
    {
    uint32_t start ;
    uint32_t end ;
    char countryName[ maxCountryNameLength + 1 ] ;
    } ;

    will doubtlessly work. (In theory, you could run afoul of
    alignment considerations if both size_t and time_t are smaller
    than uint32_t. In practice, if size_t is less than 32 bits,
    your table won't fit into memory, and time_t will never be less
    than 32 bits anyway.)

    Under Unix, a simple script like:

    (
    echo 'unsigned char referenceMagic[] = {' ;
    od -tx1 -N 16 /dev/random |
    tr '[:lower:]' '[:upper:]' |
    sed -e '2,$d'
    -e 's:^[0-9A-Z]* ::'
    -e 's:[0-9A-F][0-9A-F]:0x&,:g' ;
    echo '} ;'
    ) > referenceMagic.cc

    in your makefile can be used to regenerate the magic every time
    you relink; under Windows, you may have to write a simple C/C++
    program to do something similar, using some sort of hash value
    of the machines IP address, your current process id (a hex dump
    of the HANDLE?), the current time, etc., to seed a quality
    random generator, and output from it. (I use scripts like this
    all the time in my makefiles, but in this case, if the OS
    doesn't support /dev/random, or something similar, the script
    won't work, even if you have a full Unix tool kit installed.)

    (Also, FWIW: most of the tools which generate your CSV file will
    generate closed intervals. I'd convert this to the classical
    Unix half open interval when I read the file, and use that from
    then on.)

    Note too that std::lower_bound will work fine on a C style
    array, either loaded with ifstream::read() (on a file opened in
    binary mode) or mmapped. Just call it with startAddress,
    startAddress + header.entryCount as the first two arguments.

    --
    James Kanze (GABI Software) email:
    Conseils en informatique orientée objet/
    Beratung in objektorientierter Datenverarbeitung
    9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
    James Kanze, Dec 31, 2007
    #6
  7. kl

    kl Guest

    On 31 Dec, 15:52, James Kanze <> wrote:
    > On Dec 31, 1:51 pm, kl <> wrote:
    >
    > > On 31 Dec, 13:21, James Kanze <> wrote:
    > > > On Dec 31, 11:48 am, kl <> wrote:

    >
    >     [...]
    >
    > > > > It is easy to use vector, linked list  but I would like to try
    > > > > to use map or hash map for more effecient way to look up or at
    > > > > least test it out to compare some.
    > > > A sorted vector and lower_bound is likely to be faster than
    > > > std::map, since it will make roughly the same number of
    > > > comparisons, and will have a lot better locality.  If speed
    > > > is really an issue, of course, the solution in this
    > > > particular case is probably to use a trie starting with the
    > > > high order byte.  My guess is that a single table lookup
    > > > (using direct indexation) will suffice 99% of the time, or
    > > > more.

    > > Why I would like to have some speed up is because the file has
    > > around 80'000 lines


    > But how often are you doing a look-up.  Still, 80000 lines is
    > probably enough to want to use something more efficient than a
    > linear search.


    It can vary a lot, from 200 to almost 20000 times accessing the look
    up table within seconds (depending of your network connection/
    timeouts) and also
    done using multithreading which can be 64-128 threads depending of
    settings.
    It has to do as my software scans through different servers (actually
    game servers) and ask for some info and the lookup table in this case
    translatets the game server ip address to a country (for filtering
    purpose).

    >     [...]
    >
    >
    >
    >
    >
    > > > > //Initilize the map when and call this function during start up
    > > > > void initilizeMap()
    > > > > {
    > > > >                //read data from file and insert it into a map
    > > > >         IPCOUNTRY g_structIP;
    > > > >         FILE *fp=NULL;
    > > > >         fp=fopen("ipcountry.dat", "rb");
    > > > >         if(fp!=NULL)
    > > > >         {
    > > > >                 while( !feof( fp ) )
    > > > >                 {
    > > > >                         if(fread(&g_structIP, sizeof(g_structIP), 1, fp)!=0)
    > > > >                         {
    > > > >                                   mIPcountry[g_structIP.startIP] = g_structIP;
    > > > >                         }
    > > > >                 }
    > > > >         }
    > > > >         fclose(fp);
    > > > > }
    > > > The above code will not work at all, for several reasons.
    > > > First, of course, you haven't told us the format of the file
    > > > you're reading, so it's difficult to say how you really should
    > > > read it, but fread is only a solution if you're reading the data
    > > > into a buffer of raw bytes, which you then parse manually.
    > > > (Most likely, you're dealing with a file in CSV format, unless
    > > > you're reading directly from a data base connection.  So some
    > > > parsing will probably be necessary anyway.)

    > > Well basicly the format of file is the IPCOUNTRY struct.

    >
    > Which is?  The format of the IPCOUNTRY struct will vary
    > enormously between compilers, even on the same platform, and can
    > change from one release of the compiler to the next, or even
    > with different compiler options.  When you say that the format
    > is basically the same as that of the IPCOUNTRY struct, you're
    > basically saying that you don't know the format, and that you
    > don't care, because you're not going to ever reread the data at
    > a later date, when the program may have been recompiled with a
    > more recent version of the compiler, or with different options.
    >
    > > I have created a seperate converter function from CSV file
    > > format to STRUCT format (this is never done on the retail
    > > application) this is also have to do with speed when reading
    > > from the file.

    >
    > As long as you ensure that both this separate program and your
    > application are compiled with the same version of the compiler,
    > using the same options.  It still sounds error prone to me (but
    > I can imagine cases where it is justified).
    >
    > > Maybe not future proof and user friendly but imagine if
    > > everytime you start up an application and it gonna parse
    > > 80'000+ lines of column seperated lines with strtok or
    > > similar... maybe not a big deal with todays computer. I just
    > > thought this was a neat and lightweight solution at that point
    > > of time.

    >
    > Do you know of many programs that don't parse a couple of
    > thousand lines on start up?
    >
    > I guess it depends on the context.  I work on large scale
    > servers.  They're started at most once a day---most of the
    > projects I've worked on are started once every three or four
    > years, or more.  Whether the start up lasts 1 second, or 5 or
    > even 30, isn't a big deal.
    >
    > Still, if you have to parse it, you have to parse it.  Doing so
    > in a separate process is going to take (slightly) more time than
    > doing so in your own process.  What might be useful is if you
    > only have to update the data (from the CVS file) once a week or
    > less, but you restart the individual application a lot.  In that
    > case, I might use a memory dump with a header containing the
    > last update (from the CVS file) date and a checksum based on the
    > compiler/compiler options.  (For the latter, on a Unix system,
    > I'd simply use the output of /dev/random to initialize an array
    > of about 16 bytes, regenerating the data each time I relinked.
    > So anytime I install a newly relinked version of the program, it
    > will go to the CVS data the first time.)  And I'd still put the
    > parser for the CVS data in the same binary image, with something
    > like:
    >
    >     if ( binaryDumpPresent() && binaryDumpOKToUse() ) {
    >         initializeFromBinaryDump() ;
    >     } else {
    >         initializeFromCVSFile() ;
    >         createBinaryDumpFromData() ;
    >     }
    >
    > In this case, of course:
    >
    >  -- The actual data elements should be POD's, with only basic
    >     types (integral types, perhaps floating point types as well,
    >     although there's no need here, but no classes or pointers).
    >
    >  -- If you're using a sorted vector, you can do this with a
    >     single read/write, no need to read or write each element
    >     separately.  (In fact, I'd use two reads/writes: one for the
    >     header, and one for the rest of the data.)
    >
    > Depending on portability constraints, I might even use mmap (or
    > its equivalent under Windows); I've done this once (with a file
    > that took something like five minutes to parse on my old Sun
    > Sparc, resulted in a memory image of something like 10-15
    > Megabytes, and in which in 90% of the runs, I never accessed
    > beyond the first 128 entries---out of more than a
    > million---anyway).
    >
    > Although formally not guaranteed...  If you define the header
    > something like:
    >
    >     struct IpCountryMapHeader
    >     {
    >         unsigned char       magic[ 16 ] ;
    >         time_t              saved ;
    >         size_t              entryCount ;
    >     } ;
    >
    > and each element as:
    >
    >     struct IpCountryMapEntry
    >     {
    >         uint32_t            start ;
    >         uint32_t            end ;
    >         char                countryName[ maxCountryNameLength + 1 ] ;
    >     } ;
    >
    > will doubtlessly work.  (In theory, you could run afoul of
    > alignment considerations if both size_t and time_t are smaller
    > than uint32_t.  In practice, if size_t is less than 32 bits,
    > your table won't fit into memory, and time_t will never be less
    > than 32 bits anyway.)
    >
    > Under Unix, a simple script like:
    >
    >     (
    >         echo 'unsigned char referenceMagic[] = {' ;
    >         od -tx1 -N 16 /dev/random      |
    >             tr '[:lower:]' '[:upper:]' |
    >             sed -e '2,$d'
    >                 -e 's:^[0-9A-Z]* ::'
    >                 -e 's:[0-9A-F][0-9A-F]:0x&,:g' ;
    >         echo '} ;'
    >     )  > referenceMagic.cc
    >
    > in your makefile can be used to regenerate the magic every time
    > you relink; under Windows, you may have to write a simple C/C++
    > program to do something similar, using some sort of hash value
    > of the machines IP address, your current process id (a hex dump
    > of the HANDLE?), the current time, etc., to seed a quality
    > random generator, and output from it.  (I use scripts like this
    > all the time in my makefiles, but in this case, if the OS
    > doesn't support /dev/random, or something similar, the script
    > won't work, even if you have a full Unix tool kit installed.)
    >
    > (Also, FWIW: most of the tools which generate your CSV file will
    > generate closed intervals.  I'd convert this to the classical
    > Unix half open interval when I read the file, and use that from
    > then on.)
    >
    > Note too that std::lower_bound will work fine on a C style
    > array, either loaded with ifstream::read() (on a file opened in
    > binary mode) or mmapped.  Just call it with startAddress,
    > startAddress + header.entryCount as the first two arguments.
    >
    > --
    > James Kanze (GABI Software)             email:
    > Conseils en informatique orientée objet/
    >                    Beratung in objektorientierter Datenverarbeitung
    > 9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34- Dölj citerad text -
    >
    > - Visa citerad text -



    I will improve the the structure or even add it into a class (try to
    avoid that as it is very simple lookup table with one purpose of
    function to translate an ip address to a country).

    However the structure is working fine currently and has worked fine
    for while but ofcourse improvments can be done and I'm open to support
    cross platforms had some plans to port my app to Linux one day and it
    is today a Windows app but I don't want use any Microsoft propertiary
    classes such as MFC or other stuff hence that's why I'm looking into
    STL.

    It is developed in C but consider to port it over to C++ as it is a
    more modern and the software has evolved it self from very primitive
    to more advanced.
    kl, Dec 31, 2007
    #7
  8. kl

    James Kanze Guest

    On Dec 31 2007, 4:20 pm, kl <> wrote:
    > On 31 Dec, 15:52, James Kanze <> wrote:
    > > On Dec 31, 1:51 pm, kl <> wrote:
    > > > On 31 Dec, 13:21, James Kanze <> wrote:
    > > > > On Dec 31, 11:48 am, kl <> wrote:


    > > [...]

    > I will improve the the structure or even add it into a class
    > (try to avoid that as it is very simple lookup table with one
    > purpose of function to translate an ip address to a country).
    > However the structure is working fine currently and has worked fine
    > for while


    There's nothing necessarily wrong with using a struct here; as I
    explained, given your performance considerations, and the fact
    that the table is not updated once initialized, I'd seriously
    consider a using a C style array of POD struct and memory
    mapping it. With the precautions that I explained, to avoid
    errors.

    > but of course improvments can be done and I'm open to support
    > cross platforms had some plans to port my app to Linux one day
    > and it is today a Windows app but I don't want use any
    > Microsoft propertiary classes such as MFC or other stuff hence
    > that's why I'm looking into STL.


    Isn't DWORD a Microsoftism? I've never seen it used on a Unix
    machine. (In fact, the only place I've seen it used in real
    code, other than in the Windows API, is on IBM mainframes and on
    the old Interdata 8/32. Where it was an 8 byte entity.)

    More generally: unless the struct consists entirely of char and
    char[] (possibly signed or unsigned), you'll have to consider
    that if you write it as a binary memory image, you'll only be
    able to read it as a binary memory image with the same program
    which wrote it (or a program compiled with exactly the same
    compiler, compiler version and options). From what I
    understand, this is probably an acceptable restriction. Whether
    it is worth the bother depends on the application; if you're
    writing a server that is started once per day, or less
    frequently, it's probably not worth the bother---reparsing the
    CVS each start-up is not an issue. If its a small service
    application, invoked each time it is used, start-up definitely
    is an issue, and using the binary memory image is definitly the
    way to go. Even if you don't use memory mapping, a single read
    for the entire table will be a lot faster than any other
    solution (and is very simple as well).

    > It is developed in C but consider to port it over to C++ as it
    > is a more modern and...


    That explains the use of FILE*:).

    (Honestly, for this level of I/O, I'd define a very slim wrapper
    around the basic system level functions, and use it. You'd have
    to port it to Linux, but neither FILE* nor iostream are really
    designed for effective binary I/O.)

    --
    James Kanze (GABI Software) email:
    Conseils en informatique orientée objet/
    Beratung in objektorientierter Datenverarbeitung
    9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
    James Kanze, Jan 1, 2008
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Chris Fogelklou
    Replies:
    36
    Views:
    1,347
    Chris Fogelklou
    Apr 20, 2004
  2. Replies:
    7
    Views:
    637
    Gernot Frisch
    Mar 29, 2006
  3. grbgooglefan
    Replies:
    7
    Views:
    320
    James Kanze
    Dec 5, 2007
  4. navS
    Replies:
    3
    Views:
    491
    Ismo Salonen
    May 9, 2008
  5. rp
    Replies:
    1
    Views:
    491
    red floyd
    Nov 10, 2011
Loading...

Share This Page