historical data from many sources: design questions

Discussion in 'C++' started by Hicham Mouline, Sep 4, 2010.

  1. Hello,

    I have measurements done daily (work days) for the past 20 years or so, in
    the order then of 5000 or so entries.
    I currently have them in a text file (I've hand written the parser but I'll
    move to boost::spirit eventually)

    The application is growing:
    .. I may move to 20 years worth of measures every few seconds and arrive at a
    range of 50 000 000 entries. Each entry is probably 64 bytes.
    .. I may use a database
    .. I may receive the data over a network socket

    I have a class 'historical_data' that currently holds the 5000 in memory. A
    standalone function (in the same namespace as that class) parses from text
    file into that class.
    My application typically iterates from earliest to latest on this data.

    I am wondering what incremental changes to introduce to the code I have to :
    1. Make some factory function to create a full 'historical_data' from text
    file/database/network socket
    2. allow for 50 000 000 instead of 5000 entries, and possibly keep just a
    part in memory and the rest on text file/database/network, and access this
    transparently

    rds,
    Hicham Mouline, Sep 4, 2010
    #1
    1. Advertising

  2. Hicham Mouline

    Goran Guest

    On Sep 4, 6:30 am, "Hicham Mouline" <> wrote:
    > Hello,
    >
    > I have measurements done daily (work days) for the past 20 years or so, in
    > the order then of 5000 or so entries.
    > I currently have them in a text file (I've hand written the parser but I'll
    > move to boost::spirit eventually)
    >
    > The application is growing:
    > . I may move to 20 years worth of measures every few seconds and arrive at a
    > range of 50 000 000 entries. Each entry is probably 64 bytes.
    > . I may use a database
    > . I may receive the data over a network socket
    >
    > I have a class 'historical_data' that currently holds the 5000 in memory. A
    > standalone function (in the same namespace as that class) parses from text
    > file into that class.
    > My application typically iterates from earliest to latest on this data.
    >
    > I am wondering what incremental changes to introduce to the code I have to :
    > 1. Make some factory function to create a full 'historical_data' from text
    > file/database/network socket
    > 2. allow for 50 000 000 instead of 5000 entries, and possibly keep just a
    > part in memory and the rest on text file/database/network, and access this
    > transparently


    Database management systems are for munching such quantities of data.
    A home-grown solution, even is very constrained and simplified through
    assumptions of data structure characteristics, is likely
    * going to cost a lot and still be suboptimal
    * be much harder to grow in functionality and scale
    * will make you learn about on-disk indexing and caching of data
    (knowing is a good thing, but this is a big subject, and not your
    actual goal ;-) )

    You should consider a database. A BerkeleyDB comes to my mind, but
    here's one search for alternatives: http://stackoverflow.com/questions/260804/alternative-to-berkeleydb.

    Goran.
    Goran, Sep 4, 2010
    #2
    1. Advertising

  3. Hicham Mouline

    Pavel Guest

    Hicham Mouline wrote:
    > Hello,
    >
    > I have measurements done daily (work days) for the past 20 years or so, in
    > the order then of 5000 or so entries.
    > I currently have them in a text file (I've hand written the parser but I'll
    > move to boost::spirit eventually)
    >
    > The application is growing:
    > . I may move to 20 years worth of measures every few seconds and arrive at a
    > range of 50 000 000 entries. Each entry is probably 64 bytes.
    > . I may use a database
    > . I may receive the data over a network socket
    >
    > I have a class 'historical_data' that currently holds the 5000 in memory. A
    > standalone function (in the same namespace as that class) parses from text
    > file into that class.
    > My application typically iterates from earliest to latest on this data.
    >
    > I am wondering what incremental changes to introduce to the code I have to :
    > 1. Make some factory function to create a full 'historical_data' from text
    > file/database/network socket
    > 2. allow for 50 000 000 instead of 5000 entries, and possibly keep just a
    > part in memory and the rest on text file/database/network, and access this
    > transparently
    >
    > rds,
    >
    >

    Not enough information, so just assuming everything:

    If your standalone function has a string parameter for the file name,
    come up with a prefix to denote another data sources (e.g.
    "@tcp<ip-address>", "@database<data-source-description>" or just
    "<filename>" to support the current usage, given your filenames don't
    start with '@').

    Otherwise, introduce this parameter with the empty string default and
    load your old file when the argument is an empty string.

    From your description, the API for the historical_data class itself
    does not need to change unless you provide access to the data "in
    place"; then you would need have to change it to make client
    applications provide a buffer for the portion of data they want to
    process as they iterate and recompile/retest them all.

    Then, incrementally implement new functionality under the hood (keeping
    data in memory in parts, database and network access etc).

    Hope this helps,
    Pavel
    Pavel, Sep 4, 2010
    #3
  4. Hicham Mouline

    Goran Guest

    On Sep 4, 7:37 am, Goran <> wrote:
    > A home-grown solution, even is very constrained and simplified through
    > assumptions of data structure characteristics, is likely...


    Whoops! Shoud have read "even __if__ very constrained and simplified...
    Goran, Sep 4, 2010
    #4
  5. Hicham Mouline

    Jorgen Grahn Guest

    On Sat, 2010-09-04, Hicham Mouline wrote:
    > Hello,
    >
    > I have measurements done daily (work days) for the past 20 years or so, in
    > the order then of 5000 or so entries.
    > I currently have them in a text file (I've hand written the parser but I'll
    > move to boost::spirit eventually)


    If your parser is broken or you want to learn boost::spirit, that's a
    good idea. Otherwise not.

    > The application is growing:
    > . I may move to 20 years worth of measures every few seconds and arrive at a
    > range of 50 000 000 entries. Each entry is probably 64 bytes.


    That's an odd change -- things that happen once a day usually don't
    suddenly increase in frequency by a factor 10000. Especially not if
    the frequency has been fixed for 20 years.

    > . I may use a database
    > . I may receive the data over a network socket
    >
    > I have a class 'historical_data' that currently holds the 5000 in memory. A
    > standalone function (in the same namespace as that class) parses from text
    > file into that class.
    > My application typically iterates from earliest to latest on this data.


    So it's typically a waste to keep all the data in class historical_data.

    > I am wondering what incremental changes to introduce to the code I have to :
    > 1. Make some factory function to create a full 'historical_data' from text
    > file/database/network socket


    Don't you have one already? Just add two more. I think it's overkill
    to do fancy Design Pattern stuff here. However:

    > 2. allow for 50 000 000 instead of 5000 entries, and possibly keep just a
    > part in memory and the rest on text file/database/network, and access this
    > transparently


    Here it becomes obvious that class historical_data is an inefficient
    design. You cannot do any processing until the I/O is done, and you
    must fit all into memory.

    I think you should switch focus to the individual samples (let's say
    class Sample) and ways to operate on sequences of Samples. For some
    uses you may need to feed your samples into a std::vector<Sample> or
    similar and then process it; for other uses you can just let them stream
    by. It's the Unix pipe/stream idea.

    Reading from text file or TCP socket ... one design which I find
    fast and flexible is this one:

    - read a chunk of data from somewhere into a buffer
    - try to parse and use as many class Sample from it as possible.
    This may be 0 or many Samples, and it may or may not
    consume the whole buffer
    - remove the consumed part of the buffer
    - read another chunk of data, appending to the buffer
    - try to parse, etc

    The part which needs to know about your class Sample can look like

    pair<vector<Sample>, const char*>
    parse(const char* begin, const char* end);

    or if it's responsible for /using/ the Samples too:

    const char* parse(const char* begin, const char* end);

    /Jorgen

    --
    // Jorgen Grahn <grahn@ Oo o. . .
    \X/ snipabacken.se> O o .
    Jorgen Grahn, Sep 7, 2010
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. dbuchanan
    Replies:
    1
    Views:
    1,214
    Bart Mermuys
    Dec 1, 2005
  2. philippon

    Historical Data

    philippon, Oct 15, 2004, in forum: Java
    Replies:
    2
    Views:
    354
    Michael Borgwardt
    Oct 15, 2004
  3. Eamonn Sullivan

    A faster way of finding historical highs/lows

    Eamonn Sullivan, Jun 11, 2004, in forum: Python
    Replies:
    6
    Views:
    353
    Peter Hansen
    Jun 14, 2004
  4. Duncan Booth

    A historical question

    Duncan Booth, Sep 8, 2004, in forum: Python
    Replies:
    9
    Views:
    315
    Greg Ewing
    Sep 10, 2004
  5. Sensei

    Historical variable place

    Sensei, Jul 6, 2005, in forum: C Programming
    Replies:
    3
    Views:
    316
    Chris Dollin
    Jul 7, 2005
Loading...

Share This Page