Including large amounts of data in C++ binary

B

bcomeara

I am writing a program which needs to include a large amount of data.
Basically, the data are p values for different possible outcomes from
trials with different number of observations (the p values are
necessarily based on slow simulations rather than on a standard
function, so I estimated them once and want the program to include
this information). Currently, I have this stored as a vector of
vectors of varying sizes (first vector is indexed by number of
observations for the trial; for each number of observations, there is
a vector containing a p value for different numbers of successes, with
these vectors getting longer as the number of observations (and
therefore possible successes) increases). I created a class containing
this vector of vectors; my program, on starting, creates an object of
this class. However, the file containing just this class is ~50,000
lines long and 10 MB in size, and takes a great deal of time to
compile, especially with optimization turned on. Is there a better way
of building large amounts of data into C++ programs? I could just
include a separate datafile, and have the program call it upon
starting, but then that would require having the program know where
the file is, even when I distribute it. In case this helps, I am
already using the GNU Scientific Library in the program, so using any
functions there is an easy option. My apologies if this question has
an obvious, standard solution I should already know about.

Excerpt from class file (CDFvectorholder) containing vector of
vectors:

vector<vector<double> > CDFvectorholder::Initialize() {
vector<vector<double> > CDFvectorcontents;
vector<double> contentsofrow;
contentsofrow.push_back(0.33298);
contentsofrow.push_back(1);
CDFvectorcontents.push_back(contentsofrow); //comparison where
ntax=3
contentsofrow.clear();
contentsofrow.push_back(0.07352);
contentsofrow.push_back(0.14733);
contentsofrow.push_back(0.33393);
contentsofrow.push_back(0.78019);
contentsofrow.push_back(1);
CDFvectorcontents.push_back(contentsofrow); //comparison where
ntax=4
contentsofrow.clear();
contentsofrow.push_back(0.01209);
contentsofrow.push_back(0.03292);
contentsofrow.push_back(0.04202);
contentsofrow.push_back(0.0767);
contentsofrow.push_back(0.13314);
contentsofrow.push_back(0.23417);
contentsofrow.push_back(0.40921);
contentsofrow.push_back(0.58934);
contentsofrow.push_back(0.82239);
contentsofrow.push_back(0.98537);
contentsofrow.push_back(1);
CDFvectorcontents.push_back(contentsofrow); //comparison where
ntax=5
//ETC
return CDFvectorcontents;
}

and the main program file, initializing the vector of vectors:

vector<vector<double> > CDFvector;
CDFvectorholder bob;
CDFvector=bob.Initialize();

and using it:

double cdfundermodel=CDFvector[integerB][integerA];

Thank you,
Brian O'Meara
 
J

Jim Langston

I am writing a program which needs to include a large amount of data.
Basically, the data are p values for different possible outcomes from
trials with different number of observations (the p values are
necessarily based on slow simulations rather than on a standard
function, so I estimated them once and want the program to include
this information). Currently, I have this stored as a vector of
vectors of varying sizes (first vector is indexed by number of
observations for the trial; for each number of observations, there is
a vector containing a p value for different numbers of successes, with
these vectors getting longer as the number of observations (and
therefore possible successes) increases). I created a class containing
this vector of vectors; my program, on starting, creates an object of
this class. However, the file containing just this class is ~50,000
lines long and 10 MB in size, and takes a great deal of time to
compile, especially with optimization turned on. Is there a better way
of building large amounts of data into C++ programs? I could just
include a separate datafile, and have the program call it upon
starting, but then that would require having the program know where
the file is, even when I distribute it. In case this helps, I am
already using the GNU Scientific Library in the program, so using any
functions there is an easy option. My apologies if this question has
an obvious, standard solution I should already know about.

Excerpt from class file (CDFvectorholder) containing vector of
vectors:

vector<vector<double> > CDFvectorholder::Initialize() {
vector<vector<double> > CDFvectorcontents;
vector<double> contentsofrow;
contentsofrow.push_back(0.33298);
contentsofrow.push_back(1);
CDFvectorcontents.push_back(contentsofrow); //comparison where
ntax=3
contentsofrow.clear();
contentsofrow.push_back(0.07352);
contentsofrow.push_back(0.14733);
contentsofrow.push_back(0.33393);
contentsofrow.push_back(0.78019);
contentsofrow.push_back(1);
CDFvectorcontents.push_back(contentsofrow); //comparison where
ntax=4
contentsofrow.clear();
contentsofrow.push_back(0.01209);
contentsofrow.push_back(0.03292);
contentsofrow.push_back(0.04202);
contentsofrow.push_back(0.0767);
contentsofrow.push_back(0.13314);
contentsofrow.push_back(0.23417);
contentsofrow.push_back(0.40921);
contentsofrow.push_back(0.58934);
contentsofrow.push_back(0.82239);
contentsofrow.push_back(0.98537);
contentsofrow.push_back(1);
CDFvectorcontents.push_back(contentsofrow); //comparison where
ntax=5
//ETC
return CDFvectorcontents;
}

and the main program file, initializing the vector of vectors:

vector<vector<double> > CDFvector;
CDFvectorholder bob;
CDFvector=bob.Initialize();

and using it:

double cdfundermodel=CDFvector[integerB][integerA];

Data does not belong in code. The data should go in a seperate file.
Normally this data file would be in the same directory as the executable.

If you really think the ueer will lose the data file, you can do the trick
of adding it to the end of the executable (if your OS allows it).
 
V

Victor Bazarov

I am writing a program which needs to include a large amount of data.
Basically, the data are p values for different possible outcomes from
trials with different number of observations (the p values are
necessarily based on slow simulations rather than on a standard
function, so I estimated them once and want the program to include
this information).

I sincerely hope that the data reside in a separate, include-able
source file, which is generated by some other program somehow, instead
of being typed in by a human reading some other print-out or protocol
of some experiment...
Currently, I have this stored as a vector of
vectors of varying sizes (first vector is indexed by number of
observations for the trial; for each number of observations, there is
a vector containing a p value for different numbers of successes, with
these vectors getting longer as the number of observations (and
therefore possible successes) increases). I created a class containing
this vector of vectors; my program, on starting, creates an object of
this class. However, the file containing just this class is ~50,000
lines long and 10 MB in size, and takes a great deal of time to
compile, especially with optimization turned on. Is there a better way
of building large amounts of data into C++ programs?

Something like

------------------- experiments.cpp (generated)
namespace DATA {
double data_000[5] = { 0.0, 1., 2.2, 3.33, 4.444 };
double data_001[7] = { 0.0, 1.1, 2.222, 3.3333, 4.44444, 5.55, 6.66 };
....
double data_042[3] = { 1.1, 2.22, 3.333 };

std::vector<double> data[] = {
std::vector<double>(data_000,
data_000 + sizeof(data_000) / sizeof(double)),
std::vector<double>(data_001,
data_001 + sizeof(data_001) / sizeof(double)),
...
std::vector<double>(data_042,
data_042 + sizeof(data_042) / sizeof(double)),
};
} // namespace DATA

------------------- my_vectors.cpp
#include <experiments.cpp>

std::vector<std::vector<double> >
CDFvectorcontents(data.begin(), data.end());

-----------------------------------

?
I could just
include a separate datafile, and have the program call it upon
starting, but then that would require having the program know where
the file is, even when I distribute it. In case this helps, I am
already using the GNU Scientific Library in the program, so using any
functions there is an easy option. My apologies if this question has
an obvious, standard solution I should already know about.

Excerpt from class file (CDFvectorholder) containing vector of
vectors:

vector<vector<double> > CDFvectorholder::Initialize() {
vector<vector<double> > CDFvectorcontents;
vector<double> contentsofrow;
contentsofrow.push_back(0.33298);
contentsofrow.push_back(1);
CDFvectorcontents.push_back(contentsofrow); //comparison where
ntax=3
contentsofrow.clear();
contentsofrow.push_back(0.07352);
contentsofrow.push_back(0.14733);
contentsofrow.push_back(0.33393);
contentsofrow.push_back(0.78019);
contentsofrow.push_back(1);
CDFvectorcontents.push_back(contentsofrow); //comparison where
ntax=4
contentsofrow.clear();
contentsofrow.push_back(0.01209);
contentsofrow.push_back(0.03292);
contentsofrow.push_back(0.04202);
contentsofrow.push_back(0.0767);
contentsofrow.push_back(0.13314);
contentsofrow.push_back(0.23417);
contentsofrow.push_back(0.40921);
contentsofrow.push_back(0.58934);
contentsofrow.push_back(0.82239);
contentsofrow.push_back(0.98537);
contentsofrow.push_back(1);
CDFvectorcontents.push_back(contentsofrow); //comparison where
ntax=5
//ETC
return CDFvectorcontents;
}

and the main program file, initializing the vector of vectors:

vector<vector<double> > CDFvector;
CDFvectorholder bob;
CDFvector=bob.Initialize();

and using it:

double cdfundermodel=CDFvector[integerB][integerA];

Thank you,
Brian O'Meara

V
 
G

Gianni Mariani

I am writing a program which needs to include a large amount of data.
Basically, the data are p values for different possible outcomes from
trials with different number of observations (the p values are
necessarily based on slow simulations rather than on a standard
function, so I estimated them once and want the program to include
this information). Currently, I have this stored as a vector of
vectors of varying sizes (first vector is indexed by number of
observations for the trial; for each number of observations, there is
a vector containing a p value for different numbers of successes, with
these vectors getting longer as the number of observations (and
therefore possible successes) increases). I created a class containing
this vector of vectors; my program, on starting, creates an object of
this class. However, the file containing just this class is ~50,000
lines long and 10 MB in size, and takes a great deal of time to
compile, especially with optimization turned on. Is there a better way
of building large amounts of data into C++ programs? I could just
include a separate datafile, and have the program call it upon
starting, but then that would require having the program know where
the file is, even when I distribute it. In case this helps, I am
already using the GNU Scientific Library in the program, so using any
functions there is an easy option. My apologies if this question has
an obvious, standard solution I should already know about.

Excerpt from class file (CDFvectorholder) containing vector of
vectors:

vector<vector<double> > CDFvectorholder::Initialize() {
vector<vector<double> > CDFvectorcontents;
vector<double> contentsofrow;
contentsofrow.push_back(0.33298);
contentsofrow.push_back(1);
CDFvectorcontents.push_back(contentsofrow); //comparison where
ntax=3
contentsofrow.clear();
contentsofrow.push_back(0.07352);
contentsofrow.push_back(0.14733);
contentsofrow.push_back(0.33393);
contentsofrow.push_back(0.78019);
contentsofrow.push_back(1);
CDFvectorcontents.push_back(contentsofrow); //comparison where
ntax=4
contentsofrow.clear();
contentsofrow.push_back(0.01209);
contentsofrow.push_back(0.03292);
contentsofrow.push_back(0.04202);
contentsofrow.push_back(0.0767);
contentsofrow.push_back(0.13314);
contentsofrow.push_back(0.23417);
contentsofrow.push_back(0.40921);
contentsofrow.push_back(0.58934);
contentsofrow.push_back(0.82239);
contentsofrow.push_back(0.98537);
contentsofrow.push_back(1);
CDFvectorcontents.push_back(contentsofrow); //comparison where
ntax=5
//ETC
return CDFvectorcontents;
}

and the main program file, initializing the vector of vectors:

vector<vector<double> > CDFvector;
CDFvectorholder bob;
CDFvector=bob.Initialize();

and using it:

double cdfundermodel=CDFvector[integerB][integerA];

If it is truly a "large" amount of data (say >4meg compiled) then you
can think of using a container that can be statically initialized.

i.e.

// in header
struct datatype
{
double coeffs1[5];
double coeffs2[100];
double coeffs3[20];
};

extern datatype data;

// in data file

datatype data = {
{ 0.2, 0.4, 0.6 },
{ 1.1, 1.2 },
{ 0.1, 0.2 }
};


You could write a wrapper class that "looks" like a const std::vector
that wraps either a std::vector or a regular array so that you don't
need to make copies of the data you have.
 
J

James Kanze

I am writing a program which needs to include a large amount of data.
Basically, the data are p values for different possible outcomes from
trials with different number of observations (the p values are
necessarily based on slow simulations rather than on a standard
function, so I estimated them once and want the program to include
this information). Currently, I have this stored as a vector of
vectors of varying sizes (first vector is indexed by number of
observations for the trial; for each number of observations, there is
a vector containing a p value for different numbers of successes, with
these vectors getting longer as the number of observations (and
therefore possible successes) increases). I created a class containing
this vector of vectors; my program, on starting, creates an object of
this class. However, the file containing just this class is ~50,000
lines long and 10 MB in size, and takes a great deal of time to
compile, especially with optimization turned on.

If it's just data, optimization should make no difference.
Is there a better way
of building large amounts of data into C++ programs? I could just
include a separate datafile, and have the program call it upon
starting, but then that would require having the program know where
the file is, even when I distribute it. In case this helps, I am
already using the GNU Scientific Library in the program, so using any
functions there is an easy option. My apologies if this question has
an obvious, standard solution I should already know about.
Excerpt from class file (CDFvectorholder) containing vector of
vectors:
vector<vector<double> > CDFvectorholder::Initialize() {
vector<vector<double> > CDFvectorcontents;
vector<double> contentsofrow;
contentsofrow.push_back(0.33298);
contentsofrow.push_back(1);
CDFvectorcontents.push_back(contentsofrow); //comparison where
ntax=3
contentsofrow.clear();
contentsofrow.push_back(0.07352);
contentsofrow.push_back(0.14733);
contentsofrow.push_back(0.33393);
contentsofrow.push_back(0.78019);
contentsofrow.push_back(1);
CDFvectorcontents.push_back(contentsofrow); //comparison where
ntax=4
contentsofrow.clear();
contentsofrow.push_back(0.01209);
contentsofrow.push_back(0.03292);
contentsofrow.push_back(0.04202);
contentsofrow.push_back(0.0767);
contentsofrow.push_back(0.13314);
contentsofrow.push_back(0.23417);
contentsofrow.push_back(0.40921);
contentsofrow.push_back(0.58934);
contentsofrow.push_back(0.82239);
contentsofrow.push_back(0.98537);
contentsofrow.push_back(1);
CDFvectorcontents.push_back(contentsofrow); //comparison where
ntax=5
//ETC
return CDFvectorcontents;
}

And this is called at program start-up? Start-up isn't going to
be very fast.
and the main program file, initializing the vector of vectors:
vector<vector<double> > CDFvector;
CDFvectorholder bob;
CDFvector=bob.Initialize();
and using it:
double cdfundermodel=CDFvector[integerB][integerA];

I'd say that this is one case I'd use C style arrays, and static
initialization. It will still take some time to compile it, but
no where near as much as if you call a function on a templated
class for each element. And start-up time will be effectively
zero.

If you do need some of the additional features of std::vector,
then you can still use the static, C-style array to initialize
it, e.g.
std::vector( startAddress, endAddress ) ;
(Whatever code generates the C-style array can also be used to
generate the startAddress and endAddress variables.)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,734
Messages
2,569,441
Members
44,832
Latest member
GlennSmall

Latest Threads

Top