To dynamically bind, or not?

D

Dom Bannon

I do a fair amount of C++ but, I have to confess, very little
hard-core OO stuff. I've got a problem with a design decision,
deciding whether to use dynamic binding or not, and I'd appreciate
some input.

The problem is quite straigtforward. I had to knock up a quick
database analysis program a few months ago. This reads a large CSV
file, creates an internal database, and searches the records of the
database.

I then had to repeat this, for a slightly different CSV file, and a
slightly different internal database, with slightly different
searches. So, I've now got several thousand lines of code in 2
slightly different programs, which I need to clean up, to stop them
diverging.

The architecture is very straightforward. I've defined a class which
is the basic record, and I make a std::set of these records (a few
million of them) when I read in the CSV file. This set is the central
data structure, and I pass a reference to this set to my database read
and write functions, and to the search functions. The records are
different in the 2 programs, but about half the fields are common; one
of them is 60 bytes when coded on disk (recordA), and the other is 72
bytes (recordB).

My basic plan is/was:

1 - create a base class which is common to both recordA and recordB;
this contains about half of the old fields

2 - derive recordA and recordB from this base class, filling in the
remaining fields

3 - change the central dataset so that it's now a set<baseClass*>,
instead of set<recordA> or set<recordB>

4 - pass around the new dataset to the file read and write functions,
and the search functions. These are now polymorphic.

However, I'm now wondering if this is worth the effort. First off, I
have to keep all the records on the heap myself, and maintain a set of
pointers to these records, instead of just letting the set hold the
actual record.

Second, all the functions which actually do any work (searching, for
example) now get passed a pointer to a base class. Every time a member
function gets called (millions of times, for millions of records) the
program has to execute the same bit of compiler code to determine if
it's actually calling the code for recordA or recordB. Note that the
answer is always the same for an entire program run - we will always
call the recordA code in run A, and the recordB code in run B.

I'm now starting to think that the record inheritance is worthwhile,
but I might just as well keep the old set code, and just #ifdef the
two separate requirements. The required object type is always known
during a program run, so is dynamic binding actually of any use here?

Thoughts?

Thanks -

Dom
 
A

Alf P. Steinbach /Usenet

* Dom Bannon, on 20.01.2011 15:14:
[snip]
The architecture is very straightforward. I've defined a class which
is the basic record, and I make a std::set of these records (a few
million of them) when I read in the CSV file. This set is the central
data structure, and I pass a reference to this set to my database read
and write functions, and to the search functions. The records are
different in the 2 programs, but about half the fields are common; one
of them is 60 bytes when coded on disk (recordA), and the other is 72
bytes (recordB).


Thoughts?

Template.


Cheers & hth.,

- Alf
 
D

Dom Bannon

Template.

A template function? I don't think this would help - recordA and
recordB contain some different fields, and the search code checks the
value of some of these fields to decide whether the record qualifies.

Or a class templated on recordA/recordB that contains the set itself?

Am I getting warm?

:)

-Dom
 
A

Alf P. Steinbach /Usenet

* Dom Bannon, on 20.01.2011 16:36:
A template function? I don't think this would help - recordA and
recordB contain some different fields, and the search code checks the
value of some of these fields to decide whether the record qualifies.

Or a class templated on recordA/recordB that contains the set itself?

Am I getting warm?

:)

You can specialize the comparision/check function.


Cheers & hth.,

- Alf
 
J

James Kanze

* Dom Bannon, on 20.01.2011 15:14:
[snip]
The architecture is very straightforward. I've defined a class which
is the basic record, and I make a std::set of these records (a few
million of them) when I read in the CSV file. This set is the central
data structure, and I pass a reference to this set to my database read
and write functions, and to the search functions. The records are
different in the 2 programs, but about half the fields are common; one
of them is 60 bytes when coded on disk (recordA), and the other is 72
bytes (recordB).
Thoughts?
Template.

How? Dependent on what?

In my experience, this sort of problem is generally best handled
by an external code generator. Depending on the context, you
either define a very simple format, which you parse to generate
the C++ structs/classes and the SQL data base format, or you
extract the model from the data base, and parse it. The
generated code will contain not only the data members, but also
the SQL requests (or at least parts of them) needed to read and
write the data.
 
Ö

Öö Tiib

I do a fair amount of C++ but, I have to confess, very little
hard-core OO stuff. I've got a problem with a design decision,
deciding whether to use dynamic binding or not, and I'd appreciate
some input.

The problem is quite straigtforward. I had to knock up a quick
database analysis program a few months ago. This reads a large CSV
file, creates an internal database, and searches the records of the
database.

Define "internal database". Usually "OO stuff" means encapsulating the
data in objects and that may be involves on your case moving some
logic and checks from read, write and search functions to member
functions of these "records".
I then had to repeat this, for a slightly different CSV file, and a
slightly different internal database, with slightly different
searches. So, I've now got several thousand lines of code in 2
slightly different programs, which I need to clean up, to stop them
diverging.

Several thousand SLOC is usually a small program. Things usually stay
maintainable if to keep the (heavily self-contained) modules under 20
000 SLOC. If you feel that it needs cleaning up then perhaps you have
some huge functions? Try to keep functions under 100 lines long and
put them close to the data that they manipulate.
The architecture is very straightforward. I've defined a class which
is the basic record, and I make a std::set of these records (a few
million of them) when I read in the CSV file. This set is the central
data structure, and I pass a reference to this set to my database read
and write functions, and to the search functions. The records are
different in the 2 programs, but about half the fields are common; one
of them is 60 bytes when coded on disk (recordA), and the other is 72
bytes (recordB).

So it is actually special database engine that you did create?
Optimized for particular task with single table database.
My basic plan is/was:

1 - create a base class which is common to both recordA and recordB;
this contains about half of the old fields

2 - derive recordA and recordB from this base class, filling in the
remaining fields

3 - change the central dataset so that it's now a set<baseClass*>,
instead of set<recordA> or set<recordB>

4 - pass around the new dataset to the file read and write functions,
and the search functions. These are now polymorphic.

I think that it is hard to say what to do without seeing the code. It
seems that you don't need dynamic polymorphism since you don't use
both records in same set. Things that you can consider (some are
alternatives):

1. Make recordA and recordB to have common interface (member functions
with same signature) that hides the inner differences.

2. Read, write and search functions make function templates that
assume such "record interface" from recordsets passed to them.

3. Use common subset of recordA and recordB as data member of these to
reuse its code.

4. If the common subset needs feedback from recordA and recordB for
something then use it as base and use virtual functions for feedback.

5. Where you need virtual functions you may use CRTP instead. It makes
it bit faster, but virtual functions are fast enough and CRTP may be
confusing for someone not familiar with it.
However, I'm now wondering if this is worth the effort. First off, I
have to keep all the records on the heap myself, and maintain a set of
pointers to these records, instead of just letting the set hold the
actual record.

Yes, especially since your set contains always same records you do not
need dynamic polymorphism.
Second, all the functions which actually do any work (searching, for
example) now get passed a pointer to a base class. Every time a member
function gets called (millions of times, for millions of records) the
program has to execute the same bit of compiler code to determine if
it's actually calling the code for recordA or recordB. Note that the
answer is always the same for an entire program run - we will always
call the recordA code in run A, and the recordB code in run B.

I'm now starting to think that the record inheritance is worthwhile,
but I might just as well keep the old set code, and just #ifdef the
two separate requirements. The required object type is always known
during a program run, so is dynamic binding actually of any use here?

Thoughts?

Macros and preprocessor slicing hurts badly. I have seen only
failures, people become afraid of maintaining their own code, so if
maintenance is needed it has to be rewritten anyway.

Templates are bit better. People fear templates because lot of strange
wizardry is done with these but on your case you do not need much
template wizardry it feels.

Your problem description sounds that you may use curiously recurring
template pattern (it is not really a wizardry). If you dislike it you
may use usual, virtual functions.
 
G

Goran

I do a fair amount of C++ but, I have to confess, very little
hard-core OO stuff. I've got a problem with a design decision,
deciding whether to use dynamic binding or not, and I'd appreciate
some input.

The problem is quite straigtforward. I had to knock up a quick
database analysis program a few months ago. This reads a large CSV
file, creates an internal database, and searches the records of the
database.

I then had to repeat this, for a slightly different CSV file, and a
slightly different internal database, with slightly different
searches. So, I've now got several thousand lines of code in 2
slightly different programs, which I need to clean up, to stop them
diverging.

The architecture is very straightforward. I've defined a class which
is the basic record, and I make a std::set of these records (a few
million of them) when I read in the CSV file. This set is the central
data structure, and I pass a reference to this set to my database read
and write functions, and to the search functions. The records are
different in the 2 programs, but about half the fields are common; one
of them is 60 bytes when coded on disk (recordA), and the other is 72
bytes (recordB).

My basic plan is/was:

1 - create a base class which is common to both recordA and recordB;
this contains about half of the old fields

2 - derive recordA and recordB from this base class, filling in the
remaining fields

3 - change the central dataset so that it's now a set<baseClass*>,
instead of set<recordA> or set<recordB>

4 - pass around the new dataset to the file read and write functions,
and the search functions. These are now polymorphic.

However, I'm now wondering if this is worth the effort. First off, I
have to keep all the records on the heap myself, and maintain a set of
pointers to these records, instead of just letting the set hold the
actual record.

Second, all the functions which actually do any work (searching, for
example) now get passed a pointer to a base class. Every time a member
function gets called (millions of times, for millions of records) the
program has to execute the same bit of compiler code to determine if
it's actually calling the code for recordA or recordB. Note that the
answer is always the same for an entire program run - we will always
call the recordA code in run A, and the recordB code in run B.

I'm now starting to think that the record inheritance is worthwhile,
but I might just as well keep the old set code, and just #ifdef the
two separate requirements. The required object type is always known
during a program run, so is dynamic binding actually of any use here?

Thoughts?

The other guy told you: template :).

struct /*class, whatever*/ recordBase
{
// type, field.
t1 f1;
t2 f2;
t3 f3;
// ...
};
void LoadFromDSVLine(const char* line, recordBase& rec)
{
knockYourselfOutLoading(line, rec);
}

Up here, I presume that you have simple delimiter-separated line of
text with your record fields in it. You might create a corresponding
constructor in recordBase/A/B, but that means adding one more
responsibility to your record class, and SOLID teaches us differently,
so...


program A:
struct recordA : public recordBase
{
// A-specific fields
ta1 fa1;
ta2 fa2;
// ...
bool operator<(const recordA& rhs) const { return
knockYourselfOutComparing(*this, rhs); }
};
void LoadFromDSVLine(const char* line, recordA& rec)
{
LoadFromDSVLine(static_cast<recordBase&>(rec));
}

typedef std::set<recordA, comparisonFunctionA> setA;

Then you do the same for program B.

Goran.

P.S.1 Instead of parsing stuff myself, I would try looking for DSV
parser (there's some out there). I hope you did at least try that ;-)

P.S.2 I am intentionally using term D(elimiter)SV, not CSV, because I
am a bit of an i18n prick, and quite a big part of the world isn't
using a comma for these things.
 
M

Michael Doubez

* Dom Bannon, on 20.01.2011 15:14:
[snip]
The architecture is very straightforward. I've defined a class which
is the basic record, and I make a std::set of these records (a few
million of them) when I read in the CSV file. This set is the central
data structure, and I pass a reference to this set to my database read
and write functions, and to the search functions. The records are
different in the 2 programs, but about half the fields are common; one
of them is 60 bytes when coded on disk (recordA), and the other is 72
bytes (recordB).
Thoughts?
Template.

How?  Dependent on what?

In my experience, this sort of problem is generally best handled
by an external code generator.  Depending on the context, you
either define a very simple format, which you parse to generate
the C++ structs/classes and the SQL data base format, or you
extract the model from the data base, and parse it.  The
generated code will contain not only the data members, but also
the SQL requests (or at least parts of them) needed to read and
write the data.

And in the case, you don't want to use an external tools, you can also
use macros. It is quite ugly but effective and you can then reuse the
files from project to project.

In a file, you define the structure of your rows:
RowFields.h:
/* TYPE | Name | CSV Header ... | SQL FIELD ....
SINT4 ( Id , "Id Number" )
UINT2 ( Foo , "Foo Value" )
STRING( Bar , "Bar name" )
.....

And when you want to process the fields somewhere:
struct Row {
#define SINT4( _name, _header ) int32_t _name;
#define UINT2( _name, _header ) uint16_t _name;
....
#include "RowFields.h"
#undef ....
};

You can automate the common generation with another header files that
also performs the cleanup:

RowMacro.h:
#ifdef ROW_STRUCT_DEFINE
#define SINT4( _name, _header ) int32_t _name;
#define UINT2( _name, _header ) uint16_t _name;
...
#include ROW_FIELDS_FILE
#endif

And then:
struct Row {
#define ROW_STRUCT_DEFINE
#define ROW_FIELDS_FILE "RowFields.h"
#include "RowMacro.h"
};

The rule of thumb is to generate as less code as possible in the
macro, otherwise it is a hell to debug. I tend to generate only the
minimal structures with templated information and templated visitation
functions.

If your environment includes a compiler recent enough, you could
achieve the same with structs defining the fields and tuples for the
rows (except for switch-case, I have not found any technique to
emulate it).
 
J

Jorgen Grahn

I do a fair amount of C++ but, I have to confess, very little
hard-core OO stuff. I've got a problem with a design decision,
deciding whether to use dynamic binding or not, and I'd appreciate
some input.

I guess your "dynamic binding" is the same as "run-time polymorphism".
That term is more often used here.

....
The records are
different in the 2 programs, but about half the fields are common; one
of them is 60 bytes when coded on disk (recordA), and the other is 72
bytes (recordB).

My basic plan is/was:

1 - create a base class which is common to both recordA and recordB;
this contains about half of the old fields

2 - derive recordA and recordB from this base class, filling in the
remaining fields

3 - change the central dataset so that it's now a set<baseClass*>,
instead of set<recordA> or set<recordB>

4 - pass around the new dataset to the file read and write functions,
and the search functions. These are now polymorphic.

However, I'm now wondering if this is worth the effort. First off, I
have to keep all the records on the heap myself, and maintain a set of
pointers to these records, instead of just letting the set hold the
actual record.

Second, all the functions which actually do any work (searching, for
example) now get passed a pointer to a base class. Every time a member
function gets called (millions of times, for millions of records) the
program has to execute the same bit of compiler code to determine if
it's actually calling the code for recordA or recordB. Note that the
answer is always the same for an entire program run - we will always
call the recordA code in run A, and the recordB code in run B.

I'm now starting to think that the record inheritance is worthwhile,
but I might just as well keep the old set code, and just #ifdef the
two separate requirements. The required object type is always known
during a program run, so is dynamic binding actually of any use here?

I can only speak for myself. I find myself avoiding run-time
polymorphism in cases where it really /isn't/ a run-time decision.
Mostly because I find inheritance hard to understand; it often looks
unnatural to me. I also worry (perhaps needlessly) about performance.

The technique I'd try first is splitting out the details into helper
functions. If two similar functions are 200 lines each, it looks bad.
If they are 20 lines each, I don't mind so much. After that rewrite
it's also easier to see if you can merge it into a template.

/Jorgen
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,578
Members
45,052
Latest member
LucyCarper

Latest Threads

Top