A program that writes code: should it use 'string'?

  • Thread starter Ramon F Herrera
  • Start date
R

Ramon F Herrera

I am writing a program that generates source code. See a snippet
below. My question is about the use of that growing 'code' variable.
Is it efficient? Is is recommended for this case?

The code generated can grow a lot. Perhaps I should allocate a large
max size in advance?

TIA,

-RFH


-------------

void SynthesizeTextField(CompleteField fullTextField)
{
string code;
string baseFieldname = "text";
stringstream ss;
static int subindex = 1;

code = "Field ";
code += baseFieldname;
ss << subindex;
code += ss.str();
code += " ";
code += "doc.FieldCreate(\"";
code += baseFieldname;
code += ss.str();
code += "\", Field::e_text, \"\", \"\");";

subindex++;
}
 
K

Kai-Uwe Bux

Ramon said:
I am writing a program that generates source code. See a snippet
below. My question is about the use of that growing 'code' variable.
Is it efficient? Is is recommended for this case?


Recommended is to measure before you optimize. Write the program so that it
is easy to understand. When (and only when) you have a performance problem,
don't guess what the cause might be; instead, use a profiler to identify
the bottleneck and then do something about it.

The code generated can grow a lot. Perhaps I should allocate a large
max size in advance?
[snip]

If profiling shows that appending to the string is too costly, reserving a
certain capacity would be the first thing to try. It's the least intrusive
measure.


Best

Kai-Uwe Bux
 
J

James Kanze

I am writing a program that generates source code. See a snippet
below. My question is about the use of that growing 'code' variable.
Is it efficient? Is is recommended for this case?
The code generated can grow a lot. Perhaps I should allocate a large
max size in advance?

void SynthesizeTextField(CompleteField fullTextField)
{
string code;
string baseFieldname = "text";
stringstream ss;
static int subindex = 1;

code = "Field ";
code += baseFieldname;
ss << subindex;
code += ss.str();
code += " ";
code += "doc.FieldCreate(\"";
code += baseFieldname;
code += ss.str();
code += "\", Field::e_text, \"\", \"\");";

subindex++;
}

For starters, I'd generate (or support generation) directly into
the output stream. Something like:

std::eek:stream&
SynthesizeTextField(
std::eek:stream& dest,
... )
{
// ...
return dest ;
}

You're formatting here (some of the data is numeric,
apparently), so you might as well treat the entire thing as a
stream. And you'll certainly be outputting it in the end;
there's not much you can do with C++ source code within the
program, so you might as well generate directly into the output
stream, and never build the string at all.
 
J

Juha Nieminen

Ramon said:
I am writing a program that generates source code. See a snippet
below. My question is about the use of that growing 'code' variable.
Is it efficient? Is is recommended for this case?

I think it's ok. You could also say something like
"code.reserve(100*1024);" which allocates 100kB (or any other
amount you feel is about correct) of memory for it so that it
never has to resize (unless you exceed that limit, of course),
which might make it slightly more efficient.
 
P

Pascal J. Bourguignon

Ramon F Herrera said:
I am writing a program that generates source code. See a snippet
below. My question is about the use of that growing 'code' variable.
Is it efficient? Is is recommended for this case?

The code generated can grow a lot. Perhaps I should allocate a large
max size in advance?
[...]
code += "doc.FieldCreate(\"";
[...]

No, you should not use strings to generate code. Code is a syntac
tree. You should have a tree of objects:

Lhs* lhs=new Variable("pi_squared");
Rhs* rhs=new Variable("pi");
Statement* code=new Assignment(lhs,new Multiply(rhs,rhs));
cout<<code->generate();

would produce:

pi_squared=pi*pi;
 
J

James Kanze

I am writing a program that generates source code. See a snippet
below. My question is about the use of that growing 'code' variable.
Is it efficient? Is is recommended for this case?
The code generated can grow a lot. Perhaps I should allocate a large
max size in advance?
[...]
code += "doc.FieldCreate(\"";
[...]
No, you should not use strings to generate code. Code is a
syntac tree.

That depends a lot on the code. The compiler may treat it as a
syntax tree, but most of the time I'm generating code, it's
fairly flat (tables and that sort of stuff). And of course, in
the end, you need text, to feed to the compiler.
You should have a tree of objects:

Lhs* lhs=new Variable("pi_squared");
Rhs* rhs=new Variable("pi");
Statement* code=new Assignment(lhs,new Multiply(rhs,rhs));
cout<<code->generate();
would produce:
pi_squared=pi*pi;

I think you've missed the question. The original poster may
actually be already doing that, for all we know. The question
concerned the generation of the code, not the source from which
it was generated. And the code itself must be text (at least as
the question was posed).

Of course, I agree that you don't have to generate that text
entirely in one std::string object. Regardless of the source,
you should (usually) output it directly to an ostream (which
could be an ostringstream *if* you need the text in the process,
but usually, it will be an ofstream, I think).
 
J

James Kanze

This is C++, not Java, loose the "new" abuse:

He's building a tree. That pretty much required dynamic
allocation.
// class Variable;
// class Statement;
// class Assignment: public Statement;
Variable lhs("pi_squared");
Variable rhs("pi");
Assignment code(lhs, Multiply(rhs, rhs);

Unless you've got dynamic allocation of the nodes somewhere
hidden in the constructors, this is not going to work. And of
course, it doesn't work if the expression is the result of
parsing some external data either.
cout << code.generate();
// or
cout << Assignement(Variable("pi_squared"),
Multiply(Variable("pi"),Variable("pi")).generate();
Your tree of object approach is probably superior as
complexity increases. For simple problems direct construction
in a string/ostream is likely to be sufficient but if you have
a lot of complex code generation to do, the cost of creating
the code object hierarchy is likely to be worthwhile.

Tree or not, you'll have to either build a string or generate
text directly into an ostream sooner or later. If I understand
the original poster correctly, his question concerned the
efficiency of using a string when the size of the code became
large; he's already solved his problem the source of the code
(tree or otherwise).

I'll admit that I generate a lot of code automatically, and I've
never used a syntax tree to do so. But most of the code is just
tables, or a function with a single switch statement (which is
also a table of sorts). Or the code is generated from a
template (general sense of the word, not a C++ template).
 
J

Juha Nieminen

Pascal said:
No, you should not use strings to generate code. Code is a syntac
tree. You should have a tree of objects:

Why make things more complicated than necessary? You converted his
easy-to-read code into a mess of pointers and dynamically allocated
objects. What for?
 
P

Puppet_Sock

I am writing a program that generates source code. See a snippet
below. My question is about the use of that growing 'code' variable.
Is it efficient? Is is recommended for this case?

The code generated can grow a lot. Perhaps I should allocate a large
max size in advance?

Here;s your code snippet.

void SynthesizeTextField(CompleteField fullTextField)
{
string code;
string baseFieldname = "text";
stringstream ss;
static int subindex = 1;


code = "Field ";
code += baseFieldname;
ss << subindex;
code += ss.str();
code += " ";
code += "doc.FieldCreate(\"";
code += baseFieldname;
code += ss.str();
code += "\", Field::e_text, \"\", \"\");";


subindex++;

}

I don't see that fullTextField is used.

I don't see that code is used after it is filled.
Seems to be no way for it to get out of the function.

Not really possible to answer your question without a
lot of detailed consideration of your problem specs.

For example: The snippet shows a lot of appending,
and not much else. Not much help there deciding on
what to do about growing data set size.

You need to think about things like:
- Will the growing be only at the end or the middle or front?
- Will you need to stick data into the middle of the
target data? For example, will you need to insert
words into the middle of the data your are building?
- Will you want to be doing edit-in-place type actions?
For example, sorting on keywords, user defined edits, etc.
- Will you need to do searching in the data? Sorting on
keywords, analysis on treds, or anythign like that.
- Will you want to do any syntax analysis? Things like
search for well formed lines of code, and so on.
- Any other complications of increased scope you can
pry out of the folks setting the project.

If you can figure out which, if any, of these is likely,
then you can pic a data structure that will accomodate
them easier. That way you can get ahead of your client
asking for new features.

On the other hand, if you are confident that none of that
sort of thing is ever going to happen, then pick the most
simple way of doing things that you can. That will be
the easiest to update if it does start to degrade.
Socks
 
P

Pascal J. Bourguignon

Juha Nieminen said:
Why make things more complicated than necessary? You converted his
easy-to-read code into a mess of pointers and dynamically allocated
objects. What for?

As a first step toward implement Greenspun's Tenth Law, of course...
 
J

James Kanze

Looking at the proposed syntax above, I don't think that was
the reason for the "new" overflow syntax so I maintain my
opinion.

I'm not sure what you mean by "overflow" syntax, but Pascal
explicitly said that you should have a tree, so I think we have
to assume that he was building a tree.
This could be true for:
Lhs* lhs=new Variable("pi_squared");
Rhs* rhs=new Variable("pi");
Rhs* rhs2=new Variable("pi");
Assignemnt code(lhs, new Multiply(rhs,rhs2))

OK, so his code builds a directed acyclic graph, instead of a
tree. What does that change?
But in the code as presented:
1- Multiply can't get double ownership of rhs unless it's
constructor is convoluted. If it gets basic ownership of the
dynamically allocated object it is given, Multiply(rhs, rhs)
is probably a bug.

First, I suspect that the posted code was just a hint, and not
meant to be polished, finished, fully working code. Second, I
don't quite follow your points about "ownership". If you're
building a directed acyclic graph, then ownership is not really
a relevant issue; if there is ownership, it is shared by all
parents, but typically, you'll implement some sort of garbage
collection, and not worry about it. If you're not using the
Boehm collector, you'll allocate all of the nodes from a pool,
with a pool for each expression, and you'll drop the entire pool
when you're done with the expression. Or, since the graph is
acyclic, you can even use boost::shared_ptr if performance isn't
an issue (and the amount boost::shared_ptr will impact is
probably small enough to make it not an issue).
2-
Statement* code=new Assignment(/*...*/);
std::cout << code->generate();
is very hard to justify. To me that's clear dynamic
allocation abuse. Of course, "code" could later be added to a
statement collection but that was not in the presented code so
dynamic allocation there was unjustified.

Except that in a larger context, it's likely that you can't
allocate Statement (or any syntax element) on the stack.
(Unless you have full garbage collection, of course.)
3- The code as presented will leak if either of the 2nd, 3rd
or 4th "new" throws.

Without seeing the actual classes involved, I can't say that.
Probably, he's using the Boehm collector; this is typically the
sort of thing where garbage collection shines. Or he's defined
an operator new/operator delete in the base class constructor
which allocates from a pool, and he just tells the pool to drop
everything when he's through with the expression, at a higher
level. (That's the way I usually handle syntax trees when I
can't use the Boehm collector.) Or maybe he's made the
constructors nothrow, and replaced the new_handler to abort, so
that the entire code is guaranteed no throw.
So maybe the following would be acceptable:
shared_ptr<Lhs> lhs(new Variable("pi_squared"));
shared_ptr<Rhs> rhs(new Variable("pi"));
Assignemnt code(lhs, new Multiply(rhs,rhs))

Maybe, but there are better solutions.

[...]
Copy constructors would do the job fine. It seems to works
for the STL.

In case you hadn't notice, the STL does dynamic allocation in
its containers. Here, he's building a tree outside of any
container, so that doesn't work; he'd have to hide it in the
individual elements.
The Assignement implementation would also not be forced to
have a particular internal structure but could be implemented
in whatever way is best.
Not sure I get your point here.

If you don't know what variables you're going to need up front,
the only way to get the objects you need is by dynamic
allocation.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top