Musings, alternatives to multiple return, named breaks?

E

Eric Sosman

[...]
In many cases, though, the only reason for a return in the middle
is because something went very wrong. At that point, it might not
matter what else happens.

Another reason, frequently, is that something went very right.
"Right" as in "I was looking for something, and found it." There
you are, looping from node to node in a search tree or something,
and suddenly you come upon the match you've been looking for. What
could be more natural than to say "Hooray!" and spell it "return?"

const Node *search(const Node *root, const Key *key) {
while (root != NULL) {
int order = compareKeys(key, root->key);
if (order < 0) {
root = root->left;
} else if (order > 0) {
root = root->right;
} else {
return root; // "Hooray!"
}
}
return NULL; // "Drat."
}

Yes, you *could* collapse the two returns into one by introducing
an extra variable and a `break':

const Node *answer = NULL;
while (root != NULL) {
...
} else {
answer = root; // "Hooray!"
break; // socially acceptable spelling of "goto"
}
}
return answer; // "Hooray!" or "Drat," dunno which

.... and further obfuscations are possible. But to my mind that's
all they are: obfuscations, unnecessary increases in the amount
of information the reader must decipher and understand to see
what's going on. In short, ideology overcoming clarity. Pfui!
 
I

Ian Collins

glen said:
Somehow this implies that modern code is getting simpler.

No, it implies compilers and development processes are getting better.
There are still plenty of complicated algorithms that lead to large
complex functions. Yes, one should work toward more and smaller
functions were possible, but I know that there are still some in
computational physics, as one example, that are necessarily large.

I doubt there are many real world situation where a large function can't
be decomposed into a set of smaller functions. There's also the problem
of testing a large function; a large complex function inevitably
requires large complex fragile tests which is are maintenance nightmare.
I wouldn't expect to find large function in code written with testing
in mind or written test first.
In many cases, though, the only reason for a return in the middle
is because something went very wrong. At that point, it might not
matter what else happens.

Or very right as Eric pointed out elsewhere!
 
S

Stefan Ram

Ian Collins said:
I doubt there are many real world situation where a large function can't
be decomposed into a set of smaller functions.

It is done by the refactor »extract method«. And many jump
statements out of a block or into a block are exactly what
make it more difficult to apply this refactor to this block.
Therefore, many jumps inside a large method are usually bad,
but I am afraid that this has been said before.
 
G

glen herrmannsfeldt

(snip, I wrote)
No, it implies compilers and development processes are getting better.
I doubt there are many real world situation where a large function can't
be decomposed into a set of smaller functions. There's also the problem
of testing a large function; a large complex function inevitably
requires large complex fragile tests which is are maintenance nightmare.
I wouldn't expect to find large function in code written with testing
in mind or written test first.

Seems that there are a number of large simulation programs in the
range of 100K to 1M lines of code. Note that making functions smaller,
after a while, just moves the problem around. Now you have more
functions to keep track of!

In balancing the number of functions against the size of functions,
it seems to me that, to a first approximation, you want the function
size to about about the square root of the total number of lines.
For 100K to 1M line programs, that is from about 300 to 1000 functions
of 300 to 1000 lines each.

I might download some available large programs and see how the
distribution of function size is, and then look at some large ones
and see if I think they could have been split.
Or very right as Eric pointed out elsewhere!

-- glen
 
I

Ian Collins

glen said:
(snip, I wrote)



Seems that there are a number of large simulation programs in the
range of 100K to 1M lines of code. Note that making functions smaller,
after a while, just moves the problem around. Now you have more
functions to keep track of!

The trade-offs are potential reuse of functionality, reducing overall
code size and ease of testing. I would have thought correctness, hence
ease of testing, would be an important requirement for simulations.
In balancing the number of functions against the size of functions,
it seems to me that, to a first approximation, you want the function
size to about about the square root of the total number of lines.
For 100K to 1M line programs, that is from about 300 to 1000 functions
of 300 to 1000 lines each.

That rule might explain why some large applications become maintenance
nightmares!
 
D

David Brown

The trade-offs are potential reuse of functionality, reducing overall
code size and ease of testing. I would have thought correctness, hence
ease of testing, would be an important requirement for simulations.

Sometimes these big functions are generated by some other program,
rather than written by hand. Then you get a very different picture of
what should be tested, re-used, etc.
That rule might explain why some large applications become maintenance
nightmares!

It would be very interesting to know if there is a pattern like this in
real code.
 
G

glen herrmannsfeldt

David Brown said:
On 18/03/14 01:28, Ian Collins wrote:

(snip, I wrote)
It would be very interesting to know if there is a pattern like this in
real code.

I have downloaded Gamess, which also happens to be a SPEC program.

Seems that it is free, but not open source. That is, you have to
agree to the license terms, including not to redistribute the source,
but you still get it free.

I will try some counts on it and see what I can learn.

It is about half C and half Fortran...

-- glen
 
B

BartC

Seems that there are a number of large simulation programs in the
range of 100K to 1M lines of code. Note that making functions smaller,
after a while, just moves the problem around. Now you have more
functions to keep track of!

In balancing the number of functions against the size of functions,
it seems to me that, to a first approximation, you want the function
size to about about the square root of the total number of lines.
For 100K to 1M line programs, that is from about 300 to 1000 functions
of 300 to 1000 lines each.

That doesn't sound right. A larger program will just contain more functions
of probably the same average size (where projects are created by merging
different libraries or modules created by different people).

I've looked at one 207Kloc project (a CPython interpreter), the average
function size is 41 lines; your formula suggests it should be 450 lines!
(And the 41 lines includes a space between functions, a 2-line declaration
(this style likes to split it across two lines with the return type on its
own line), and opening and closing braces on separate lines. Sometimes the
odd 3-4 line comment. So more like 35 lines of actual code. Also the
line-count excludes header files.)

(BTW it's tricky counting functions in C because you need to implement half
a C compiler to detect the beginning of a function! It worked in this case
because the style used indentation for all code, with the opening { always
at the start of a line.)

I looked also at a project of mine, 15Kloc, your formula suggests 122 lines
per function, actual value 39 lines (again with 4 or 5 lines of overhead
each). Another, 18Kloc, size 25 lines, your prediction: 134 lines.

What sort of programs have you been writing...?

(Although these projects are all to do with language implementation, which
need to deal with large numbers of tiny details.)
 
G

glen herrmannsfeldt

(snip, I wrote)
That doesn't sound right. A larger program will just contain more
functions of probably the same average size (where projects are
created by merging different libraries or modules created
by different people).

I hadn't thought of projects and libraries when I wrote that one.
Considering libraries and/or projects (a large group effort might
be arranged into smaller groups, or projects). In that case, the
first approximation would be to minimize the sum of projects,
functions/project, and lines/function, such that each is the
cube root of the total lines. The weighting might not be quite
right but, for 100k to 1M lines that indicates about 46 to 100
groups, function/group, and lines/function.

Seems to me that the needed interactions increase such that
groups shouldn't increase quite that fast. This is meant to be,
in the end, a single large program. If instead it is a suite of
independent programs, the result will be different.
I've looked at one 207Kloc project (a CPython interpreter), the average
function size is 41 lines; your formula suggests it should be 450 lines!
(And the 41 lines includes a space between functions, a 2-line declaration
(this style likes to split it across two lines with the return type on its
own line), and opening and closing braces on separate lines. Sometimes the
odd 3-4 line comment. So more like 35 lines of actual code. Also the
line-count excludes header files.)

OK, new formula says 59. For 41 lines, that means about 5000 functions,
which have to be documented somehow.
(BTW it's tricky counting functions in C because you need to implement half
a C compiler to detect the beginning of a function! It worked in this case
because the style used indentation for all code, with the opening { always
at the start of a line.)

I haven't started counting yet. Fortran is a little easier, though in
both cases you have to watch for comments. Braces inside comments
complicate the counting.
I looked also at a project of mine, 15Kloc, your formula
suggests 122 lines per function, actual value 39 lines (again
with 4 or 5 lines of overhead each). Another, 18Kloc, size 25
lines, your prediction: 134 lines.

So this one is all you? No dividing up into groups?
What sort of programs have you been writing...?
(Although these projects are all to do with language
implementation, which need to deal with large numbers
of tiny details.)

I was thinking more about computational physics or chemistry,
where the complications of nature come into the calculation.
For language implementations, the language, in the end, is
limited by people. For physics and chemistry the final limit
is nature. The complication is where to approximate, and by how
much.


-- glen
 
J

James Kuyper

Seems that there are a number of large simulation programs in the
range of 100K to 1M lines of code. Note that making functions smaller,
after a while, just moves the problem around. Now you have more
functions to keep track of!

In balancing the number of functions against the size of functions,
it seems to me that, to a first approximation, you want the function
size to about about the square root of the total number of lines.
For 100K to 1M line programs, that is from about 300 to 1000 functions
of 300 to 1000 lines each.

I think that's the wrong approach. A function's size should be
determined primarily by the need for the programmer to be able to
understand it clearly. If your program does a lot of work, it should
have a lot of functions. If that number gets too large, organize them
into groups, don't just make the functions larger. In a really large
program, you shouldn't have to understand all of the functions
simultaneously, just the functions in each group of functions.

I've inherited responsibility for a lot of code that violates the
following guideline, but when I've been free to set the function size,
I've found it best to keep it down to one, or at most two, printed
pages. I've seldom seen any function with more than a few hundred lines
that I considered well-structured or at all easy to understand.

Most of the cases I've seen where a large function was easy to
understand, involved reading or writing badly structured data, so that
the function contained nothing more than a long series of assignment
statements. For example, a large struct containing fields named like
radiance_band1, radiance_band2, ... radiance_band39, which should have
been implemented as an array. If it had been, 39 separate assignment
statements could have been replaced by a single assignment statement in
a loop. There were multiple fields organized like this, so it might have
been a good idea to declare an array of structures, rather than several
separate arrays that happened to have the same size, but either approach
would be better than a large disorganized set of individual fields.
 
G

glen herrmannsfeldt

(snip, I wrote)
I think that's the wrong approach. A function's size should be
determined primarily by the need for the programmer to be able to
understand it clearly. If your program does a lot of work, it should
have a lot of functions. If that number gets too large, organize them
into groups, don't just make the functions larger. In a really large
program, you shouldn't have to understand all of the functions
simultaneously, just the functions in each group of functions.

Yes, as someone else indicated, you should organize into groups.
That adds one more level of abstraction, but is it enough?

If you have X functions with mean Y lines of code each, for a
total of X*Y, should Y not increase at all as X increases?

For an example, say 1M loc, and you have a choice between
(mean) 50k functions of 20 lines each and 20k functions of 50
lines each, which will be more readable?

Now, going from 2D to 3D doesn't make the function 50% harder
to read, as there will be some similarity between those lines.

For a completely unrelated subject, consider how the mean size
of cities changes as population grows? It might be better to have
more cities of the same, smaller, size but that isn't usually what
happens.
I've inherited responsibility for a lot of code that violates the
following guideline, but when I've been free to set the function size,
I've found it best to keep it down to one, or at most two, printed
pages. I've seldom seen any function with more than a few hundred lines
that I considered well-structured or at all easy to understand.

With my second approximation, where you have groups writing separately,
it is cube root for groups, functions/group and lines/function, so
the mean can be down to 100, and maybe the peak at 300.
Most of the cases I've seen where a large function was easy to
understand, involved reading or writing badly structured data, so that
the function contained nothing more than a long series of assignment
statements.

I am considering mean, so one large function can increase the mean,
while the majority are still small. Statistically, median might
be better, but mean has the convenience that I can multiply
mean lines/function by functions and get total lines.
For example, a large struct containing fields named like
radiance_band1, radiance_band2, ... radiance_band39, which should have
been implemented as an array. If it had been, 39 separate assignment
statements could have been replaced by a single assignment statement in
a loop.

A little off topic, but note that PL/I allows for structure expressions.

Fortran has array expressions, which sometimes help and sometimes don't,
but still not structure expressions. (Might be inherited from COBOL.
I still haven't written even one COBOL program, ever.)
There were multiple fields organized like this, so it might have
been a good idea to declare an array of structures, rather than several
separate arrays that happened to have the same size, but either approach
would be better than a large disorganized set of individual fields.

Yes, as I indicated above, this can increase lines per function, with
a slower increase in complexity (decrease in readability).
And it tends to happen more as programs get larger.

-- glen
 
I

Ian Collins

glen said:
(snip, I wrote)



Yes, as someone else indicated, you should organize into groups.
That adds one more level of abstraction, but is it enough?

If you have X functions with mean Y lines of code each, for a
total of X*Y, should Y not increase at all as X increases?

For an example, say 1M loc, and you have a choice between
(mean) 50k functions of 20 lines each and 20k functions of 50
lines each, which will be more readable?

I can only comment based on code I have worked on (which is a fair bit
over 30 years!) but in my experience smaller functions lead to less code
overall. Why? Because a code base with large functions tends to have
common functionality buried in multiple functions. When tasks get
broken down into smaller units that functionality only exists (and has
to be tested and maintained) once and gets reused.
 
M

mathog

glen said:
I have done timing tests, where I need some statements at the
beginning and end, which have the same problem.

It has crossed my mind a few times that if the language supported the
concept of (optional) "pre", (required but implicit) "body", (optional)
"post" sections for functions it would be easier to do this sort of
thing. Here is an example of what I mean.

This is "file.c":

/* other includes */
#if BUILDFLAG
#include testing_file.h
#fi

/* the "body" of a function, no "body" keyword is required
This is just the usual sort of C function that we already have. */
int some_action(void){
/* do some stuff */
}

This is testing_file.h:

pre some_action(){
printf("do something in pre\n");
}

post some_action(){
printf("do something else in post\n");
}

If we build without BUILDFLAG the compiler finds neither the "pre" nor
"post" sections, and builds things as they are now.

If we build with BUILDFLAG set then the compiler slots the pre actions
in before the body executes, and the post in afterwards - on (all) the
return(s). Our debugging/testing code goes in where we want, and we
didn't have to edit the "body" part at all. This eliminates the need to
hunt for all of the returns when debugging/testing like this. pre
should be able to access the call arguments and any variables declared
within pre, but nothing in the body. post would be able to access
whatever pre can, and in addition the function's return value, perhaps
using a predefined Macro like:

printf("return value is %d\n", POST_RETURN);

Debuggers could set break points on any line within "pre" or "post".

Notice that neither the type of the function nor its arguments would be
specified in pre and post statements, these would be taken from the
corresponding "body" part. For C it would be a bit silly to require
that that information be entered 3 times. Extending this idea to C++
would probably require that information multiple times though, since the
same method name can do many different things, and the extra information
may be needed to distinguish which pre/post goes with which of those
methods.

It would be very easy to automatically generate a testing_file.h file
automatically from file.c using any number of scripting languages.

Regards,

David Mathog
 
B

BartC

OK, new formula says 59. For 41 lines, that means about 5000 functions,
which have to be documented somehow.

Why does there need to be a formula? That's like saying a long book ought to
have longer paragraphs than a short one.
I haven't started counting yet. Fortran is a little easier, though in
both cases you have to watch for comments. Braces inside comments
complicate the counting.


So this one is all you? No dividing up into groups?

Yes, all my work. One older larger project, totalling some 200Kloc, was
about 52 lines/function.

This project (a graphics app) would grow by thinking of new command, and
adding a new module, consisting of functions written in the same coding
style as the rest. The existing functions are not going to get bigger, and
the new ones are not going to individually be bigger than the ones in any
other module either!

I suppose if someone has a long-winded coding style (writing 50-line
functions in 100-lines) then overall the Loc might be double.
I was thinking more about computational physics or chemistry,
where the complications of nature come into the calculation.
For language implementations, the language, in the end, is
limited by people. For physics and chemistry the final limit
is nature. The complication is where to approximate, and by how
much.

Perhaps you're thinking of Fortran too much. I remember my boss at a
placement I was on (in late 70s) being bemused by my use of the new-fangled
'structured programming', and by my use of subroutines in particular!
 
I

Ian Collins

mathog said:
It has crossed my mind a few times that if the language supported the
concept of (optional) "pre", (required but implicit) "body", (optional)
"post" sections for functions it would be easier to do this sort of
thing. Here is an example of what I mean.

This functionality is better implemented in the host environment than
than code. For example, dtrace does exactly what you want (without any
compiler intervention) in this case and I've used embedded platforms
that can add the appropriate hooks.
 
J

James Kuyper

(snip, I wrote)



Yes, as someone else indicated, you should organize into groups.
That adds one more level of abstraction, but is it enough?

The grouping can be hierarchical, so that is as many additional levels
of abstraction as you want. I've recently been assigned to work on a new
project, and the program I'm currently responsible for in that project
have functions grouped into libraries, and the libraries can be divided
into several different groups, depending upon who's responsible for each
group of libraries: the C standard, the C++ standard, the POSIX
standard, three different third party vendors, our project, and another
project that we have borrowed a lot of code from. There's an average of
a couple of dozen libraries in each group, and an average of a dozen or
so identifiers with external linkage (most of them, function names) in
each library. That's three different levels to the hierarchy, which is
almost enough to make everything comprehensible (or at least, it might
be, if our code and the other project's code were better organized and
documented ;-( ). I imagine that larger projects need more levels; but
this one currently marks the upper limits of my experience.
If you have X functions with mean Y lines of code each, for a
total of X*Y, should Y not increase at all as X increases?

I don't think so. Y should be determined by the average developer's
ability to understand the code, I don't see any reason why it should
change with the size of the project.
For an example, say 1M loc, and you have a choice between
(mean) 50k functions of 20 lines each and 20k functions of 50
lines each, which will be more readable?

Unless this is C++ code, 20 lines seems a bit short for a function, but
any code that does something which can be implemented in 20 lines is
certainly going to be more readable than one that requires 50 lines. The
key point is that the average programmer won't have to (and won't have
time to) read all of the code - whether it has 50K of functions or only
20K. Any given programmer will read only the functions he or she is
actually working on, and short descriptions of the calling and called
functions. I assume you're not describing a one person project!
Now, going from 2D to 3D doesn't make the function 50% harder
to read, as there will be some similarity between those lines.

That depends upon what the third dimension means. If you have homogenous
data, and a third dimension just means larger chunks of data, then it
won't matter much. However, usually an additional dimension implies
corresponding additional feature that need to be implemented, would
could easily make the code 50% harder to understand. In my own code, two
dimensions that occur almost everywhere are called "sample" and
"detector", and together they lay out a two-dimensional image. A
possible third dimension (though the code is not actually organized that
way ) would be by scan; each scan of the rotating mirror creates a new
image, and that's just more of the same thing, which would be an example
of what you're talking about. However, the third dimension that actually
occurs in many contexts in my code is usually the band number; each
frequency band needs different processing, something which goes way
beyond being simply an extension of the two-dimensional image processing
part of the code.
For a completely unrelated subject, consider how the mean size
of cities changes as population grows? It might be better to have
more cities of the same, smaller, size but that isn't usually what
happens.

Larger cities require larger administrations to understand them well
enough to administer them properly. Correspondingly, large software
projects need to have more people working on them, but in my experience,
the best way to subdivide responsibilities for a program is to assign
entire functions or even entire libraries to one programmer, not by
trying to have one programmer understand one part of a function, and
have a another programmer understand a different part of it. Therefore,
the maximum size of a function is determined by a single programmer's
ability to understand it, not by the size of the program that it's a
part of.
 
E

Eric Sosman

[...]
If you have X functions with mean Y lines of code each, for a
total of X*Y, should Y not increase at all as X increases?

If I have X functions with mean Y lines of code each, and
I write Z new functions to add a new feature, what should I do
to adjust the Y value? Go back and add a few lines to each of
the X functions, or deliberately make the Z new ones obese?
 
G

glen herrmannsfeldt

(snip, I wrote)
Why does there need to be a formula?

In considering how function size scales with program size, it
is possible that the exponent is zero. (That is, they don't scale.)

One could do statistics on a number of large programs, an some small
ones, and find out.
That's like saying a long book ought to have longer
paragraphs than a short one.

There are people who do statistics on books. That is why it is
almost impossible to be an anonymous author now. There is easily
enough data to compare writing styles.

I might wonder if paragraphs get longer with size for technical
books, and not so much for fiction books.

The difficulty in understanding a function increases faster (worse)
than linearly with size, as you need to try to understand the
possible interactions between the different parts. (The exponent
is greater than zero.) Similarly, the readability goes down with
size (exponent less than zero).

If the difficulty in understanding doesn't increase more than
linearly with the number of functions, then you should keep them
all small. If there are interactions between functions, though,
that will complicate understanding them, such that it is worse
than linear.

Note that this is all mean size, and that it might be that most
don't change but that a small number become very large.

The idea behind "Mythical Man Month" was that the time required
for a large software project, unlike had been previously believed,
does not increase linearly with size. The interactions between
different parts increase, and that causes the time to increase.

(snip)
I suppose if someone has a long-winded coding style (writing 50-line
functions in 100-lines) then overall the Loc might be double.

Well, I am supposing that one can find the optimal size for a
given project, which also may not be true.

(snip, I wrote)
Perhaps you're thinking of Fortran too much. I remember my boss at
a placement I was on (in late 70s) being bemused by my use of
the new-fangled 'structured programming', and by my use of
subroutines in particular!

Well, much of computational physics and chemistry is done in Fortran,
though GAMESS seems to be about equal Fortran and C. But note that
Fortran has array expressions, and C doesn't. (Though personally, I
know many cases where the array expression is harder to understand
than one done with loops.)

I think about the same, independent of the language. I do know that
my Java programs are more C-like than those of non-C Java programmers.

-- glen
 
G

glen herrmannsfeldt

(snip, I wrote)
I can only comment based on code I have worked on (which is a fair bit
over 30 years!) but in my experience smaller functions lead to less code
overall. Why? Because a code base with large functions tends to have
common functionality buried in multiple functions. When tasks get
broken down into smaller units that functionality only exists (and has
to be tested and maintained) once and gets reused.

And faster function calls help move in that direction.

For programs with run-times of hours, days, or weeks, instead of
seconds or minutes, there is a tendency to code for speed.

Reminds me of a program I worked on (inherited from someone else)
many years ago (Fortran 66 days). It did Gaussian quadrature on many
different (mathematical) functions, all coded in different subroutines.
Because of the way it was split up, it had to recompute many expressions
in each subroutine. (and overall wasn't all that big.) I don't
remember much of the detail by now, but by combining what was in
different subroutines into one (not so) large subroutine, it greatly
reduced the redundancy of calculation. I believe it was 2D or 3D,
and so nested loops, with a subroutine call at each nesting level.

But yes, it is often that code is duplicated for no good reason,
or no reason at all, and that it is smaller and faster (especially
in debugging and testing) to combine the duplicates into one.

-- glen
 
G

glen herrmannsfeldt

(snip, I wrote)
If I have X functions with mean Y lines of code each, and
I write Z new functions to add a new feature, what should I do
to adjust the Y value? Go back and add a few lines to each of
the X functions, or deliberately make the Z new ones obese?

In the not so unusual case, the Z functions call the X functions,
though in a slightly different way than before. In OOP, you subclass
the X, adding the new feature to the subclass, and, hopefully,
calling the old ones where possible.

In the old days, you add a new argument to each X, test the value
of that argument and process accordingly. Either one adds some lines
to each X, either indirectly (the subclass) or directly (in the actual
X functions).

I presume you don't want to copy the code from each X into the
corresponding Z, which makes things even bigger.

-- glen
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,020
Latest member
GenesisGai

Latest Threads

Top