hmm: code bloat?...

B

BGB / cr88192

hmm, an actual question of sorts...

I recently noticed in my project, which is primarily C based, a few bits of
trivia:
it is in excess of 1Mloc (1 Mloc = 1,000,000 lines of code);
610.5 kloc of this go into a VM project (dynamic C compiler + garbage
collector + ...).

so, ~ 1.094 Mloc of C.

about 483.5 kloc then is stuff related to 3D and misc (328.5 kloc of which
are my own code, 155 kloc coming from Quake2, errm, being essentially the
entire Q2 engine, where the code was combined mostly for my own
fiddling...).


mind that the vast majority of this is code I wrote myself, this being
primarily a single person project...


so, does something like this seem like there is a good deal of code bloat
going on?...
is it better to keep on going like this, or maybe look for parts to shave
off (even if one doesn't want to remove much?...).

or do others just keep on going as always before?...

more so, are there other implications from a codebase having a continual
tendency to inflate?... (say, if ones' code tends to inflate at an average
rate of, say, 100-150 kloc/yr or so...).



another observation:
the ratio between C and headers is notably different than in Quake2, where
Q2 has an H/C ratio of 0.14, but in my code it is closer to 0.4 or so...

any thoughts about the ration of headers and C code, or any "interesting"
implications from this property?...
(well, beyond my usage of some automatic header-writing tools, preference
for small-ish functions, ...).


....

just wondering is all...
 
B

bartc

BGB / cr88192 said:
hmm, an actual question of sorts...

I recently noticed in my project, which is primarily C based, a few bits
of trivia:
it is in excess of 1Mloc (1 Mloc = 1,000,000 lines of code);
610.5 kloc of this go into a VM project (dynamic C compiler + garbage
collector + ...).

so, ~ 1.094 Mloc of C.

about 483.5 kloc then is stuff related to 3D and misc (328.5 kloc of which
are my own code, 155 kloc coming from Quake2, errm, being essentially the
entire Q2 engine, where the code was combined mostly for my own
fiddling...).
mind that the vast majority of this is code I wrote myself, this being
primarily a single person project...


so, does something like this seem like there is a good deal of code bloat
going on?...
is it better to keep on going like this, or maybe look for parts to shave
off (even if one doesn't want to remove much?...).

For a language project, 600kloc sounds like a lot of code to me, especially
for a one-man project.

But then, from bits of code you've posted elsewhere, you seem to like
complicated ways of doing things...

(I kept my own projects in line by doing reviews every so often and perhaps
reorganising/rewriting some subsystem or other, when it seemed about to get
out of hand, or consolidating it with another.

Revising to keep the source-code small was also quite fun to do, second only
to optimising for performance... My largest project, quite an elaborate
application, was some 150kloc; included inside that (about 20kloc) was a
bytecode compiler and interpreter...)
or do others just keep on going as always before?...

I would say: redesign and rewrite. Otherwise I don't think you're going to
make much impact on that 1000kloc.

(However, if you're writing a commercial application, customers like value
for money. I had to ship my product on CD (this was some years ago) even
though it fitted easily on one floppy disk, to make it appear more
substantial.)
 
T

Tom St Denis

For a language project, 600kloc sounds like a lot of code to me, especially
for a one-man project.

Depends on whether you're maintaining it. I mean I write apps against
glibc, do I get to include that 100K [or whatever it is] loc in my
project tally?

If he's including the Q2 engine and only modifying a few lines here or
there to suit his particular platform he's hardly maintaining it. To
him it's just another library he links in.

His build process should really have two "clean" targets, one that
just cleans his code (removes objects corresponding to files he
maintains) and another "cleanest" [or whatever you want to call it]
that removes all objects. That wait for most of his builds he's not
rebuilding static code over and over...

Unless long build times are impressing his boss...
(However, if you're writing a commercial application, customers like value
for money. I had to ship my product on CD (this was some years ago) even
though it fitted easily on one floppy disk, to make it appear more
substantial.)

Customers for whatever reason like to think their applications are
doing a lot of thinking. That's why you'll see installers that churn
with progress bars while not really doing much of anything (I've
caught a few where the idle time was near 100% and syswait 0% and it
just sat there). Always good to put technical sounding words in there
like "optimizing data" ... [damn you Adobe...]

If you need to fill space, nothing better than 100s of 10MB files full
of random crap stuffed inside a zip archive renamed .PAK to look
important :)

Tom
 
N

Nick Keighley

hmm, an actual question of sorts...

I recently noticed in my project, which is primarily C based, a few bits of
trivia:
it is in excess of 1Mloc (1 Mloc = 1,000,000 lines of code);
610.5 kloc of this go into a VM project (dynamic C compiler + garbage
collector + ...).

so, ~ 1.094 Mloc of C.

about 483.5 kloc then is stuff related to 3D and misc (328.5 kloc of which
are my own code, 155 kloc coming from Quake2, errm, being essentially the
entire Q2 engine, where the code was combined mostly for my own
fiddling...).

mind that the vast majority of this is code I wrote myself, this being
primarily a single person project...

you've written an MLOC of code...

so, does something like this seem like there is a good deal of code bloat
going on?...

how could we tell? Maybe you've got a MLOC of functionality in there.
I suspect with such a large code base there's *some* fat but without
looking its just a guess. What do you think? Have you tried
refactoring it?

is it better to keep on going like this,

like what? What's wrong with what you are doing? Is it hard to modify
or debug? Is there a lot of duplicate code?
or maybe look for parts to shave
off (even if one doesn't want to remove much?...).

are you talking about removing duplicate code or dead code [good] or
removing little used functionality [ok] or actually reducing
functionality [sounds bad]. Do you have users? Do they want
functionality removed (in my experience- hardly ever!). Do they think
its too big?

or do others just keep on going as always before?...

I'm a firm believer (in theory if not in practice!) that there comes a
point where things should be re-written if they get too creaky.

You should take a look at refactoring...

more so, are there other implications from a codebase having a continual
tendency to inflate?... (say, if ones' code tends to inflate at an average
rate of, say, 100-150 kloc/yr or so...).

are you adding functionality? Do you refactor?

another observation:
the ratio between C and headers is notably different than in Quake2, where
Q2 has an H/C ratio of 0.14, but in my code it is closer to 0.4 or so...

never measured this one. My first thought was I'd expect this be about
1.0.
A quick look at some code and... yes around 1.0. Why do you have so
few header files! And what the hell does Q2 put in its C files! Do
they use a one file per function coding standard?
any thoughts about the ration of headers and C code, or any "interesting"
implications from this property?...
(well, beyond my usage of some automatic header-writing tools, preference
for small-ish functions, ...).

just wondering is all...

I tend to use a one header and c file per "module" (or class if its C+
+). Why would you have multiple C files to one header file? I suppose
a module could require multiple C files to implement it that all
communicated by one header file. Sounds like such a module needs
breaking up.
 
M

Michael Foukarakis

hmm, an actual question of sorts...

I'll try and give you an answer, sort of... :)
I recently noticed in my project, which is primarily C based, a few bits of
trivia:
it is in excess of 1Mloc (1 Mloc = 1,000,000 lines of code);
610.5 kloc of this go into a VM project (dynamic C compiler + garbage
collector + ...).

600 kLOC for a C VM is too much. I've actually implemented a (large)
subset of a C compiler + an interpreter + garbage collector in under
4k LOC, so I know the magnitude of the problem. Perhaps you need a lot
of code refactoring, which is expected in one man projects.

Then again, it occurs to me you might not be using the correct tools
for the job. I'd be interested to hear more details regarding your
project, actually.
about 483.5 kloc then is stuff related to 3D and misc (328.5 kloc of which
are my own code, 155 kloc coming from Quake2, errm, being essentially the
entire Q2 engine, where the code was combined mostly for my own
fiddling...).

This part sounds fairly typical.
so, does something like this seem like there is a good deal of code bloat
going on?...

Definitely sounds bloated (at least the first part).
is it better to keep on going like this, or maybe look for parts to shave
off (even if one doesn't want to remove much?...).

Definitely try to refactor. Or even redesign.
or do others just keep on going as always before?...

Nope. I always review and reconsider my design, trying to identify
problems, limitations, etc. Perhaps not on a regular chronological
basis, but surely before the shit hits the fan - that would be when I
find out $(wc -l *.c *.h) -eq OVER NINE THOUSAND. ;-)
more so, are there other implications from a codebase having a continual
tendency to inflate?... (say, if ones' code tends to inflate at an average
rate of, say, 100-150 kloc/yr or so...).

That all depends on the project really. There's no objectivity in pure
LOC.
another observation:
the ratio between C and headers is notably different than in Quake2, where
Q2 has an H/C ratio of 0.14, but in my code it is closer to 0.4 or so...
any thoughts about the ration of headers and C code, or any "interesting"
implications from this property?

Well, such an increased ratio might mean that there's lots of macros
or functions etc. in those .h files that can be reused. Which is
somewhat contradictory to the LOC count for your project (which we
don't know what it's about, btw. Care to give a hint?).
 
N

Nick

BGB / cr88192 said:
hmm, an actual question of sorts...

I recently noticed in my project, which is primarily C based, a few bits of
trivia:
it is in excess of 1Mloc (1 Mloc = 1,000,000 lines of code);
610.5 kloc of this go into a VM project (dynamic C compiler + garbage
collector + ...).

so, ~ 1.094 Mloc of C.

That does seem a heck of a lot. The website in my sig is driven by
36klock of my own, and a total of 270kloc (but that includes a wiki
renderer, an HTML templating system, the SQLite database engine and a
pile of other bits).

That implements a scripting language to do the boring bits, and the
data structures to allow that scripting language to do clever things
(like compute the best path through the graph depending on current
preferences).

Four times as much seems a lot for /anything/.

I know I've a fair chunk of dead code in there as well - it still
supports legacy data formats and other databases.
another observation:
the ratio between C and headers is notably different than in Quake2, where
Q2 has an H/C ratio of 0.14, but in my code it is closer to 0.4 or so...

Is that lines in *.c to lines in *.h, or number of .c files and number
of .h files?
 
B

BGB / cr88192

Francis Glassborow said:
I agree, it is very easy for code to grow and eventually become
unmaintainable. I would certainly review the whole of the code base for
starters. Though that will take some time given its size I am sure that
you will benefit in the long run.

yeah.

I noted some while documenting stuff that there are things floating around
in the code that I have almost never used, for example:
the 'channels' feature I heard about from a video from Rob Pike, and then
implemented (I guess I am more used to async communication mechanisms, and
have a harder time seeing why I would have multiple threads but want them to
operate lock-step...);
some stuff for genetic-programming and neural nets, for which thus far I
have found little use (apart from specialized tools, which is where I
originally wrote this code);
....

in a few cases, I have implemented APIs which have turned out absurdly
bulky:
my class/instance system has around 500 API calls;
....

as well I guess as a general design strategy which leads to bulk:
writing an actual textual assembler, code generators which produce textual
ASM;
....

components which, in retrospect, have debatable use:
such as an x86 interpreter (this thing by itself being ~ 50 kloc);
....

I would not worry over much about the .h/.c ratio. Well designed code
where each function does one thing tends to have rather higher proportion
of lines in header files but modern optimising compilers often manage to
do link time optimisations so that your large number of functions do not
actually result in an equivalent number of fgunction calls in the
executable.

yeah.

I guess there is lots going on to inflate the amount of stuff present in
headers.

it is also notable that given my use of tools, pretty much all prototypes
end up in headers, rather than just ones manaually put there. Q2 seems to be
a little more lax, and only put "many" of the prototypes in headers.

I guess there are a lot of structs, ... as well, since I generally don't put
structs in C files, ...
 
B

BGB / cr88192

Nick Keighley said:
you've written an MLOC of code...

apparently...



how could we tell? Maybe you've got a MLOC of functionality in there.
I suspect with such a large code base there's *some* fat but without
looking its just a guess. What do you think? Have you tried
refactoring it?

I used to do that some, usually trying to trim down the project
occasionally.
however, I have not done that in a while, and I suspect there is a bit of a
pile-up.

the present state thus being a little over 1 Mloc of code, and around 5GB of
files (the vast majority being PNG's).


last time I did a major clean up was back when my project was in the ~ 300
kloc range, was I think a few years ago.

like what? What's wrong with what you are doing? Is it hard to modify
or debug? Is there a lot of duplicate code?

I meant, just keep going on coding stuff, as the codebase gets ever larger.

or maybe look for parts to shave
off (even if one doesn't want to remove much?...).

are you talking about removing duplicate code or dead code [good] or
removing little used functionality [ok] or actually reducing
functionality [sounds bad]. Do you have users? Do they want
functionality removed (in my experience- hardly ever!). Do they think
its too big?

I could do all of these...

I suspect I may get around 50-100 kloc from dead code removal (or, more
like, dead component removal, nevermind smaller code).

removing some little-used stuff is also possible.
removing actually-used components is possible, but I guess I will probably
not do this.

I'm a firm believer (in theory if not in practice!) that there comes a
point where things should be re-written if they get too creaky.

You should take a look at refactoring...

yeah, I may need to consider this.

are you adding functionality? Do you refactor?

adding functionality is done a lot;
refactoring, not so much.

I had often used the "cellular" approach of splitting components if they got
too large, and interfacing components with defined APIs and allowed
dependencies. this keeps things maintainable, but does little to constrain
size (actually, abstract APIs probably make this issue worse, FWIW).

never measured this one. My first thought was I'd expect this be about
1.0.
A quick look at some code and... yes around 1.0. Why do you have so
few header files! And what the hell does Q2 put in its C files! Do
they use a one file per function coding standard?

well, I have header files for what I have them for, mostly structs and
prototypes.
I tend to use tools to mine prototypes, mostly because IMO having to go copy
prototypes to the headers is a hassle.


Q2 tends to have lots of big hairy functions (where in many cases 50-200
lines will be used by single functions), and seems to be a little lax about
which prototypes actually make it to the headers.

granted, there are a few of these in my codebase, but these tend to be a
strong minority.


well, that, and the whole engine has lots of bit twiddling and fairly nasty
use of pointer aritmetic.

or, at least by my standards where something like:
"if(*(float *)((char *)(&array_of_structs[index])+offset) == foo) ..."
is, IMO, nasty...

I tend to use a one header and c file per "module" (or class if its C+
+). Why would you have multiple C files to one header file? I suppose
a module could require multiple C files to implement it that all
communicated by one header file. Sounds like such a module needs
breaking up.

ok, both my project and Q2 tend to use more centralized headers with all
declarations from a "component" (where, in my terminology: component =
'library' = 'mass of code compiled into a single DLL').

basically, both my project and Q2 have the practice of creating a single
header which is included by all source files in a given library (and is
often the only header included).


but, actually, it was a measure of C loc vs H loc, FWIW...
(I have not been counting files here).

modifying line-counter to also count files, codebase totals:
1826 C files; 1795 H files.

interesting...

calculating: average C file loc is 558.4, average H file loc is 297.6.

Q2 only:
189 C files, 78 H files.
average C file loc is 821 loc, average H file loc is 278.


my codebase tends to, thus, have a much bigger portion of content in its
headers than in Q2.

IOW: my headers contain, on average, 2.86x more stuff per C kloc than Q2's C
kloc.

then again, as noted elsewhere:
tools mine prototypes, and also pretty much all structs, constant
declarations, ... go here.

but, there are lots of possible considerations here...
 
B

BGB / cr88192

hmm, an actual question of sorts...

I'll try and give you an answer, sort of... :)
I recently noticed in my project, which is primarily C based, a few bits
of
trivia:
it is in excess of 1Mloc (1 Mloc = 1,000,000 lines of code);
610.5 kloc of this go into a VM project (dynamic C compiler + garbage
collector + ...).

<--
600 kLOC for a C VM is too much. I've actually implemented a (large)
subset of a C compiler + an interpreter + garbage collector in under
4k LOC, so I know the magnitude of the problem. Perhaps you need a lot
of code refactoring, which is expected in one man projects.
-->

I am not sure how that is possible, assuming writing things in C...


<--
Then again, it occurs to me you might not be using the correct tools
for the job. I'd be interested to hear more details regarding your
project, actually.
-->

well, most stuff is hand-written C, but a few minor things are tool-written
(as far as C goes, this is not likely to contribute much to overall LOC).

most tasks like parsing, ... use hand-written recursive-descent parsing.

the VM project includes a number of components, and tends to compile the C
into native code which is run at runtime.

so, major components:
a garbage collector;
an assembler+linker (x86, x86-64);
a big library giving a dynamic typesystem, P-OO and C/I OO stuff, ...;
a codegen (RPNIL -> ASM, x86 / x86-64, supports: cdecl, stdcall, SysV/AMD64,
and Win64 calling conventions);
a C compiler frontend (C -> RPNIL);
a Java-ByteCode interpreter;
an x86 interpreter (simulates userspace-only, POSIX-derived process-model
and core APIs);
....

about 483.5 kloc then is stuff related to 3D and misc (328.5 kloc of which
are my own code, 155 kloc coming from Quake2, errm, being essentially the
entire Q2 engine, where the code was combined mostly for my own
fiddling...).

<--
This part sounds fairly typical.
-->

yeah.
my part also contains a 3D modeler and skeletal animation tool...

so, does something like this seem like there is a good deal of code bloat
going on?...
Definitely sounds bloated (at least the first part).
is it better to keep on going like this, or maybe look for parts to shave
off (even if one doesn't want to remove much?...).
Definitely try to refactor. Or even redesign.
ok.

or do others just keep on going as always before?...

<--
Nope. I always review and reconsider my design, trying to identify
problems, limitations, etc. Perhaps not on a regular chronological
basis, but surely before the shit hits the fan - that would be when I
find out $(wc -l *.c *.h) -eq OVER NINE THOUSAND. ;-)
-->

yeah.

hmm... there was a standard tool for line-counting, and I was off having
written my own version of this as well...

more so, are there other implications from a codebase having a continual
tendency to inflate?... (say, if ones' code tends to inflate at an average
rate of, say, 100-150 kloc/yr or so...).

<--
That all depends on the project really. There's no objectivity in pure
LOC.
-->

yes, ok.

another observation:
the ratio between C and headers is notably different than in Quake2, where
Q2 has an H/C ratio of 0.14, but in my code it is closer to 0.4 or so...
any thoughts about the ration of headers and C code, or any "interesting"
implications from this property?

<--
Well, such an increased ratio might mean that there's lots of macros
or functions etc. in those .h files that can be reused. Which is
somewhat contradictory to the LOC count for your project (which we
don't know what it's about, btw. Care to give a hint?).
-->

macros, not that many...
I don't like excessive macro use.

not so many functions either, though I do make some use of "static inlines"
in a few places.

also possible is that I have some amount of functions which are
"one-liners":

void libTypeFoo(libType *obj)
{ if(obj->iface->foo) obj->iface->foo(obj); }

mostly as I really don't like code directly messing around with struct
internals if they don't actually own the struct.

....


actually, the main project is mostly 3D stuff, and random misc stuff...

it was partly merged with Q2 as part of a test, where I had noted that Q2
does lots of things my project has not mastered, and so the partial reason
for merging the codebases was to "see what happens".


otherwise, the project doesn't have a whole lot of a particular purpose,
mostly just myself and idle coding I guess...
 
B

bartc

BGB / cr88192 said:
I'll try and give you an answer, sort of... :)


<--
600 kLOC for a C VM is too much. I've actually implemented a (large)
subset of a C compiler + an interpreter + garbage collector in under
4k LOC, so I know the magnitude of the problem. Perhaps you need a lot
of code refactoring, which is expected in one man projects.
-->

I am not sure how that is possible, assuming writing things in C...

That does sound tight. I would have reckoned on several 10Kloc of
C/C-equivalent code for such a project. Perhaps he missed out a zero.
the VM project includes a number of components, and tends to compile the C
into native code which is run at runtime.

so, major components:
a garbage collector;
an assembler+linker (x86, x86-64);
a big library giving a dynamic typesystem, P-OO and C/I OO stuff, ...;
a codegen (RPNIL -> ASM, x86 / x86-64, supports: cdecl, stdcall,
SysV/AMD64, and Win64 calling conventions);
a C compiler frontend (C -> RPNIL);
a Java-ByteCode interpreter;
an x86 interpreter (simulates userspace-only, POSIX-derived process-model
and core APIs);
...

This doesn't sound like a single executable. So perhaps you shouldn't just
combine all the Loc for each.
 
B

BGB / cr88192

bartc said:
That does sound tight. I would have reckoned on several 10Kloc of
C/C-equivalent code for such a project. Perhaps he missed out a zero.

yep.



This doesn't sound like a single executable. So perhaps you shouldn't just
combine all the Loc for each.

actually, it is all a bit "fuzzy"...

there is no clear line of demarcation which components will go into which
exact EXE's.

(think of it more like DirectX, with piles of semi-independent DLL's, some
loosely coupled via "interfaces", others fully independent, and others fully
dependent, ...).


all of these are compiled into DLL's, some or all of which may be used by a
"front-end", but may be used by any number of frontends...


there are actually around 45 EXE's, many of which are component-specific
tests (such as to verify that OO stuff still works, ...) or tools. a few of
which are app's (3D engine, mesh-modeler, skeletal tool, Q2 frontend).

counting here, there are 17 DLL's (2 of which belong to Q2's renderer). 8
belong to my VM, 7 to my 3D engine.


Q2 is in total a small minority of the code, and currently uses a few of the
other components, and is at present not really used by anything (since much
of Q2's code is hardly well-organized or generic), and also because the
"merge" was fairly recent and not really intended to be a "permanent"
solution (partly due to Q2 using GPL, and otherwise there is little escape
from this, but my VM is being kept "clean", mostly as I am migrating the
thing to Public Domain, vs before where it was LGPL).

....
 
S

Stefan Ram

BGB / cr88192 said:
it is in excess of 1Mloc (1 Mloc = 1,000,000 lines of code);

The number of lines of code (globally) should not be relevant,
when code is well organized. When code is well organized, size
disappears.

For example, when I write

#include <stdio.h>
int main( void ){ printf( "hello, world\n" ); }

, how many lines are this?

Two?

But the I/O library used might also have been written in C.

Why do you not take the number of lines of the I/O library
into account?

Because they are well hidden behind an interface.

So, when a project is well organized nearly everything is
well hidden behind an interface, and you never see 1Mloc,
you see only about what fits on your screen or less.

You organize you project so that you never see 1 Mlocs.

A line count, such as »1 Mloc« is nearly meaningless, because
it is arbitrary what you take into account for this and what
not. You might name a part of your project »the graph library«
(wie 189,124 locs) and consider it to be a »separate project«,
and suddenly, there are 189,124 locs less in you project. So
such counts bear little information.

A master once came to an interview and was asked »Do you have
any experience with large software projects?«. He immediatly
answered »No.« and he was not given the job. What they didn't
know is: He organized every project in such a way, that to him
it appeared small. Thus, whatever he did, he never did see a
»large« project.
 
F

Flash Gordon

BGB said:
Nick Keighley said:
On 5 Jan, 09:10, "BGB / cr88192" <[email protected]> wrote:
or maybe look for parts to shave
off (even if one doesn't want to remove much?...).
are you talking about removing duplicate code or dead code [good] or
removing little used functionality [ok] or actually reducing
functionality [sounds bad]. Do you have users? Do they want
functionality removed (in my experience- hardly ever!). Do they think
its too big?

I could do all of these...

I suspect I may get around 50-100 kloc from dead code removal (or, more
like, dead component removal, nevermind smaller code).

I would start off by doing this. It will mean you have that much less
code to look at for the next stage...

The next stage for me would be to identify if there are redundant
features which were either implemented but never used or implemented and
used, but have since been superseded and are no longer used (or with
minimal work could be no longer used). That should get rid of a load
more code in my opinion.
removing some little-used stuff is also possible.

<snip>

Then you could look at this, with the other eye on whether making *more*
use of it else were would be an even better saving!

That should shrink the code base and make it easier to refactor and
improve it further.

At each stage, getting rid of code will make the next stage easier since
you have less to look at. Then repeat the exercise from the beginning,
since through all these stages you make make more code in to dead code!
Also you might make more features obsolete allowing them to be deleted!
Keep repeating until it does not give you enough benefit to be worth the
effort of going further.
 
N

Nick Keighley

I used to do that some, usually trying to trim down the project
occasionally.
however, I have not done that in a while, and I suspect there is a bit of a
pile-up.

then there almost certainly is some fat (unless you're a really clever
coder!)
the present state thus being a little over 1 Mloc of code, and around 5GB of
files (the vast majority being PNG's).

last time I did a major clean up was back when my project was in the ~ 300
kloc range, was I think a few years ago.



I meant, just keep going on coding stuff, as the codebase gets ever larger.

I don't really understand the question. Is your code base causing you
a problem? You are presumably adding code for a reason (unless you
just like coding) so, er, what was the question again?
are you talking about removing duplicate code or dead code [good] or
removing little used functionality [ok] or actually reducing
functionality [sounds bad]. Do you have users? Do they want
functionality removed (in my experience- hardly ever!). Do they think
its too big?

I could do all of these...

I suspect I may get around 50-100 kloc from dead code removal (or, more
like, dead component removal, nevermind smaller code).

removing real dead code seems a good idea. If-it-might-com-in-useful-
one-day (vanishingly unlikely really) then you can get it out of your
repository.
removing some little-used stuff is also possible.

well that's your call. Little used might be very useful when you need
it! The US's SDI system was only supposed to be used once (or less)...

Error handling often doesn't get used very often.
removing actually-used components is possible, but I guess I will probably
not do this.




yeah, I may need to consider this.



adding functionality is done a lot;
refactoring, not so much.

I don't see the code increasing in size as inherently wrong. I'm more
interested in *why* it is growing.

I had often used the "cellular" approach of splitting components if they got
too large, and interfacing components with defined APIs and allowed
dependencies.

all sounds good (though the "cellular" term seems a little odd)

this keeps things maintainable, but does little to constrain
size

really? I thought it would help
(actually, abstract APIs probably make this issue worse, FWIW).

oh? I'd have thought good abstractions would keep the code size down.

I misunderstood you here. You were talking header kloc and c file
kloc. I'd expect there to be a big difference but I've never done any
measurements (never really cared taht much to be honest!)

Q2 tends to have lots of big hairy functions (where in many cases 50-200
lines will be used by single functions), and seems to be a little lax about
which prototypes actually make it to the headers.

granted, there are a few of these in my codebase, but these tend to be a
strong minority.

well, that, and the whole engine has lots of bit twiddling and fairly nasty
use of pointer aritmetic.

or, at least by my standards where something like:
"if(*(float *)((char *)(&array_of_structs[index])+offset) == foo) ..."
is, IMO, nasty...

fairly nasty. But does it really matter? If you need to do stuff like
that isolate it in a function/module/macro and document it. The rest
of the code doesn't care how nasty it is.

<snip>
 
B

BGB / cr88192

Nick Keighley said:
then there almost certainly is some fat (unless you're a really clever
coder!)

yeah.
well, I think there is as well, so no issue there...

I don't really understand the question. Is your code base causing you
a problem? You are presumably adding code for a reason (unless you
just like coding) so, er, what was the question again?

well, usually my reason for adding code is not to deal with problems...

more often, it is adding code to add "features"...

sometimes though, the features turn out to be almost entirely useless.

I once added a partial "precise" mode to my (otherwise conservative) GC, but
ended up not using it (because precise GC makes using it a lot more
difficult for not much gain). later on, this is no longer part of the public
API. internally, some of the code lingers on, and a cleanup of this
component would likely eliminate this.

the GC also does ref-counting, which is another one of those "rarely used"
features, mostly because ref-counting is one of those "all or nothing"
features, meaning that any code which uses the feature has to be entirely
ref-counting safe.


then, elsewhere, there is another precise GC, which was written because I
realized that it was kind of pointless to have a precise GC on the same heap
as a conservative one if they end up essentially "splitting the world in
half" anyways.

in this case, I had intended this other GC mostly for a particular use, but
ended up using a different MM strategy for that code instead: allocating
lots of memory in a temporary heap, and then destroying this heap when done.
this was done as an attempt to improve both stability and performance.

though not very generic, it works fairly well for code which produces 10s of
MB of stuff in a fraction of a second but never needs to refer to it again
after the task completes. it works much better than using a generic GC for
this task, since this usage pattern is essentially "abusive" to the GC
(tending to cause misbehavior and poor performance).

and, it is all this fuss over a single set of features.

removing real dead code seems a good idea. If-it-might-com-in-useful-
one-day (vanishingly unlikely really) then you can get it out of your
repository.

well, this has happened sometimes, but it often takes a while for features
to "become" useful.

but, yeah, a good start is removing dead components and subsystems, a few of
which I know to exist.
(particularly related to my codegen).

although, I had made a new experimental codegen with a nifty register
allocator which was never really migrated back into the main codegen (mostly
as the new codegen worked a bit different, and my old codegen is a big
tangled mess that has been hacked on a lot).

it is just that the old codegen has proven a bit difficult to replace.

well that's your call. Little used might be very useful when you need
it! The US's SDI system was only supposed to be used once (or less)...

Error handling often doesn't get used very often.

ok.



I don't see the code increasing in size as inherently wrong. I'm more
interested in *why* it is growing.

partly as a result of adding features;
partly as a result of "generalizing" things;
....

all sounds good (though the "cellular" term seems a little odd)

well, because originally, one has a lot of code which is in a single
component.
so, all the code shares the same directory, makefile, and naming prefixes,
....

but, then it gets too large, and is split.

then, often, the library name prefix gets changed, as well as the sub-parts
being moved into new directories, ...

say:
FOO_SubSysA_...
FOO_SusSysB_...
FOO_SusSysC_...
FOO_SusSysD_...
FOO_SusSysE_...
FOO_SusSysF_...

splits into:
FOO_SubSysA_...
FOO_SusSysC_...
FOO_SusSysE_...
FOO_SusSysF_...

BAR_SusSysB_...
BAR_SusSysD_...

then, maybe some internal patch-up is done to compensate for the change in
naming, ...
(this is often done via 'sed' or search/replace).


so, it a way, it is sort of like mitosis or similar...

really? I thought it would help

component splitting very often causes there to be sub-components which are
larger than the original singular component.

this is usually the result of "abstracting" one component from another,
which often adds new code in the form of abstract API wrappers, ...

oh? I'd have thought good abstractions would keep the code size down.

on the larger scale, probably.
but at the small scale, it adds a bunch of function calls which often do
little more than redirect to other functions.

in a few components, this can end up being a significant part of the overall
code size (in particular, in one of the larger components, which consists
almost entirely of exported APIs and relatively little internal logic code).

most other components contain more of a balance, with most of the size being
due to logic-code, and a smaller amount of wrapper code usually serving to
serve as an interface to the outside world.

I misunderstood you here. You were talking header kloc and c file
kloc. I'd expect there to be a big difference but I've never done any
measurements (never really cared taht much to be honest!)

yep.

well, I measured a lot of stuff, some of which I didn't bother to mention.

or, at least by my standards where something like:
"if(*(float *)((char *)(&array_of_structs[index])+offset) == foo) ..."
is, IMO, nasty...

fairly nasty. But does it really matter? If you need to do stuff like
that isolate it in a function/module/macro and document it. The rest
of the code doesn't care how nasty it is.

yep, I usually avoid doing stuff like this personally, or if it is done, it
is wrapped up somewhat...
Q2 does things like this often as a matter of common practice.

as well as the good old trick of reading raw data from a file, casting it to
a struct pointer, and just using this structure directly (although often
with either endianess-swap functions, or a pre-pass to go over the read-in
file and pre-swap all the values if needed).


I often use more explicit read/write value operations, such as reading a
datum at a time.

Foo_Vertex *Foo_ReadVertex(VFILE *fd)
{
Foo_Vertex *tmp;
tmp=Foo_AllocVertex();
tmp->x=Foo_ReadFloat(fd);
tmp->y=Foo_ReadFloat(fd);
tmp->z=Foo_ReadFloat(fd);
return(tmp);
}

Foo_Triangle *Foo_ReadTriangle(VFILE *fd)
{
Foo_Triangle *tmp;
tmp=Foo_AllocTriangle();
tmp->v0=Foo_ReadVertex(fd);
tmp->v1=Foo_ReadVertex(fd);
tmp->v2=Foo_ReadVertex(fd);
return(tmp);
}

....
 
B

BGB / cr88192

Nick said:
That does seem a heck of a lot. The website in my sig is driven by
36klock of my own, and a total of 270kloc (but that includes a wiki
renderer, an HTML templating system, the SQLite database engine and a
pile of other bits).

That implements a scripting language to do the boring bits, and the
data structures to allow that scripting language to do clever things
(like compute the best path through the graph depending on current
preferences).

Four times as much seems a lot for /anything/.

I know I've a fair chunk of dead code in there as well - it still
supports legacy data formats and other databases.

yep...

well, the codebase tends to be fairly large and, per scale, not do quite as
much as some other smaller codebases I have seen...

Is that lines in *.c to lines in *.h, or number of .c files and number
of .h files?

C vs H loc...

but, then I can also note that I do a few "unusual" things with headers
which tend to bloat the total loc in this case (many headers exist which are
tool-written, but not nearly so much C is tool-written, ...).

 
B

BGB / cr88192

Stefan Ram said:
The number of lines of code (globally) should not be relevant,
when code is well organized. When code is well organized, size
disappears.

For example, when I write

#include <stdio.h>
int main( void ){ printf( "hello, world\n" ); }

, how many lines are this?

Two?

for this file, yes.
if this is the only C file in the the project, then this is the case.

But the I/O library used might also have been written in C.

Why do you not take the number of lines of the I/O library
into account?

well, if you mean the IO library in the C runtime (such as MSVCRT), no, this
is not included in my case.
granted, I have a VFS system, which is included (since it is part of my
overall codebase).

my x86 interpreter also includes a C library, which is included in the total
count, but is built specifically for the virtualized environment with in the
interpreter (this is about 13 kloc for the C-RTL, and 5 kloc for syscall
wrappers, ...).

Because they are well hidden behind an interface.

So, when a project is well organized nearly everything is
well hidden behind an interface, and you never see 1Mloc,
you see only about what fits on your screen or less.

granted.

well, interfaces keep things managable, but all the code does still exist
(even if most of it largely just "does something" and is not messed with
often much beyond this...).

part of what made me more aware of the mass amounts of code, was going and
trying to start documenting some of my stuff...

You organize you project so that you never see 1 Mlocs.

A line count, such as »1 Mloc« is nearly meaningless, because
it is arbitrary what you take into account for this and what
not. You might name a part of your project »the graph library«
(wie 189,124 locs) and consider it to be a »separate project«,
and suddenly, there are 189,124 locs less in you project. So
such counts bear little information.

ok.

this is a thought...
I tend to include pretty much everything that is not system libraries, and
tend to minimize use of 3rd-party libs and code (mostly due to the
possibility of awkward dependency issues, ...).

A master once came to an interview and was asked »Do you have
any experience with large software projects?«. He immediatly
answered »No.« and he was not given the job. What they didn't
know is: He organized every project in such a way, that to him
it appeared small. Thus, whatever he did, he never did see a
»large« project.

interesting...


well, I guess it can be noted then that, code seems "bigger" when it is less
cleanly written.

after all, the Q2 code would "seem" a lot bigger when one tries to work on
it, mostly because the code is so tangled and nasty (doing one thing in one
place often leads to hunting for bugs appearing somewhere else, often as a
result of subtle bit-twiddling, where, FWIW, Q2 has explicit bit-twiddling
that is done on a project-wide scale).


my codebase is much larger, but working on most things is a good deal less
effort, I guess because I do tend to modularize and abstract things bit
more...

I also guess I tend to make much more heavy use of ASCII-based data
serialization as well (most serialized complex data tends to be ASCII-based,
....).

and, personally, I tend to have an aversion to passing structs between
components, as I have ran into a lot of troubles here in the past, whereas
Q2 passes structs all over the place (and likes to assume that structs in
one place are the same as in other, and as in their file formats, really
likes arrays of structs, ...).

oddly, I almost always use arrays of struct pointers, and almost never
arrays of raw structs, whereas Q2 uses lots of arrays of raw structs, ...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,011
Latest member
AjaUqq1950

Latest Threads

Top