How do you read source for big programs?

K

kj

I consider myself quite proficient in C and a few other programming
languages, but I have never succeeded in understanding a largish
program (such as zsh or ncurses) at the source level. Basically,
I quickly become disoriented, losing sight of the forest for the
trees.

What's your approach for understanding a large program at the source
level? By "understanding a program" I mean more than just figuring
out where to zero in to make a small change (e.g. change the value
of a global variable), but rather to digest as much of the source
as necessary to know the program's structure in detail, know where
in the source to go for any customization you'd want to make, know
what you'd need to do to port the program to a different OS from
the one it was written for, know what you'd need to do to abstract
some of the program's functionality into a smaller subprogram that
you could embed in a program of your own, etc. Bottom line: the
goal is to know the program's source inside out.

I realize that this is a task that could take days, if not weeks.
I'm willing to put in the effort, but I'm really at a loss as to
how to proceed. (My current interest is reading the source for
zsh, but some day I'd like to read the source codes for Perl, Emacs,
Firefox, Apache, you name it.)

Thanks!

kj
 
E

Eric Sosman

kj said:
I consider myself quite proficient in C and a few other programming
languages, but I have never succeeded in understanding a largish
program (such as zsh or ncurses) at the source level. Basically,
I quickly become disoriented, losing sight of the forest for the
trees.

What's your approach for understanding a large program at the source
level? By "understanding a program" I mean more than just figuring
out where to zero in to make a small change (e.g. change the value
of a global variable), but rather to digest as much of the source
as necessary to know the program's structure in detail, know where
in the source to go for any customization you'd want to make, know
what you'd need to do to port the program to a different OS from
the one it was written for, know what you'd need to do to abstract
some of the program's functionality into a smaller subprogram that
you could embed in a program of your own, etc. Bottom line: the
goal is to know the program's source inside out.

I realize that this is a task that could take days, if not weeks.
I'm willing to put in the effort, but I'm really at a loss as to
how to proceed. (My current interest is reading the source for
zsh, but some day I'd like to read the source codes for Perl, Emacs,
Firefox, Apache, you name it.)

Even a relatively small program of a few hundred
thousand lines is too complex to grasp "in detail," and
understanding the entirety of medium and large programs
requires shortcuts. The largest single program I ever
personally worked on had grown to about three million
lines by the end of my eleven years on it, and although
I was expert on certain parts of it and had a rough idea
what the other parts were about, there is no way that I
could ever pretend to understand the entire thing "in
detail." And three million lines really isn't that large;
the program I mention had its origins in the early 1980s,
and bloat-- er, I mean, "scale" -- has grown in the last
quarter century.

Here's another matter: If the program is "interesting,"
somebody out there is making changes to it. If it's both
interesting and large, there'll be several such somebodys.
By the time you finish your study of the code (assuming
you can do so), those somebodys will have added two brand-
new subsystems while ripping out and re-implementing three
existing subsystems from scratch. Your knowledge will be
obsolete before you can finish acquiring it.

What to do? I think one must recognize the futility of
the pursuit of perfect knowledge; very few programmers are
actually God (despite what they think of themselves). One
must instead seek to accomplish a goal -- fix a bug, add
a feature, whatever -- *without* needing to gather complete
knowledge as a prerequisite. When you are dropped in the
middle of the Pacific, you don't need a detailed cartography
of the Earth's entire land mass, but you do need to find
yourself an island within swimming distance. That's the
skill a programmer should cultivate: When dropped into the
middle of a huge sea of mysterious code, to find a few
islands and start building bridges between them. Aim for
a network of knowledge, not for a blanket.

As a practical means to start developing your network,
your archipelago of islets in a sea of confusion, I can
recommend that you port the program to a new environment.
Port Perl to the Palm Pilot, get Emacs running on your Tivo
box, re-target gcc to the Analytical Engine, whatever you
like. The exercise will teach you a tremendous amount, not
the least of which will be a sense for what kinds of practices
help or hinder the porting, make the program more or less
robust in the face of other incompletely-aware programmers'
changes, and so on.

IMHO, the education of programmers concentrates entirely
too much on the design and generation of programs, and not
enough on the analysis and understanding of programs already
written. If you're going to be an effective programmer, you
must learn these skills for yourself.
 
E

E. Robert Tisdale

kj said:
I consider myself quite proficient
in C and a few other programming languages,
but I have never succeeded in understanding
a largish program (such as zsh or ncurses) at the source level.
Basically, I quickly become disoriented,
losing sight of the forest for the trees.
What's your approach for understanding a large program
at the source level? By "understanding a program"
I mean more than just figuring out where to zero in
to make a small change (e.g. change the value of a global variable),
but rather to digest as much of the source as necessary
to know the program's structure in detail,
know where in the source to go
for any customization you'd want to make,
know what you'd need to do to port the program
to a different OS from the one it was written for,
know what you'd need to do
to abstract some of the program's functionality
into a smaller subprogram
that you could embed in a program of your own, etc.
Bottom line: the goal is to know the program's source inside out.

I realize that this is a task that could take days, if not weeks.
I'm willing to put in the effort,
but I'm really at a loss as to how to proceed.
(My current interest is reading the source for zsh,
but some day I'd like to read the source codes
for Perl, Emacs, Firefox, Apache, you name it.)

That's not a good idea.
If you make changes to programs and/or libraries
you will be obliged to support those changes
in every subsequent release of the the program/library.
The only practical approach is to request
the authors/maintainers to add the feature that you want
so that it will be included in each new release.

Trying to understand a program by reading source code
is like trying to understand aviation by inspecting a 747.
You need to read some other documentation.
Look for something called "Design Documentation".
If you can't find any formal description of the design,
it is very unlikely that you will be able to
"reverse-engineer" the design by reading source code.

The typical programmer can write and maintain programs
up to about 100,000 lines of code. Larger programs usually involve
several other people cooperating to write, test and
maintain the program. No one person understands everything
and it is unlikely that you will either.

In a "well designed" program/library,
the platform [operating system] dependent code
is sequestered in a few routines that are typically stored
in a separate directory. If they aren't,
you should reorganize the code so that they are.

If the code is "well designed", it will be composed of
several independent *modules* with "obvious" functionality
so you can immediately focus on the modules
that are of interest to you and ignore the rest.

If the code is not well designed,
it may be cheaper [faster] to re-write it
instead of trying to read and understand it.
It takes about two man months to write 100,000 lines of code
if you know what the code is supposed to do
(if you don't need to experiment before you decide upon a design).
It could take a lot longer than that to read and understand
100,000 lines of poorly designed code.
 
J

Joe Wright

Eric said:
Even a relatively small program of a few hundred
thousand lines is too complex to grasp "in detail," and
understanding the entirety of medium and large programs
requires shortcuts. The largest single program I ever
personally worked on had grown to about three million
lines by the end of my eleven years on it, and although
I was expert on certain parts of it and had a rough idea
what the other parts were about, there is no way that I
could ever pretend to understand the entire thing "in
detail." And three million lines really isn't that large;
the program I mention had its origins in the early 1980s,
and bloat-- er, I mean, "scale" -- has grown in the last
quarter century.

Here's another matter: If the program is "interesting,"
somebody out there is making changes to it. If it's both
interesting and large, there'll be several such somebodys.
By the time you finish your study of the code (assuming
you can do so), those somebodys will have added two brand-
new subsystems while ripping out and re-implementing three
existing subsystems from scratch. Your knowledge will be
obsolete before you can finish acquiring it.

What to do? I think one must recognize the futility of
the pursuit of perfect knowledge; very few programmers are
actually God (despite what they think of themselves). One
must instead seek to accomplish a goal -- fix a bug, add
a feature, whatever -- *without* needing to gather complete
knowledge as a prerequisite. When you are dropped in the
middle of the Pacific, you don't need a detailed cartography
of the Earth's entire land mass, but you do need to find
yourself an island within swimming distance. That's the
skill a programmer should cultivate: When dropped into the
middle of a huge sea of mysterious code, to find a few
islands and start building bridges between them. Aim for
a network of knowledge, not for a blanket.

As a practical means to start developing your network,
your archipelago of islets in a sea of confusion, I can
recommend that you port the program to a new environment.
Port Perl to the Palm Pilot, get Emacs running on your Tivo
box, re-target gcc to the Analytical Engine, whatever you
like. The exercise will teach you a tremendous amount, not
the least of which will be a sense for what kinds of practices
help or hinder the porting, make the program more or less
robust in the face of other incompletely-aware programmers'
changes, and so on.

IMHO, the education of programmers concentrates entirely
too much on the design and generation of programs, and not
enough on the analysis and understanding of programs already
written. If you're going to be an effective programmer, you
must learn these skills for yourself.

If we could give awards here for quality responses, Eric is my
nominee hands down, especially for this one. I couldn't have said it
better and I won't try.
 
C

Chris Croughton

I consider myself quite proficient in C and a few other programming
languages, but I have never succeeded in understanding a largish
program (such as zsh or ncurses) at the source level. Basically,
I quickly become disoriented, losing sight of the forest for the
trees.

This is usual. You should try a really large program like GCC if you
really want to be confused.
What's your approach for understanding a large program at the source
level? By "understanding a program" I mean more than just figuring
out where to zero in to make a small change (e.g. change the value
of a global variable), but rather to digest as much of the source
as necessary to know the program's structure in detail, know where
in the source to go for any customization you'd want to make, know
what you'd need to do to port the program to a different OS from
the one it was written for, know what you'd need to do to abstract
some of the program's functionality into a smaller subprogram that
you could embed in a program of your own, etc. Bottom line: the
goal is to know the program's source inside out.

Read the documentation, especially the design documentation. If there
is any, of course. Otherwise, you need to write the design
documentation from studying the code (not necessarily to production
standard, but effectively you need to do what the authors should have
done originally).

In particular you need to draw call trees (or equivalent tables), module
connectivity and data flow diagrams, and the like. Possibly also state
diagrams or tables.

With ncurses there is a lot of documentation, and it's a library so a
lot of the functions are pretty standalone or call more basic functions,
so it's not too bad. From what I remember zsh has little internal
documentation and the parts are more ineter-related so it's worse.
I realize that this is a task that could take days, if not weeks.

Months. Years.
I'm willing to put in the effort, but I'm really at a loss as to
how to proceed. (My current interest is reading the source for
zsh, but some day I'd like to read the source codes for Perl, Emacs,
Firefox, Apache, you name it.)

Good luck (you'll need it)!

There are tools which can help. Check out doxygen, for instance,
available for Windows as well as *ix, it can draw up call and data trees
(although it's better at C++ than at pure C) and can (with a little help
sometimes) extract code comments as documentation. There are other
cross-referencing programs which can sometimes help.

Chris C
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,065
Latest member
OrderGreenAcreCBD

Latest Threads

Top