J
jacob navia
OK, after the stack and the debuggers, let's look a little bit
more in depth into this almost ignored piece of the language,
the linker.
Obviously, the C standard doesn't mention this [1]. And almost
never we discuss it here.
Like many other things, this is an error because the linker
is an *essential* piece of the language. Without it, nothing
would ever work.
Separate compilation
--------------------
C supports the separate compilation of modules. Each module
is compiled into an independent object code file (.o in Unix,
or .obj under Microsoft) and those separate object files are
assembled into the executable by the "link editor" or linker
for short.
There are several standards for object file formats:
o The "ELF" format used in most Unix systems
o The COFF format used under windows 32 bit
o The OMF format used by 16 bit DOS/Windows systems
and many others I do not know...
What is important in the context of this discussion, is
what is inside from the language viewpoint.
An object file contains:
o A symbol table of exported symbols
o Several "sections" of data.
o Relocation information
Sections
--------
The "sections" are logical parts of the program that should be
assembled into the final executable. Basically we have 3 kind
of sections:
1) The code section, i.e. we have here the binary opcodes for the
processor
2) The data section, i.e. the initialized tables, strings, or
numbers that are contained in the module
3) The non initialized data section, that is basically just
a size information: XXX bytes should be reserved for non
initialized variables
For example;
int function(char *a)
{
static int bss;
if (strcmp(a,"foobar"))
return 42;
else
return 366554 + bss++;
}
In the code section we would have:
o The prologue code
o The call, the if, and the return with its
o epilogue code
In the data section we would have the "foobar" array
of characters followed by a zero, the number 42 and
the number 366554 in case the processor doesn't support
inlined integer constants. If the processor DOES support
inlined constant values (the x86 for example), the two
integers values would go in the code section
The non-initialized section would contain sizeof(int)
bytes to hold the integer called "bss".
Relocations
-----------
The symbol "strcmp" is not defined in the module, and its
address is not known at compile time. The object module
contains just a record to indicate to the linker:
From: compiler
To: linker
Dear Linker:
Please fill at the offset 4877 in the code section, sizeof(void *)
bytes with the address of the symbol "strcmp".
Thanks in advance
Your compiler
The relocations can be much more complicated than that, but basically,
all of them are just that.
The symbol table
----------------
The object module defines some symbols, and imports some symbols
from other modules. All those symbols are specified in the object
module symbol table. In some object code formats we find also
debug information records in the symbol table. In others,
the debug information is written into a separate section.
Libraries
Static libraries are just a bunch of object code modules that
are stored into a single file for convenience reasons. They
are seen by the linker in the same way as many object files.
----------------------------------------------------------------------
With all this information, the linker goes through all object files
noting which symbols are defined in which module, which symbols are
required from one module and defined in another, until there are no
more object files or libraries. It checks then that all symbols are
defined (if not will complain) and builds the executable.
Linkers can be very complex beasts, like, for instance, the gnu "ld"
linker. This is a linker that features:
o A "link editor language", that allows you to change the
workings of the linker and describe your own executable
format...
o An apparent "machine independence" (what does this means in
a linker is not obvious to me) that allows it to link
object modules from different formats...
o A "BFD" format, that is a kind of GNU machine independent
object file format, or similar.
Other linkers, like lcc-win's for instance are completely stupid beasts
that can only link the format generated by lcc-win and nothing else.
Obviously, the only thing *you* care about a linker is how fast it is,
so in this sense, lcc-win is a better choice: it is quite fast. But
you pay the price: it can only link lcc-win's code...
In the next installment we will go in detail into the dark corners of
the linkers, specifically, the problems with symbol collision.
-------------
[1] The only mention of the linker in the standard is when
speaking about extended characters in identifiers, it mentions
<quote>
On systems in which linkers cannot accept extended characters, an
encoding of the universal character name may be used in forming valid
external identifiers. For example, some otherwise unused
character or sequence of characters may be used to encode the \u in a
universal character name. Extended characters may produce a long
external identifier.
<end quote>
Nowhere is the "linker" defined.
more in depth into this almost ignored piece of the language,
the linker.
Obviously, the C standard doesn't mention this [1]. And almost
never we discuss it here.
Like many other things, this is an error because the linker
is an *essential* piece of the language. Without it, nothing
would ever work.
Separate compilation
--------------------
C supports the separate compilation of modules. Each module
is compiled into an independent object code file (.o in Unix,
or .obj under Microsoft) and those separate object files are
assembled into the executable by the "link editor" or linker
for short.
There are several standards for object file formats:
o The "ELF" format used in most Unix systems
o The COFF format used under windows 32 bit
o The OMF format used by 16 bit DOS/Windows systems
and many others I do not know...
What is important in the context of this discussion, is
what is inside from the language viewpoint.
An object file contains:
o A symbol table of exported symbols
o Several "sections" of data.
o Relocation information
Sections
--------
The "sections" are logical parts of the program that should be
assembled into the final executable. Basically we have 3 kind
of sections:
1) The code section, i.e. we have here the binary opcodes for the
processor
2) The data section, i.e. the initialized tables, strings, or
numbers that are contained in the module
3) The non initialized data section, that is basically just
a size information: XXX bytes should be reserved for non
initialized variables
For example;
int function(char *a)
{
static int bss;
if (strcmp(a,"foobar"))
return 42;
else
return 366554 + bss++;
}
In the code section we would have:
o The prologue code
o The call, the if, and the return with its
o epilogue code
In the data section we would have the "foobar" array
of characters followed by a zero, the number 42 and
the number 366554 in case the processor doesn't support
inlined integer constants. If the processor DOES support
inlined constant values (the x86 for example), the two
integers values would go in the code section
The non-initialized section would contain sizeof(int)
bytes to hold the integer called "bss".
Relocations
-----------
The symbol "strcmp" is not defined in the module, and its
address is not known at compile time. The object module
contains just a record to indicate to the linker:
From: compiler
To: linker
Dear Linker:
Please fill at the offset 4877 in the code section, sizeof(void *)
bytes with the address of the symbol "strcmp".
Thanks in advance
Your compiler
The relocations can be much more complicated than that, but basically,
all of them are just that.
The symbol table
----------------
The object module defines some symbols, and imports some symbols
from other modules. All those symbols are specified in the object
module symbol table. In some object code formats we find also
debug information records in the symbol table. In others,
the debug information is written into a separate section.
Libraries
Static libraries are just a bunch of object code modules that
are stored into a single file for convenience reasons. They
are seen by the linker in the same way as many object files.
----------------------------------------------------------------------
With all this information, the linker goes through all object files
noting which symbols are defined in which module, which symbols are
required from one module and defined in another, until there are no
more object files or libraries. It checks then that all symbols are
defined (if not will complain) and builds the executable.
Linkers can be very complex beasts, like, for instance, the gnu "ld"
linker. This is a linker that features:
o A "link editor language", that allows you to change the
workings of the linker and describe your own executable
format...
o An apparent "machine independence" (what does this means in
a linker is not obvious to me) that allows it to link
object modules from different formats...
o A "BFD" format, that is a kind of GNU machine independent
object file format, or similar.
Other linkers, like lcc-win's for instance are completely stupid beasts
that can only link the format generated by lcc-win and nothing else.
Obviously, the only thing *you* care about a linker is how fast it is,
so in this sense, lcc-win is a better choice: it is quite fast. But
you pay the price: it can only link lcc-win's code...
In the next installment we will go in detail into the dark corners of
the linkers, specifically, the problems with symbol collision.
-------------
[1] The only mention of the linker in the standard is when
speaking about extended characters in identifiers, it mentions
<quote>
On systems in which linkers cannot accept extended characters, an
encoding of the universal character name may be used in forming valid
external identifiers. For example, some otherwise unused
character or sequence of characters may be used to encode the \u in a
universal character name. Extended characters may produce a long
external identifier.
<end quote>
Nowhere is the "linker" defined.