Getting source file from the object file

Chris Dollin · Mar 9, 2009

user923005 said:
The actual proof is in here:
P.T.Breuer and J.P.Bowen, "Decompilation: The enumeration of types and
grammars", Tech. Rep. PRG-TR-11-92, Oxford University Computing
Laboratory, 11 Keble Road, Oxford OX1 3QD, 1992.
Home!

Found here:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.8799

Where:
"Proposition 1 There is no computable algorithm which will take an
arbitrary attribute grammar description and always produce a piece of
computer code which enumerates all the valid phases of the grammar,
and then terminates (if the grammar is finite)."
is proposed and then proven.

If we're decompiling C programs, the grammar isn't arbitrary.

(I'm not sure I understand the importance of the result as stated, since
very many useful grammars have arbitrarily many valid phrases and so
any attempt to enumerate all of them will fail to terminate.)

Chris Dollin · Mar 9, 2009

Richard said:
This is how one learns to make backups.

And use change-control systems.

Michael Angelo Ravera · Mar 9, 2009

Dear Friend

Can we get the source file from the object ?
example :

file a.c
cc a.c
Now I have a.out
But I not have the a.c
how could I get the source file using a.out

thanks

The truthful answer is that it depends upon how much symbol table
information is kept in the object file and how much is discarded. But,
the answer is that, for arbitrary code, in general, it is not
possible.

If only the function locations are kept in the symbol table, you can,
with a good decompiler, recreate a realy ugly set of source code that
will produce a functionally equivalent object. The big help here is
that the decomplier can know entry point information and can therefore
determine some of the boundaries of functions. If the decompiler is
REALLY good, it MAY be able to guess that some inaccessible locations
between functions are code space data (for literals, initializations,
etc).

If more symbol information (necessarily including data locations and,
with good fortune, structure definitions) is available in the object
file, the decompiler might be able to produce a reasonable facsimile
of the original source code. It won't be able to tell, for instance,
whether the original code used a for, while, or do loop (or, in some
cases, an if and goto, but the label locations will usually be in the
retained symbol information, if anything is).

However, if you have a large set of source code one of which is known
to have produced the object file and would like to know which among
the source files produced the object file, you could simply compile
all of the source code files and find the object file that has the
smallest amount of difference to the object file that you wish to
create. The object files won't always be identical, because often
modification and compilation date information is retained in the
object file, even when no symbol information is. You can determine
this by compiling the same source to two different object files and
examining their differences.

Good Luck.

Richard Bos · Mar 11, 2009

I think that this approach assumes that the compiler vendor is writing
the decompiler.

No; merely that the writer of the decompiler has access to the compiler
that was used, and knows how it was invoked.

For instance, a compiler vendor can substitute inline assembly for a
function call and do other operations that are not actually attainable
from the C language itself.

Yes, but it'll do so again when called by the decompiler.

Compilers have bugs. I found a bug generated in the assembly of the
Intel compiler, for instance. Now, what is the decompiler going to do
with this incorrect assembly language when trying to turn it back into C?

The decompiler as described will not turn any object code back into C;
instead, it enumerates all possible source codes, compiles these using
the same implementation as was used to get the object code, and compares
the results.
This approach is of dubious practical use, yes; but it is only
equivalent to the traveling salesman's problem, not to circumventing
Goedel's theorem. IOW, it's "merely" very long-winded, not impossible.

Of course, it does require that you know, and have, the compiler which
was originally used, and all necessary configuration information; but in
the OP's case, he does know and have all that.

Richard

user923005 · Mar 11, 2009

No; merely that the writer of the decompiler has access to the compiler
that was used, and knows how it was invoked.

Yes, but it'll do so again when called by the decompiler.

The decompiler as described will not turn any object code back into C;
instead, it enumerates all possible source codes, compiles these using
the same implementation as was used to get the object code, and compares
the results.
This approach is of dubious practical use, yes; but it is only
equivalent to the traveling salesman's problem, not to circumventing
Goedel's theorem. IOW, it's "merely" very long-winded, not impossible.

Of course, it does require that you know, and have, the compiler which
was originally used, and all necessary configuration information; but in
the OP's case, he does know and have all that.

There are an infinite number of programs that produce a finite output.
You will never enumerate them all.

crisgoogle · Mar 11, 2009

There are an infinite number of programs that produce a finite output.
You will never enumerate them all.

You don't *need* to enumerate them all, though. You only need to
enumerate
until you find one that _does_ give you the output. Assuming that the
input source file was of finite length in the first place (yes, I know
what they say about "assume", but I'm willing to go out on a limb
here),
then this will take finite time.

No-one's arguing the feasibility or reasonableness of doing this;
they're
just arguing against your assertion that the problem reduces to the
halting problem.

Richard Bos · Mar 12, 2009

user923005 said:
There are an infinite number of programs that produce a finite output.
You will never enumerate them all.

True, but completely irrelevant, since all that is required is to find
_a_ source file that will compile, not _all_ possible ones.

Richard

JosephKK · Mar 13, 2009

Aha. Here is the problem: "which will take an arbitrary attribite grammar
description...". Deompilation to a particular language does *not* use an
arbitrary grammar description. There can be algorithms that use a
*particular* grammar description.

Dik, in this case the mathematical use of "arbitrary" is similar to
your use of "particular", but specifically more general (an in: any
particular).

James Kuyper · Mar 13, 2009

JosephKK said:
Dik, in this case the mathematical use of "arbitrary" is similar to
your use of "particular", but specifically more general (an in: any
particular).

That difference is (greater generality) leads to conclusions about what
is possible that don't necessarily apply to the less general case. And
what we're dealing with in this particular thread is a substantially
less general case.

JosephKK · Mar 14, 2009

There's some confusion here. If we know that the given executable file
is the output of a particular compiler whose output is a deterministic
function of its input, then certainly a brute-force search will find, in
finite time, some source file which when compiled by that compiler
yields the desired output.

I am not so assured about that. How reliably can you disambiguate
between pointer and array style accesses? How can you reconstruct
structs and unions reliably (especially those defined in header
files)? Furthermore, what if they are typedef'ed in header files?
(just to push the points) How about incomplete types?

Or maybe this is a lot easier than i know and never found the
literature that shown how easy it can be.

Keith Thompson · Mar 14, 2009

JosephKK said:
[...]

There's some confusion here. If we know that the given executable file
is the output of a particular compiler whose output is a deterministic
function of its input, then certainly a brute-force search will find, in
finite time, some source file which when compiled by that compiler
yields the desired output.

Click to expand...

I am not so assured about that. How reliably can you disambiguate
between pointer and array style accesses? How can you reconstruct
structs and unions reliably (especially those defined in header
files)? Furthermore, what if they are typedef'ed in header files?
(just to push the points) How about incomplete types?

You don't need to. It's not possible to reconstruct *the* C source
file that produced a given executable (unless the executable is
annotated with a lot of extra information), but it should be possible
to construct *a* C source file that, when compiled, will produce an
executable with the same behavior as the one you started with. If the
original C source used arrays and the re-generated one uses pointers,
that shouldn't matter.

If nothing else, you can emit the code for a C emulator for the
machine code, and include the machine code as a static array.

Whether it's possible to produce the exact same executable is
another question.

Of course, none of this is of any use for the original poster, who (as
I recall) was looking for a way to recover a C source file that he had
accidentally deleted. The answer to that is (a) he's almost certainly
out of luck, unless his system provides some way to recover deleted
files, and (b) for the future, use backups and a source control
system.

James Kuyper · Mar 14, 2009

JosephKK said:
I am not so assured about that. How reliably can you disambiguate
between pointer and array style accesses? How can you reconstruct
structs and unions reliably (especially those defined in header
files)? Furthermore, what if they are typedef'ed in header files?
(just to push the points) How about incomplete types?

Click to expand...

I don't see how that creates any insuperable problems. If the original
source code used a structure, the decompiled object file might use a
structure containing a different set of field names. If the original
source code used a typedef, the decompiled object file might use the
corresponding type directly (or vice versa). So? All of this is covered
by the specified objective: "some source file that when compiled by that
compiler yields the desired output".

Or maybe this is a lot easier than i know and never found the
literature that shown how easy it can be.

Click to expand...

No one said it was easy - he said "brute-force search", which isn't easy
at all. That phrase basically implies something like generating all
possible source code files of length N, in order of increasing N, until
you find one that compiles to an object file that matches the one you
want to decompile. For an object file known to have been produced from a
source code file of finite size (even if that size is not known), this
brute-force algorithm requires generation of only a finite number of
different source code files; it will certainly halt no later than the
iteration where N is equal to the actual length of the original source
code file, and for typical source code and typical real world languages,
it will probably terminate a lot earlier than that. Unless the compiler
gets stuck in an infinite loop on one of the source code files,
compilation of a finite number of source code files can be completed in
a finite amount of time.

This is NOT a practical algorithm, merely a proof of principle that
shows that the process is guaranteed to be performable in a finite
amount of time (unlike the halting problem).

Phil Carmody · Mar 14, 2009

JosephKK said:
Dik, in this case the mathematical use of "arbitrary" is similar to
your use of "particular", but specifically more general (an in: any
particular).

You seem to be repeatedly pretending to be the one on this
newsgroup with a unique mathematical insight. Unfortunately
your mathematical knowledge apparently isn't broad enough
to know who Dik is. I knew Dik for his mathematical exploits
long before I knew he programmed C. He knows the meanings of
the terms he's using, and is using them correctly.

Phil

JosephKK · Mar 15, 2009

You seem to be repeatedly pretending to be the one on this
newsgroup with a unique mathematical insight. Unfortunately
your mathematical knowledge apparently isn't broad enough
to know who Dik is. I knew Dik for his mathematical exploits
long before I knew he programmed C. He knows the meanings of
the terms he's using, and is using them correctly.

Phil

While i am quite recent, i calls 'em like i ses 'em. I am neither
particularly knowledgeable nor particularly ignorant. This is USENET.
.

JosephKK · Mar 15, 2009

JosephKK said:
JosephKK said:

[...]

There's some confusion here. If we know that the given executable file
is the output of a particular compiler whose output is a deterministic
function of its input, then certainly a brute-force search will find, in
finite time, some source file which when compiled by that compiler
yields the desired output.

Click to expand...

I am not so assured about that. How reliably can you disambiguate
between pointer and array style accesses? How can you reconstruct
structs and unions reliably (especially those defined in header
files)? Furthermore, what if they are typedef'ed in header files?
(just to push the points) How about incomplete types?

Click to expand...

You don't need to. It's not possible to reconstruct *the* C source
file that produced a given executable (unless the executable is
annotated with a lot of extra information), but it should be possible
to construct *a* C source file that, when compiled, will produce an
executable with the same behavior as the one you started with. If the
original C source used arrays and the re-generated one uses pointers,
that shouldn't matter.

If nothing else, you can emit the code for a C emulator for the
machine code, and include the machine code as a static array.

Whether it's possible to produce the exact same executable is
another question.

Of course, none of this is of any use for the original poster, who (as
I recall) was looking for a way to recover a C source file that he had
accidentally deleted. The answer to that is (a) he's almost certainly
out of luck, unless his system provides some way to recover deleted
files, and (b) for the future, use backups and a source control
system.

Agreed.

JosephKK · Mar 15, 2009

I don't see how that creates any insuperable problems. If the original
source code used a structure, the decompiled object file might use a
structure containing a different set of field names. If the original
source code used a typedef, the decompiled object file might use the
corresponding type directly (or vice versa). So? All of this is covered
by the specified objective: "some source file that when compiled by that
compiler yields the desired output".

No one said it was easy - he said "brute-force search", which isn't easy
at all. That phrase basically implies something like generating all
possible source code files of length N, in order of increasing N, until
you find one that compiles to an object file that matches the one you
want to decompile. For an object file known to have been produced from a
source code file of finite size (even if that size is not known), this
brute-force algorithm requires generation of only a finite number of
different source code files; it will certainly halt no later than the
iteration where N is equal to the actual length of the original source
code file, and for typical source code and typical real world languages,
it will probably terminate a lot earlier than that. Unless the compiler
gets stuck in an infinite loop on one of the source code files,
compilation of a finite number of source code files can be completed in
a finite amount of time.

This is NOT a practical algorithm, merely a proof of principle that
shows that the process is guaranteed to be performable in a finite
amount of time (unlike the halting problem).

Producing a C program that compiles to teh same executable seems to be
possible. Indeed, if viewed as a different form of compilation or
code language transformation it must be possible. Recovering the
variable names that are not exported in any way in any existing file
clearly is not a reasonable expectation. Indeed the translator may
not recognise all of the variables in the machine code.
.

James Kuyper · Mar 15, 2009

JosephKK wrote:
....

... Recovering the
variable names that are not exported in any way in any existing file
clearly is not a reasonable expectation. ...

No one has suggested otherwise; no one has defined the problem to be
solved in a way that requires recovering those names.

A process take input from /proc/<pid>/fd/0, but won't process it	0	Oct 29, 2023
Command Line Arguments	0	Mar 7, 2023
Makefile doesn't detect header file change	5	Mar 20, 2012
How can I view / open / render / display a pdf file with c code?	0	Sep 23, 2023
How can I upload a tar.bz2 file to OpenStack swift object storage container using the Python swift client?	1	Mar 22, 2024
PHP failed to create file	13	Dec 12, 2023
Copying files from sub folders under source directories into subfolders with same names as source di	17	May 19, 2014
Compilation of old source code.	0	Mar 3, 2022

Getting source file from the object file

Chris Dollin

Chris Dollin

Michael Angelo Ravera

Richard Bos

user923005

crisgoogle

Richard Bos

JosephKK

James Kuyper

JosephKK

Keith Thompson

James Kuyper

Phil Carmody

JosephKK

JosephKK

JosephKK

James Kuyper

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads