C source cruncher wanted

D

David Given

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I have a project where I need to distribute 300kB of C source code as part
of a shell script, and would like to compress it as much as possible.

Does anyone know where I can get a (open source) tool for crunching C
programs? That is, removal of whitespace, comments, extraneous characters,
renaming identifiers to make them as short as possible, etc. Actual
outright obfuscation is not my goal; I just want to reduce the source size.

(Yes, I know I could use a tool such as gzip, but for various reasons I'd
like to make the actual source code as small as possible as well.)

It seems to be rather hard to find crunchers these days --- I know there
certainly used to be some, and you can still get them for Javascript, but I
can't find anything that works on C...

I'm using a Unix environment.

- --
+- David Given --McQ-+ "I must have spent at least ten minutes out of my
| (e-mail address removed) | life talking to this joker like he was a sane
| ([email protected]) | person. I want a refund." --- Louann Miller, on
+- www.cowlark.com --+ rasfw

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)

iD8DBQFDTP0yf9E0noFvlzgRAiHDAJ9DzfPEay2LgUvR9WP9AZAZCxfvhQCeMtBo
g8Ui+u/TcCMT7jS7BfFOZpc=
=1Fp3
-----END PGP SIGNATURE-----
 
S

Skarmander

David said:
I have a project where I need to distribute 300kB of C source code as part
of a shell script, and would like to compress it as much as possible.

Does anyone know where I can get a (open source) tool for crunching C
programs? That is, removal of whitespace, comments, extraneous characters,
renaming identifiers to make them as short as possible, etc. Actual
outright obfuscation is not my goal; I just want to reduce the source size.

(Yes, I know I could use a tool such as gzip, but for various reasons I'd
like to make the actual source code as small as possible as well.)

It seems to be rather hard to find crunchers these days --- I know there
certainly used to be some, and you can still get them for Javascript, but I
can't find anything that works on C...
Likely because it makes people go "now what good is that", like I'm
going right now. Now what good is that? :)

It's easy enough to write something like that, though. Just grab any
random C parser + pretty printer from the net and modify it so it prints
small instead of pretty.

Renaming identifiers is slightly trickier because you have to take care
to do it only for non-external symbols (if your code is self-contained,
this doesn't matter). Additional complications arise depending on
whether you want to collapse units into one or not, and whether you're
willing to use #defines or not (I wouldn't bother; too much opportunity
for error). Then there's the ISO C limit on line length (I forget this
one; 509 characters?) that you'll have to respect if you want code to
remain portable.

But I'll still go on record as saying it's not worth it. Any platform
that can compile C has gzip (or is capable of decompressing the format,
at least). The source produced this way is nearly useless to
maintainers, especially if identifiers are renamed, whether obfuscation
is your goal or not.

You also don't save on compilation time: either the code is stable, in
which case a one-time saving of this magnitude is probably irrelevant,
or it's not stable, in which case you have to "compress" it every time
you change the original, which takes more time than just feeding it to
the compiler, unless your compiler really sucks. About the only thing I
can imagine this is good for is reduced transmission times over a
network, but again, gzip is your friend. In fact, HTTP has built-in support.

S.
 
D

Dale

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I have a project where I need to distribute 300kB of C source code as
part of a shell script, and would like to compress it as much as
possible.

Does anyone know where I can get a (open source) tool for crunching C
programs? That is, removal of whitespace, comments, extraneous
characters, renaming identifiers to make them as short as possible,
etc. Actual outright obfuscation is not my goal; I just want to reduce
the source size.

(Yes, I know I could use a tool such as gzip, but for various reasons
I'd like to make the actual source code as small as possible as well.)

It seems to be rather hard to find crunchers these days --- I know
there certainly used to be some, and you can still get them for
Javascript, but I can't find anything that works on C...

I'm using a Unix environment.

Well, there's always sed. You could use that to remove all the spaces,
tabs. newline characters and whatever else you want to be rid of. Just
replace the unwanted characters with nothing (e.g.., s/\ //g to get rid of
spaces).
 
E

Eric Sosman

David Given wrote On 10/12/05 08:07,:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I have a project where I need to distribute 300kB of C source code as part
of a shell script, and would like to compress it as much as possible.

Does anyone know where I can get a (open source) tool for crunching C
programs? That is, removal of whitespace, comments, extraneous characters,
renaming identifiers to make them as short as possible, etc. Actual
outright obfuscation is not my goal; I just want to reduce the source size.
[...]

CB Falconer (anybody know why he's been so silent of late?)
has made mention of an identifier-renaming program he wrote;
you might be able to modify it to squeeze out excess white space
at the same time. I don't have a link to his code repository,
but if you Google your way through some of his postings to this
group you'll probably find it.

(Still, I've got to echo Skarmander's question: "Now, what
good is that?")
 
K

Kevin Handy

Dale said:
Well, there's always sed. You could use that to remove all the spaces,
tabs. newline characters and whatever else you want to be rid of. Just
replace the unwanted characters with nothing (e.g.., s/\ //g to get rid of
spaces).

Idon'tthinkyouwanttoremoveallspaces,especiallyinquotedstrings.

I really don't understand the need for this "crunching", unless
it is for obfusation, and then there are probably better ways
of doing that. Running it through a "pretty-printer" like indent
would "decrunch" the source.

Now, if the original poster would specify why he wants to do
this, we an comment on it intelligently.
 
D

David Given

Kevin Handy wrote:
[...]
I really don't understand the need for this "crunching", unless
it is for obfusation, and then there are probably better ways
of doing that. Running it through a "pretty-printer" like indent
would "decrunch" the source.

Now, if the original poster would specify why he wants to do
this, we an comment on it intelligently.

Surely that's irrelevant? I do know what I'm looking for, and I do have
specific reasons for wanting it.

FWIW, what I've got is a build utility consisting of a shell script
containing a script and a chunk of source code which is the interpreter for
the script. When the utility is run for the first time, it will unpack the
interpreter, compile it, stash the binary somewhere, and then use it to
invoke the script.

The interpreter is currently pretty chunky. I want people to be able to
deploy the utility by just dropping it in to a source distribution, which
means I want to make it as small as possible. Being able to read the code
isn't an issue, because if you're developing, you use the full,
uncompressed source.

I'm currently building several versions of the shell script package, using
different encodings for the interpreter source. The uncompressed version is
about 400kB. The non 7-bit clean version, which is diff unfriendly and uses
a gzip compressed data chunk, is 100kB. The 7-bit clean version, which uses
gzip and then uuencode, is 150kB. If I can reduce the size of the
interpreter source then I can reduce the size of the package, even if it is
using gzip. It's worth noting that using 'cobfusc -dem' I can reduce the
source code size by 40%, which reduces the gzip compressed version by 25%,
so using a code cruncher *is* useful; but cobfusc was not intended for code
compression, so I can't achieve any further savings.

None of this is particularly on-topic, which I why I didn't mention it to
begin with...
 
M

Michael Wojcik

FWIW, what I've got is a build utility consisting of a shell script
containing a script and a chunk of source code which is the interpreter for
the script. When the utility is run for the first time, it will unpack the
interpreter, compile it, stash the binary somewhere, and then use it to
invoke the script.

Ah, it's "Revenge of the Shell Archive".
The interpreter is currently pretty chunky. I want people to be able to
deploy the utility by just dropping it in to a source distribution, which
means I want to make it as small as possible. Being able to read the code
isn't an issue, because if you're developing, you use the full,
uncompressed source.

I'm currently building several versions of the shell script package, using
different encodings for the interpreter source. The uncompressed version is
about 400kB. The non 7-bit clean version, which is diff unfriendly and uses
a gzip compressed data chunk, is 100kB. The 7-bit clean version, which uses
gzip and then uuencode, is 150kB.

uuencode is a lousy encoding (its expansion ratio is 5:3). Base64
would be significantly better (4:3). You should drop about 20KB with
Base64.

(Are you using gzip with maximum compression?)
If I can reduce the size of the
interpreter source then I can reduce the size of the package, even if it is
using gzip.

Someone already suggested modifying a source reformatter (like
indent), since you basically need a C parser and a backend that
writes C source in something close to its minimal representation.
Personally, I wouldn't bother with renaming identifiers - I think
that's past the point of diminishing returns.

It shouldn't be hard to remove comments (assuming no pathological
cases; see the thread starting at [1]) and leading/trailing
whitespace from your source, if you want to do it yourself. That
alone should get you some savings.


1. http://groups.google.com/group/comp.lang.c/msg/41a4486b8ae7dcc1
 
T

the Swampster

David Given said:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
I have a project where I need to distribute 300kB of C source code as part
of a shell script, and would like to compress it as much as possible.
Does anyone know where I can get a (open source) tool for crunching C
programs? That is, removal of whitespace, comments, extraneous characters,
renaming identifiers to make them as short as possible, etc. Actual
outright obfuscation is not my goal; I just want to reduce the source size.

IIRC there something like that in "A book on C" by Kelley & Pohl.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,017
Latest member
GreenAcreCBDGummiesReview

Latest Threads

Top