RLE code review

nullptr · Nov 8, 2003

Hi,

As an exercise, I've written a program implementing run-length encoding.
I would be more that happy to hear your comments, suggestions, etc

Thanks,
---
#include <stdio.h>
#include <string.h>

int main(int argc, char *argv[])
{
int cur = 0, i = 0, prev = 0;
FILE *src, *dest;

if(argc == 4)
{
src = fopen(argv[argc-2], "rb");
dest = fopen(argv[argc-1], "wb");

if(src == NULL)
{
fprintf(stderr, "error: %s can't openned\n", argv[argc-2]);
return -1;
}

if(dest == NULL)
{
fprintf(stderr, "error: %s can't openned \n", argv[argc-1]);
return -1;
}

if((strcmp(argv[1], "--dump")) == 0)
{
while((cur = fgetc(src)) != EOF)
fprintf(dest, "%x ", cur);

fclose(src);
fclose(dest);
}

else if((strcmp(argv[1], "--comp")) == 0)
{

for(;fscanf(src, "%x", &cur) == 1; prev = cur, i++)
{
if(prev != cur)
{
if(prev != 0)
fprintf(dest, "%02d %x ", i, prev);
i = 0;
}
}
/* dump the unprocessed data */
if(i)
fprintf(dest, "%02d %x ", i, prev);
fclose(src);
fclose(dest);

}

else if((strcmp(argv[1], "--uncomp")) == 0)
{
while(fscanf(src, "%d%x", &i, &cur) == 2)
{
while(i-- > 0)
fprintf(dest, "%x ", cur);
}
fclose(src);
fclose(dest);
}

else
fprintf(stderr, "error: unknown option %s\n", argv[1]);
}

else
printf("usage: hexdump OPTIONS <src-file> <dest-file>\n"
"\nOPTIONS:\n"
"\t--dump \t dump src-file in hex;\n"
"\t--comp \t perform rle compression on src-file;\n"
"\t--uncomp\t decompress the data\n");

return 0;
}

Richard Heathfield · Nov 8, 2003

nullptr said:
Hi,

As an exercise, I've written a program implementing run-length encoding.
I would be more that happy to hear your comments, suggestions, etc

Thanks,
---
#include <stdio.h>
#include <string.h>

int main(int argc, char *argv[])
{
int cur = 0, i = 0, prev = 0;
FILE *src, *dest;

Personally, I'd set these to NULL, to make it easier to find out if you'd
used them accidentally.

if(argc == 4)
{
src = fopen(argv[argc-2], "rb");
dest = fopen(argv[argc-1], "wb");

If argc is 4, then argc - 2 and argc - 1 are basically 2 and 3, so why not
say so? Oh, what's argv[1] for? Why not document this stuff?

if(src == NULL)
{
fprintf(stderr, "error: %s can't openned\n", argv[argc-2]);
return -1;

Make this: return EXIT_FAILURE; if you wish the code to be portable. You

}

if(dest == NULL)
{
fprintf(stderr, "error: %s can't openned \n", argv[argc-1]);
return -1;
}

if((strcmp(argv[1], "--dump")) == 0)
{
while((cur = fgetc(src)) != EOF)
fprintf(dest, "%x ", cur);

Up to you, of course, but I prefer to use fwrite for binary files, and
fprintf for text files. You seem to be trying to write information to a
binary file in a text format. In my experience, that's a recipe for
head-scratching at some point.

fclose(src);
fclose(dest);
}

else if((strcmp(argv[1], "--comp")) == 0)
{

for(;fscanf(src, "%x", &cur) == 1; prev = cur, i++)
{
if(prev != cur)
{
if(prev != 0)
fprintf(dest, "%02d %x ", i, prev);

Same issue here, of course. Now, what happens when you have a block of more
than INT_MAX characters in a row?

i = 0;
}
}
/* dump the unprocessed data */
if(i)
fprintf(dest, "%02d %x ", i, prev);

Well remembered.

fclose(src);
fclose(dest);

}

else if((strcmp(argv[1], "--uncomp")) == 0)
{
while(fscanf(src, "%d%x", &i, &cur) == 2)
{
while(i-- > 0)
fprintf(dest, "%x ", cur);

I don't think that's going to restore your original file.

}
fclose(src);
fclose(dest);
}

else
fprintf(stderr, "error: unknown option %s\n", argv[1]);
}

else
printf("usage: hexdump OPTIONS <src-file> <dest-file>\n"
"\nOPTIONS:\n"
"\t--dump \t dump src-file in hex;\n"
"\t--comp \t perform rle compression on src-file;\n"
"\t--uncomp\t decompress the data\n");

return 0;
}

Oh, just one thing - the program doesn't work. A cursory inspection doesn't
reveal why, but I ran it on itself and got a 0 byte compressed file. I
mean, great compression, *yes*, but it makes it more difficult to
reconstruct the original than might be considered strictly necessary.

Arthur J. O'Dwyer · Nov 8, 2003

As an exercise, I've written a program implementing run-length encoding.
I would be more that happy to hear your comments, suggestions, etc

Sure thing.

#include <stdio.h>
#include <string.h>

int main(int argc, char *argv[])
{
int cur = 0, i = 0, prev = 0;
FILE *src, *dest;

Richard suggests initializing all your variables
to *some* mnemonic value at declaration time; e.g.,

FILE *src = NULL;
FILE *dest = NULL;

My opinion differs on this matter; in my opinion,
you shouldn't even be initializing 'cur' and 'prev'
until you have figured out whether the program is
going to be processing a file or not, at least.

At any rate, I think we all agree that variable
declarations should be one per line unless there's
a readability issue at stake:

FILE *src = NULL;
FILE *dest = NULL;
int cur;
int prev;
int i = 0;

Finally, I see that you seem to be using the value
of 'i' to represent several different things in this
program (?). Usually, single-character names like
'i' (and 'j' and 'k') are "reserved" for use as loop
counters. When you break with this convention, it
makes your writing more confusing than it really needs
to be. I suggest you remove the declaration of 'i'
and insert declarations of better-named variables
inside the appropriate 'if' blocks [see below].

if(argc == 4)
{
src = fopen(argv[argc-2], "rb");
dest = fopen(argv[argc-1], "wb");

As Richard already said, [argc-2] is just a fancy
way of writing [2] at this point.
More importantly, you're sticking an awful lot of
code inside this if-block. You need to figure out
what the essential "building blocks" of your program
are going to be, and split them out into functions.
In a program like this, I would make one function

int process_rle(FILE *input, FILE *output);

and have 'main' call that function when needed. But
then, I see that this program really has three distinct
functionalities, so let's give it three distinct
processing functions:

int process_comp(FILE *input, FILE *output);
int process_uncomp(FILE *input, FILE *output);
int process_dump(FILE *input, FILE *output);

Try it out!

if(src == NULL)
{
fprintf(stderr, "error: %s can't openned\n", argv[argc-2]);
return -1;

Grammatical and spelling errors in your error messages
are never a good sign.

}

if(dest == NULL)
{
fprintf(stderr, "error: %s can't openned \n", argv[argc-1]);
return -1;
}

if((strcmp(argv[1], "--dump")) == 0)

See my note further down about user-interface issues.

{
while((cur = fgetc(src)) != EOF)

No reason to use 'fgetc' here when 'getc' would work
just as well. The only difference between 'fgetc'
and 'getc' is that 'fgetc' is guaranteed to evaluate
its argument ('src') only once. And that doesn't
matter in this case. So use 'getc'.

fprintf(dest, "%x ", cur);

You're reading single bytes from 'src', but writing
hexadecimal numbers to 'dest'? What exactly is this
program supposed to be doing? Do you have another
program that translates the hex output back into
single bytes, or do you really want this sort of
output?

fclose(src);
fclose(dest);
}

else if((strcmp(argv[1], "--comp")) == 0)
{
for(;fscanf(src, "%x", &cur) == 1; prev = cur, i++)

*This* is where you should have initialized 'i' to zero --
in the initialization section of the 'for' loop that uses
its value! [Also, 'i' should be named 'count' or similarly.]

{
if(prev != cur)
{
if(prev != 0)

What if the file contains '\0' bytes? You never
print those, do you?

fprintf(dest, "%02d %x ", i, prev);

Ye gods, that's confusing. You're going to alternate
between printing decimal and hexadecimal numbers to
the output file? How's the reader supposed to tell which
is which?

i = 0;
}
}
/* dump the unprocessed data */
if(i)
fprintf(dest, "%02d %x ", i, prev);
fclose(src);
fclose(dest);
}

else if((strcmp(argv[1], "--uncomp")) == 0)
{
while(fscanf(src, "%d%x", &i, &cur) == 2)
{
while(i-- > 0)

Here's the second bizarre use of 'i' to mean 'count'.
Better to use a loop syntax that's obviously correct,
like
int i;
for (i=0; i < count; ++i)

fprintf(dest, "%x ", cur);
}
fclose(src);
fclose(dest);
}

else
fprintf(stderr, "error: unknown option %s\n", argv[1]);

....and close the files, too, or are you giving up on
that? Be consistent!
[You don't *have* to fclose(src) and dest if you don't
want to -- the program will take care of that for you
when it exits. But since you did it consistently in every
other case, why not do it here too? Of course, once you've
pulled out the 'process_rle' function(s), you'll have only
one place that needs 'fclose's. Hint hint.]

}

else
printf("usage: hexdump OPTIONS <src-file> <dest-file>\n"
"\nOPTIONS:\n"
"\t--dump \t dump src-file in hex;\n"
"\t--comp \t perform rle compression on src-file;\n"
"\t--uncomp\t decompress the data\n");

Okay. User-interface time.
First off, don't use '\t' characters in output that's supposed
to be human-readable. Use spaces instead; makes the output much
more consistent between different user configurations. Or if
you're really hardcore, just use the 'printf' format specifiers.

Next, note that when you write "OPTIONS", most people are going
to assume that those "OPTIONS" are... well, OPTIONAL. So it may
come as a surprise to them to find out that the program just
shows this same help message over and over if they don't enter
exactly *one* of these OPTIONS as the first parameter.
Next, note that this program doesn't actually compress very
many files. It'll probably expand anything you give it,
*especially* text or images. That's because it reads bytes
and for each byte, writes up to five bytes of text. That's
kind of silly, unless you're just doing that kind of output
for debugging purposes.
Finally, once you've split out those 'process_*' functions,
you will find that your 'main' function is really empty! Put
that empty space to good use by adding a decent parser for your
command line -- never force the user to enter arguments in a
certain order if there's another logical way they could do it.
See http://www.contrib.andrew.cmu.edu/~ajo/free-software/detab.c
for a simple example of a decent 'main' function.

return 0;

Good job.

}

All in all, a better program than the average first effort.
But you need to work on splitting up functions; on variable
naming and encapsulation (declare 'i' close to where it's
used, for example); and on your program concept in general--
where are you trying to take this program? Did you want to
use putc() instead of fprintf(), for example? How about
adding some error-checking on the value of 'i', making sure
it won't overflow? Things like that.

Hope this helped,
-Arthur

nullptr · Nov 8, 2003

Thanks for suggestions. I'll work on this, and post the modified version
at the later time.

BTW, this program is a exercise from K.N. King "C Prgramming: A modern
Approach", Chp. 22 (ex.17).

nullptr · Nov 8, 2003

All right, here's the updated version. Obviously, it only works well with
the long sequences of identcal bytes.

else if((strcmp(argv[1], "--uncomp")) == 0)
{
while(fscanf(src, "%d%x", &i, &cur) == 2)
{
while(i-- > 0)

Click to expand...

Here's the second bizarre use of 'i' to mean 'count'.
Better to use a loop syntax that's obviously correct,
like
int i;
for (i=0; i < count; ++i)

The advantage of the while loop like that is that it doesn't use an
additional variable.

--
#include <stdio.h>
#include <string.h>
#include <stdlib.h>

void rle_comp(FILE *src, FILE *dest)
{
char prev = 0;
char cur = 0;
int count; /* number of identical bytes */

for(count = 0; fread(&cur, sizeof(char), 1, src) == 1;
prev = cur, count++)
{
if(prev != cur)
{
if(prev != 0)
{
fwrite(&count, sizeof(int), 2, dest);
fwrite(&prev, sizeof(char), 1, dest);
}
count = 0;
}
}

/* dump the unprocessed data */
if(count)
{
fwrite(&count, sizeof(int), 2, dest);
fwrite(&prev, sizeof(char), 1, dest);
}
}

void rle_uncomp(FILE *src, FILE *dest)
{
int count = 0;
char cur = 0;

while(fread(&count, sizeof(int), 2, src) == 2 &&
fread(&cur, sizeof(char), 1, src) == 1)
{
while(count-- > 0)
fwrite(&cur, sizeof(char), 1, dest);
}
}

int main(int argc, char *argv[])
{
FILE *src, *dest;

if(argc == 4)
{
src = fopen(argv[2], "rb");
dest = fopen(argv[3], "wb");

if(src == NULL)
{
fprintf(stderr, "error: %s can't be openned\n", argv[2]);
exit(EXIT_FAILURE);
}

if(dest == NULL)
{
fprintf(stderr, "error: %s can't be openned\n", argv[3]);
exit(EXIT_FAILURE);
}

else if((strcmp(argv[1], "--comp")) == 0)
rle_comp(src, dest);

else if((strcmp(argv[1], "--uncomp")) == 0)
rle_uncomp(src, dest);

else
fprintf(stderr, "error: unknown option %s\n", argv[1]);

fclose(src);
fclose(dest);
}

else
printf("usage: hexdump OPTION <src-file> <dest-file>\n"
"\nwhere OPTION is mandatory:\n"
" --comp perform rle compression on src-file\n"
" --uncomp decompress src-file into dest-file\n");

return 0;
}

Simon Biber · Nov 8, 2003

Richard Heathfield said:
Oh, just one thing - the program doesn't work. A cursory inspection
doesn't reveal why, but I ran it on itself and got a 0 byte compressed
file. I mean, great compression, *yes*, but it makes it more difficult
to reconstruct the original than might be considered strictly necessary.

The compression routine expects to read each byte as a hexadecimal
integer using fscanf:
for(;fscanf(src, "%x", &cur) == 1; prev = cur, i++)

This means it should work on files that have already been passed
through the 'dump' option.

Unfortunately, there is no 'undump' option that takes the resulting
text file and converts it back to binary.

Sheldon Simms · Nov 8, 2003

All right, here's the updated version. Obviously, it only works well with
the long sequences of identcal bytes.

Before I say anything else, I have to ask, have you tested this program?
It doesn't work when I compile it. I can "compress" a file, but after
"compressing" this source code I got the following result:

[sheldon@wsxyz mcc]$ gcc test.c
[sheldon@wsxyz mcc]$ ./a.out --comp test.c testout
[sheldon@wsxyz mcc]$ ls -l test*
-rw-rw-r-- 1 sheldon sheldon 1801 Nov 8 12:19 test.c
-rw-rw-r-- 1 sheldon sheldon 13860 Nov 8 12:20 testout

Hmmm, not exactly compressed is it? The uncompress looks like this:

[sheldon@wsxyz mcc]$ ./a.out --uncomp testout test.test.c
Segmentation fault

Oh well. Let's look at the code.

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

void rle_comp(FILE *src, FILE *dest)
{
char prev = 0;
char cur = 0;
int count; /* number of identical bytes */

for(count = 0; fread(&cur, sizeof(char), 1, src) == 1;

^^^^^^^^^^^^

sizeof(char) is by definition equal to 1, so this isn't really
useful. You could just pass a constant 1 as the argument, but
maybe you'll want to change the code later to run-length-encode
ints. So do it like this:

fread(&cur, sizeof cur, 1, src) == 1

Now it will work whether cur is a char, short, int or pretty much
anything else.

prev = cur, count++)
{
if(prev != cur)
{
if(prev != 0)

What if the file contains bytes with the value zero? Might cause
a problem.

{
fwrite(&count, sizeof(int), 2, dest);

This is quite interesting. The third argument to fwrite is the number
of elements, each of which has size equal to the second argument, that
you want to write to the file.

Here you are trying to write two ints, each of which has size sizeof(int)
to the file. However, you are passing the address of one (1) int as the
first argument. This will not work.

You want to write 1 int to the file, like this:

fwrite(&count, sizeof count, 1, dest);

fwrite(&prev, sizeof(char), 1, dest);

As above, use the sizeof object expression when you can:

fwrite(&prev, sizeof char, 1, dest);

}
count = 0;
}
}

/* dump the unprocessed data */
if(count)
{
fwrite(&count, sizeof(int), 2, dest);
fwrite(&prev, sizeof(char), 1, dest);

Same problems as above.

}
}

void rle_uncomp(FILE *src, FILE *dest)
{
int count = 0;
char cur = 0;

while(fread(&count, sizeof(int), 2, src) == 2 &&

This looks like the source of the segmentation fault. Trying
to read two (2) ints into a single int. Change as above, and
correct the expected return value from fread() to match.

fread(&cur, sizeof(char), 1, src) == 1)
{
while(count-- > 0)
fwrite(&cur, sizeof(char), 1, dest);
}
}

After making the above changes, the program works... kind of. It
successfully "compresses" and "uncompresses" files. However, what
the program calls "compress", I call "expand". Here are my results.

[sheldon@wsxyz mcc]$ gcc test.c
[sheldon@wsxyz mcc]$ ./a.out --comp test.c testout
[sheldon@wsxyz mcc]$ ls -l test*
-rwxrwxr-x 1 sheldon sheldon 11944 Nov 4 11:45 test
-rw-rw-r-- 1 sheldon sheldon 1801 Nov 8 12:37 test.c
-rw-rw-r-- 1 sheldon sheldon 7700 Nov 8 12:38 testout
[sheldon@wsxyz mcc]$ ./a.out --uncomp testout test.test.c
[sheldon@wsxyz mcc]$ ls -l test*
-rwxrwxr-x 1 sheldon sheldon 11944 Nov 4 11:45 test
-rw-rw-r-- 1 sheldon sheldon 1801 Nov 8 12:37 test.c
-rw-rw-r-- 1 sheldon sheldon 7700 Nov 8 12:38 testout
-rw-rw-r-- 1 sheldon sheldon 1801 Nov 8 12:39 test.test.c
[sheldon@wsxyz mcc]$ diff test.c test.test.c
[sheldon@wsxyz mcc]$

Think about why this is happening and how you might be able to change
the program so that "compress" really compresses. In particular,
consider how many bytes are being written to the output file in
comparison to the number of bytes being read. What if my input file
is a text file that looks like this:

xxooxxooxxooxxooxxooxxooxxooxxooxxooxxooxxooEOF

What would the compressed form of this file be?

-Sheldon

nullptr · Nov 8, 2003

Before I say anything else, I have to ask, have you tested this program?
It doesn't work when I compile it. I can "compress" a file, but after
"compressing" this source code I got the following result:

Yes, I've tested it on my FreeBSD system, and the results are the same
after modifying fread/fwrite calls.

Think about why this is happening and

how you might be able to change

the program so that "compress" really compresses. In particular,
consider how many bytes are being written to the output file in
comparison to the number of bytes being read. What if my input file is a
text file that looks like this:

xxooxxooxxooxxooxxooxxooxxooxxooxxooxxooxxooEOF

What would the compressed form of this file be?

-Sheldon

The compressed output would be something like 2x2o2x2o...

I know, that this isn't compressing anything unless there's a long number
of the same bytes, but isn't is the nature of rle? Anyway, do you know of
any solution for this?

Richard Heathfield · Nov 8, 2003

Simon said:
The compression routine expects to read each byte as a hexadecimal
integer using fscanf:
for(;fscanf(src, "%x", &cur) == 1; prev = cur, i++)

Ah, thank you for pointing out the obvious.

Simon Biber · Nov 8, 2003

nullptr said:
The compressed output would be something like 2x2o2x2o...

No, it looks more like
02 00 00 00 x 02 00 00 00 o 02 00 00 00 x 02 00 00 00 o ...
That is, each integer is probably taking up four bytes.

I know, that this isn't compressing anything unless there's a long
number of the same bytes, but isn't is the nature of rle? Anyway,
do you know of any solution for this?

The problem is reduced somewhat if you reduce the count to an
unsigned char (one byte). Note the compare against UCHAR_MAX
to cater for repetitions longer than the maximum value you
can store in a byte, (which is typically 255). You should have
made it check for overflow of the int count before, anyway.

I also rewrote the while loop in the uncompressor, as I think
an incrementing for loop from 0 up to the count is clearer,
and more obviously free of any problems to do with unsigned
numbers never going below zero.

Lastly, I moved around the code in main to make it only open
the files if the option specified was correct.

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <limits.h>

void rle_comp(FILE * src, FILE * dest)
{
char prev = 0;
char cur = 0;
unsigned char count; /* number of identical bytes */

for (count = 0; fread(&cur, sizeof(char), 1, src) == 1;
prev = cur, count++) {
if (prev != cur || count == UCHAR_MAX) {
if (prev != 0) {
fwrite(&count, sizeof count, 1, dest);
fwrite(&prev, sizeof prev, 1, dest);
}
count = 0;
}
}

/* dump the unprocessed data */
if (count) {
fwrite(&count, sizeof count, 1, dest);
fwrite(&prev, sizeof prev, 1, dest);
}
}

void rle_uncomp(FILE * src, FILE * dest)
{
unsigned char count = 0;
char cur = 0;
int i;

while (fread(&count, sizeof count, 1, src) == 1 &&
fread(&cur, sizeof cur, 1, src) == 1) {
for (i = 0; i < count; i++)
fwrite(&cur, sizeof cur, 1, dest);
}
}

int main(int argc, char *argv[])
{
FILE *src, *dest;

if (argc == 4) {
void (*option) (FILE *, FILE *);

if ((strcmp(argv[1], "c")) == 0)
option = rle_comp;
else if ((strcmp(argv[1], "x")) == 0)
option = rle_uncomp;
else {
fprintf(stderr, "Error: unknown option %s\n", argv[1]);
exit(EXIT_FAILURE);
}

src = fopen(argv[2], "rb");
if (src == NULL) {
fprintf(stderr, "Error: %s can't be opened\n", argv[2]);
exit(EXIT_FAILURE);
}

dest = fopen(argv[3], "wb");
if (dest == NULL) {
fprintf(stderr, "Error: %s can't be opened\n", argv[3]);
exit(EXIT_FAILURE);
}

option(src, dest);

fclose(src);
fclose(dest);
}

else
printf("Usage: rle COMMAND <src-file> <dest-file>\n"
"\nwhere COMMAND is mandatory:\n"
" c perform rle compression on src-file\n"
" x decompress src-file into dest-file\n");

return 0;
}

nullptr · Nov 8, 2003

On Sun, 09 Nov 2003 05:42:40 +1100, Simon Biber wrote:
[snip]

Thanks, your code was a lot of help.

Chris Torek · Nov 8, 2003

[straying into comp.programming territory, really, but what the heck]

The [run-length encoded] output would be something like 2x2o2x2o...

I know, that this isn't compressing anything unless there's a long number
of the same bytes, but isn't is the nature of rle?

Indeed it is.

Anyway, do you know of any solution for this?

There are a number of approaches. My favorite is a two-pronged
attack:

- First, for handling single-character values ("1 occurrence of
a, 1 of b, 1 of c"), avoid the prefix count entirely. This
requires picking out some sort of "unlikely" character and
using that to indicate "n occurrences of some character will
follow". Just for example, let me use "z".

Now if the input is:

aaaabcdefg

the output will read:

z4abcdefg

Here we have made the output one character shorter than the input.
Because it takes at least three characters to encode the length
-- "z", the "compression marker", followed by the count (4),
followed by the character being compressed -- we profit only if
a character repeats at least three times.

- The second prong: variable-length count-encoding. (I learned
this trick from Knuth, who used it in nybble-encoded run lengths
in TeX/Metafont ".gf" font files.)

Note that in the example above I used a single digit-character
('4') to express "four occurrences of the letter a". What if
we need 14 occurrences -- how can we write that?

One method is to use binary files and simply restrict the count
to "somewhere between 0 and 255" (255 is portable because we
know that CHAR_BIT is at least 8). But we can do a bit better
by using a trick. Just going back to the textual representation,
what sense is there in saying "0 occurrences of a"? (None of
course -- it just expands the "compressed" file, uselessly.)
But what if we were to write "14" as "014", and 140 as "00140"?
Now the number of "0" digits tells us how many *more* digits to
expect.

This even works for binary files: if you believe you will have
occurrences of many-thousands-bytes-of-'p', a long string of "255
p, 255 p, 255 p, 255 p, ..." can be replaced with something like:
"0, i.e., pick up at least TWO bytes; 0; i.e., pick up at least
THREE bytes; 5; 67; 17 => 5 * 65536 + 67 * 256 + 17 => 344839; p"
which means "output 344839 p's".

This still leaves a problem. I picked "z" above as an "unlikely"
character, but what if we need to output a literal "z"? The
straightforward answer is to encode *any* string of "z"s, including
just one, so that one "z" comes out as "z1z".

This does mean that some input files will expand. As it turns out,
NO MATTER WHAT compression algorithm we choose, there will be SOME
input file(s) that do not compress, and even expand. (Actually this
is not strictly true -- it only holds for any lossless compression
scheme.) A proof of this is beyond the scope of this newsgroup

but it is easy to see in a handwavey fashion if you think of it
this way:

Any lossless compression scheme works by taking advantage of
redundancy in the input data, and eliminating (some of) the
redundancy. Thus, all you have to do to defeat it is come up
with an input that has no redundancy.

It turns out that almost every "reasonable" input has suitable
redundancy, so that a "good" compression algorithm (such as LZW
or bzip2) will compress most inputs. It also turns out that
the output of a good compression algorithm is the worst kind of
input to that compression algorithm, though -- so attempting to
re-compress a compressed file is (or at least should be) futile.

Irrwahn Grausewitz · Nov 8, 2003

nullptr said:
All right, here's the updated version. Obviously, it only works well with
the long sequences of identcal bytes.

Hm, I compiled and tried it, but it didn't work, let's see...

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

void rle_comp(FILE *src, FILE *dest)
{
char prev = 0;
char cur = 0;
int count; /* number of identical bytes */

You probably want a shorter type for count, as on most implementations
an int occupies more than one byte, inflating your output.

for(count = 0; fread(&cur, sizeof(char), 1, src) == 1;

sizeof (char) is guaranteed to equal one. BTW, why not use getc()?

prev = cur, count++)

^^^^^^^^^^
Stylistic point: I'd put prev = cur; at the end of the loop.

{
if(prev != cur)
{
if(prev != 0)

This prevents you from "compressing" null bytes!

{
fwrite(&count, sizeof(int), 2, dest);

Err, you write 2 integers to the output stream. That's certainly not
what you want here. Additionally, if the type of count changes in the
future you have to change the second parameter. Better:

fwrite( &count, sizeof count, 1, dest );

fwrite(&prev, sizeof(char), 1, dest);

sizeof (char) is guaranteed to equal one. BTW, why not use putc()?

}
count = 0;
}
}

/* dump the unprocessed data */
if(count)
{
fwrite(&count, sizeof(int), 2, dest); See above.
fwrite(&prev, sizeof(char), 1, dest); See above.
}

If you change the loop construct you don't have to handle the
unprocessed data here.

}

void rle_uncomp(FILE *src, FILE *dest)
{
int count = 0;
char cur = 0;

while(fread(&count, sizeof(int), 2, src) == 2 && See above.
fread(&cur, sizeof(char), 1, src) == 1) See above.
{
while(count-- > 0)
fwrite(&cur, sizeof(char), 1, dest); See above.
}
}

int main(int argc, char *argv[])
{
FILE *src, *dest;

if(argc == 4)
{
src = fopen(argv[2], "rb");
dest = fopen(argv[3], "wb");

if(src == NULL)
{
fprintf(stderr, "error: %s can't be openned\n", argv[2]);
exit(EXIT_FAILURE);
}

if(dest == NULL)
{
fprintf(stderr, "error: %s can't be openned\n", argv[3]);
exit(EXIT_FAILURE);
}

else if((strcmp(argv[1], "--comp")) == 0)

^^^^
This 'else' is obsolete for obvious reasons.

rle_comp(src, dest);

else if((strcmp(argv[1], "--uncomp")) == 0)
rle_uncomp(src, dest);

else
fprintf(stderr, "error: unknown option %s\n", argv[1]);

fclose(src);
fclose(dest);
}

else
printf("usage: hexdump OPTION <src-file> <dest-file>\n"
"\nwhere OPTION is mandatory:\n"
" --comp perform rle compression on src-file\n"
" --uncomp decompress src-file into dest-file\n");

return 0;
}

Just for fun I partially rewrote your program. This had the benefit
that I could test a working version which provides a bit more error
checking, and (not surprisingly) found that it is in fact a run-length
inflation tool for (most kinds of) files ... but you already knew
that, didn't you? ;-)

Note that I left your main function almost unchanged, Arthur already
told you about ways to improve it.

If it contains major flaws somebody will point them out, and I admit
in advance that it's not a stylistical masterpiece. Here we go:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <limits.h>

int rle_comp( FILE *src, FILE *dest )
{
int prev = 0;
int cur = 0;
int run = 1;
unsigned char count;

for ( count = 0; run; ++count )
{
if ( ( cur = getc( src ) ) == EOF )
run = 0;
if ( ( ( cur != prev || !run ) && count )
|| ( count == UCHAR_MAX ) )
{
if ( fwrite( &count, sizeof count, 1, dest ) != 1
|| putc( prev, dest ) == EOF )
run = 0;
count = 0;
}
prev = cur;
}
return ferror( src ) || ferror( dest );
}

int rle_uncomp( FILE *src, FILE *dest )
{
int cur = 0;
unsigned char count = 0;

while( fread( &count, sizeof count, 1, src ) == 1
&& ( cur = getc( src ) ) != EOF )
while( count-- != 0 )
putc( cur, dest );

return ferror( src ) || ferror( dest );
}

int main( int argc, char *argv[] )
{
FILE *src, *dest;

if( argc == 4 )
{
src = fopen( argv[2], "rb" );
if( src == NULL )
{
fprintf(stderr, "error: %s can't be opened\n", argv[2]);
exit(EXIT_FAILURE);
}

dest = fopen( argv[3], "wb" );
if( dest == NULL )
{
fclose( src );
fprintf(stderr, "error: %s can't be opened\n", argv[3]);
exit(EXIT_FAILURE);
}

if( strcmp( argv[1], "--comp") == 0 )
{
if ( rle_comp( src, dest ) != 0 )
fprintf( stderr,
"error: I/O failure during compression\n" );
}
else if ( strcmp( argv[1], "--uncomp" ) == 0 )
{
if ( rle_uncomp( src, dest ) != 0 )
fprintf( stderr,
"error: I/O failure during decompression\n" );
}
else
fprintf(stderr, "error: unknown option %s\n", argv[1]);

fclose( src );
fclose( dest );
}
else
printf( "usage: hexdump OPTION <src-file> <dest-file>\n"
"\nwhere OPTION is mandatory:\n"
" --comp perform rle compression on src-file\n"
" --uncomp decompress src-file into dest-file\n");

return 0;
}

HTH
Regards

Irrwahn Grausewitz · Nov 9, 2003

[About improving an simple RLE algorithm]

[straying into comp.programming territory, really, but what the heck]

There are a number of approaches. My favorite is a two-pronged
attack:

- First, for handling single-character values ("1 occurrence of
a, 1 of b, 1 of c"), avoid the prefix count entirely. This
requires picking out some sort of "unlikely" character and
using that to indicate "n occurrences of some character will
follow". Just for example, let me use "z".

Now if the input is:

aaaabcdefg

the output will read:

z4abcdefg

This still leaves a problem. I picked "z" above as an "unlikely"
character, but what if we need to output a literal "z"? The
straightforward answer is to encode *any* string of "z"s, including
just one, so that one "z" comes out as "z1z".

<snip>

Come to think about it: how about using an "illegal" (zero) count
prefix for marking uncompressed sequences, instead of using an
"unlikely" character to prefix compressed sequences? So that, for
example,

aaaabcdefgzzzzzzz

will become:

4a06bcdefg7z

with following meaning:

4a : expand to four occurrences of 'a'
06 : an uncompressed sequence of length six follows
bcdefg : the uncompressed sequence itself
7z : expand to seven occurrences of 'z'

Regards

Jonathan2s6 · Nov 12, 2003

Get your Server or Homepage now

Choose which one you prefer.
Redhat, Debian, FreeBSD, Mandrake, Windows, SUSE

http://www.comserver.net

Request for source code review of simple Ising model	88	Apr 10, 2014
the code of two-way linked list failed to be compiled.	15	Sep 21, 2008
Please help with C programming to save GPS reception data in Raspberry Pi.	0	Dec 8, 2022
Command Line Arguments	0	Mar 7, 2023
How can I view / open / render / display a pdf file with c code?	0	Sep 23, 2023
visual studio compilling issues with solaris c code	1	Oct 14, 2009
[C language] Issue in the Lotka-Volterra model.	0	Jun 28, 2023
epoll networking code review	4	Aug 30, 2012

RLE code review

nullptr

Richard Heathfield

Arthur J. O'Dwyer

nullptr

nullptr

Simon Biber

Sheldon Simms

nullptr

Richard Heathfield

Simon Biber

nullptr

Chris Torek

Irrwahn Grausewitz

Irrwahn Grausewitz

Jonathan2s6

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads