compressing charatcers

David Brown · Apr 3, 2014

OK, fair enough. Here's the result of my attempt to compress all 12
million books in the British Library:

1

Let's see the source code!

But somehow, I doubt that's what the OP was looking for.

True - but it shows how important it is to get your specifications
right. The exact problem called for compressing the single string
FOOFIGHTERS - I think a 0 bit result would be fine:

$ cat compress.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(int argc, char* argv[]) {
if (argc != 2) {
printf("Usage: ./compress FOOFIGHTERS\n");
return EXIT_FAILURE;
}
if (strcmp(argv[1], "FOOFIGHTERS")) {
printf("Usage: ./compress FOOFIGHTERS\n");
return EXIT_FAILURE;
}
printf("Compressed string = \"\"\n");
return 0;
}

$ cat decompress.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(void) {
printf("FOOFIGHTERS\n");
return 0;
}

James Kuyper · Apr 3, 2014

I tried very hard but the tools I have here are not under my control. I use very, very and very old tools and it is extremely difficult to type/indent/copy/delete/yank the text in the tool I use which somehow works like a text editor. I am really sorry, I can't do better than that. I do get your point. On my personal computer I use emacs with indent-tab-mode set to nil. I know how much indentation and spacing is important. unfortunately, my access is limited when I don't have my personal computer.

I've used some very poor editing tools in my life; the poorest was
punched cards; but even with punched cards I was able to indent
consistently without much trouble. I think you're over-estimating the
difficulty, or underestimating the importance, of consistent indentation.

The one thing I could not do easily was complete one entire punched card
without errors. I started deliberately writing programs in ways that
allowed me to re-use cards from previous programs: I used very generic
variable names, and widely spaced statement numbers (I was coding in
Fortran I) so I could easily insert new numbered statements between the
existing ones. If I misspelled a variable name the first time I typed
it, that misspelling would become the new officially correct spelling. I
knew at the time that I was following bad coding practices, and I
stopped using them as soon as I had access to a system which allowed me
to use a proper text editor.

As per standard (n1570), section 7.19, size_t is unsigned int type.

No, that section specifies that size_t is an unsigned integer type.
Unsigned integers include both the standard unsigned integers and the
extended unsigned integers. "unsigned int" is only one of the standard
unsigned integer types. The others are _Bool, unsigned char, unsigned
short, unsigned long, and unsigned long long (6.2.5p6). size_t could be
any of those except _Bool, or it could even be an extended integer type.

2nd, I can not use %zu in C90 mode. I use C90 standard because that
is what most of posters use and advise here and 2nd the tools I have
here are 14 years old at minimum:

In C90, you should cast a size_t value to "unsigned long", and print it
with "%lu".

snpritnf does not solve the issue of overwriting. ...

If you pass the correct buffer length to snprintf(), it will not
overwrite the end of that buffer. If you pass a buffer length of 0, it
will return the actual length needed, so you can dynamically allocate a
buffer that is big enough, and call it again with the correct buffer
length. That's seems like a solution to me. It's not perfect, but it's a
lot safer than printf() when there would otherwise be a danger of
overrunning a buffer.

... 2nd, no snprintf in C90 standard

That, on the other hand, is an entirely legitimate objection.

glen herrmannsfeldt · Apr 3, 2014

(snip)

I've used some very poor editing tools in my life; the poorest was
punched cards; but even with punched cards I was able to indent
consistently without much trouble. I think you're over-estimating the
difficulty, or underestimating the importance, of consistent indentation.

With either 026 or 029, editing isn't so hard. You duplicate up to
where you want to change, then put in new characters. With 029,
if you want to insert, put your thumb on the card on the left,
(the backing is intentionally high friction) new characters will
be inserted, then continue duplicating the rest.

Since free-form Fortran (still) ignores blanks, you don't have
to delete, just put blanks where you don't want characters.
(Well, you have to get them right in Hollerith constants.)

The one thing I could not do easily was complete one entire punched card
without errors. I started deliberately writing programs in ways that
allowed me to re-use cards from previous programs: I used very generic
variable names, and widely spaced statement numbers (I was coding in
Fortran I) so I could easily insert new numbered statements between the
existing ones. If I misspelled a variable name the first time I typed
it, that misspelling would become the new officially correct spelling. I
knew at the time that I was following bad coding practices, and I
stopped using them as soon as I had access to a system which allowed me
to use a proper text editor.

If I had to pay for the cards, I might have tried something like
that, but fortunately I didn't.

Still, I tried not to waste them.

As for indenting, it was usual to use the program drum to allow
a convenient skip to column 7. I didn't use nest level indenting
until I was writing PL/I programs. I then had drum cards for
three or four column indenting.

-- glen

Joe Pfeiffer · Apr 3, 2014

glen herrmannsfeldt said:
As for indenting, it was usual to use the program drum to allow
a convenient skip to column 7. I didn't use nest level indenting
until I was writing PL/I programs. I then had drum cards for
three or four column indenting.

One of the best programming experiences of my undergrad life (in 1976 or
1977) was taking FORTRAN from Walt Dunn at the University of
Washington. He insisted on nest level indenting, and what's more
listings had to be annotated to show control flow in terms of Dijkstra's
canonical programming constructs. It imposed a discipline on my code
that has stood me in good stead for nearly 40 years.

jay · Apr 4, 2014

One way would be to change the terminating condition in the for
statement from "idx<sz" to "str[idx]" which would stop the loop at the
first \0 encountered.

Well, it does not seem to work:

[e046926@mip0stl1 c]$ ./a.out aaaaaaabbbbbbbqrttttt
You entered: [aaaaaaabbbbbbbqrttttt]
Repitition begins at [0] = a, ends = 7
String to add = 7
Array = a7bbbbbbbqrttttt
New Index = 2

Repitition begins at [2] = b, ends = 9
String to add = 7
Array = a7b7qrttttt
New Index = 4

Result = [a7b7qrttttt]
a7b7qrttttttttttttt

is this the only one change required ? I think my program is too difficult to change and it is so because of extremely complicated design. Can anyone guide me to simplify it so that it is no longer dreadful to change ?

Malcolm McLean · Apr 4, 2014

is this the only one change required ? I think my program is too difficult
to change and it is so because of extremely complicated design. Can anyone
guide me to simplify it so that it is no longer dreadful to change ?

The high-level function you want is

char *compress(char *str);

We'll return a pointer to an allocated string with the compressed result.

So we need to write a function that calculates the length of the compressed result, then we call malloc(), then we write another function which compresses
the string to passed in buffer, which we know is large enough.

The system is based on the idea that runs are compressed.
So write a function

int runlength(char *str)

This returns 1 if the first character is not followed by an identical
character. It returns 2 if the first character is followed by one identical
character, and so on. As a convenience, return 0 if passed the empty string.

Now we can write

int getcompreesedsize(char *str)

Keep a travelling pointer to str, and increment by the run length, until
run length returns 0. Then inside you loop, add the logic. You count 1
for a run length of 1, 2 for a run-length of 2-9, and you've got to decide
how you will handle run lengths of > 9.
Don't forget to allocate space for the terminating nul.

Then write the

void compressstring(char *dest, char *str)

it's very similar to the previous function, except you are actually writing
the data instead of just counting it. Keep two travelling pointers, one to
the source, one to the destination. As a check, when debugging, print out
destptr - dest and check that the result is equal to the size calcualated
in the previous function.

BartC · Apr 4, 2014

explanation than taking one from web. Its name is compress but it does not
compress anything.

so, string may or may not shorten in length. This question was in Google
or Microsoft or amazon interview. I got it >from careercup.com. I think
the objective of the interviewer was to check how fast/better a person
thinks when it >comes to solving a problem. It took me around 12 hours to
code this. I guess they must be wanting the coded >solution in 20 minutes
or something. I have never been able to solve these kinds of problems and
my C is still not >that good. I wanted to become better at problem solving
using C and that is the reason I did this.

I found the problem more difficult than it looks.

I applied my usual technique which is to use a scripting language (or any
language you are very familiar with and which is quick to work with) to work
on the algorithm. That took nearly ten minutes. The logic looked a bit
untidy, but it was less than 20 lines and it seemed to work.

Another ten minutes to do more tests, including a routine to generate random
strings, and another routine to expand the compressed strings and see if
they matched the original.

(BTW I managed a compression ratio of 4:1 on the strings I was testing with.
Ie. the compressed strings were 75% shorter than the original.)

The next step was to convert to C, which is now just routine. Another 15 or
20 minutes for the basic compress function and some test code. 20 minutes to
do it all directly in C (and in an interview situation) would have been
tough I think.

As for program structure, my original code had these functions:

compress()
getstring() Only needed as a source of random strings
expand() Only needed for easier testing

The C version just has compress(), which is less than 50 lines.

So I think your version might be just too long. And some things it has such
as checking the result of sprintf() I don't think are worthwhile.

/* Take input string s. Return compressed version in locally allocated
string. */

static char* compress(char* s) {
int runcount;
char* t;
char* t0;
int slen,nlen;
int i,j;
char c;
char numstr[10];

slen = strlen(s);
t = malloc(slen+1);
t0 = t;
if (t==NULL) {
return NULL;
}
runcount = 0;
for (i=0; i<slen; ++i) {
c = s;
if (t==t0) {
*t = c;
++t;
runcount = 1;
} else {
if (*(t-1)==c) {
++runcount;
} else {
if (runcount>1) {
nlen = sprintf(numstr,"%d",runcount);
memcpy(t,numstr,nlen);
t+=nlen;
}
*t = c;
++t;
runcount = 1;
}
}
}
if (runcount>1) {
nlen = sprintf(numstr,"%d",runcount);
memcpy(t,numstr,nlen);
t+=nlen;
}
*t = 0;
return realloc(t0,t-t0+1);
}

BartC · Apr 4, 2014

Ike Naar said:
The C version just has compress(), which is less than 50 lines.

Click to expand...

Okay, I'll bite.
The compress() below is 20 lines:

static char *compress(char const *s)
{
size_t const slen = strlen(s);
char * const t0 = malloc(slen + 1);
if (t0 != NULL)
{
char *t = t0;
size_t i = 0;
while (i != slen)
{
size_t runcount = 1;
while (s == s[i+runcount]) ++runcount;
*t++ = s;
if (runcount > 1) t += sprintf(t, "%zu", runcount);
i += runcount;
}
*t = '\0';
}
return t0;
}

Well, I didn't intend it to be a contest, but that seems pretty good. I
applied your algorithm to my original scripted code and got it down to a
dozen lines or so, and it seems much tidier.

No doubt there will also be a solution in some functional language in half a
line or so, but probably not in a form that anyone else can understand.

(BTW gcc doesn't like your %zu formats. Not with standard options anyway.)

James Kuyper · Apr 4, 2014

On 04/04/2014 12:49 PM, BartC wrote:
....

(BTW gcc doesn't like your %zu formats. Not with standard options anyway.)

-std=c99 is a standard option for me. You can reasonably assume that the
same is true for anyone else who uses "%zu", unless they've gone even
farther, and are relying upon C2011.

Stephen Sprunk · Apr 4, 2014

(BTW gcc doesn't like your %zu formats. Not with standard options anyway.)

That depends on what you consider the "standard options" to be:

% gcc -c foo.c -O3 -W -Wall -pedantic -std=c89
foo.c: In function â€˜compressâ€™:
foo.c:18: warning: ISO C90 does not support the â€˜zâ€™ printf length modifier
% gcc -c foo.c -O3 -W -Wall -pedantic -std=c99
%

It seems pretty clear what the problem is and how to solve it: quit
using a version of the language that was superceded 15 years ago.

S

Martin Shobe · Apr 4, 2014

On 04/04/2014 12:49 PM, BartC wrote:
...

-std=c99 is a standard option for me. You can reasonably assume that the
same is true for anyone else who uses "%zu", unless they've gone even
farther, and are relying upon C2011.

I believe this is a known mingw problem. I also have problems with it
using version 4.8.2 with -std=c11.

Martin Shobe

Keith Thompson · Apr 4, 2014

BartC said:
(BTW gcc doesn't like your %zu formats. Not with standard options anyway.)

gcc warns about incorrect printf format strings (only when a string
literal is used). It will warn about "%zu" if you tell it to enforce
the 1989/1990 version of the C standard *and* use the "-pedantic"
option.

The actual behavior of printf with a "%zu" option is controlled by the
runtime library, which is not part of gcc. It's likely to be a real
problem for something like MinGW, which uses the gcc compiler and
Microsoft's runtime library. (I expect that Microsoft will support
"%zu" in some not-to-distant release, but I don't think they do so yet).

If you're not using a Microsoft implementation, just specify
command-line options that cause gcc to accept it.

As already discussed, there are numerous workarounds.

Werner Wenzel · Apr 4, 2014

Am 04.04.2014 18:49, schrieb BartC:

(BTW gcc doesn't like your %zu formats. Not with standard options anyway.)

If you are on Windows, you might want to have a look at

http://sourceforge.net/apps/trac/mingw-w64/wiki/gnu printf

I use

#define __USE_MINGW_ANSI_STDIO 1

together with

-std=c11

and have not encountered any problems so far.

Werner

Stefan Ram · Apr 4, 2014

Stephen Sprunk said:
% gcc -c foo.c -O3 -W -Wall -pedantic -std=c89
foo.c: In function "compress":
foo.c:18: warning: ISO C90 does not support the "z" printf length modifier
% gcc -c foo.c -O3 -W -Wall -pedantic -std=c99

Some Windows implementations of C (based on a recent gcc)
believe that it is the right thing to implement printf not
as the ISO/IEC specifies, but according to some Windows
system library (possibly for compatibility with Windows
programs written for a Microsoft compiler?). Such implementations
will not known »%zu« even with -std=c11 IIRC.

It seems pretty clear what the problem is and how to solve it: quit
using a version of the language that was superceded 15 years ago.

The spelling »supercede« is still regarded as incorrect,
even though it was already recorded in the 16th century.
»supersedes« was derived from a Latin verb, "supersedere".
The standard spelling is not »supercede«, but »supersede«.

Barry Schwarz · Apr 4, 2014

45:36 -0700 (PDT), jay <[email protected]>

Click to expand...

One way would be to change the terminating condition in the for
statement from "idx<sz" to "str[idx]" which would stop the loop at the
first \0 encountered.

Click to expand...

Well, it does not seem to work:

[e046926@mip0stl1 c]$ ./a.out aaaaaaabbbbbbbqrttttt
You entered: [aaaaaaabbbbbbbqrttttt]
Repitition begins at [0] = a, ends = 7
String to add = 7
Array = a7bbbbbbbqrttttt
New Index = 2

Repitition begins at [2] = b, ends = 9
String to add = 7
Array = a7b7qrttttt
New Index = 4

Result = [a7b7qrttttt]
a7b7qrttttttttttttt

is this the only one change required ? I think my program is too difficult to change and it is so because of extremely
complicated design. Can anyone guide me to simplify it so that it is no longer dreadful to change ?

Show your code.

And please set your newsreader to limit your line lengths to something
less than 80.

Stephen Sprunk · Apr 4, 2014

The spelling Â»supercedeÂ« is still regarded as incorrect,
even though it was already recorded in the 16th century.
Â»supersedesÂ« was derived from a Latin verb, "supersedere".
The standard spelling is not Â»supercedeÂ«, but Â»supersedeÂ«.

My usual dictionary lists "supercede" as a variant spelling of
"supersede"; out of four others I checked, only one noted it as "widely
regarded as an error", which is not the same as saying it _is_ an error,
and also says it is "common in current published writing".

S

RSA implementation issues in public key pem loader function	0	May 21, 2025
Universal BMP Steganography Tool (AES-128-CTR + SP800-90A CSPRNG) Full Encoder/Decoder with 3LSB Payload, PasswordDerived Key & External Key File	4	Mar 26, 2026
Fibonacci	0	May 13, 2023
Database Manager: A C++ Console Application	14	May 12, 2025
Implementing a Q-Learning Algorithm with Logistic Regression Normalization in C++	0	Jun 4, 2025
AES-128 Clipboard Protector: Auto-Encrypt Ctrl+C, Smart-Decrypt Ctrl+V (C++ Windows Hook)	7	Mar 24, 2026
Need Helping adding Square root code to an existing calculator. (Absolute begginer?)	0	Jan 12, 2025
Building a Professional Neural Network Framework: Full C++ Implementation with Windows GUI and Real-Time Training Visualization	11	Jun 14, 2026

compressing charatcers

David Brown

James Kuyper

glen herrmannsfeldt

Joe Pfeiffer

jay

Malcolm McLean

BartC

BartC

James Kuyper

Stephen Sprunk

Martin Shobe

Keith Thompson

Werner Wenzel

Stefan Ram

Barry Schwarz

Stephen Sprunk

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads