Algorithm that makes maximum compression of completly diffused data.

Chris Angelico · Nov 3, 2013

There is a way to apparently get around these limits: store data
externally, perhaps inside the compression application itself. Then, if
you just look at the compressed file (the "data.zip" equivalent, although
I stress that zip compression is *not* like this), you might think it has
shrunk quite a lot. But when you include the hidden data, the compression
is not quite so impressive...

Storing externally is, of course, a very useful thing - it's just not
compression. For instance, a git repository (and quite likely a
Mercurial one too) keeps track of files as blobs of data referenced by
their cryptographic hashes. The current state of a directory can be
described by listing file names with their object hashes, which is a
very compact notation; but it doesn't have _all_ the information, and
the "decompression" process involves fetching file contents from the
library. It's a tool in the arsenal, but it's not compression in the
same way that PK-ZIP is.

With real compression algorithms, there's usually an "out" clause that
detects that the algorithm's doing a bad job. The file format for
PK-ZIP and, I expect, every other archiving+compression file format,
has allowances for some of the files to be stored uncompressed. That
way, there's no need for any file to be enlarged by more than some
signature - say, a one-byte "compression type" marker, or even a
one-bit mark somewhere. But enlarged they must be, for the reasons
Steven explained.

ChrisA

Ethan Furman · Nov 3, 2013

Congratulations Jonas. My kill file for this list used to have only one
name, but now has 2.

You have more patience than I! Jonas just made mine seven.

Ethan Furman · Nov 3, 2013

What i actually saying is that you are indeed... [insult snipped]

*plonk*

Mark Janssen · Nov 3, 2013

Congratulations Jonas. My kill file for this list used to have only one

You have more patience than I! Jonas just made mine seven.

Gosh, don't kill the guy. It's not an obvious thing to hardly anyone
but computer scientists. It's an easy mistake to make.

Gene Heskett · Nov 3, 2013

You have more patience than I! Jonas just made mine seven.

Yeah, well there are a couple others in the mugwump category here yet. I
lurk here to try and learn, and baseless arguments are just noise. To be
filtered. And its working!

But it may be that this old dog has learned his last "new" trick in the
language arena too, too many "irons in the fire", and fine tuning machinery
to run the GCode I write to carve metal or wood is the primary interest
ATM. At 79yo, the short term memory needs help. I'm smart enough to
understand that, but it doesn't mean I like it. Its a right PIMA TBE.

Cheers, Gene
--
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)

All the evidence concerning the universe has not yet been collected,
so there's still hope.
A pen in the hand of this president is far more
dangerous than 200 million guns in the hands of
law-abiding citizens.

Michael Torrie · Nov 3, 2013

Gosh, don't kill the guy. It's not an obvious thing to hardly anyone
but computer scientists. It's an easy mistake to make.

I don't think he's being plonked for not understanding computational
theory. He's being plonked for resorting to name calling on his second
post! If he was a smart computer scientist type, then engaging in a
discussion about the theoretical aspects of his algorithm would have
been welcomed by him, because that's what science is all about. But he
failed that early on.

Thanks to everyone in this part of the thread for turning this
ridiculous farce into a really educational discussion on the theory of
information compression. Too bad the OP has tuned out a long time ago.

Joshua Landau · Nov 3, 2013

Well, then, why don't you share it?

Let me try to get you to understand WHY what you say is impossible.

Click to expand...

[snip reasons]

But your second reason, better known as the pigeonhole principle,
demonstrates that for any lossless compression method, there must be data
sets that actually expand the data. It doesn't matter how cleverly you
compress the data, you can't fit 20kg of potatoes in a 10kg bag, so to
speak. Suppose your compression algorithm compresses a single byte into a
nybble (four bits). There are 256 different input data sets (0x00,
0x01, ... 0xFF) and only 16 different outputs (0x0, 0x1, ... 0xF). There
is no way for 256 pigeons to fit in 16 pigeon holes unless you put two or
more pigeons in at least one hole. Ergo, if the compression algorithm is
lossless, *some* data must be expanded rather than compressed.

You have to be careful to make this totally rigorous, too.

Note that I *can* make a "compression" algorithm that takes any
length-n sequence and compresses all but one sequence by at least one
bit, and does not ever expand the data.

"00" -> ""
"01" -> "0"
"10" -> "1"
"11" -> "00"

This, obviously, is just 'cause the length is an extra piece of data,
but sometimes you have to store that anyway

. So if I have a list of
N length-Y lists containing only 1s or 0s, I can genuinely compress
the whole structure by N log2 Y items.

Joshua Landau · Nov 3, 2013

I can genuinely compress
the whole structure by N log2 Y items.

By which I mean 2N items.

Mark Janssen · Nov 4, 2013

Note that I *can* make a "compression" algorithm that takes any

length-n sequence and compresses all but one sequence by at least one
bit, and does not ever expand the data.

"00" -> ""
"01" -> "0"
"10" -> "1"
"11" -> "00"

This, obviously, is just 'cause the length is an extra piece of data,
but sometimes you have to store that anyway .

And how many bits will you use to store the length?

So if I have a list of
N length-Y lists containing only 1s or 0s, I can genuinely compress
the whole structure by N log2 Y items.

But you cheated by using a piece of information from "outside the
system": length. A generic compression algorithm doesn't have this
information beforehand.

Tim Chase · Nov 4, 2013

But you cheated by using a piece of information from "outside the
system": length. A generic compression algorithm doesn't have this
information beforehand.

By cheating with outside information, you can perfectly compress any
one data-set down to 1 bit. Just store the data in the program, then
store 1 bit of "is this file the data we have stored in the
program?". Granted, in modern OSes, you waste 7 extra bits since
they require you to write an entire byte at a time.

And with that, you could even have an empty file and test for a file
extension. ;-)

-tkc

jonas.thornvall · Nov 4, 2013

Den lördagen den 2:e november 2013 kl. 22:31:09 UTC+1 skrev Tim Roberts:

Well, then, why don't you share it?

Let me try to get you to understand WHY what you say is impossible. Let's

say you do have a function f(x) that can produce a compressed output y for

any given x, such that y is always smaller than x. If that were true, then

I could call f() recursively:

f(f(...f(f(f(f(f(x)))))...))

and eventually the result get down to a single bit. I hope it is clear

that there's no way to restore a single bit back into different source

texts.

Here's another way to look at it. If f(x) is smaller than x for every x,

that means there MUST me multiple values of x that produce the same f(x).

Do you see? If x is three bits and f(x) is two bits, that means there are

8 possible values for x but only 4 values for f(x). So, given an f(x), you

cannot tell which value of x it came from. You have lost information.

--

Tim Roberts, (e-mail address removed)

Providenza & Boekelheide, Inc.

Well let me try to explain why it is working and i have implemented one.
I only need to refresh my memory it was almost 15 years ago.
This is not the solution but this is why it is working.
65536=256^2=16^4=***4^8***=2^16

Yes i am aware that 256 is a single byte 8 bits, but the approach is valid anyway.

jonas.thornvall · Nov 4, 2013

Den måndagen den 4:e november 2013 kl. 14:53:28 UTC+1 skrev (e-mail address removed):

Den lördagen den 2:e november 2013 kl. 22:31:09 UTC+1 skrev Tim Roberts:

Well let me try to explain why it is working and i have implemented one.

I only need to refresh my memory it was almost 15 years ago.

This is not the solution but this is why it is working.

65536=256^2=16^4=***4^8***=2^16

Yes i am aware that 256 is a single byte 8 bits, but the approach is valid anyway.

You see if the program behave intelligent some small operations can performwonder on a number.

Dave Angel · Nov 4, 2013

Re: Algorithm that makes maximum compression of completly diffused
data.

Den lÃ¶rdagen den 2:e november 2013 kl. 22:31:09 UTC+1 skrev Tim Roberts:

every x,

same f(x).

there are

f(x), y=

information.

Well let me try to explain why it is working and i have implemented one.
I only need to refresh my memory it was almost 15 years ago.
This is not the solution but this is why it is working.
65536=256^2=16^4=***4^8***=2^16

Yes i am aware that 256 is a single byte 8 bits, but the approach is valid =
anyway.

And e ^ (I * pi) == -1
So what. ?

Better file that patent, before the patent office realizes the
analogy to the perpetual motion machine.

rusi · Nov 4, 2013

And e ^ (I * pi) == -1
So what. ?

Better file that patent, before the patent office realizes the
analogy to the perpetual motion machine.

Now I am too much of a dim-wit to comprehend all this compressified
profundification but I think I have a (dim) clue how such a patent
will be obtained:

Just make a submission in such an unreadable double (then triple,
quadruple...) spaced format that the patent office will grant it in
exasperation.

jonas.thornvall · Nov 4, 2013

Den måndagen den 4:e november 2013 kl. 15:27:19 UTC+1 skrev Dave Angel:

every x,

same f(x).

there are

f(x), y=

is valid =

And e ^ (I * pi) == -1

So what. ?

e is an approximation... and your idea is not general for any n.

Dave Angel · Nov 5, 2013

Re: Algorithm that makes maximum compression of completly diffused
data.

e is an approximation... and your idea is not general for any n.

e is certainly not an approximation, and I never mentioned n.

Steven D'Aprano · Nov 5, 2013

Den mÃ¥ndagen den 4:e november 2013 kl. 15:27:19 UTC+1 skrev Dave Angel:

[...]

This is not the solution but this is why it is working.

Click to expand...

65536=256^2=16^4=***4^8***=2^16

Click to expand...

Click to expand...

"this" being Jonas' alleged lossless compression method capable of
compressing random data.

I must say, I cannot see the connection between the fact that 256**2 ==
2**16 and compression of random data. I might as well state that I have
squared the circle, and offer as proof that 3+4 == 5+2.

e is an approximation... and your idea is not general for any n.

e is not an approximation, it is a symbolic name for an exact
transcendental number which cannot be specified exactly by any finite
number of decimal places.

Your comment about "n" is irrelevant, since Euler's Identity e**(i*pi)=-1
has nothing to do with "n". But in case you actually meant "i", again you
are mistaken. i is a symbolic name for an exact number.

Steven D'Aprano · Nov 5, 2013

Den mÃ¥ndagen den 4:e november 2013 kl. 15:27:19 UTC+1 skrev Dave Angel:

[...]
This is not the solution but this is why it is working.

65536=256^2=16^4=***4^8***=2^16

Click to expand...

Click to expand...

"this" being Jonas' alleged lossless compression method capable of
compressing random data.

/facepalm

I meant "it", not "this", as in "why it (the compression method) is
working". Sorry for any confusion.

Tim Roberts · Nov 7, 2013

Well let me try to explain why it is working and i have implemented one.
I only need to refresh my memory it was almost 15 years ago.
This is not the solution but this is why it is working.
65536=256^2=16^4=***4^8***=2^16

All of those values are indeed the same, and yet that is completely
unrelated to compression. Did you honestly believe this was actually
explaining anything?

Yes i am aware that 256 is a single byte 8 bits, but the approach is valid anyway.

This whole post is a most amazing collection of non sequiturs. If you
would like to describe your compression scheme, there really are people
here who would be interested in reading it (although that number gets less
and less as this thread goes on).

Mark Janssen · Nov 7, 2013

Well let me try to explain why it is working and i have implemented one.

All of those values are indeed the same, and yet that is completely
unrelated to compression. Did you honestly believe this was actually
explaining anything?

I think the idea is that you could take any arbitrary input sequence,
view it as a large number, and then find what exponential equation can
produce that result. The equation becomes the "compression".

MarkJ
Tacoma, Washington

If you need to code a Windows Forms software that uses C# software how do i make the design for a software that makes this Post Description function ?	0	Sep 21, 2022
Data saving in condition of changing reality	0	Apr 29, 2022
What makes functions special?	2	Jul 9, 2011
Maximum post data length	0	Mar 4, 2011
An algorithm to generate fake but realistic graph data that trends	0	Nov 21, 2013
How much is my encryption algorithm worth?	20	Mar 23, 2014
A text lossy compression scheme	1	Sep 1, 2012
force --install-layout=deb on systems where that makes sense?	0	Apr 19, 2011

Algorithm that makes maximum compression of completly diffused data.

Chris Angelico

Ethan Furman

Ethan Furman

Mark Janssen

Gene Heskett

Michael Torrie

Joshua Landau

Joshua Landau

Mark Janssen

Tim Chase

jonas.thornvall

jonas.thornvall

Dave Angel

rusi

jonas.thornvall

Dave Angel

Steven D'Aprano

Steven D'Aprano

Tim Roberts

Mark Janssen

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads