C
Chad Perrin
Anyhow, interesting stuff, but not about Ruby... continue offline
if you wish...
Bummer. I was having fun spectating.
Anyhow, interesting stuff, but not about Ruby... continue offline
if you wish...
I don't think either existed in 1983 . But I expect that
the answer is "no". In any case, neither would suit, as the
compression existed to serve the purpose of full-text search
that had a separate compressed index for every word, and to
use the index, it was necessary to be able efficiently to
open the file at any word. Nowadays I'd open the file at any
disk block boundary, or bigger, since the CPU cost is
negligible next to the I/O cost. You'd probably use some
form of arithmetic coding these days anyhow.
Ok, I'm sorry, I thought you meant literal repetition, I didn't
mean to unfairly latch on to that, I just needed to clarify that
Shannon's theorem doesn't imply such repetition.
Hmm. I think perhaps we still disagree then... at least about the
definition of predictability...
Information is relative. Our language is the prime example; if my
words don't evoke symbols in your thoughts that map to a similar
symbolic morphology in you, then we can't communicate. So English
isn't just a set of words, it's a symbolic net whose parts connect
our shared experiences.
In other words, you, as the "decompressor" of my sequence of words,
must have related context, experience, that forms a morphologically
similar mesh, the nodes of which are associated with words that we
share.
Taking that as my context, I maintain that your "general" data
stream only has theoretical existence. All meaningful data streams
have structure. There are standard forms of mathematical pattern
search like finding repetition, using Fourier analysis, even
fractal analysis, but these cannot hope to find all possible
patterns - they can only find the kinds of structure they look
for. At some point, those structure types that are coded into the
de/compressors are encodings of human interpretations, not intrinsic
realities. The encoding of the human interpretation is "fair", even
the extreme one in your URL example.
I agree, though I hadn't thought of the physics linkup. I think
that the very structure and nature of knowledge itself is hidden
here somewhere. So's most of our non-conscious learning too, like
learning to walk... though the compression of patterns of movement
in our cerebellum requires an approach to using the *time* domain
in a way I haven't seen a machine implement.
Since a transcendental number is just an infinite stream of digits,
there exist an infinite number of them that can encode any given
infinite stream of information. I don't know how deep that insight
really is though.
Good one! I'm sure there's some deep theoretical studies behind
that.
Now that you've explained yourself more clearly, I don't
think you're off-base at all.
I don't know. I wish I knew where I read that the human brain can
"learn" about the contents of 2 CDs - 1200MB, so even Einstein's
knowledge could be encoded into that, if we knew how. The sum of
all human knowledge would doubtless be bigger than that, but I
found it an interesting thought. Some recent neurological research
I read indicated that, contrary to previous thought, our brains
actually do have a limited capacity. The previous assumption of
almost unlimited was based on thoughts about the likely failure
modes when "full" - and those were based on incorrect theories
about how memory works. I could perhaps dig up those papers
somewhere.
I don't agree at all. There surely is a limit to the achievable
compression, but we're nowhere near it, and the search will tell
us much about the nature of knowledge.
Anyhow, interesting stuff, but not about Ruby... continue offline
if you wish...
Ah I see, you were surprised at deflate being so bad... I was surprised at
your surprise
I'm sorry. Would you care to put that in layman's terms? I did not
have the good fortunate of academic education in the field (oh, if
only!), so I cannot readily address your statement (assuming of course
you are using academic terms in good faith, and notjustbeing
intentionally obscure).
Sorry, I assumed you were familiar with Kolmogorov complexity and tried to
remove some redundancy [1] from my message based on that premise ;-) (you said
you had been thinking deeply about this so it wasn't too unreasonable
[although this is all redundant ultimately and could be replaced by a pointer
into e.g. Cover&Thomas' Elements of information theory; BTW IIRC there's a new
edition in the works?]
What you hinted at in your prev. message is known as the
Kolmogorov(-Chaitin)/descriptive complexity, corresponding to the length of
the minimal description of a string using a program in a given Turing
machine/programming language. It is therefore relative to the universal
machine/language you choose, but if you pick two different ones the
corresponding complexities for a given string can differ at most by a constant
amount, the length of the string that emulates one machine on the other.
In theory, you often ignore such constant factors (just choose a string long
enough , but in the example you gave, the machine included a full copy of
the KJV bible, allowing it to replicate it with a single instruction... and as
you said the description of the machine itself made all the difference and it
only worked for that particular string, making it of little interest. There's
a sort of trade-off between the complexity of the machine and that of the
programs you feed into it and both parts matter in practice: a machine able to
compress any English text to 1/1000th of its original size wouldn't help
residential users if it included/required a (mildly compressed?) copy of
Google's archive.
I probably wasn't 100% intellectually honest in my prev. message, I apologize
for that. I wasn't as much giving new information as trying to drag this
thread into a more information-theoretic OT, hoping some terminology would
trigger interesting replies.
[1] I'm sensitive to redundancy (as an EE, I've also learned a bit about
it.... This why I rarely post to ruby-talk anymore; there's very little
surprise in the questions/issues being discussed, and the responses to
new-but-actually-old questions I could give would hardly convey new
information. If everybody killed his messages when they don't pass the
"entropy criterion" the way I do, this ML would have a small fraction of its
current traffic; I cannot ask people to only post things of interest to me
(why would anybody consent to, and how would they know anyway), but I can try
to only post messages that would have interested me. For instance, I was going
to reply "inverse BWT!" to the OP when nobody had responded yet, but I
anticipated somebody would do eventually, so I killed my msg. I don't respond
to questions which are likely to be answered by somebody else (you can call it
proactive global redundancy control by local self-censoring .
A nice consequence of this is that it also filters out most unreasonable
content (would I be interested in some ad hominem argumentation? No, so I
should not post any either. The temptation to do so is strong, as some people,
including well-known/knowledgeable posters, use emotionally loaded language
and sometimes defend their positions vehemently.)
sorry again
Trans said:Sorry about the delayed response. I just have too much one my mind...
Truly surprised. Makes me wonder why other formats like 7z aren't more
widely used. Is decompression speed really that much more important
than size?
Phillip said:Taking at todays cheap processing time and even cheaper mass storage,
I'd say, all else being equal, size isn't that important to the end user.
Taking at todays cheap processing time and even cheaper mass storage,
I'd say, all else being equal, size isn't that important to the end user.
in other environments (transferring large files across a thin pipe),
size is more important than speed.
Chad said:2. Certain compression programs are very well known, and "everyone" has
them (for some definition of "everyone", depending on OS platform, et
cetera). Thus, "everyone" uses them. Short of producing a hugely
popular program that handles both old and new compression algorithms (or
both old and new file formats, in other examples of this phenomenon in
action), adoption of something new is going to be very slow and prone to
failure despite any technical advantages to the new algorithm/format.
This is illustrated by the demonstration of the commercial end-user
market failure of the Betamax -- VHS won that little skirmish simply
because it was more widely available, quickly became a household word,
and prevented migration to Betamax simply by way of market inertia.
Or just witness the huge popularity out there for the RAR format - a
format that is popular only because it once got popular - and the only
reason for that was that it made it easy to split archives into many
small parts. This was important (especially for the pirate market, of
course) in the days that a huge download could be corrupted. RAR
allowed people to only redownload one small part when that happened.
Today, with torrents and hashes in all major protocols, this is
essentially a moot issue, and the slightly better compression than
some other formats is also negligible as in no user would actually
notice any difference except for byte count in the file manager...
Today it's just one big bad hassle, because people use it a LOT and
this means installing an extra program on every computer you use if
you want to participate - a non-free program at that, for those of us
who care.
Using RAR amounts to Cargo Cult Compression in my opinion, as it's
only invoking the old ritual of times past because "we have always
done it this way" with no understanding on why and no actual benefits.
Not that it's really the best solution for many reasons, but if you
want to be really practical, people should use zip for all their
compression needs - because that's the only format all modern OS:es
can open with no extra software, at least that I know of.
Or just witness the huge popularity out there for the RAR format - a
format that is popular only because it once got popular - and the only
reason for that was that it made it easy to split archives into many
small parts. This was important (especially for the pirate market, of
course) in the days that a huge download could be corrupted. RAR
allowed people to only redownload one small part when that happened.Today, with torrents and hashes in all major protocols, this is
essentially a moot issue, and the slightly better compression than
some other formats is also negligible as in no user would actually
notice any difference except for byte count in the file manager...
Today it's just one big bad hassle, because people use it a LOT and
this means installing an extra program on every computer you use if
you want to participate - a non-free program at that, for those of us
who care.Using RAR amounts to Cargo Cult Compression in my opinion, as it's
only invoking the old ritual of times past because "we have always
done it this way" with no understanding on why and no actual benefits.Not that it's really the best solution for many reasons, but if you
want to be really practical, people should use zip for all their
compression needs - because that's the only format all modern OS:es
can open with no extra software, at least that I know of.
In general, I agree with your assessment of the situation, but I'm not
entirely sure about one thing you said.
Is zip format really available with all OSes without extra software?
I'm pretty sure MS Windows requires one to install additional software
still, and I don't seem to recall zip being standard on all major Linux
distributions. I haven't bothered to check whether it's available on
FreeBSD, mostly just because I haven't needed an unzipper in quite a
while.
--
CCD CopyWrite Chad Perrin [http://ccd.apotheon.org]
Leon Festinger: "A man with a conviction is a hard man to change. Tell
him you disagree and he turns away. Show him facts and figures and he
questions your sources. Appeal to logic and he fails to see your point."
It works on XP, or at least it did a few years ago when I used it the
last time, and as far as I've seen it works OOTB on any Linux/BSD/*nix
distro that ships GNOME or KDE. To be honest, I don't know about OS/X,
but I just assumed they handle this
is. I'd personally draw the line at Windows, Mac, and the 3-6
biggest Linuxes, but I know a lot of people would include a lot more,
just as a lot of people would only include Windows.
I do think that I'm mostly right, though.
Not that it's really the best solution for many reasons, but if you
want to be really practical, people should use zip for all their
compression needs - because that's the only format all modern OS:es
can open with no extra software, at least that I know of.
On 3/23/07 said:When did I say otherwise? I believe my original point was simply one
of surprise that BWT improved deflate, which I believe says more about
deflate's deficiencies than BWT's use as a means of compression. I've
only been trying to explain my reasoning for that thought ever since.
The point is, by defaulting to zip, you may hold back the revolution a
bit, but it also ensures that your mom, your boss, your customers, and
just about everyone except Perrin ;-) can open it. This is convenient
and pragmatic, but it's also important: there's a lot of busy people
out there who will not even bother to reply wit a complaint if they
can't open the files directly, they will just junk the message and
move on. A *lot* of people out the aren't interested in installing
anything, no matter how good it is. A lot of these people may pay your
salary or sponsor your next project.
I have about 40 students I work with, computer literate and in the
ages 18-30, and I am writing this because this has shown to be a real
problem, before I got them all to only send stuff in formats as
standard as possible - this includes not sending MS files unless
requested. Students have, in a very real sense, lost internship
opportunities, and maybe even job offers (who knows?) because of this,
and it has only a long time after shown that it was only because it
was a RAR archive or funnily, an Open Document, or something like
that.
Reminds me a lot of my idea of just "indexing" all data by their
offset in Pi.
Want to reply to this thread or ask your own question?
You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.