A text lossy compression scheme

S

syntotic

How come I did not think of it before! For big text quantities, if you losea few letters it does not matter at all... it is a typo, period. It happens to anyone when writing without proofreading. The Human processor will be able, in MOST CASES, to recover the original word. In a worse case a whole sentence loses meaning without changing it (for instance, inverting it). Languages like English have repeated letter as a matter of orthography, so letter is the same as leter when it comes to meaning...

So what if we implement a lossy compression scheme where, before using any standard text compression algorithm, we lose some letters here and there? Iy t can be calculated very formally and implemented with a dictionary approach. To begin with, single syllable words would remain the same, but they can be simplified if there is no way in a single language they can be confused with another word. For instance, all OR can be substituted by R, all NOTby O, all YES by Y and so on. Selection of words to be simplified can be done after compiling a language dictionary and calculating the probability of confusion for dropping random letters, that is, a measure of word uniqueness. Then we take the droppings that minimized confusion across the whole dictionary while maximizing the number of letters dropped. Some heuristics, like a priori dropping repeated letters, can also be applied as a further preprocessing. This idea can be taken down to simplify the process to the level of syllables: instead of taking the full language, only known syllabic combinations are considered and substituted by a syllabic dropping...

(This is more or less what is achieved with Huffman, but heuristics would add a better compression ration a priori since the text would not be the SAME text but a different text altogether to begin with, albeit the SAME text given a native speaker Human processor...)

(WHAT THE HELL HAPPENED TO CROSS-POSTING? THOUSANDS OF POSTS WITH ONLY THREE OR FOUR NON UNDERSTANDING REPLIES. IT IS FOR SAFETY. OTHERWISE POSTS DISSAPPEAR CAUSE YOU DO NOT WANT TO ADMIT THE THEFT OF COMPUTERS. IMBECILE MUSICIANS, IT IS ISLAMIC TERRORISM WHAT THEY DO IN RADIO BUT CANNOT UNDERSTAND IT. GIVE BACK THE MUSIC FILES. GOOGLE LOST TRACK AND THEY CANNOT BELIEVE IT..)

Danilo J Bonsignore
 
S

syntotic

Sorry, again FIGHTING to reach a wi fi outlet before... inspiration down. Well, anyway.

....
The plaintext can be thought of as the normal B temperature, then the cypher-plaintext in this method is equivalent to cooling down the B temperature of the text.
....
Substitutions for substitutions for common words can be optimized to give bit based compression minimums; vg, select either n or d or... for AND (but obviously not a).
....
Hey! I am thinking of Hebrew! Maybe it is the process they went through...?!* And forgot to decipher the cypher-plaintext!
....
A similar idea can be applied to pictorial text by taking pairs of letters and saving them as the overimposed symbol. Then the width of the picture isthat of the line that accepted less compression. Counterexample: co* cannot be compressed this way, obviously; example: de* would give the equivalentto a strikethrough d and would be easy to decode by sight and by machine. A standard OCR algorithm can be trained to decode the new symbols, an easy problem if the plaintext is a typographic picture rather than manuscript script.
....


Danilo J Bonsignore
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Staff online

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,012
Latest member
RoxanneDzm

Latest Threads

Top