A text lossy compression scheme

Discussion in 'C++' started by syntotic@gmail.com, Sep 1, 2012.

  1. Guest

    How come I did not think of it before! For big text quantities, if you losea few letters it does not matter at all... it is a typo, period. It happens to anyone when writing without proofreading. The Human processor will be able, in MOST CASES, to recover the original word. In a worse case a whole sentence loses meaning without changing it (for instance, inverting it). Languages like English have repeated letter as a matter of orthography, so letter is the same as leter when it comes to meaning...

    So what if we implement a lossy compression scheme where, before using any standard text compression algorithm, we lose some letters here and there? Iy t can be calculated very formally and implemented with a dictionary approach. To begin with, single syllable words would remain the same, but they can be simplified if there is no way in a single language they can be confused with another word. For instance, all OR can be substituted by R, all NOTby O, all YES by Y and so on. Selection of words to be simplified can be done after compiling a language dictionary and calculating the probability of confusion for dropping random letters, that is, a measure of word uniqueness. Then we take the droppings that minimized confusion across the whole dictionary while maximizing the number of letters dropped. Some heuristics, like a priori dropping repeated letters, can also be applied as a further preprocessing. This idea can be taken down to simplify the process to the level of syllables: instead of taking the full language, only known syllabic combinations are considered and substituted by a syllabic dropping...

    (This is more or less what is achieved with Huffman, but heuristics would add a better compression ration a priori since the text would not be the SAME text but a different text altogether to begin with, albeit the SAME text given a native speaker Human processor...)

    (WHAT THE HELL HAPPENED TO CROSS-POSTING? THOUSANDS OF POSTS WITH ONLY THREE OR FOUR NON UNDERSTANDING REPLIES. IT IS FOR SAFETY. OTHERWISE POSTS DISSAPPEAR CAUSE YOU DO NOT WANT TO ADMIT THE THEFT OF COMPUTERS. IMBECILE MUSICIANS, IT IS ISLAMIC TERRORISM WHAT THEY DO IN RADIO BUT CANNOT UNDERSTAND IT. GIVE BACK THE MUSIC FILES. GOOGLE LOST TRACK AND THEY CANNOT BELIEVE IT..)

    Danilo J Bonsignore
    , Sep 1, 2012
    #1
    1. Advertising

  2. Guest

    Sorry, again FIGHTING to reach a wi fi outlet before... inspiration down. Well, anyway.

    ....
    The plaintext can be thought of as the normal B temperature, then the cypher-plaintext in this method is equivalent to cooling down the B temperature of the text.
    ....
    Substitutions for substitutions for common words can be optimized to give bit based compression minimums; vg, select either n or d or... for AND (but obviously not a).
    ....
    Hey! I am thinking of Hebrew! Maybe it is the process they went through...?!* And forgot to decipher the cypher-plaintext!
    ....
    A similar idea can be applied to pictorial text by taking pairs of letters and saving them as the overimposed symbol. Then the width of the picture isthat of the line that accepted less compression. Counterexample: co* cannot be compressed this way, obviously; example: de* would give the equivalentto a strikethrough d and would be easy to decode by sight and by machine. A standard OCR algorithm can be trained to decode the new symbols, an easy problem if the plaintext is a typographic picture rather than manuscript script.
    ....


    Danilo J Bonsignore
    , Sep 2, 2012
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jens Mander
    Replies:
    0
    Views:
    484
    Jens Mander
    Jun 10, 2005
  2. Jens Mander
    Replies:
    2
    Views:
    1,351
    Jerry Coffin
    Sep 1, 2005
  3. Tim Chase
    Replies:
    6
    Views:
    191
  4. Vlastimil Brom

    Re: Least-lossy string.encode to us-ascii?

    Vlastimil Brom, Sep 13, 2012, in forum: Python
    Replies:
    0
    Views:
    131
    Vlastimil Brom
    Sep 13, 2012
  5. Tim Chase
    Replies:
    0
    Views:
    114
    Tim Chase
    Sep 13, 2012
Loading...

Share This Page