News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

100 to 1 Compressor

Started by Sergiu FUNIERU, February 23, 2010, 11:15:06 PM

Previous topic - Next topic

Rockoon

Quote from: Mirage on August 31, 2010, 03:59:59 PM
Eh couldn't a ratio like that be achieved by the compressor itself having a list or something of chunks of byte sequences that it would map to a smaller byte while still keeping unique? Mind you it would have to take pretty big chunks and the list itself would be pretty large but doing that would seem feasible to me.  :lol

There are more larger sequences than there are smaller sequences and this reality is very dramatic when talking about 100:1.

Think about it for a moment.

How many 100 byte sequences can you compress into 1 byte?  .. only 255 of them, right? (the 256th byte value would indicate that the 100 byte sequence is not in the dictionary)

Now, how many English sentences are 100 bytes or less? Almost all of them, right? Can you even come close to reducing all of English sentences to only 255 sentences?

http://en.wikipedia.org/wiki/Pigeonhole_principle
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

Mirage

Not exactly what I was shooting at but okay :P

Rockoon

Quote from: Mirage on September 01, 2010, 07:58:03 PM
Not exactly what I was shooting at but okay :P

It is what you were shooting at. You just dont realize it yet  :bg

The confusion is in the thought that if instead of (100 bytes : 1 byte), that if its (400 bytes : 4 bytes), that it would be different. Its not different at all.

The ratio of the number of permutations of 400 bytes to 4 bytes, is the same ratio as the number of permutations of 100 bytes to 1 byte.

There are more 400 byte sequences (a 241 digit number) than there are atoms in the visible universe (an 80 digit number.) A dictionary of 4294967295 pre-stored 400 byte sequences would take 1.7 terabytes of disk space, and still would not be able to attain 100:1 compression on average.

A simple experiment you can perform is to take a largish mp3 file and processes it 2 byte at a time, simply keeping track of whether or not each 16-bit possibility occurs. You will find that all 65536 possibilities occurs in the file. Pigeon Hole Principle in action. Each of 65536 items occur, and cannot therefore be encoded with less than 65536 codes. You will also find that the frequency of each 16-bit value is about the same, so you can't even get away with using shorter codes for more frequent ones.
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

xandaz

    Well.... i just saw this thread. It kinda got me laughing. Thanks Sergiu.