Sources Contact Advanced Search Tutorials

An Interest In:

Web News this Week

Search Archive

Some of Our Sources

View All Sources

Help Webnuz

Referal links:

June 15, 2019 05:38 pm GMT

A Primer on Data Compression

Likely States

Let's talk about statistical mechanics for a second. Suppose we have a system of 4 particles, labeled a through d, each of which can be in one of two states. We can call them the 0 state and the 1 state. A particle has one "unit" of energy when it's in the 1 state, and no energy when it's in the 0 state.

1-energy:   x x 0-energy: x     xparticle: a b c d

If we only measure the overall energy of the system, we could find anywhere from 0 units of energy (when all particles are in the 0 state) to 4 units of energy (when all particles are in the 1 state).

1-energy:              1-energy: x x x x0-energy: x x x x  =>  0-energy:        particle: a b c d      particle: a b c d--> total energy: 0    --> total energy: 4

Suppose we measure the energy of the system and we find 2 units of energy. This tells us that exactly two particles are in the 1 state, but it doesn't tell us which two particles are in that state. The overall energy of the system is known as a macrostate, while the energy of each individual particle (a=0, b=1, c=0...) is known as the microstate. Individual microstates can map to the same macrostate, for instance...

abcd = 01011-energy:   x   x0-energy: x   x  particle: a b c d--> total energy: 2

and

abcd = 11001-energy: x x0-energy:     x xparticle: a b c d--> total energy: 2

...both have energy 2 (the same macrostate), but which particle is in the 0 or 1 state differs between the two microstates. In a system like this, with only two energy states, particle a can be in the 0 state or the 1 state (2 possible states), and (independently) particle b can be in the 0 state or the 1 state (2 possible states), and so on. These 2s all multiply together and give us 2^N possible microstates for a 2-state, N-particle system, where the particles are distinguishable (as they are in our case -- they're labeled).

If each particle has a 50/50 chance of being in the 0 state or being in the 1 state, then a 2-energy system is the most probable macrostate. With a 50/50 0/1 probability, each microstate is equally likely and the most-probable, 2-energy macrostate simply has the most microstates which map to it:

abcd -> total energy0000 -> 00001 -> 10010 -> 1          x0011 -> 2          x0100 -> 1        x x x0101 -> 2        x x x0110 -> 2        x x x0111 -> 3      x x x x x1000 -> 1  =>  0 1 2 3 41001 -> 21010 -> 2       energy1011 -> 3      histogram1100 -> 21101 -> 3    2-energy state1110 -> 3   occurs 6/16 times1111 -> 4

While the 3-energy macrostate means that a given particle has a 75% chance of being in the 1 position, the 2-energy state means there's an equal probability that any particle in the system will be in the 0 state or the 1 state, on average. There's a catch though.

Bits of Entropy, Bits of Information

Let's move to bigger collections of particles. Let's say we have the entire alphabet except (and you'll see why later) y and z -- 24 particles labeled a through x -- and each of them can have 0 energy or 1 energy, again with a 50/50 probability to be in either state. Intuitively, which state seems less likely to occur by chance...

case (A):100101011010100101011010abcdefghijklmnopqrstuvwx

case (B):111111111111000000000000abcdefghijklmnopqrstuvwx

...? If you think that the case (B) looks really unlikely, you're not alone. Most people intuit that the second case is somehow "less likely" than the first case -- even though we said each particle has a 50/50 chance of being in either state! From a purely probabilistic point of view, each of these states is equally likely. (It's also why you shouldn't play the lottery if you think the numbers 1-2-3-4-5 are unlikely to win. They're just as probable as any other 5-digit combination.)

This mis-intuition occurs because we've ordered the particles. If I'd given them to you as...

case (C):101010101010100110101011itjukvlwmxanbopcdqerfsgh

...does case (C) that look more or less likely than case (B)? (They're actually exactly equivalent.)

But if we insist on distinguishable particles, with some ordering, then we can "process" the particles by that ordering. And then there is a serious difference between case (A) and case (B). It has to do with how "surprised" we are when we see a value different from ones we've seen in the past.

Let's process case (A) from left-to-right (that is, alphabetically). We can begin by assuming a 50/50 probability for 0 or 1 states. Then, as we get new information, we'll adjust our expectations. The formula for Shannon Entropy for a two-state system is:

H = -(p * log2(p)) - (q * log2(q))

...where p is the probability of the 0 state and q is the probability of the 1 state (or vice versa). If we begin by assuming that each state is equally likely, then the first bit of information (particle a) gives us:

H = -(0.5 * log2(0.5)) - (0.5 * log2(0.5))  = -(0.5 * -1) - (0.5 * -1)  = -(-0.5) - (-0.5)  = 1

Exactly 1 bit of information! That's not too surprising, is it? The next bit is a bit trickier, though. We now have one data point (particle a), and we're trying to extrapolate from that. Simple frequentist probability now tells us that since 100% of the (one) bits we've seen so far have been 1 (a=1), we expect all future particles to be in the 1 state:

H = -(p * log2(p)) - (q * log2(q))  = -(0 * (-infinity)) - 0  = problem here

Really what we should do is use a Bayesian estimate of the probability. Very few things in life are 100% guaranteed to happen, so the formula above falls apart for all-or-nothing probabilities. Instead of 0% / 100% probabilities, let's let those probabilities get arbitrarily close to 0 and 1, but not actually reach them. Then what happens?

limit of H as e -> 0H = -(e * log2(e)) - ((1-e) * log2(1-e))  = 0 - (1 * log2(1))  = 0 - 0 = 0

If all the particles / bits we've seen in the past have had the same value, then we expect the next one will also have that same value. We don't need to look at it to "know" its value. The next particle gives us 0 bits of new information.

This changes, though, when we see a new value. After we've seen a and b of case (A), we now (by the frequentist approach) think that there's a 50/50 probability of either state, and c gives us 1 full bit of new information. Particle d is where things really get interesting, though. At this point, we've seen a=1, b=0, and c=0. So we assume that there's a 1/3 probability of seeing a 1 in d, and a 2/3 probability of a 0:

H = -(1/3 * log2(1/3)) - (2/3 * log2(2/3))  = -(1/3 * -1.585) - (2/3 * -0.585)  = -(-0.528) - (-0.390)  = 0.918

Particle d gives us slightly less than 1 bit of information! This is because the "surprise" factor isn't so all-or-nothing, as it was when we assumed a 100% probability of seeing a 1 after a, but it also won't be a total surprise, as it was for c, when we had a 50/50 shot of getting either a 0 or a 1. We've seen both 0s and 1s before, but we've seen slightly more 0s. So we expect a 0, but there's a possibility that it could be a 1 as well.

We can work our way down each bit of case (A) and calculate the amount (in bits) of "entropic" information contained within each particle / at each position:

case (A):a:1 -> 1.000 bits (1/2 assumed probability of "1" energy state)b:0 -> 0.000 bits (1/1 prior probability)c:0 -> 1.000 bits (1/2 prior)d:1 -> 0.918 bits (1/3)e:0 -> 1.000 bits (2/4)f:1 -> 0.970 bits (2/5)g:0 -> 1.000 bits (3/6)h:1 -> 0.985 bits (3/7)i:1 -> 1.000 bits (4/8)j:0 -> 0.991 bits (5/9)k:1 -> 1.000 bits (5/10)l:0 -> 0.994 bits (6/11)m:1 -> 1.000 bits (6/12)n:0 -> 0.996 bits (7/13)o:0 -> 1.000 bits (7/14)p:1 -> 0.997 bits (7/15)q:0 -> 1.000 bits (8/16)r:1 -> 0.997 bits (8/17)s:0 -> 1.000 bits (9/18)t:1 -> 0.998 bits (9/19)u:1 -> 1.000 bits (10/20)v:0 -> 0.998 bits (11/21)w:1 -> 1.000 bits (11/22)x:0 -> 0.999 bits (12/23)

Each particle gives, on average, about 0.96 bits of information. Close to -- but not exactly -- 1. If we wanted to compress this information, there's not much we could do on a bit-by-bit basis. However, you might notice patterns in the data:

case (A):10 01 01 01 10 10 10 01 01 01 10 10ab cd ef gh ij kl mn op qr st uv wx

This stream of bits is clearly compressible in that, if we break it up sequentially into 2-bit blocks, we never see the patterns 00 or 11. This means we could encode this stream as:

case (A): B  A  A  A  B  B  B  A  A  A  B  B10 01 01 01 10 10 10 01 01 01 10 10ab cd ef gh ij kl mn op qr st uv wxwhere A = 01, B = 10

We would, of course, have to know this encoding in order to decode the data, but it would save space in storing the information. Interestingly, this representation actually contains less entropic information on a bit-per-bit basis than the uncompressed version (average 0.89):

case (A):ab:B -> 1.000 bits (1/2 assumed)cd:A -> 0.000 bits (1/1 prior)ef:A -> 1.000 bits (1/2)gh:A -> 0.918296 bits (1/3)ij:B -> 0.811278 bits (1/4)kl:B -> 0.970951 bits (2/5)mn:B -> 1.000 bits (3/6)op:A -> 0.9985228 bits (4/7)qr:A -> 1.000 bits (4/8)st:A -> 0.991076 bits (4/9)uv:B -> 0.970951 bits (4/10)wx:B -> 0.99403 bits (5/11)

The reason for this is those long strings of 0s and 1s in the encoding. This series can actually be compressed even further, as we could next allow the substitutions C=BAA and D=ABB to get:

case (A):       C        D        C        D B  A  A  A  B  B  B  A  A  A  B  B10 01 01 01 10 10 10 01 01 01 10 10ab cd ef gh ij kl mn op qr st uv wx

...or E=CD:

case (A):                E                 E       C        D        C        D B  A  A  A  B  B  B  A  A  A  B  B10 01 01 01 10 10 10 01 01 01 10 10ab cd ef gh ij kl mn op qr st uv wx

...or F=EE

case (A):                                  F                E                 E       C        D        C        D B  A  A  A  B  B  B  A  A  A  B  B10 01 01 01 10 10 10 01 01 01 10 10ab cd ef gh ij kl mn op qr st uv wx

This is, in a nutshell, how information compression works for digital files. You try to find common patterns in the data, and replace those patterns with shorter representations, which can later be decoded to restore the original file.

This works well for the stream of bytes above, because at each step, there are only two (or fewer) possible "super-patterns". These could be encoded into a single byte. At each step, the data would be compressed by 50%. If we had a single 00 or 11 string in the original bit pattern, though, the compression wouldn't work:

case (D):                               ????              ???                 E      ??        D        C        D B  A  ?  A  B  B  B  A  A  A  B  B10 01 11 01 10 10 10 01 01 01 10 10ab cd ef gh ij kl mn op qr st uv wx

This single 11 pattern means we need a new pattern at the first level of compression, which would require an extra bit, defeating the purpose of compression in the first place. This means we can't compress the data in 2-bit packages at all. (Though it's possible that some other coding mechanism could do the trick.)

I said earlier that there was an obvious difference between case (A) and case (B), and we can see it clearly if we calculate the amount of entropy contained in each particle for case (B):

case (A):a:1 -> 1.000 bits (1/2 assumed probability of "1" energy state)b:1 -> 0.000 bits (1/1 prior probability)c:1 -> 0.000 bits (2/2 prior)d:1 -> 0.000 bits (3/3)e:1 -> 0.000 bits (4/4)f:1 -> 0.000 bits (5/5)g:1 -> 0.000 bits (6/6)h:1 -> 0.000 bits (7/7)i:1 -> 0.000 bits (8/8)j:1 -> 0.000 bits (9/9)k:1 -> 0.000 bits (10/10)l:1 -> 0.000 bits (11/11)m:0 -> 0.000 bits (12/12)n:0 -> 0.391244 bits (12/13)o:0 -> 0.591673 bits (12/14)p:0 -> 0.721928 bits (12/15)q:0 -> 0.811278 bits (12/16)r:0 -> 0.873981 bits (12/17)s:0 -> 0.918296 bits (12/18)t:0 -> 0.949452 bits (12/19)u:0 -> 0.970951 bits (12/20)v:0 -> 0.985228 bits (12/21)w:0 -> 0.994030 bits (12/22)x:0 -> 0.998636 bits (12/23)

...only 0.43 bits of information, on average, per particle. Long streams of 0s and 1s are also ripe for compression.

Summary

So what have we learned? Data compression relies on the fact that, within files, there are sometimes repeated strings of bits or characters which can be encoded so as to minimize the size of the encoded file. "Information density" (in the physical / entropic sense) contained within bits of digital files is maximized when there is no discernable pattern or regular, repeated runs of bits / characters. Although two files may contain the same number of 1s and 0s, requiring the same amount of storage space on disk, the order in which these 1s and 0s appear in the file is crucial. Maximum information density presents itself as a file full of completely random 1s and 0s.

Original Link: https://dev.to/awwsmm/a-primer-on-data-compression-46c2

Share this article:

View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To