## What is Entropy?

In cryptography, there’s a very important notion known as entropy. I should probably say right off the bat, this is only very indirectly related to the physics/chemistry/thermodynamics kind of entropy that people learn about in high school science classes, at least for our purposes.[1] Instead, this definition of entropy comes from information theory. Wikipedia defines entropy as “a measure of the uncertainty in a random variable.” This deceptively simple sentence doesn’t really tell about all the complexity and nuance that goes on behind the scenes though, and in particular, isn’t very meaningful without the background in statistics to know what a random variable is (protip: it’s not like the variables you use in algebra). So let’s try explaining it in simple terms, and in a way that relates mostly to practical applications. I’ll try to save unneeded math to the end, as a sort of bonus information section.

The most important application of entropy in your every day life is making passphrases. When you’re making a passphrase, you want it to have more bits of entropy than your attacker can overcome. This, fundamentally, is the only thing that should determine the security of your password choice.[2] That isn’t to say, however, that it’s the only thing that matters when picking a password, nor is it the only thing that matters for password security – when picking a password, your ability to enter it whenever you need to is kind of important, and picking a good passphrase doesn’t mean it gets handled correctly by what you’re putting it into. But for now, let’s focus on the entropy itself.

Entropy is most coarsely thought of as “randomness,” and is usually measured in bits. For example, a simple flip of a coin has one bit of entropy, a single roll of a six sided die has about 2.58 bits of entropy, and coming up with a phrase like “What is entropy?” probably has somewhere around 20 bits of entropy. The reason a coin flip has one bit of entropy is that there are two possible outcomes, each with equal probability (heads or tails, 50/50). A bit is a 1 or a 0, so you can see how that translates fairly directly. A fair coin has one bit of entropy per flip, so two coin flips have two bits of entropy, three coin flips have three bits of entropy, etc.

You’ll notice I haven’t actually defined what entropy is, and for now, I simply won’t. The actual definition requires some math, which I’ll save for a later section. Right now, I just want you to have some intuition towards it.

So how does entropy relate to breaking your passphrase? To understand that, notice something about the relationship between the bits of entropy in the coin flip, and the number of possible permutations. For one coin flip/bit of entropy, there are two possible permutations: heads, or tails – i.e. 0 or 1. For two coin flips/bits of entropy, there are four permutations: heads heads, heads tails, tails heads, and tails tails – i.e. 00, 01, 10, or 11. For three coin flips/bits of entropy, there are eight permutations: 000, 001, 010, 011, 100, 101, 110, 111. For each n coin flips/bits of entropy, there are 2n possibilities. In other words, each bit of entropy you add, doubles the number of permutations. For those who don’t know, exponential functions like that get really, really big pretty fast. There are estimated to be about 2266 atoms in the universe, for example.

Now how it all ties together: for an attacker to guess your passphrase, they have to go through, on average, half the possible permutations for that many bits of entropy. So if you have, say, 100 bits of entropy in your passphrase, an attacker would be expected to try about 299 passphrases (which is half of 2100) before finding yours, which is 633,825,300,114,114,700,748,351,602,688 guesses.

To understand why, let’s start with a reasonable assumption: the attacker knows how you made your passphrase. I know a lot of people get up in arms when this assumption is made, but it is standard practice in the field of cryptography (it’s an application of what is known as Kerckhoffs’s principle).[3]

The fact of the matter is, if you thought of the method, other people in this planet of billions have also used it, and if there is a group of people who use it, password crackers know that method. And if they know that method and it’s worth trying, they will try it. To understand just how likely it is that your method of making a passphrase is known, think about how many passphrases have been leaked by massive data dumps of passphrases used on websites (for those who don’t know, it’s somewhere around “a lot”). Humans are pretty good at finding patterns, but when there’s that amount of data in play, nothing beats a computer. So crackers will go through, and make programs that find patterns in the passphrases they have. In this day and age, they have gotten pretty damn good at it, and there are multiple articles detailing just how good they are.[4]

So, if the cracker knows that you used two coin flips to pick your passphrase, how many tries does it take for them to find your passphrase? Well, on average, you’re going to get half way through the list of every possibility before you find the one you’re looking for, and half of 4 is 2, so on average, it will take them two tries to find your passphrase. To get a tiny bit more math in the mix briefly, this can be derived in a couple different ways, but an easy one is defining “likely” to be “at least with probability of ½”: each passphrase has equal probability of being the correct one, so ¼ probability is assigned to each of the four possible passphrases. Then, if you pick two, that’s ¼ + ¼ = ½ probability that one of the two is the correct passphrase.

So now to address the root of the issue, how should one make their passphrase? This comes down to a matter of threat models, but ultimately, the issue is having more bits of entropy in your passphrase-generating-method than your attacker can try. The best way to be sure of this is to make your passphrase using a method you can accurately measure the entropy of. How much entropy you need to do that varies, but a good number for something you want to be sure is secure is around 100, since this is roughly equivalent to how hard it is to do some of the best attacks against the well known and vetted cryptographic algorithm AES.[5] If you really want to be secure though, for example if you want to truly secure your AES-256 encrypted hard drive with all its possible glory, then you want around as many bits of entropy as there are in the key (in that case, 256 bits of entropy – anything above that is wasted as it’s faster to bruteforce the keyspace than to try passphrases).

So how much entropy does each passphrase-generating-method have?[6] Here are a few examples:

• random letters (only one case): 4.70 bits per character
• random letters (random capitalization): 5.70 bits per character
• random letters (random capitalization)+numbers: 5.95 bits per character
• random keyboard characters: 6.55 bits per character
• grammatically correct English sentence: 1.10 bits per character [7]
• “l33t”-ified English sentences: ~1.86 bits per character [8]
• random words: 12-17 bits per word (depending on dictionary used)

This means that if you chose an English sentence for your passphrase, and you wanted 100 bits of entropy, you would need to make it 91 characters long. On the other hand, if the words were chosen randomly, you would only need 8 words, or about a third of that length. If you’re randomly selecting characters on the keyboard, you need about 16 characters to get 100 bits of entropy.

A key word to notice in the above examples is “random.” In order for those numbers to be accurate, when it says random, it had better be using a random source of data – something like dice rolls, atmospheric noise, or /dev/random. As it turns out, people making up random numbers, letters, or anything, it isn’t very random. It’s simply a psychological fact that people are terrible randomness generators.[9] This is why when you are using a process that requires randomness to have full entropy (almost by definition, all high entropy processes require this), you have to use some source of good randomness.

So how should you make your passphrases? Well, if you want them to be stronger than an attacker can break, use some random process to make them. There are many ways of doing this, but some of the most popular include using using software that reads from /dev/random, or uses timing between keystrokes – though I don’t recommend using software unless you understand how it works. A very simple way is to use dice, and diceware tables. Diceware.com[10] is a website that has tables for picking words using dice (the default dictionary gives 12.9 bits of entropy per word) or individual characters (at 6.55 bis of entropy per character). For a strong password that’s easy to memorize and generate, I think that diceware is the way to go (and apparently, so does Randall Munroe).

A final point to make about making passphrases is something glossed over earlier: a passphrase is most useful if you can remember it and use it. While having 100 bits of entropy is something nice to shoot for, one has to look at their threat models and decide if it’s worth it. Is having a passphrase 8 words long really the price you pay to keep hackers out of your lolcats.rus account? Maybe yes, maybe no. The truth is, for most web accounts, 100 bits of entropy is overkill. It’s probably more important that you use different passphrases on each account than it is you keep every advanced cracking method at bay, since all it takes is one website who stores your passwords in plaintext to ruin your awesome password (and as mentioned before, any given password has 0 bits of entropy!). For your encrypted hard drive where you enter the passphrase once every boot though, you really have no excuse.[11]

To give some reference on how hard it is to crack a passphrase of different levels of entropy, here’s a table of some important entropy levels:

Well, that about wraps it up. This is one of those subjects where pretty much any aspect can be gone into more detail, with no end in sight. But this post is already much longer than I had originally intended, so I’ll cut it off here. Though I do feel like I’m forgetting something…

Of course! What is entropy? Well, as I mentioned before, the definition is pretty much entirely math, and not necessary to get the concept and usefulness of it. But if you really want to know, well, the answers follow below.

## Bonus Math Material

Still here?

Well, the definition of entropy actually isn’t as complicated as I made it out to be. It just requires some math knowledge that frankly, most of the general public probably have never studied and probably never will. To be honest, I don’t think there’s going to be anything here that can’t be found on Wikipedia, though maybe my slightly different wording will make it easier to understand, what with two reference points and all.

The definition of the Shannon entropy of a discrete random variable X, or, H(X), is:

H(X) = E[-log(P(X))]

Depending on what you learned, if you’ve studied stats or stochastic math before, you’re probably thinking, “That’s it?” The expression is pretty simple, but it requires knowing what the hell all of that stuff is. So, let’s start with the basics.

A random variable. In this case, our random variable is X (a popular name for random variables – they generally have similar names to algebraic variables, only capitalized). A “random variable” can be thought of as every possible outcome of a process that has some degree or randomness, or unpredictability, or probability (same thing). Note that I say every, not any. This is different from most variables you’re probably thinking of. To be more accurate, a random variable is a function that maps all possible outcomes to a number. So if X is representing a coin flip, it might be a function that maps “heads” to 0 and “tails” to 1.

Now we know what a random variable is (or at least have a vague idea), what is P(X)? P(X), in this case, represents what’s called the “probability mass function.” I say “this case” because it’s only called that for discreet random variables, but since that’s what we’re working with, that’s what we’ll call it. The probability mass function represents the probability of each value in the range of the random variable. So for example, if X is that coin flip used before, then P(X) represents the probabilities of getting heads (0) and getting tails (1). So if the coin is fair, then P(0) = P(1) = ½ (for heads or tails), and P(x)=0 for all other values of x.

Let’s jump over to that E now. E[X], in this context, represents the “expected value” of the random variable X. “Expected value” can roughly be thought of as the “weighted average” of the random variable. It’s calculated by multiplying each value in the range of the random variable by it’s associated probability:

E[X] = x1*P(x1)+x2*P(x2)+…xn*P(xn)

where x1, …, xn represents the possible outputs of X. So, using our coin again, we get E[X] = 0*P(0)+1*P(1) = 0*½+1*½ = ½. This isn’t intuitively meaningful though, so let’s use a six sided die instead. We’ll define our random variable X this time to be the side of the die that lands face up, or mapping the number of dots to that number (e.g. if it lands on 6, then the result is 6). If the die is fair (i.e. each side has equal probability of landing on it), then

E[X] = 1*P(1)+2*P(2)+3*P(3)+4*P(4)+5*P(5)+6*P(6) = 1/6(1+2+3+4+5+6) = 1/6(21) = 3½ .

So if you roll a six sided die, the expected value is 3½ .

So, what’s that log doing in there? If you’ve kept up so far, I’m assuming you know what a logarithm is, but on that off chance someone reading this doesn’t, here’s a Khan academy video. You’ll notice, there’s no explicit base on this log. That’s because you can technically use whatever base you want, but what base you use is what ends up defining the units of your result. We’ve been working with bits, so we’ll stick with a base of 2. The expression -log(P(X)) is actually a value known as the “self-information” of X, and is commonly represented as I(X). An intuitive definition might be “the amount of information that events give you.” Observe that the self-information is a function, which is itself another random variable – one based on probabilities of outcomes instead of the vales of those outcomes. For most of the cases we’ve dealt with, this is where the real math behind entropy comes from. Looking at our coin toss example again, suppose someone tells us that the coin landed heads up. Then we have I(X) = -log(P(0)) = -log(1/2) = 1 bit. By telling us what side the coin landed on, we were given 1 bit of information. Notice that this is also the amount of entropy in a coin flip. Those last two sentences are related by this crucial point: for our coin toss, the probability of each possible outcome is equal. They all have the same probability, so when we take the expected value of all of them, we end up with that same number. In other words, what’s the weighted average of the same number repeated a bunch of times? Well, that number, clearly. Where things get more complicated are cases like English sentences, where the probability of each outcome is most certainly not the same.

Now then, putting it all back together:

H(X) = E[-log(P(X))]

The entropy of X is the expected value of the self-information of X. The self-information of X is the function that gives the amount of information an event in X tells us. The expected value is sort of like the weighted average.

If we know every possible event and its probability, we can expand this expression into

H(X) = P(x1)*I(x1)+P(x2)*I(x2)+…+P(xn)I(xn) = P(x1)*-log(P(x1))+P(x2)*-log(P(x2))+…+P(xn)*-log(P(xn)).

If we also know that each of these probabilities are equal, then this can be further simplified to

H(X)=n*1/n*-log(1/n)=log(n).

Hence, the simplified form used in most fully random contexts.

So, that’s pretty much how information entropy works. If you’d like to ask any questions or see some worked out examples or something, go ahead and ask below.

## Footnotes

[1] There are, actually, some very direct correlations between Shannon entropy and thermodynamic entropy. In fact, while Shannon entropy was named after thermodynamic entropy because of their apparently superficial mathematical similarities, it turns out that thermodynamic entropy is a result of information entropy, and can be accurately thought of as an application of information theory. This field of study is known as statistical mechanics, or statistical thermodynamics. This, however, has nothing to do with practical applications of information entropy for making passwords and the like. You can read more about it here if you’re interested though: https://en.wikipedia.org/wiki/Statistical_mechanics

[2] Some people disagree. Those people are usually wrong. That particular link could also be considered borderline deceptive. I might write a whole post on the subject, but for now, suffice to say this: using entropy to measure the strength of your passphrases relies on proven math. Using thier method (i.e. padding) relies on password crackers being bad at what they do.

[3] This is also the reason why you should only trust cryptographic software that uses source code which has been vetted by well reputed professional cryptographers. The easiest way to be sure this is true is to use software that is both open source and popular.

While this is probably the most popular article on the subject, the conclusions it makes aren’t entirely accurate, stemming from a less than perfect understanding of the material. In particular, “XKCD style passwords ” (i.e. diceware passphrases) cannot actually be cracked using this method if done properly. The difference comes from the word “random,” which to many people means “stuff I made up,” but in the comic is actually referring to the more strict mathematical definition, a point gone into with a bit more depth later in the post.

[5] In particular, a related key attack for AES-256 has a computational complexity of 99.5 bits, but this assumes you’re using AES-256, and there is a related key (which there shouldn’t be). The real idea behind 100 bits as a decent standard is that if someone has the ability and funds to efficiently enumerate 100 bits, then chances are you could make your passphrase a million bits of entropy and they would just attack the security of the thing your passphrase is being entered into.

[6] To calculate the entropy of a set of n choices with equal probability, it’s simply log2(n), or if your calculator doesn’t have a log2 button, log(n)/log(2).

[7] This is an estimate (though one I’d consider fairly accurate), and doesn’t use the formula listed in [6] because each character doesn’t have equal probability. Instead, it uses the actual definition of Shannon entropy, and an experiment run where someone (these days, a computer program) guesses each character, one at a time, and the character is revealed upon a correct guess. When the full set of characters is guessed, then the number of guesses it took is divided by the number of characters. A common complaint I’ve seen people use against this method is that when someone is cracking a passphrase, they don’t get to know that a guess is partially correct. This argument misses the point of the experiment though, as it isn’t meant to model the method a cracker would use, but instead to measure the entropy. The fact of the matter is, I would guess the passphrase “Marry had a little lamb” long before I tried the passphrase “Marry had a little LaipUfIkOch.” Any cracker worth its salt is going to have parsed a good chunk of the internet (or more simply, password dumps) to know which combinations of words and punctuation are most likely, and use those rules to progressively generate likely passphrases.

[8] I don’t have an official citation for this, this is just my estimate. Unlike [7], it’s probably pretty bad. It was calculated as such: vowels (a, e, i, o, u) have a relative frequency of 38.1%. If each of these were to be independently replaced with one of 4 characters with equal probability (e.g. the original vowel, a number, and two symbols that look like it), then a sentence would have (1-0.381)*1.1+0.381*(1.1+2)=1.862 bits per character, with the first half representing the characters normally guessed at 1.1 bits, the second half representing the vowels which now have an extra 2 bits of entropy. This estimate uses letter frequency of vowels to estimate their entropy, which is terrible (how many words do you know that start with three consonants in a row? Using this kind of estimate, it would be almost quarter of them), and that there are multiple, equal probability replacements for each vowel (unlikely), and that each vowel is equally likely to be replaced (also unlikely). So this makes me believe that my estimate runs a tad high. On the other hand, it’s not only vowels that are replaced, so it could possibly be higher. But with the exception of “t,” the frequency of other letters is generally less than most vowels, so other letter replacements would have less of an impact. At most, if you replaced every letter and not just vowels with mutually exlusive 4 character sets at random, you would have about 1.1+2=3.1 bits of entropy per character.

[9] There are many, many studies that show this, but here’s a fun one you can do on your own: https://www.khanacademy.org/labs/explorations/frequency-stability

With a random distribution, the bars in each of those charts should be roughly even (pay attention to the units). When you try to make your own “random” 1s and 0s though, you’ll notice some patterns emerge.

default dictionary: http://world.std.com/%7Ereinhold/diceware.wordlist.asc

random character table: http://world.std.com/~reinhold/dicewarefaq.html#tables

[11] Unless the operating system puts arbitrary limits on the number of characters you can use and ties it to your user password. That would be horrible.