1 Introduction
When it comes to math, one of the things that people are phenomenally bad at – aside from the obvious everything – is probabilities. Specifically, calculating probabilities. People are bad enough at estimating chances, but properly calculating then has its own hurdles.
In this particular case, I'd like to take a look at the probability of a Simple random sample: getting a particular distribution of results from a number of samples, like getting 3 sixes in a certain number of rolls. Specifically, I'll look at the case with replacement. In this case, the sample pool remains the same after each individual sample. The case without replacement (like picking colored marbles from a vase without putting them back) is a bit trickier so I won't cover that now.
The binomial distribution
One of the equations that is relevant here is the binomial distribution (see Eq 1), the scourge of math students everywhere. What I'd like to do is explain why it looks the way it does, what the terms mean and how you can generalize it to cover more than two outcomes (that's what the “bi” stands for: two). Basically, I'd like to show you how to “read” the equation and to interpret the terms in it, which will make it easier to deal with the subject.
(1) 
First, a few words about the Eq 1 itself. Say you have a random event that can have two outcomes: coinflip, true/false question, but also a dieroll if you just look at an X vs nonX result. Or, in the parlance of our times, whether you have a win (W) or a fail (F). What Eq 1 gives you is the probability of getting exactly k wins in n trials.
That's all well and good, of course, but unless you actually understand how the equation is built up and what all the terms in it mean, it'll just be yet another magic formula that someone dreamed up to torture you with. In practical terms, it also means that it's harder to apply it correctly. So let's look at what it means, starting with what each individual symbol means.
Symbol  Meaning 

W  Win index, or number of wins. 
F  Fail index, or number of fails. 
n  Number of trials. 
k  Selection value (usually linked to the number of wins). 
p  Singletrial probability (usually linked to wins). 
P(cond)  Total probability, with the condition. 
Here's an example of its use. Say you have 7 dice (7d6) and you want to know the probability of throwing 3 sixes. This means you have n = 7, k = 3 and p = 1/6, and the condition is W = k = 3.
Divide and conquer
The next step is to split the equation into the two parts that are of true relevance, namely:
 The base chance of getting a specific result that satisfies your conditions.
 The number of results that satisfies the conditions (each with the same base probability).
I've always found this division to make the most sense. Note that I've split it into two parts, not three. The terms involving p and (1−p) belong in the same group, as I'll explain later. The nice thing about this division is that the two parts neatly correspond to two concepts of mathematics: probability and combinatorics. If you understand those two, equations like Eq 1 should start to make sense.
2 Probability and its rules
When I say you need to understand probability theory to understand the binomial distribution, what I really mean is that you understand the rules or probability, and how to combine single chance occurrences into larger conditions. See also wiki:probability.
2.1 Probability definition
The probability of a selected case x is the number of times it occurs relative to the total pool of cases. This is the fraction given in Eq 2. All other rules are the result of this one.
(2) 
For example, we roll a d6 and we're interested in getting a six. Since there are 6 sides and only 1 six, the probability is P(1 six) = 1/6.
2.2 Probability range
Since the total number of items is at least as large as the selection, each probability itself is at most 1. A probability can't be lower than zero either, because counts are natural numbers.
This also holds for the sum probability of multiple items. Also, the probability to get any result at all is 1, because obviously something needs to happen. If you sum up probabilities and you get an answer greater than 1, some of the cases you've counted actually overlap. If you add up all probabilities and you get less than 1, you've missed a few cases.
(3) 
2.3 Complementary rule : NOT
If the probability of some event is P(X) and the sum probability of all events, then the probability of it not happening is 1−P(X). The symbol for “not” is ¬.
(4) 
For example, if the probability of a six on a d6 is 1/6, the probability of not getting a six is P(~six) = 5/6.
2.4 Disjoint conditions : XOR
If you have multiple conditions that are mutually exclusive (disjoint), the probabilities add. This corresponds to an “or” case.
Please note the “mutually” here. It must be either X or Y. For the programmers: the English word “or” actually mean an exclusive OR ( ^ in C), not an inclusive OR (  ). This difference is the main reason why calculated probabilities can end up greater than 1. There are several symbols that can be used for a disjoint condition, all of which are problematic. I'll use the boolean XOR symbol, ⊕, because even though it may be a little obscure for most people, at least it's nonambiguous.
(5) 
For example, the chance of getting a 6 or a 5 on a single d6. is P(six or 5) = 1/6 + 1/6 = 1/3. You can also approach it from the definition of Eq 2, where you still have a total of 6, but now you have 2 valid cases, leading directly to 2/6.
The term "or" is one of those areas where human languages, mathematics and logic are at odds. As said, the word “or” actually means an exclusive or (one of the two, but not both), even though in math and logic it's used for the union (the full combination; and/or).
It gets even more vexing when one looks at the symbols used. The C
equivalent for OR (the inclusive OR) is 
or

. However, the 
symbol is also used for
a
conditional probability. So that's out. I can't use the
C XOR symbol, ^
either, as that actually means AND in logic.
There's also the union symbol, ∪
, but I think that
stands for inclusive OR, not exclusive. So in the end it comes
down to the Oplus, ⊕. While relatively unknown, at least it
only has a single meaning.
If anyone has a better idea, I'd love to hear it.
2.5 Joint conditions : AND
When you have two independent conditions that both need to be true, the probabilities multiply. Think of it as a subselection, where you have to get a fraction of a fraction. This corresponds to an “and” case. The logical symbol for it is a cap ( ∩ ), but since it looks too much like a cup, I'll use the actual ANDsymbol ( & ).
(6) 
An example of this rule is the chance to throw two sixes with two dice. First, you need to throw a six with one for a 1/6 chance, and in the event that that happens, there's another 1/6 chance for the other, combining into P(six and six) = 1/6 · 1/6 = 1/36. You can also get this result from the definition. With 2 dice you have 36 possibilities in total, and only one of them will contain (6,6). Note that for a 5 and 6 the rules are different, since both (5,6) and (6,5) are possible here.
2.6 Usage in general and binomials in particular
When considering probabilities, you'll often have multiple trials/samples and a diffent type of answer for each trial. For example, say you have three different options, a, b and c. Each option has a different probability, p_{a}, p_{b}, and p_{c}. Now I'm looking for the probability of getting 2 a, 3 bs and 4 cs. In other words, the number of results for each possibility is k_{a} = 2, k_{b} = 3 and k_{c} = 4, for a total number of trials n = Σ_{i} k_{i} = 10. Now look at one possible result that satisfies the condition, say, "aabbbcccc". This is an AND case, so probabilities multiply. Each trial result gets the probability belonging to that answer, which in this case leads to
The generalized case for this has m classes, labelled X_{j}, and we're the number of outcomes we're looking for for each class is nj. The probability for a single result that satisfies this condition is Eq 7. It may look a little hairy, but all it really says is that you should multiply the probability of each of the result in the condition.
(7) 
Now, look back at the binomial probability. In that case, you have two options, Win and Fail, with probabilities p_{w} and pf, respectively. Let's look at the case for k wins. That is W = k_{w} = k. Now, because there are only two options (that's the “bi” in binomial), the probabilities of the failcase are now also set. The probability for failing is p_{f} = p_{¬w} = 1−p_{w}. As for the number of failed cases, remember that the binomial looks for exactly k wins, meaning the rest of the cases fail, leading to k_{f} = n−k. This little fact is never mentioned explicitly, but it shouldn't be overlooked.
In the end, what you get is Eq 8. The only difference with the actual binomial equation is that the subscript has been removed in the latter. Never underestimate the mathematician's laziness when it comes to writing stuff down.
(8) 
There's still one thing missing though: both Eq 7 and Eq 8 only cover the possibility of one specific valid result. In most cases, there will be others as well. In the a,b,c case, for example, I only looked at "aabbbcccc", but "bbbaacccc", "abcabcbcc", or "cccbbaaa" would have worked as well, as would any other combination. You need take all of these into account, but for that you need a way to quickly find all variations first. And that's the topic of the next section: combinatorics.
3 Combinatorics : counting evolved
Very briefly, combinatorics is the study of counting discrete variations. In particular, how you can quickly count them all in an orderly and correct fashion. In this case, what I'll need is a way to quickly count the different possible ways of distributing items over a number of bins. What also matters is whether the items are unique (that is to say, whether you can tell them apart from other items) or nonunique.
I'll start with unique items and then show how to incorporate nonunique ones. This will involve factorials (n!). Along the way, we'll pick up permutations (nPr) and combinations (nCr), which are merely special cases of the general formula.
3.1 Arrangements with unique items
The specific task here is to find out in how many ways you can distribute n unique objects over n different slots. This is a case of combinatorics without replacement, because once you've placed an object in a particular slot, you can't use it anywhere else.
I'll start by looking at an example with four objects, a, b, c and d. Naturally, there will also be four slots to put them in. To count the possibilities, we'll make a tree (see Fig 2). There are two ways to fill the tree: by a slot or by item. When working without replacement, it's more convenient to fill by item. Right now it won't matter that much, because when you start dealing with groups of nonunique items, it has a nice way of grouping all of those on the same depth.
As you can see in Fig 2, we start with four open slots and four items. First we place a, which has a choice of all four slots. For b, only three slots are left open; c has only 2 and d takes the last remaining slot. The point here is that each next item has one fewer slot. The last column has all the different arrangements, and as you can see there are 24 in total. The really important point, though, is to note that all the branches for a given level has the same number of subbranches. This means that you can simply multiply the number of slots left, leading to 4·3·2·1. In other words, it's simply the factorial: 4!.
Generalizing this, given n unique items, the number of variations is n!.
3.2 Arrangements with nonunique items
Now lets look at what happens when there are also identicallooking items. For example, lets take one a, one b and three cs, for a total of five items (see Fig 3). Technically, for five items you'd get 5! variations, however because the cs are identical some solutions are actually the same: we've overcounted. To resolve this, we need to know how many variations are actually the same. This is simply the number of variations within the cgroup. With 3 cs, that means 3!. So to get the final answer, we divide by 3!.
This is becoming similar to the case in section 2.6. Again, take the case that you have m groups, k_{j} identical items per group, and n = Σ_{j} k_{j} items in total. If all items were unique, you'd get n! variations, but in each group you're counting k_{j}! times too many. The proper number is n! divided by all the overcounts. This term is called the multinomial coefficient, and defined as:
(9) 
As an extra example of this, consider again the case with 2 as, 3 bs and 4 cs. That is, k_{a} = 2, k_{b} = 3 and k_{c} = 4, and n = 9. The number of arrangements is n! / (k_{a}! k_{b}! k_{c}!) = 9!/(2! 3! 4!) = 1260.
There's also a neat little way of describing this graphically, via a table (see Table 2). First, you take the most regular answer, where everything's divided into groups, like aabbbcccc. Write the indices above each term, ultimately resulting in n!. Now break up the string into groups index each group, giving you all the k_{j}! terms. The result is dividing the full indexed part with the groupparts.

3.3 Permutations and combinations
Permutations and combinations are merely special cases of Eq 9. This is not how it's generally taught, but it's still true. In permutation, you have r unique items and n slots. In other words, you have n−r identical, empty slots. You only have an overcount for the empty slots, so you end up with
(10) 
This also has another name: the falling factorial, written as n_{(r)}. It's called that because it's basically the top r terms of a factorial. For example, ^{7}P_{3} = 7_{(3)} = 7·6·5. It's written out in the equation below. Notice how the lower terms cancel out.
For combinations, you have r nonunique items over n slots. In this case you have two groups with overcounts, one of size r and one of size n−r. This becomes
(11) 
The combination has three interesting properties.
 It can be interpreted as a variation from the permutation, namely by switching from unique to nonunique items. That's where the division by r! comes from.
 Two, it's symmetric between r and n−r. That is, ^{n}C_{r} and ^{n}C_{n−r} are identical.
 Three, it uses two groups of identical items, which is exactly what you have in the case of the binomial distribution: one set of wins and one set of fails. This is why it shows up as a multiplier in Eq 1, and why the combination is also called the binomial coefficient.
Instead of looking at unique versus nonunique, you can also interpret permutation and combination by looking at whether the order of the positions matter. If it does (that is, "123" is not the same as "132"), then it's a permutation. If the order doesn't matter, you have a combination.
In the above , I've considered the binomial coefficient as a special case of the multinomial, but you can also do the reverse: string multiple binomials together to form a multinomial. In this case, you first distribute k_{1} elements over the whole n slots. This can be done in ^{n}C_{k1} ways. At this point, you've got n−k_{1} slots remaining, over which you distribute k_{2}. This leaves n−k_{1}−k_{2} for group 3, and so on until you've got no slots and items left. This is illustrated by Eq 12. Note how one part of the denominator of one term cancels out against the numerator of the next.
(12) 
The binomial and multinomial coefficients are usually written as a 2by1 matrix between parentheses (see Eq 13). The construct is pronounced “n choose k”. No, I don't know why either.
(13) 
Back to the binomial
Now that both parts of the division has been covered, lets go back to the binomial distribution and see how they fit together. For convenience, here's what we were talking about again.
(14) 
 The base chance of getting a specific result that satisfies your conditions.
 The number of results that satisfies the conditions (each with the same base probability).
The thing about these two parts is that they actually represent two of the rules of probability, namely those describing how probabilities combine. Part 1, the base chance is a joint condition; an AND case. The second part covers all the mutually exclusive ways that the joint condition can be satisfied.
For example with 2 as and 1 b, the ways the cases combine are aab, aba or baa. If you were to write it out completely, you'd get
Obviously, you're not going to write this out all the time, especially not when the numbers get large, but this is really what you're looking for. The use of the basechance and combinations are ways of shortcircuiting this. As said, the basechance is the probability of the string of ANDs, which will be the same for case. Because of this, the addition of all the (X)ORs simply becomes a multiplication by the number of (X)ORs that you have, which is given by the binomial, or multinomial if you have more than one option.
Another interesting point is that the binomial and multinomial coefficients themselves can be read as probabilities as well. It's just that because we're certain k_{j} items will exist if we're just arranging items, the probability for each group is 1, so the whole probability part is unnecessary.
4 Examples
Because I know all of this can be a little mesmerizing at first, I thought I'd give a few examples as well.
4.1 Beastly dice
First, simple example of the binomial distribution. Say we have five dice (5d6) and we're looking at the different ways of getting exactly three sixes. Note that this also means the remaining cases should be a nonsix!
First, the probability of a six is p_{6} = 1/6, and of a nonsix p_{¬6} = 5/6. As for the groupcounts, we have 2 groups with k = k_{6} = 3 and k_{¬6} = 2, and n = 5. Putting this together results in Eq 15. Note the base chance fo getting any triplesix result, and the number of ways you can vary this over five dice.
(15) 
There's also another way of getting to this result, namely by direct use of the definition of probability, Eq 2. To do this, we need know the total number of possibilities, and how many will result in 3 sixes. With five dice, the total is T = 6^{5}. There are ^{5}C_{3} = 10 ways of getting exactly 3 sixes in five dice and 5 nonsix options for the remaining two dice, leading to a total of valid cases of 10·5^{2} = 250.
4.2 Yahtzee : full house
More dice, this time looking at the yahtzee game. A full house is three of a kind plus a pair, like 66655, 11122 or 22333, and any variation thereof. In this particular case I want to look at getting a full house in one go, because to take all three turns into account is way too hairy.
The case is actually similar to the previous one in that you have 3 of one group and 2 of another, but the probabilities are a little different. For a specific full house, say 11122 the base chance is 1/6^{5}. However, we're not looking for a specific one, but any full house. This means that there will be 6 options for the first group, and 5 for the second. The final probability is
(16) 
An alternative approach is to consider the groupdetermining dice to be “free”, giving the probability 6/6 and 5/6 for the first and second group, respectively. After that, the remaining dice have to follow suit to give (1/6)^{3}.
4.3 Multinomial : spinning wheels
And now for an application of the multinomial. Consider the wheel from Fig 4. It has 2 as, 3 bs and 5 cs: n_{a} = 2, n_{b} = 3, n_{c} = 5. The probabilities for each letter follow from the probability definition, resulting in p_{a} = 2/10 = 20%, p_{b} = 3/10 = 30% and p_{c} = 5/10 = 50%.
The question is: what's the chance of getting 3 as, 2 cs in 9 spins?
Well, first you have to note that 3+2 is less than 9, so there's a hidden condition: the remaining 4 tries should be a b. This transforms the question into 3 as, 4 bs and 2 cs, or k_{a} = 3, k_{b} = 4, k_{c} = 2, Also note that the number of results have nothing to do with how many appear on the wheel.
(17) 
5 Summary
As this post covers quite a bit, I think a summary might be in order.

Rules of probability:

Definition: the probability of event X is the
number of occurances over the total number of events.
Do not forget this rule, as its the basis for all the others. 
Complementary rule (NOT). The probability of not getting
X is the rest of the cases.

Disjoint/Parallel condition ((X)OR) :
intersection of cases; probabilities add.

Joint/Serial condition (AND) :
union of cases; probabilities multiply.

Definition: the probability of event X is the
number of occurances over the total number of events.

Compoundcondition probabilities usually involve both AND and
OR conditions, making up the two parts of a
binomial/multinomial probability.
 The AND conditions results in the probability of one solution, which multiplies all the probabilities of each subsample.
 The OR condition counts how many solutions there are, given by the bi/multinomial coefficient.

Factorial ( n! ).
The factorial represents the number of ways you can arrange
n unique/distinct items.

Permutation ( ^{n}P_{r} ).
Permutation refers to how many ways you can distribute r
unique/distinct items over n slots, or when the order
of the elements matters.

Combination ( ^{n}C_{r} ).
Combination refers to how many ways you can distribute r
nonunique/indistinct items over n slots, or when the order
of the elements does not matter.

Binomial / Multinomial coefficient (
n choose k(_{1},
k_{2}, … k_{m}) ),
The binomial is the number of ways you can arrange 2 groups with
k and n−k nonunique elements over n
bins. This is basically the same thing as the combination.
The multinomial is the extension to m groups, with
k_{j} elements in each group. In both cases
you take the base n! for the unique variations, and divide
by k_{j}! for the overcounts for all the groups.
And lastly,
 Bi/multinomial probability.
When you have m groups with probability p_{j}
for each group and you want to know the chance of drawing exactly
kj items in
n = Σ_{j} kj samples, the
probability is:
(18)
And remember that this holds for just two groups (the binomial case) as well, except that often they'll just pretend that the second group isn't really there. Which is unfortunate, as that's where the (1−p) and (n−k) terms come from.
hum, your conclusion miss the most important things....
don't play loto...
You could have talked about or for non disjoint probabilities: P(A or B) = P(A) + P(B)  P(A&B)
Still, very good. If only I read this before my test last Friday...
@ elwing
I think the phrase is "Gambling is a tax for people that are bad at math."
@ Tyr
Oh, disjoint! So that's what it's called. I'd looked for a different term than "mutually exclusive", but couldn't find it.
About nondisjoint cases, I did cover this in an earlier draft, but then discovered that it gets complicated really fast and decided against including it. For two conditions, it is indeed
.
The minus is to get rid of the intersection that was counted twice by using P(A)+P(B). The problem is that this doesn't extend well into more conditions. For three, you get
Three subtractions because to correct for the overlaps, and then an additional addition to correct for the correction, yiich. I don't even know what it'd be for 4 or more cases. I think in these cases it'll be easier to break up the OR cases into pure disjoint (XOR) and joint (AND) cases. Or even use the complement rule if possible.
Sorry :\. I'd hoped I got it out on time for any upcoming exams, but I guess not.
Fantastic article. I hope I got the details right. Miss one small thing and your probability calculations give answers that are just plain wrong.
I think the conclusion is play lotto, but make sure you are the house. =) [Details relating public law, politics, and economics omitted.]
Definitely share if you have time of what differences you are senieg with your new statistical testing. Are you finding one main factor now that is affecting CTR? btw, we are also on twitter @abtests thanks for responding, and drop by the site soon.