If you’d like to understand the importance of having a good knowledge of probability theory, consider the following eye-opening example (inspired by something I read at David Siegel’s site.)
A rare disease is known to exist in 1% of the population. A test for the disease is known to be 98% accurate, meaning that if you have the disease, the test will return positive 98% of the time.
Now, you’re curious whether you might have the disease and so you go take the test. It comes back positive. What is the probability you actually have the disease? The results might surprise you.
To solve this problem, we use Baysian probability theory, which says:
P(A|B) = P(A)*P(B|A)/P(B)
- A = You have the disease
- B = You test positive
In words, this means that the probability that you have the disease (A) and you test positive (B) is the probability that you have the disease, P(A), times the probability that you test positive given that you have the disease, P(B|A), divided by the probability that you test positive, P(B).
So to make this calculation we need three numbers:
- P(A) — We know that P(A) (the probability we have the disease) is 1%.
- P(B|A) — We know that P(B|A), the probability that we test positive if we have the disease, is 98%.
- P(B) — We don’t know this one, and have to calculate it.
We can compute P(B)—i.e. the probability that a random person taking the test returns positive—using “conditional” probability:
P(B) = P(B|A)P(A) + P(B|!A)P(!A)
This means that the probably of testing positive, is the sum of the conditional probabilities that (a) we test positive given that we have the disease times the probability that we actually have the disease, plus (b) the probability that we test positive given that we don’t have the disease, times the probability that we don’t have the disease.
In the above conditional probability equation, we know all the values except P(B|!A). How to determine this? Well, we know the following must be true:
P(B|A) + P(B|!A) = 100%
Therefore, since we know P(B|A) is 98%, we can conclude that P(B|!A) must be 2%.
P(B|!A) is known as the “false positives”, i.e. those who test positive but don’t have the disease.
P(B) = 0.98*0.01 + 0.02*0.99 = 0.0296
So now we have the all the numbers to calculate P(A|B), i.e. that chances that we actually have the disease given that we tested positive for it:
P(A|B) = 0.01*0.98/0.0296 = 0.33
Surprising no? If we went to the doctor, took this test, and tested positive, there would only be a 33% chance that we actually have the disease.
How can we make sense of this? It’s actually quite logical.
Imagine a random population sample of 1,000,000 people. Of those, 10,000 (1%) will have the disease. Of those 10,000 tested, 9,800 (98%) will diagnose correctly in the test. Of the 990,000 (99%) who don’t have the disease, 19,800 will test positive, i.e. the 2% false-positive percentage.
So of the 1,000,000 people tested, 29,600 will test positive, but very few of those will really have the disease, i.e. 9,800/29,600 or 33%.