Thursday, September 26, 2013

surprises, part 2

In my last post, I described a classic problem in probability: suppose a certain disease occurs in 0.5% of a given population, and there is a test that detects it with 99% accuracy, but also returns a false positive 1% of the time it is used on a healthy person. What is the conditional probability that an individual has the disease, given that they have a positive result from the test? The answer, somewhat surprisingly, turns out to be less than a third.

When we discussed this in my probability class, one student asked a very sensible question: What if we test the person twice?

This question seemed worth investigating. As I see it, the question can be interpreted two ways. On one hand, what if we tested everyone twice? How would that affect the conditional probability given above? On the other hand, what if we only gave the test a second time to those who had a positive first test? Would we be more likely to filter out those who are actually ill in that case, having restricted to a population in which the disease is more prevalent? Do these two methods produce different results?

To begin with, let’s return to the original question and analyze it more thoroughly by introducing some variables. Let $r$ be the prevalence of the disease in the total population (which can be interpreted as the probability that any particular individual has the disease). Suppose the test we have returns a true positive (a positive result for someone who is ill) with probability $p$ (called the sensitivity of the test), and it returns a false positive (a positive result for someone who is well) with probability $q$ (the value $1 - q$ is called the test’s specificity). Bayes’ formula then says that the probability of having the illness given a positive test result is \[ P(r) = \frac{r \cdot p}{r \cdot p + (1 - r) \cdot q}. \] If we fix $p$ and $q$ and let $r$ vary, we get a graph like the following:

(drawn here with $p = 0.98$ and $q = 0.05$; you can click on the graph to go to an interactive version). Notice the large derivative for small values of $r$; that low conditional probability we got at the beginning? Was essentially an artifact of the disease itself being fairly uncommon. (As one student slyly put it, “so the way to make a positive test more likely to mean you’re sick is to give more people the disease?”) Raising the value of $p$ doesn’t change the graph much. The real problem lies in the false positives; if the disease is sufficiently rare, then having any chance at all of false positives ($q > 0$) means that the false positives will outnumber the true positives.

If we change the situation so that every time an individual is tested we administer the test twice, then a few things happen. First, the chance of getting two false positives when testing a healthy individual is $q^2$, which is generally much smaller than $q$. Meanwhile, the chance of getting two positives when testing a sick individual is $p^2$, smaller than $p$ but not by much. The result is a much steeper curve for low-prevalence diseases:

(the red curve is the same as before; the purple curve represents the probability of having the illness given two positive tests). Effectively, we have created a new test with a much reduced chance of false positives.

But testing everyone twice seems unnecessary. Just as a low prevalence leads to a reduced probability that a positive result means the disease is actually present, so it also reduces the probability that one is ill given a negative result. Here is the graph of this latter conditional probability (that is, the prevalence of the disease among those who have a negative test):

So we shouldn’t worry too much about those who have a negative test. We can give the test a second time just to those who have a positive first test. In effect, rather than creating a new test as before, we have restricted to a new population, in which the disease is far more prevalent (as given by the original conditional probability $P(r)$). Here is the graph of the original function $P(r)$ (again in red) together with the graph (in orange) of the probability of having the disease given a positive result and being among those who had a first positive test:

Do you notice something about the purple and orange curves in the graphs above? They are the same. I admit, this surprised me at first. I thought that having a second positive result when restricted to those who already had one would make it more likely that one had the disease than if we tested everyone twice indiscriminately. But the algebra bears out this coincidence of graphs. It doesn’t matter whether everyone is tested twice or just those who first have a positive result; the conditional probability of having the disease after two positive tests is the same either way. In the latter case, of course, far fewer total tests are administered.

Something we haven’t considered yet is what it means to have one positive and one negative test. Here the relative sizes of $p$ and $1-q$ matter. You can check that if $p + q = 1$, then having one positive and one negative test returns one’s likelihood of having the disease back to that of the overall population (because a sick person and a healthy person have the same chance of getting one positive and one negative result). However, if $q$ is greater than $1-p$ (that is, if a healthy person is more likely to have a false positive than a sick person is to have a false negative), then obtaining different results on two tests means one’s chance of having the disease is slightly less than in the overall population. One last graph, in which the red and blue curves from before reappear, together with a green curve representing the probability of having the disease given one positive and one negative test:

Conversely, if $q$ is less than $1 - p$, then the green curve would lie slightly above the diagonal.

The ideas we have been exploring are at the heart of Bayesian analysis, in which a certain assumption (called a prior) about how some characteristic is distributed is fed into a conditional probability model, and a new distribution is obtained. The new distribution becomes the new prior, and the process may be repeated. This kind of analysis depends on a Bayesian view of probability, in which the distribution represents a measure of belief (rather than any necessarily objective knowledge), and how that belief changes with the introduction of new knowledge. In our case, our prior was the assumption that the disease had prevalence $r$, and the new knowledge we introduced was the result of a medical test. This is the same kind of analysis—at a much more elementary level—that Nate Silver made famous (or perhaps that made Nate Silver famous) during recent election seasons. I must say, I was pleased that a student’s question led so neatly into this timely topic.

Tuesday, September 17, 2013

my favorite surprise

I’m excited about tomorrow. Tomorrow in my probability class, we’re going to start discussing Bayes’ Formula. This is the main thing I remember about my college probability class. While we have already seen some surprising results in particular cases where the rules of probability are applied, this is, to me, the first truly surprising general result. It changes everything I think about probability.

Here’s my motivating example: suppose we have before us an urn that contains five blue balls and five red balls. We draw two balls, in order, and record their colors. (To be clear, this is sampling without replacement.) What is the probability that the first ball is red? “Well,” you say, “evidently the likelihood of that is 1/2, because half of the balls are red.” “Very well,” I say, “then what is the probability that the first ball is red, assuming that the second ball is blue?” “What does that have to do with it?” you ask. “When the second ball is drawn, the first one has already been chosen, so how could the second ball turning up blue have anything to do with the probability that the first ball is red?”

Let’s throw some formulas in here. Suppose $E$ and $F$ are two events in the sample space $S$ of an experiment. (I discussed the definitions of these terms in my previous post.) The conditional probability of $E$ given $F$, written $P(E\mid F)$, is the quotient $\dfrac{P(E \cap F)}{P(F)}$, meaning, loosing speaking, that we consider all the ways both $E$ and $F$ can occur (weighted by their individual probabilities), and think of this as just a subset of the outcomes where $F$ occurs (rather than all of $S$). “Sensible enough,” (I hope) I hear you say. Now, you will hopefully also agree that we can split $F$ into two parts: the one that intersects $E$ and the one that does not, i.e., $F = (F \cap E) \cup (F \cap E^c)$. “Aren’t you overcomplicating things?” you demur. “Just wait,” I plead. Because the events $F \cap E$ and $F \cap E^c$ are mutually exclusive (i.e., disjoint), and so we have $P(F) = P(F \cap E) + P(F \cap E^c)$. Interesting, no? So we can write \[ P(E \mid F) = \frac{P(E \cap F)}{P(E \cap F) + P(E^c \cap F)} \] (using the fact that $E \cap F = F \cap E$). And now perhaps it seems like this manipulation isn’t so weird, because in our “motivating case”, each of the terms in that expression isn’t so hard to compute, and in fact one of them appears twice!

So what happens? Let’s return to our urn and say $E$ is the event “the first ball is red”, while $F$ is the event “the second ball is blue”. Then $P(E \cap F) = \big(\frac{5}{10}\big)\big(\frac{5}{9}\big)$ and $P(E^c \cap F) = \big(\frac{5}{10}\big)\big(\frac{4}{9}\big)$, so \[ P(E \mid F) = \frac{\big(\frac{5}{10}\big)\big(\frac{5}{9}\big)}{\big(\frac{5}{10}\big)\big(\frac{5}{9}\big)+\big(\frac{5}{10}\big)\big(\frac{4}{9}\big)} = \frac{25}{25+20} = \frac{5}{9}. \] Since 5/9 > 1/2, it is more likely that the first ball was red if we know that the second ball is blue! (Surprised? Think about what happens if there are only two balls to begin with, one blue and one red. Once that’s sunk in, try the above again starting with $m$ blue and $n$ red balls in the urn.)

So far I’m cool with everything that’s happened. The realization that later events provide information about earlier ones is a bit of a jolt, but not so far-fetched after a little reflection. Bayes, however, endeavors to turn our minds further inside-out. We just need one new idea, just as simple as everything we’ve done to this point: the equation for conditional probability can be rewritten as $P(E \cap F) = P(F) \cdot P(E \mid F)$. And of course, because $E \cap F = F \cap E$, we could just as well write $P(E \cap F) = P(E) \cdot P(F \mid E)$. Now, as before, let’s split $F$ into $E \cap F$ and $E^c \cap F$. Using our most recent observation, we have \[ P(E \mid F) = \frac{P(E) \cdot P(F \mid E)}{P(E) \cdot P(F \mid E) + P(E^c) \cdot P(F \mid E^c)}. \] “Now why on Earth…?” you splutter, to which I reply, “Because sometimes the knowledge you have is more suited to computing the conditional probabilities on the right than finding the one on the left directly from the definition.”

Here’s a classic example. Suppose there is an uncommon illness that occurs in the general population with probability 0.005 (half of percent). Suppose further that there is a medical test for this affliction that is 99% accurate. That is, 99% percent of the time the test is used on a sick patient, the test returns positive, and 99% of the time it is used on a healthy patient, it returns negative. You are concerned that you might have this illness, and so you have the test. It comes back positive. What is the probability that you have the illness?

Do you see where this is going? You’re interested (well, we both are, really, because I care about your health) in the event $E$ “I have the illness.” The information we have, though, is that the event $F$ “the test came back positive” occurred. And what we know about the test is how its results depend on the patient being sick or well. That is, we know $P(F \mid E)$ and $P(F \mid E^c)$, and fortunately we also know $P(E)$ (ergo we also know $P(E^c) = 1 - P(E)$). We can compute the likelihood of your being ill as \[ P(E \mid F) = \frac{(0.005)(.99)}{(0.005)(0.99) + (0.995)(0.01)} \approx 0.3322. \] Far from it being a certainty that you have this particular illness, your chances are better than 2 in 3 that you don’t! Even if the illness were twice as common and occurred in 1% of the population, your chances of being sick are only 1 in 2 after the test comes back positive. (Notice that this probability—this conditional probability—depends not only on the efficacy of the test, but also on the prevalence of the illness.)

And that’s my favorite surprise in probability.

(If you haven’t read it yet, you should go look at the article Steven Strogatz wrote for the New York Times about Bayes’ Theorem, in which he makes it seem—somewhat—less surprising.)

Wednesday, September 11, 2013

a (biased) game of war

Here was the problem that took up half our class time today trying to solve (time well spent, I believe): Two players, Alexa and Beatrice, take turns drawing cards from a deck. The first one to draw an ace wins. Alexa draws first. What is the probability Alexa wins?

This problem could be solved with a naïve view of probability, but because I assigned it in order for the class to practice using modern definitions, I’d like to spell those out. First, a sample space is the set $S$ of all possible outcomes of a given experiment (here, the game). A subset of the sample space is called an event. (One might think of an “event“ as “all the possible ways the experiment can succeed,” for whatever definition of “success” we might like.) A probability measure on $S$ is a function that takes each event $E$ and assigns to it a number $P(E)$ with the following properties:

  • For every event $E$, $0 \le P(E) \le 1$.
  • $P(S) = 1$ (I think of this as the statement “whenever you perform the experiment, something happens.”)
  • If $E_1$, $E_2$, $E_3$, …, is any sequence of mutually exclusive events (meaning that $E_i \cap E_j = \varnothing$ whenever $i \ne j$), then \[ P \left( \bigcup_{i=1}^\infty E_i \right) = \sum_{i=1}^\infty P(E_i). \] (Note that this still holds if the sequence only has finitely many events; for example, if $E$ and $F$ are mutually exclusive, then $P(E \cup F) = P(E) + P(F)$.)
(This definition works fine if $S$ is finite; if $S$ is uncountably infinite, then we have to be more careful about what kinds of events we can allow. But in our situation $S$ is finite, so no worries.) This is the modern, axiomatic view of probability, essentially as formulated by Kolmogorov. (If you haven’t read Slava Gerovitch’s recent, excellent article on Kolmogorov’s life, you should.) The key realization is that the probability of an event is not given a priori; we must always make assumptions about the likelihood of an event.

The first question that arises in analyzing the game between Alexa and Beatrice is therefore, What is the sample space? That is, what are the possible outcomes? The students in my probability class came up with essentially three possibilities:

  1. The set “Alexa wins on her $n$th draw, or Beatrice wins on her $n$th draw“.
  2. The set of all sequences of cards drawn from a deck, with the last card drawn an ace.
  3. The set of all possible orderings of the 52 cards in the deck.
In class, I added the following analysis of these choices:
  1. This is the simplest description of the set of possible outcomes; one is only concerned with who won, and at which point in the game she won.
  2. This is the most “natural” (i.e., experiential) description of the possible outcomes; one pays attention only to which cards appear in the course of the game.
  3. This is the most “equitable” description of possible outcomes; we generally assume that, if the deck has been well-shuffled, then all orderings are equally likely.
(One could add the set of outcomes, “Alexa wins” or “Beatrice wins”, as a potential choice of sample space, but I had already suggested to the students that it is worthwhile to consider when a particular player wins.) Another way to distinguish the third choice of sample space from the first two is philosophical: is the experiment done when the players have finished drawing cards, or before that, when the deck is placed in front of them? After all, once the deck is shuffled and laid down, if one knows the order of the cards, then one knows who will win. One student delivered a description of how to find the probability that Alexa wins, when the sample space is chosen to be “all possible orderings of the deck”: \[ P(\text{Alexa wins}) = \frac{\#(\text{orderings in which the first ace appears in an odd position})} {\#(\text{possible orderings of the whole deck})}. \] The denominator equals, of course, 52!. Counting the numerator then becomes the challenge. It is equivalent to the solution proposed below.

A useful approach, applicable to many situations, is to think about the sequence of events, “Alexa wins on her first turn“, “Alexa wins on her second turn“, etc. These events are clearly mutually exclusive: if Alexa wins after drawing her seventh card, then the game is over, and she won’t win after drawing ten cards. So if we can find the probability of each of these events, then we can add them up to find the total probability that Alexa wins.

Finding the probability that Alexa wins on the first turn is straightforward enough; she just has to draw an ace from the deck. This will happen with probability 4/52 = 1/13.

Finding the probability that Alexa wins on her second turn is a bit trickier, but contains the germ of the complete solution. First, she must draw anything but an ace; the probability of this is 48/52. Then Beatrice must draw anything but an ace; since Alexa has already drawn one card, the probability of this is 47/51. Then Alexa draws an ace; with only 50 cards remaining, her chances are 4/50. Thus the probability she wins after drawing two cards is (48/52)(47/51)(4/50).

Now we can see how the general case works: in order for Alexa to win on her $n$th turn, she and Beatrice must each have drawn $(n-1)$ cards that are not aces, and Alexa must draw an ace after that. The probability of this happening is $\frac{48}{52}\cdot\frac{47}{51}\cdot\frac{46}{50} \cdots \frac{48-2n+1}{52-2n+1}\cdot\frac{4}{52-2n+2}$ (as long as $n > 1$).

How many turns must we consider? Well, suppose all four aces were at the end of the deck. Then Alexa and Beatrice would each draw 24 cards before reaching an ace, and Alexa would win on her 25th turn. The probability of this happening is \[ \frac{48!4!}{52!} = \frac{1}{270\,725} \approx 0.0000037, \] since the 48 non-aces can appear in any order, followed by the four aces. Thus we need to add together the probability that Alexa wins on each of her 25 (potential) turns to get the total probability that she wins.

Before we compute this total, what do we think the result should be? Does Alexa have a greater chance of winning than Beatrice, or the reverse? Or are their chances equal? The students came up with the following arguments:

  • Alexa has more potential turns than Beatrice—25 as opposed to 24, because Beatrice could never win on her 25th—so Alexa is more likely to win. (Given how small is the probability that Alexa wins on her 25th turn, I find this argument not very convincing.)
  • Each time Beatrice draws a card, she has fewer cards to choose from, but the same number of aces, so her likelihood of drawing an ace at each turn is greater than Alexa’s, and she has a better chance of winning. (This, however, disregards the sequence of events necessary for Beatrice to have the opportunity to draw.)
  • As previously mentioned, we just want to know how likely it is that the first ace appears in an odd position. It seems that there should be no preference for the first ace to appear in an odd versus an even position, so the two players have equal chances of winning. (This does make sense to me, intuitively, but there’s something skewed about the “first ace” that makes it questionable. As we’ll see, the computations do not bear this estimation out, but they nearly do.)
Apparently, Alexa’s advantage in drawing first has the potential to be balanced out by Beatrice’s greater chance of drawing an ace due to the reduced number of cards in the deck on her turn.

I don’t know how to resolve this philosophical conundrum without actually doing the computation, so here goes. With a bit of reindexing, the probability that Alexa wins can be computed by the sum \[ \sum_{N=0}^{24} \frac{48!}{(48-2N)!}\cdot\frac{(52-2N)!}{52!}\cdot\frac{4}{52-2N} \] (here $N$ is the number of turns that have elapsed before Alexa draws her $(N+1)$st card). Using Wolfram|Alpha, we find that this sum is about 0.5198. Therefore Alexa has a small but clear advantage. In fact, her chances of winning are better than the house’s chances of winning at French roulette if you, the gambler, bet on all reds, which is 19/37, or about 0.5135. We can check our work (i.e., do we have the right formula?) by applying similar reasoning to compute Beatrice’s chances of winning as \[ \sum_{N=0}^{23} \frac{48!}{(48-2N-1)!}\cdot\frac{(52-2N-1)!}{52!}\cdot\frac{4}{52-2N-1}, \] which is about 0.4802, as we would expect.

I have two lingering questions about this scenario:

  • Is there a way to determine that the first player has an advantage in the game without doing the full computation? That is, can one provide a “rigorous” (but not overly computational) argument that Alexa will win more than half the time?
  • The probability that the first player wins in this game must necessarily be a rational number, because it is a finite sum of rational numbers. As it turns out, this fraction is exactly 433/833, which has a remarkably small denominator (especially considering that the last term in the sum is 1/270725). Is the number 833 intrinsically meaningful in this situation, or is this a coincidence?
Comments are welcome.

Added 9/12: I have made perhaps a small step towards understanding heuristically why the first player should have a small advantage. Suppose instead of the winner being the first one to draw an ace, it is the first one to draw the ace of spades. Then the players have equal chances: a specific card is equally likely to be in either an even or odd position. Now suppose that the winning condition is to draw a red card; then the first player already has 1/2 probability of winning on her first turn, apart from any advantage she has later in the game. It seems plausible that if the winning condition depends on drawing one of $k$ designated cards before the other player does, then the first player’s probability of winning is an increasing function of $k$, reaching probability 1 when $k$ is 52 (i.e., the winning condition is “draw a card”). I may get around to calculating how the answer varies with $k$.

Thursday, September 05, 2013

some Smith student summer projects

We just had our first weekly math lunch, during which several of our math majors explained what they had done over the summer. Here are just a few of the projects they described (I couldn’t remember them all!):
  • investigating the sensitivity of face-recognition algorithms to certain properties, like whether the face is turned towards the camera or not;
  • restoring and classifying polylink models, originally created by Alan Holden in the 1970s;
  • turning research from the spring into a professional-level paper for publication;
  • starting a research project on dynamics, game theory, and biology to describe interaction between snails and crabs;
  • an REU about integrating monomials over the Cantor set;
  • helping a faculty member teach statistics to 14–15-year-old girls in an intensive two-week program;
  • much more!
Needless to say, it’s always an honor and a pleasure to be working with these students.

Tuesday, September 03, 2013

probability skills

Tomorrow I start teaching my first probability class. At 8:30 am. So that’s a thing. (I’ve done 8:30 classes before, and they’re fine. In fact, I suspect upper-class students will be more amenable to them than the first-years I’ve taught in the past.)

Probability is a little different for me. It has proofs, sure—it’s a math class, after all—but it feels more like a collection of techniques than the other, more theoretical, upper-level classes I’ve taught. Those techniques are unified by a general philosophy: We can understand the long-term behavior of random processes. So I’ll be incorporating some philosophical texts along with the standard fare.

Because the theory of probability is so steeped in problem-solving, it seems like a good candidate for inquiry-based learning. I chose a classroom with a conference-room design and lots of blackboards (pictures here), so that it feels more like a collaborative environment.

Although it took me a while to realize it, since I’m not as familiar with the material, the nature of the subject also lends itself to standards-based grading. Because of previous confusion I’ve encountered when using the term “standards”, for this class I’m calling them “core skills”. A list is below. (The arrangement of topics has been heavily influenced by our textbook, A First Course in Probability, by Sheldon Ross.)

I feel like this class is a big step forward for me. I’m really going to try to let go of controlling what goes on in class through lecture, and to let (hopefully deep) exploration happen through my choices of topics and questions.

List of core skills in probability

  1. Communication – Contribute to class discussion. Present neat, original written work.
  2. Combinatorial analysis – Apply counting principles, in particular models involving permutations, combinations, binomial and multinomial coefficients.
  3. Axioms of probability – Use definitions of sample spaces, events, and probability, together with Boolean algebra, to prove propositions. Explain meaning of the axioms of probability.
  4. Conditional probability – Find conditional probability of one event given another. Explain meaning of conditional probability. Use formulas involving conditional probabilities.
  5. Bayes’ formula – Use Bayes’ formula to compute conditional probabilities. Find how odds of an event change with introduction of new data. Justify potentially counterintuitive results.
  6. Independence of events – Determine whether two (or more) events are independent. Explain significance of independence.
  7. Random variables – Find the probability mass (or density) function and cumulative distribution function of a random variable. Explain what a random variable is and why they are important.
  8. Expectation – Find and interpret the expected value of a random variable.
  9. Variance – Find and interpret the variance of a random variable.
  10. Discrete random variables – Create models with discrete random variables, including:
    • Binomial distribution
    • Geometric distribution
    • Poisson distribution
    • Hypergeometric distribution
  11. Continuous random variables – Create models with continuous random variables, including:
    • Uniform distribution
    • Exponential distribution
    • Normal distribution
    • Gamma distribution
  12. Joint distribution – Find the joint distribution function or joint probability mass (or density) function of two random variables. Find marginal distributions of the two variables.
  13. Independent random variables – Determine whether two variables are independent. Explain meaning of independent random variables. Use independence to compute expectation.
  14. Sums of random variables – Justify and apply linearity of expectation. Find the distribution of a sum of independent random variables by using convolution.
  15. Conditional distributions – Find the conditional mass (or density) function of one random variable given another. Use conditioning to compute expectations or probabilities.
  16. Covariance and correlation – Find the covariance and correlation of a pair of random variables. Explain the meaning of covariance. Compute variance of a sum of random variables.
  17. Inequalities and Laws of Large Numbers – Use the Markov and Chebyshev inequalities to approximate distributions. Explain the Weak Law and Strong Law of Large Numbers.
  18. Central Limit Theorem – Explain the conditions under which the Central Limit Theorem holds. Apply the Central Limit Theorem to approximate sums of random variables.