Understanding Conditional Probability and Bayes Theorem | by Sachin Date | Jul, 2024 – Towards Data Science

Photo by Stephen Cobb on Unsplash A primer on two concepts that form the substrate of regression analysis

Few incidents in history exemplify how thoroughly conditional probability is woven into human thought, as remarkably as the events of September 26, 1983.

Just after midnight on September 26, a Soviet Early Warning Satellite flagged a possible ballistic missile launch from the United States directed toward to the Soviet Union. Lt. Col. Stanislav Petrov, the duty officer on shift at a secret EWS control center outside Moscow, received the warning on his screens. He had only minutes to decide whether to flag the signal as legitimate and alert his superior.

The EWS was specifically designed to detect ballistic missile launches. If Petrov had told his boss that the signal was real, the Soviet leadership would have been well within the bounds of reason to interpret it as the start of a nuclear strike on the USSR. To complicate matters, the Cold War had reached a terrifying crescendo in 1983, boosting the probability in the minds of the Soviets that the signal from the EWS was, in fact, the real thing.

Accounts differ on exactly when Petrov informed his superior about the alarm and what information was exchanged between the two men, but two things are certain: Petrov chose to disbelieve what the EWS alarm was implying namely, that the United States had launched a nuclear missile against the Soviet Union and Petrovs superiors deferred to his judgement.

With thousands of nuclear-tipped missiles aimed at each other, neither superpower would have risked retaliatory annihilation by launching only one or a few missiles at the other. The bizarre calculus of nuclear war meant that if either side had to start it, they must do so by launching a tidy portion of their arsenal all at once. No major nuclear power would be so stupid as to start a nuclear war with just a few missiles. Petrov was aware of this doctrine.

Given that the EWS detected a solitary launch, in Petrovs mind the probability of its being real was vanishingly small despite the acutely heightened tensions of his era.

So Petrov waited. Crucial minutes passed. Soviet ground radars failed to detect any incoming missiles making it almost certain that it was a false alarm. Soon, the EWS fired four more alarms. Adopting the same line of logic, Petrov chose to flag all of them as not genuine. In reality, all alarms turned out to be false.

If Stanislav Petrov had believed the alarms were real, you might not be reading this article today, as I would not have written it.

The 1983 nuclear close call is an unquestionably extreme example of how human beings compute probabilities in the face of uncertainty, even without realizing it. Faced with additional evidence, we update our interpretation our beliefs about what weve observed, and at some point, we act or choose not to act based on those beliefs. This system of conditioning our beliefs on evidence plays out in our brain and in our gut every single day, in every walk of life from a surgeons decision to risk operating on a terminally ill cancer patient, to your decision to risk stepping out without an umbrella on a rainy day.

The complex probabilistic machinery that our biological tissue so expertly runs is based upon a surprisingly compact lattice of mathematical concepts. A key piece of this lattice is conditional probability.

In the rest of this article, Ill cover conditional probability in detail. Specifically, Ill cover the following:

Lets begin with the definition of conditional probability.

Conditional probability is the probability of event A occurring given that events B, C, D, etc., have already occurred. It is denoted as P(A | {B, C, D}) or simply P(A | B, C, D).

The notation P(A|B, C, D) is often pronounced as probability of A given B, C, D. Some authors also represent P(A|B, C, D) as P(A|B; C; D).

We assume that events B, C, D jointly influence the probability of A. In other words, event A does not occur independently of events B, C, and D. If event A is independent of events B, C, D, then P(A|B, C, D) equals the unconditional probability of A, namely P(A).

A subtle point to stress here is that when A is conditioned upon multiple other events, the probability of A is influenced by the joint probability of those events. For example, if event A is conditioned upon events B, C, and D, its the probability of the event (B C D) that A is conditions on.

Thus, P(A | B,C,D) is the same as saying P(A |B C D).

Well delve into the exact relation of joint probability with conditional probability in the section on Bayes theorem. Meanwhile, the thing to remember is that the joint probability P(A B C D) is different from the conditional probability P(A | B C D).

Lets see how to calculate conditional probability.

Every summer, millions of New Yorkers flock to the sun-splashed waters of the citys 40 or so beaches. With the visitors come the germs, which happily mix and multiply in the warm seawater of the summer months. There are at least a dozen different species and subspecies of bacteria that can contaminate seawater and, if ingested, can cause a considerable amount of, to put it delicately, involuntary expectoration on the part of the beachgoer.

Given this risk to public health, from April through October of each year, the New York City Department of Health and Mental Hygiene (DOHMH) closely monitors the concentration of enterococci bacteria a key indicator of seawater contamination in water samples taken from NYCs many beaches. DOHMH publishes the data it gathers on the NYC OpenData portal.

The following are the contents of the data file pulled down from the portal on 1st July 2024.

The data set contains 27425 samples collected from 40 beaches in the NYC area over a period of nearly 20 years from 2004 to 2024.

Each row in the data set contains the following pieces of information:

Before we learn how to calculate conditional probabilities, lets see how to calculate the unconditional (prior) probability of an event. Well do that by asking the following simple question:

What is the probability that the enterococci concentration in a randomly selected sample from this dataset exceeds 100 MPN per 100 ml of seawater?

Lets define the problem using the language of statistics.

Well define a random variable X to represent the enterococci concentration in a randomly selected sample from the dataset.

Next, well define an event A such that A occurs if, in a randomly selected sample, X exceeds 100 MPN/100 ml.

We wish to find the unconditional probability of event A, namely P(A).

Well calculate P(A) as follows:

From the data set of 27425 samples, if you count the samples in which the enterococci concentration exceeds 100, youll find this count to be 2414. Thus, P(A) is simply this count divided by the total number of samples, namely 27425.

Now suppose a crucial piece of information flows in to you: The sample was collected on a Monday.

In light of this additional information, can you revise your estimate of the probability that the enterococci concentration in the sample exceeds 100?

In other words, what is the probability of the enterococci concentration in a randomly selected sample exceeding 100, given that the sample was collected on a Monday?

To answer this question, well define a random variable Y to represent the day of the week on which the random sample was examined. The range of Y is [Monday, Tuesday,,Sunday].

Let B be the event that Y is a Monday.

Recall that A represents the event that X > 100 MPN/100 ml.

Now, we seek the conditional probability P(A | B).

In the dataset, 10670 samples happen to fall on a Monday. Out of these 10670 samples, 700 have an enterococci count exceeding 100. To calculate P(A | B), we divide 700 by 10670. Here, the numerator represents the event A B (A and B), while the denominator represents the event B.

We see that while the unconditional probability of the enterococci concentration in a sample is 0.08802 (8.8%), this probability drops to 6.56% when new evidence is gathered, namely that the samples were all collected on Mondays.

Conditional probability has the nice interpretative quality that the probability of an event can be revised as new pieces of evidence are gathered. This aligns well with our experience of dealing with uncertainty.

Heres a way to visualize unconditional and conditional probability. Each blue dot in the chart below represents a unique water sample. The chart shows the distribution of enterococci concentrations by the day of the week on which the samples were collected.

The green box contains the entire data set. To calculate P(A), we take the ratio of the number of samples in the green box in which the concentration exceeds 100 MPN/100 ml to the total number of samples in the green box.

The orange box contains only those samples that were collected on a Monday.

To calculate P(A | B), we take the ratio of the number of samples in the orange box in which the concentration exceeds 100 MPN/100 ml to the total number of samples in the orange box.

Now, lets make things a bit more interesting. Well introduce a third random variable Z. Let Z represent the month in which a random sample is collected. A distribution of enterococci concentrations by month, looks like this:

Suppose you wish to calculate the probability that the enterococci concentration in a randomly selected sample exceeds 100 MPN/100 ml, conditioned upon two events:

As before, let A be the event that the enterococci concentration in the sample exceeds 100.

Let B be the event that the sample was collected on a Monday.

Let C be the event that the sample was collected in July (Month 7).

You are now seeking the conditional probability: P (A | (B C)), or simply P (A | B, C).

Lets use the following 3-D plot to aid our understanding of this situation.

The above plot shows the distribution of enterococci concentration plotted against the day of the week and month of the year. As before, each blue dot represents a unique water sample.

The light-yellow plane slices through the subset of samples collected on Mondays i.e. on day of week=0. There happen to be 10670 samples lying along this plane.

The light-red plane slices through the subset of the samples collected in the month of July i.e., month = 7. There are 6122 samples lying along this plane.

The red dotted line marks the intersection of the two planes. There are 2503 samples (marked by the yellow oval) lying along this line of intersection. These 2503 samples were collected on July Mondays.

Among this subset of 2503 samples, are 125 samples in which the enterococci concentration exceeds 100 MPN/100 ml. The ratio of 125 to 2503 is the conditional probability P(A | B, C). The numerator represents the event A B C, while the denominator represents the event B C.

We can easily extend the concept of conditional probability to additional events D, E, F, and so on, although visualizing the additional dimensions lies some distance beyond what is humanly possible.

Now heres a salient point: As new events occur, the conditional probability doesnt always systematically decrease (or systematically increase). Instead, as additional evidence is factored in, conditional probability can (and often does) jump up and down in no apparent pattern, also depending on the order in the events are factored into the calculation.

Lets approach the job of calculating P(A | B) from a slightly different angle, specifically, from a set-theoretic angle.

Lets denote the entire dataset of 27425 samples as the set S.

Recall that A is the event that the enterococci concentration in a randomly selected sample from S exceeds 100.

From S, if you pull out all samples in which the enterococci concentration exceeds 100, youll get a set of size 2414. Lets denote this set as S_A. As an aside, note that event A occurs for every single sample in S_A.

Recall that B is the event that the sample falls on a Monday. From S, if you pull out all samples collected on a Monday, youll get a set of size 10670. Lets denote this set as S_B.

The intersection of sets S_A and S_B, denoted as S_A S_B, is a set of 700 samples in which the enterococci concentration exceeds 100 and the sample was collected on a Monday. The following Venn diagram illustrates this situation.

The ratio of the size of S_A S_B to the size of S is the probability that a randomly selected sample has an enterococci concentration exceeding 100 and was collected on a Monday. This ratio is also known as the joint probability of A and B, denoted P(A B). Do not mistake the joint probability of A and B for the conditional probability of A given B.

Using set notation, we can calculate the joint probability P(A B) as follows:

Now consider a different probability: the probability that a sample selected at random from S falls on a Monday. This is the probability of event B. From the overall dataset of 27425 samples, there are 10670 samples that fall on a Monday. We can express P(B) in set notation, as follows:

What I am leading up to with these probabilities is a technique to express P(A | B) using P(B) and the joint probability P(A B). This technique was first demonstrated by an 18th century English Presbyterian minister named Thomas Bayes (17011761) and soon thereafter, in a very different sort of way, by the brilliant French mathematician Pierre-Simon Laplace (17491827).

In their endeavors on probability, Bayes (and Laplace) addressed a problem that had vexed mathematicians for several centuries the problem of inverse probability. Simply put, they sought the solution to the following problem:

Knowing P(A | B), can you calculate P(B | A) as a function of P(A | B)?

While developing a technique for calculating inverse probability, Bayes indirectly proved a theorem that became known as Bayes Theorem or Bayes rule.

Bayes theorem not only allows us to calculate inverse probability, it also enables us to link three fundamental probabilities into a single expression:

When expressed in modern notation it looks like this:

Plugging in the values of P(A B) and P(B), we can calculate P(A | B) as follows:

The value 0.06560 for P(A | B) is of course the same as what we arrived at by another method earlier in the article.

Bayes Theorem itself is stated as follows:

In the above equation, the conditional probability P(B | A) is expressed in terms of:

Its in this form that Bayes theorem achieves phenomenal levels of applicability.

In many situations, P(B | A) cannot be easily estimated but its inverse, P(A | B), can be. The unconditional priors P(A) and P(B) can also be estimated via one of two commonly used techniques:

The point is, Bayes theorem gives you the conditional probability you seek but cannot easily estimate directly, in terms of its inverse probability which you can easily estimate and a couple of priors.

This seemingly simple procedure for calculating conditional probability has turned Bayes theorem into a priceless piece of computational machinery.

Bayes theorem is used in everything from estimating student performance on standardized test scores to hunting for exoplanets, from diagnosing disease to detecting cyberattacks, from assessing risk of bank failures to predicting outcomes of sporting events. In Law Enforcement, Medicine, Finance, Engineering, Computing, Psychology, Environment, Astronomy, Sports, Entertainment, Education there is scarcely any field left in which Bayes method for calculating conditional probabilities hasnt been used.

Lets return to the joint probability of A and B.

We saw how to calculate P(A B) using sets as follows:

Its important to note that whether or not A and B are independent of each, P(A B) is always the ratio of |S_A S_B| to |S|.

The numerator in the above ratio is calculated in one of the following two ways depending on whether A is independent of B:

Thus, when A and B are independent events, |S_A S_B| is calculated as follows:

The principle of conditional probability can be extended to any number of events. In the general case, the probability of an event E_s conditioned upon the occurrence of m other events E_1 through E_m can be written as follows:

We make the following observations about equation (1):

Now heres something interesting: In equation (1), if you rename the event E_s as y, and rename events E_1 through E_m as x_1 through X_m respectively, equation (1) suddenly acquires a whole new interpretation.

And thats the topic of the next section.

There is a triad of concepts upon which every single regression model rests:

Even within this illustrious trinity, conditional probability commands a preeminent position for two reasons:

Ill illustrate this using two very commonly used, albeit very dissimilar, regression models: the Poisson model, and the linear model.

Consider the task of estimating the daily counts of bicyclists on New York Citys Brooklyn Bridge. This data actually exists: for 7 months during 2017, the NYC Department of Transportation counted the number of bicyclists riding on all East River bridges. The data for the Brooklyn bridge looked like this:

Data such as these, which contain strictly whole-numbered values, can often be effectively modeled using a Poisson process and the Poisson probability distribution. Thus, to estimate the daily count of bicyclists, you would:

Consequently, the probability of observing a particular value of y, say y_i, will be given by the following Probability Mass Function of the Poisson probability distribution:

See more here:

Understanding Conditional Probability and Bayes Theorem | by Sachin Date | Jul, 2024 - Towards Data Science

Related Posts

Comments are closed.