Bayes Theorem
Introduction
Used to calculate conditional probabilities,
Bayes Theorem is a fundamental concept in probability theory.
This includes machine learning.
Bayes theorem is used to calculate the probability of an event
occurring given that a related event has already occurred.
This is called conditional probability, and given two events
Imagine that you have three bags that are filled with different coins. You randomly select a coin out of a random bag, and it's a quarter. What is the probability that the coin came from the first bag? If you can determine the probability of choosing a quarter, then you can calculate the probability the coin you picked came from the first bag.
When used properly, Bayes theorem is effectively a machine learning algorithm. It can be used to calculate the probability of an event occurring given that another event has already occurred. And priori features in data can be used to create a model that predicts the posteriori probability of an event occurring.
Formal Definition of Bayes Theorem
is the conditional probability of event occurring given that event has already occurred. is also called the posterior probability.
is the conditional probability of event occurring given that event has already occurred. is also called the likelihood.- As in the likelihood of event
occurring given that event has already occurred.
- As in the likelihood of event
is the prior probability of event occurring. is also called the marginal probability.- The prior probability is the probability of
event
occurring without any additional information, - Essentially it is based on prior knowledge of the event.
Proof of Bayes Theorem
Proof Step 1
Begin by defining the probability of two events,
To demonstrate proof of this theory,
consider the following example.
Suppose that you want to compute the probability of finishing an ice cream cone.
This can be calculated by computing the probability that
the store has ice cream in stock multiplied by the probability that
you can eat the entire ice cream cone given that the ice cream is in stock.
In this example, define
Therefore, the probability that you can finish the ice cream and that the ice cream is in stock can be computed by using the following formula:
Proof Step 2
Similarly to what you learned in Step 1,
the probability of events
ProofStep 3
Because the left hand sides of Steps 1 & 2 are exactly the same, you can also set the right hand sides equal to each other:
Proof Step 4
Dividing both sides by
Proof Conclusion
Therefore you have proven Bayes Theorem of conditional probability. This means that given the probability of two events occurring, you can calculate the probability of one event occurring given that another event has already occured.
Extension of Bayes Theorem
Explanation of Bayes Extension
In many implementations,
you may see Bayes Theorem written a bit differently because
there are more than just two events occurring.
Next,
you will work to adjust your Bayes equation to
account for multiple events by reformatting the theory.
You will be seeking the probability of some event,
However,
often you will not know the denominator in this equation,
which is the probability of
Let
Proof of Bayes Extension
Step 1 of Bayes Extension Proof
Step 2 of Bayes Extension Proof
For this example,
Step 3 of Bayes Extension Proof
Then use the property of distribution to distribute the intersection, as shown below:
Step 4 of Bayes Extension Proof
Take the probability of both sides, as show by the equation below:
Step 5 of Bayes Extension Proof
Because
Step 6 of Bayes Extension Proof
According to the multiplication of independent events theorem,
Step 7 of Bayes Extension Proof
Thus you can use the summation to get this:
Step 8 of Bayes Extension Proof
While this appears differently, you can now derive Bayes to be:
Example of Bayes in Action
You have two bags full of poker chips. Bag one has seven red and two blue chips while bag two has five red and nine blue chips. You draw a chip at random and it turns out to be red. Compute the probability that the chip was drawn from bag one.
First, define your known probabilities.
- Let
represent the event that the chip was chosen from bag one. - Let
represent the event that the chip was chosen from bag two. - Let
represent the event that the chip is red. - There are seven red chips out of total nine in bag one, thus...
The goal of this problem is to find the probability that
the chip came from bag one,
given that it is red.
Thus, you must find
Example - Part One
Example - Part Two
Example - Part Three
Example - Part Four
You have now determined that after drawing a red chip, the probability that it was drawn from the first bag is 68.5%. This example illustrates how to apply Bayes theorem to a real world problem.
Naive Bayes Theorem
Based on Bayes Theorem, the Naive Bayes Theorem or (NBT) is a classification algorithm that is primarily used on large datasets. NBT uses the same algorithm as Bayes Theorem to attempt to classify a feature in a given dataset. NBT can often replace more sophisticated classification algorithms because it is easier to implement while still providing a high accuracy rate.
NBT is considered naive because it assumes the presence of a particular feature in a class is unrelated to the presence of other features. An example of an algorithm would be a classification of a tomato: it is round, it has a diameter of less than 8cm, and it is red. In this example, all three characteristics may be related, but NBT does not assume this when calculating probabilities. Considering these assumptions, NBT is more accurate than you may suppose. NBT may not be quite as accurate as some more complex classification tools, but its easy implementation and relatively low computing costs are advantages which make NBT a viable tool for many applications.
Formal Definition of Naive Bayes Theorem
The formula for Naive Bayes Theorem is:
Where:
is the posterior probability of the class ( , target), given the probability of a predictor ( , attribute). is the prior probability of the class. , also called the likelihood.- It's the probability of the predictor given the probability of the class.
is the prior probability of the predictor.
Example of Naive Bayes Theorem
The following example includes two weeks' worth of data for
weather conditions and a football team's schedule.
The goal is to determine the likelihood that
the football team will play given that it is sunny outside.
In the dataset, the
Example - Frequency of Play Dataset
Day | Condition | Played |
---|---|---|
Monday | Sunny | No |
Tuesday | Rainy | Yes |
Wednesday | Sunny | Yes |
Thursday | Cloudy | Yes |
Friday | Sunny | No |
Saturday | Rainy | No |
Sunday | Rainy | Yes |
Monday | Sunny | Yes |
Tuesday | Rainy | No |
Wednesday | Cloudy | No |
Thursday | Sunny | Yes |
Friday | Sunny | Yes |
Saturday | Rainy | Yes |
Sunday | Cloudy | Yes |
Solution to Naive Bayes Theorem Example
The first step to solving this problem is to convert this dataset into a frequency table. When implementing NBT in practice, this is usually the most computationally expensive step, because real-world datasets are much larger than this simple example. See the associated Frequency Table below:
Weather | Yes | No | Total Days |
---|---|---|---|
Sunny | 4 | 2 | 6 |
Rainy | 3 | 2 | 5 |
Cloudy | 2 | 1 | 3 |
Total | 9 | 5 | 14 |
Given the data in the frequency table above,
you can now compute the probability that
the football team will play if the weather is sunny,
represented by
Now it just becomes a matter of populating the likelihood probability, the prior probability of the class, and the prior probability of the predictor.
The term
The probability
Finally, the probability
Now we plug the numbers into the equation and then solve the problem:
As illustrated in this example, there's about a 67% chance that the football team will play given the weather is sunny. This may seem like a simple solution to a relatively easy question. However, in real-world practice, these initial datasets will be much more complex, often with terabytes of data. By performing classification and considering conditional probability, you can gain insights that are often difficult to achieve with large datasets.
Naive Bayes Theorem and Multiple Predictors
The example above is a simple example of NBT.
You often want more predictors or features to
make more sophisticated predictions.
One example of this is a spam filter.
A message can be legitimate or spam predicted on labels
Where:
is the probability of a label given the features.- In this case whether a message is spam or legitimate.
- And whether that label is likely given the features or words in the message.
is the probability of the features given the label.- In this case, the probability of the words in the message given the label.
- This is where we'd count occurrences of words in messages of that label.
is the probability of the label.- This is where we'd count the number of spam and legitimate messages.
is the probability of the features.- This is where we'd consider the probability of these words occurring in any message.
- This is more nuanced
- We might need to decide whether a message is spam or legitimate
This arises from the fact that
Multiple Predictors Example - Spam Filter
So let's examine the driving equation for multiple predictors.
Where:
is the probability of a message being legitimate given the features. is the probability of a message being spam given the features. is the probability of the features given the message is legitimate. is the probability of the features given the message is spam. is the probability of a message being legitimate. is the probability of a message being spam.
Below are some examples that show real-world applications of NBT.
Suppose that you are working with a total of 50 emails.
Suppose further that 30 of those emails are classified as normal or
In this scenario, the prior probability of a message being normal is:
Based on this information, it can be assumed that the prior probability of a message being spam is:
The 30 normal messages have been classified as normal due to the presence of some words, such as "dear" and "money", that are common in normal messages. From the data, you can observe that the word "dear" appears 30 times and the word "money" appears 5 times. Given this data, you can compute the probability that an email contains a common "normal" word, such as "dear" or "money", given that the email is normal.
On the other hand, the 20 spam messages have been classified as spam due to the presence of words such as "password" and "money". From the data, you can see that the word "password" appears 10 times and the word "money" appears 20 times. Given this data, you can compute the probability that a message contains a common "spam" word, such as "password" or "money", given that the message is spam.
Now suppose that you receive an additional email with the word "money" in it, and you want to calculate the probability that the new email is normal.
By applying the NBT formula, you get:
Therefore, given your initial data, you can conclude that there is a 30% chance that a message containing the word "money" is normal.
Multiple Predictors
So far we've only discussed some theory of multiple predictors and an example using the same form but only one word. Let's see what happens when we have multiple words, or predictors.
The possibly millions of words being accounted for
Note: Naive Bayes assumes that all the features are independent of each other. This means we can just multiply the features together.
Because we assume all the features are independent, we can rewrite the equation as:
Because we're multiplying a lot of normalized probabilities, (i.e. probabilities that are between 0 and 1), there can be an underflow problem. That is the product of all these probabilities can be so small that the computer can't represent it or that it's so small that it's not useful. We fix this by taking the log of the probabilities.
Confusion Matrix
Note: For more information, read Confusion Matrix.
What is a Confusion Matrix?
A confusion matrix is a graphical representation that is used to summarize the performance of a classification algorithm. In fact, only looking at the accuracy score of an algorithm might be misleading if there are an unequal number of observations or more than two classes in the dataset.
For this reason, using a confusion matrix can give a better idea of how a classification model is performing.
A confusion matrix is visually represented as a table with two rows and columns that contains the combinations of predicted and actual values possible. Below is a diagram of a confusion matrix.
In the confusion matrix above, TP
stands for true positive,
TN
stands for true negative,
FP
stands for false positive, and
FN
stands for false negative.
Using the spam message example from earlier,
TP
is the number of messages that were correctly classified as spam.TN
or true negative is the messages correctly classified as normal.FP
or false positive is the messages incorrectly classified as spam.- As in it was a legitimate message, but was identified as spam.
FN
or false negative is the messages incorrectly classified as normal.- As in it was a spam message, but was identified as legitimate.
Gaussian Naive Bayes
What is a Gaussian Distribution?
This has been covered in the Gaussian Distribution document, for more details read it. Gaussian distribution, also known as a normal distribution, is a useful way to calculate probability. In order to apply a normal distribution to Bayes Theorem, you must calculate the Gaussian distribution for each feature in each class. This can be done easily once the class feature's mean and variance are calculated. The height, weight, and foot size dataset below will be used to demonstrate mean and variance calculations.
Gaussian Naive Bayes Example Data
Sex | Height | Weight | Foot Size |
---|---|---|---|
Male | 6 | 180 | 12 |
Male | 5.92 | 190 | 11 |
Male | 5.58 | 170 | 12 |
Male | 5.91 | 165 | 10 |
Female | 5.0 | 100 | 6 |
Female | 5.5 | 150 | 8 |
Female | 5.42 | 130 | 7 |
Female | 5.75 | 150 | 9 |
Gaussian Naive Bayes Example Data Mean and Variance
Gaussian Naive Bayes Example Data Mean
Given a dataset with
The table below shows the means for each feature:
Sex | |||
---|---|---|---|
Male | 5.85 | 176.25 | 11.25 |
Female | 5.42 | 132.50 | 7.5 |
Gaussian Naive Bayes Example Data Variance
Given a dataset with
Note: The denominator is
instead of because the mean is calculated from the dataset and the variance is calculated from the mean. This is a statistical adjustment called Bessel's correction.
To demonstrate, we'll calculate the variance of weight in males:
The table below shows the variances for each feature:
Sex | |||
---|---|---|---|
Male | 0.035 | 122.92 | 0.917 |
Female | 0.097 | 558.33 | 1.667 |
Creating the Gaussian Normal Distribution
Using the previous section's example and calculations for mean and variance,
we can now create the Gaussian normal distribution for each feature.
This can be done in many ways.
For example,
in Python we can use scikit-learn
's GaussianNB
class.
To better illustrate the underlying mathematics,
for now you will use a web based interactive normal distribution tool
Plugging in the normal distribution values for each characteristic's mean and variance, you will see the expected curves. Next you can plug in some values for each characteristic and see how much of the curve is covered by the values.
Case Study of Gaussian Naive Bayes
Let's take the height of 5.8
, weight of 185
, foot size of 11
and
calculate the Gaussian Naive Bayes for the odds of that person being male
.
When do the math on those normal distributions by any means, the answer for the probability of each feature for a male is:
To find the probability that this particular person is male, use the following equation.
A key step in the above equation is to multiply by 0.50 because there's an initial 50% probability that a person is male.
Then, do the same for all data representing female height, weight and foot size.
Because
Gaussian Naive Bayes theorem is currently used in countless applications.
Further Reading on Naive Bayes Theorem
- Learning how to manage probability distributions in machine learning
Smoothing
Laplace Smoothing
TODO: write some notes here on laplace smoothing
Further Reading on Laplace Smoothing
Implementation
Python Implementation
One way to quickly and easily implement Naive Bayes in Python is to use the SciKit-Learn library. It's great for analysis and prototyping and might even be sophisticated and generalized enough for some more basic production applications.
References
Web Links
- Wikipedia. 'Bayes Theorem'
- Brilliant.org. 'Bayes Theorem & Conditional Probability'
- MachineLearningMastery.com 'Gentle Introduction to Maximum a Posteriori (MAP) for Machine Learning'
- Wikipedia. 'Bessel Correction'. From 2023-06-05
- Normal Distribution Interactive Tool
- Gamal. Bassant. 2023-12-17. Medium. 'Naive Bayes Algorithm'
- Kandala. Anupama. Medium. 'From Zero to Hero: Laplace Smoothing for Naive Bayes Classifiers' 2023