Naive Bayes Classifier
Dr. Ashish Tendulkar
Machine Learning Techniques
IIT Madras
Introduction
- Simple yet very powerful classifier that is used extensively in applications like document classification and spam filtering.
- Generative counterpart of logistic regression.
- Uses Bayes theorem for calculating probability of a sample belonging to a class.
- Makes strong (naive) conditional independence assumption between the features given a label.
Part 1: Training Setup
Binary classification
Feature matrix
Label vector
\((n, m)\)
\((n, )\)
Index of example
\(y^{(i)} \in \{0, 1\}\)
Multiclass classification
Feature matrix
Label matrix
\((n, m)\)
\((n, k )\)
Index of example
\(\mathbf{y}^{(i)} \in \{0, 1\}^k\)
Spot the difference!
Part 2: Model
Naive Bayes' assumption
Naive Bayes classifier makes a strong conditional independence assumption:
Features are conditionally independent given the label.
It enables us to express joint probability of features given label as product of probabilities of individual features given label:
NB classifier predicts probability, \(p(y|\mathbf{x})\), of class label, \(y\), given a feature vector \(\mathbf{x}\), using Bayes' theorem
Evidence
Posterior probability
Class conditional density
Class prior
With Naive Bayes assumption, posterior probability \(p(y|\mathbf{x})\) becomes
Rewriting
Expressing denominator as a sum over all \(k\) labels.
and
Rewriting the denominator after applying the chain rule
Rewriting numerator and denominator following conditional independence assumption
Rewriting numerator and denominator compactly
Parameters of naive Bayes classifier
\(k\) prior probabilities
\(k \times m\) class conditional densities
\(m\) conditional densities per class and there are \(k\) such classes.
The number of parameters for each conditional density vary and depends on its mathematical form.
NB schematic
Credits: https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote05.html
Modeling conditional densities: \(p(x_i|y)\)
- Depends on the nature of the feature \(x_i\):
-
Binary feature - e.g. word is present or not
- categorical feature is generalization of binary.
- Multinomial feature - e.g. word count \(c_i\) as features with additional constraint that \(\sum_{i=1}^{m} c_i = l\), the length of the sequence they represent.
- Continuous feature - Features are real numbers. e.g. area of an apartment in sq. feet.
-
Binary feature - e.g. word is present or not
Modeling \(p(x_j|y_c)\)
Probability distribution used for modeling \(p(x_j|y_c)\) depends on the nature of the feature \(x_j\):
- Categorical feature: \(p(x_j|y_c) \sim \text{Cat}(e, \mu_{j1c}, \mu_{j2c}, \ldots, \mu_{jec}) \)
- Binary feature: \(p(x_j|y_c) \sim \text{Bernoulli}(\mu_{jc}) \)
- Multinomial feature: \(p(\mathbf{x}|y_c) \sim \text{Multinomial}(l, \mu_{1c}, \mu_{2c}, \ldots, \mu_{mc}) \)
- Continuous feature: \(p(x_j|y_c) \sim \mathcal{N}(\mathbf{\mu}_{jc}, \sigma_{jc}) \)
Let \(\mathbf{w}\) be the set of all parameters: priors as well as class conditional densities
Note: we need to estimate parameters of relevant distributions, one for each feature, for each class label.
Bernoulli Distribution
When \(x_j\) is a binary feature, we use Bernoulli distribution to model the class conditional density: \(p(x_j|y_c)\)
- \(p(x_j = 1|y_c) = \mu_{jc} \)
- \(p(x_j = 0|y_c) = 1- \mu_{jc} \)
Combine these two equations into a compact form as follows:
When \(x_j=1\),
and \(x_j=0\),
Parameterized by \(\mu_jc\), \(p(x_j|y_c)\) is calculated as follows:
Verify that the compact form and earlier form are equivalent.
For \(s \leq m\) binary features and \(k\) classes, we will have \(k \times s\) parameters for \(s\) Bernoulli distributions.
Categorical Distribution
When \(x_j\) is a categorical feature i.e. it takes one of the \(e \gt 2\) discrete values [e.g. \(\{\text{red, green, blue}\}\) or roll of a dice], we use categorical distribution to model the class conditional density \(p(x_j|y_c)\).
For discrete set \(v\), \(p(x_j|y_c)\) is parameterized by the \(|v|\), that is # events in \(v\) and probability of each event \( \mu_{j1c}, \mu_{j2c}, \ldots, \mu_{jec}\) such that \(\sum_{q=1}^e \mu_{jqc} = 1\)
For \(x_j = v_q\) such that \(v_q \in v\):
\(p(x_j=v_q|y_c; e, \mu_{j1c}, \mu_{j2c}, \ldots, \mu_{jec}) = \mu_{jqc}\)
Let \(v = \{v_1, v_2, \ldots, v_e\}\) be the set of \(e\) discrete values.
Total parameters = \(k \times \sum_{j=1}^m |v_j| \)
\(p(x_j|y_c)\) can be written in a compact form as follows:
Let \( \mathbf{\mu_{jc}} = [\mu_{j1c}, \mu_{j2c}, \ldots, \mu_{jec}]\) be the parameter vector for \(p(x_j|y_c)\)
Verify that the compact form is equivalent to the following:
where \( \mathcal{1}(x_j = v_q) = 1 \text{ if } x_j = v_q \text{ else } 0\)
Multinomial Distribution
When \(\mathbf{x}\) is count vector i.e. each component \(x_j\) is a count of appearance in the object it represents and \( \sum x_j = l \), which is the length of the object, we use multinomial distribution to model \(p(\mathbf{x}|y_c)\).
It is parameterized by the length of object \(l\) and probability of features \(\{x_1, \ldots, x_m\}\): \( \mu_{1c}, \ldots, \mu_{mc}\).
\(p(\mathbf{x}|y_c; l, \mu_{1c}, \mu_{2c}, \ldots, \mu_{mc}) = \frac{n!}{x_1! \ldots x_m!}\prod_{j=1}^{m} \mu_{jc}^{x_j}\)
Used for modelling documents that are represented by the word counts.
The probability of \(p(\mathbf{x}|y_c)\) such that \(\sum_{j=1}^{m} x_j = l\) is given by:
Total parameters = \(k \times m \)
Gaussian Distribution
When \(x_j\) is a continuous feature i.e. it takes a real value, we use gaussian (or normal) distribution to model the class conditional density \(p(x_j|y_c)\).
It is parameterised by the mean \(\mu_{jc}\) and variance \(\sigma_{jc}^2\).
standard deviation of \(x_j\) for class \(y_c\)
mean of \(x_j\) for class \(y_c\)
value of \(j\)-th feature
This is 1-D gaussian distribution. It models class conditional density for a single feature.
Multivariate Gaussian Distribution
Alternately, we can use multivariate gaussian distribution to represent \(p(\mathbf{x}|y)\) with parameters mean vector \(\mathbf{\mu}_{m \times 1}\) and covariance matrix \(\Sigma_{m \times m}\).
Total parameters = \(k \times 2m \)
In NB setting, since the features are conditionally independent of one another, all entries of \(\Sigma\) except diagonal are zero.
- The diagonal entries represent variance of that feature i.e. \(\Sigma_{jj} = \sigma_j^2\)
Once we learn parameters of different conditional densities, we use them to infer class label for new example.
Inference
We assign a class label \(y_c\) to a new example \(\mathbf{x}\) that maximizes the posterior probability.
Let \(\mathbf{w}\) be the set of all paramaters.
Using the definition of posterior probability
Since \(p(\mathbf{x}; \mathbf{w})\) is independent of \(y_c\), we ignore denominator from this computation
For a new example, \(\mathbf{x}\), we assign a class label \(y_c\) that yields max value among all \(y = \{y_1, \ldots, y_k\}\).
Expanding \(p(\mathbf{x}|y_c; \mathbf{w}) \) with naive Bayes assumption, we get
This equation involves multiplication of small numbers, there is a risk of underflow in this calculation:
The following equation is useful for getting the class label. It does not return the probability of an example belonging to class \(y_c\).
In case, we want the probability, we should use the following equation
This calculation should also be performed in \(\text{log}\) space.
Part 3: Loss function
Likelihood describes joint probability of observed data \(D\) given the parameter \(\mathbf{w}\) for the chosen statistical model.
For mathematical and computational convenience, we calculate log likelihood by taking log on both the sides:
The product becomes sum in the log space.
Since training examples are i.i.d., we can express this as a product of probability of individual samples:
Log likelihood is defined as
Our job is to find the parameter vector \(\mathbf{w}\) such that the \(l(\mathbf{w})\) is maximized.
Equivalently we can minimize the negative log likelihood (NLL) to maintain uniformity with other algorithms:
Simplifying with naive Bayes assumptions of conditional independence of features given label:
Applying log on product makes it summation in log.
Rearranging
The calculation of \(p(x^{(i)}_j|y^{(i)})\) depends on the probability distribution of the features.
Part 4: Optimization for parameter estimation
The parameter estimation by maximizing the log likelihood function is carried out with the following three steps:
- Calculate partial derivation of log likelihood function w.r.t. each parameter.
- Set the partial derivative to 0, which is the condition at maxima.
- Solve the resulting equation to obtain the parameter value.
Since \(p(x_j|y)\) depends on the choice of probability distribution, we will discuss parameter estimation for different distributions separately.
Estimating prior probability: \(p(y)\)
Note that \(\mathbb{1}(y^{(i)}=y_c) = 1 \text{\ when }y^{(i)}=y_c \text{\ else\ } 0. \)
The prior probability for class \(y_c\) is equal to the ratio of the number of examples with label \(y_c\) to the total number of examples in the training set \(n\).
The total number of parameters to be estimated is equal to the number of class labels \(k\) - one prior per label.
Estimating class conditional densities
Bernoulli distribution
Recall
, substituting this in \(l(\mathbf{w})\)
Parameters for label \(y_r\):
Distributing log into the bracket - multiplication turns into addition
(Step 1) calculate \( \frac{\partial l(\mathbf{w})}{\partial w_{jy_r}} \) and set it to \(0\).
Applying derivative to individual terms in the loss equation.
The derivatives of the first term and all terms where \(y^{(i)} \neq y_r \) are 0. Retaining terms where \(y^{(i)} = y_r \).
(Step 2) Setting \( \frac{\partial l(\mathbf{w})}{\partial w_{jy_r}} \) to \(0\).
(Step 3) Solving it further with algebraic manipulation:
This yields:
# examples with label \(y_r\)
# examples with label \(y_r\) and \(x_j = 1\)
What if \(\sum_{i=1}^{n} \mathbb{1}(y^{(i)}=y_r) x_j^{(i)} = 0\)?
Leads to \(w_{jy_r}\) = 0, which would mean \(p(x_j|y_r) \) = 0
Leads to \(p(y = y_r|\mathbf{x}) = 0\) since \(p(x_j|y_r) \) = 0
Fixing problem with zero count
Laplace smoothing: We can correct it by adding +1 to numerator and +2 to denominator (1 for each value of feature: \(x_j \in\{0,1\}\)).
In general, we can add \(+c\) to numerator and \(+2c\) to denominator. \(c\) is a hyperparameter that helps control overfitting.
However too high value of \(c\) leads to underfitting.
Categorical distribution
In plain english, this is ratio of number of examples with label \(y_r\) and \(x_j=v\) to the total number of training examples with label \(y_r\).
Parameters:
Incorporating smoothing, we obtain
Smoothing factor \(c\) is a hyperparameter and \(c = 1\) leads to Laplace smoothing.
Multinomial distribution
In plain english, this is ratio of number of training examples where \(x_j\) appears with label \(y_r\) to the sum of feature values in training examples with label \(y_r\).
Incorporating smoothing, we obtain
Smoothing factor \(c\) is a hyperparameter and \(c = 1\) leads to Laplace smoothing.
Gaussian/Normal distribution
Let \(n_r\) be the number of examples of class \(y_r\)
There are two parameters per feature \(\{\mu_j, \sigma_j^2\}\) per label.
Part 5: Evaluation
Evaluation
Classification evaluation measures with cross validation and test set:
- Confusion matrix
- Precision/recall/F1 score
- AUC ROC/PR curve
Appendix
Estimating class conditional density
Let's calculate class conditional density with chain rule
There are large number of parameters in this model.
- For \(m=3\) binary features \(x_1, x_2, x_3\)
- # parameters per label: \(2^m\)
- For binary label, \( y \in \{0,1\}\), total parameters = \(2 \times 2^m \).
x_1 | x_2 | x_3 | y = 0 | y = 1 |
---|---|---|---|---|
0 | 0 | 0 | ||
0 | 0 | 1 | ||
0 | 1 | 0 | ||
0 | 1 | 1 | ||
1 | 0 | 0 | ||
1 | 0 | 1 | ||
1 | 1 | 0 | ||
1 | 1 | 1 |
Row are possible values \(x_1, x_2, x_3\) can take.
Values in column \(y=0\) are \(p(x_1, x_2, x_3) \) when \(y=0\).
The column-wise sum of values is 1:
- \(\sum p(x_1, x_2, x_3, y=0) = 1\) and
- \(\sum p(x_1, x_2, x_3, y=1) = 1\)
For \(m=30\) binary features, total parameters for binary classification problem > 2 billion (1 billion per label)!
- Lots of parameters!
- Need a lot of data to learn them without overfitting.
Estimating class conditional density
Let's calculate class conditional density with chain rule
There are large number of parameters in this model.
Simplify the problem with naive bayes assumption: Features are conditionally independent given the label.
Conditional Independence (CI)
[Chain rule]
[CI]
x_1 | x_2 | x_3 | y = 0 | y = 1 |
---|---|---|---|---|
0 | 0 | 0 | ||
0 | 0 | 1 | ||
0 | 1 | 0 | ||
0 | 1 | 1 | ||
1 | 0 | 0 | ||
1 | 0 | 1 | ||
1 | 1 | 0 | ||
1 | 1 | 1 |
- Training examples with \(m=3\) binary features \(x_1, x_2, x_3\) and a label \(y \in \{0, 1\}\)
- In this simple case, the joint distribution \(p(\mathbf{x}, y) \) has \(2^m\) parameters for each class.
- Total parameters = \(2 \times 2^m \).
Importance of CI
Importance of CI
x_1 | x_2 | x_3 | y = 0 | y = 1 |
---|---|---|---|---|
0 | 0 | 0 | xx | |
0 | 0 | 1 | ||
0 | 1 | 0 | ||
0 | 1 | 1 | ||
1 | 0 | 0 | ||
1 | 0 | 1 | ||
1 | 1 | 0 | ||
1 | 1 | 1 |
- \(n\) training examples with \(m=3\) features \(x_1, x_2, x_3\) and a label \(y \in \{0, 1\}\)
- Each feature is a binary feature: \(x_i \in \{0, 1\}\)
- In this simple case, the joint distribution \(p(\mathbf{x}, y) \) has \(2^m\) parameters for each class.
- Total parameters = \(2 \times 2^m \).
Importance of CI
x_1 | ... | x_100 | y = 0 | y = 1 |
---|---|---|---|---|
0 | ... | 0 | ||
0 | ... | 1 | ||
0 | ... | 0 | ||
0 | ... | 1 | ||
: | : | : | ||
1 | ... | 1 |
- How many parameters do we need to learn for \(m=100\) binary features \(x_1, x_2, \ldots, x_{100}\) and a label \(y \in \{0, 1\}\)
- Total parameters = \(2 \times 2^{100} \).
- Lots of parameters!
- Need a lot of data to learn them without overfitting.
Parameter reduction with CI
- Due to CI assumption, we need to learn the following for each class \(y= \{y_1, y_2, \ldots, y_k\}\):
- Total parameters reduced from \(k \times 2^m\) to \(k \times 2m\).
x_1 | y = 0 | y = 1 |
---|---|---|
0 | ||
1 |
x_2 | y = 0 | y = 1 |
---|---|---|
0 | ||
1 |
x_m | y = 0 | y = 1 |
---|---|---|
0 | ||
1 |
...
x_1 | ... | x_100 | y = 0 | y = 1 |
---|---|---|---|---|
0 | ... | 0 | ||
0 | ... | 1 | ||
0 | ... | 0 | ||
0 | ... | 1 | ||
: | : | : | ||
1 | ... | 1 |
Types of Naive Bayes (NB)
- Bernoulli NB
- Categorical NB
- Multinomial NB
- Gaussian NB
Estimating \(w_{y_k}\)
We need to take care of an additional constraint: \(\sum_{i=1}^k y_k = 1\)
Calculate \( \frac{\partial l(\mathbf{w})}{\partial w_{y_k}} \) and set it to \(0\).
Estimating \(w_{y_k}\)
Given constraint: \(\sum_{i=1}^k w_{y_k} = 1\) implies \(\lambda = -n \), substituting it, we get
Using Lagrange multiplier:
Copy of Naive Bayes Classifier
By Swarnim POD
Copy of Naive Bayes Classifier
- 110