Machine Learning Techniques
Feature matrix
Label vector
\((n, m)\)
\((n, )\)
Index of example
\(y^{(i)} \in \{0, 1\}\)
Feature matrix
Label matrix
\((n, m)\)
\((n, k )\)
Index of example
\(\mathbf{y}^{(i)} \in \{0, 1\}^k\)
Spot the difference!
Naive Bayes classifier makes a strong conditional independence assumption:
Features are conditionally independent given the label.
It enables us to express joint probability of features given label as product of probabilities of individual features given label:
NB classifier predicts probability, \(p(y|\mathbf{x})\), of class label, \(y\), given a feature vector \(\mathbf{x}\), using Bayes' theorem
Evidence
Posterior probability
Class conditional density
Class prior
With Naive Bayes assumption, posterior probability \(p(y|\mathbf{x})\) becomes
Rewriting
Expressing denominator as a sum over all \(k\) labels.
and
Rewriting the denominator after applying the chain rule
Rewriting numerator and denominator following conditional independence assumption
Rewriting numerator and denominator compactly
\(k\) prior probabilities
\(k \times m\) class conditional densities
\(m\) conditional densities per class and there are \(k\) such classes.
The number of parameters for each conditional density vary and depends on its mathematical form.
Credits: https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote05.html
Probability distribution used for modeling \(p(x_j|y_c)\) depends on the nature of the feature \(x_j\):
Let \(\mathbf{w}\) be the set of all parameters: priors as well as class conditional densities
Note: we need to estimate parameters of relevant distributions, one for each feature, for each class label.
When \(x_j\) is a binary feature, we use Bernoulli distribution to model the class conditional density: \(p(x_j|y_c)\)
Combine these two equations into a compact form as follows:
When \(x_j=1\),
and \(x_j=0\),
Parameterized by \(\mu_jc\), \(p(x_j|y_c)\) is calculated as follows:
Verify that the compact form and earlier form are equivalent.
For \(s \leq m\) binary features and \(k\) classes, we will have \(k \times s\) parameters for \(s\) Bernoulli distributions.
When \(x_j\) is a categorical feature i.e. it takes one of the \(e \gt 2\) discrete values [e.g. \(\{\text{red, green, blue}\}\) or roll of a dice], we use categorical distribution to model the class conditional density \(p(x_j|y_c)\).
For discrete set \(v\), \(p(x_j|y_c)\) is parameterized by the \(|v|\), that is # events in \(v\) and probability of each event \( \mu_{j1c}, \mu_{j2c}, \ldots, \mu_{jec}\) such that \(\sum_{q=1}^e \mu_{jqc} = 1\)
For \(x_j = v_q\) such that \(v_q \in v\):
\(p(x_j=v_q|y_c; e, \mu_{j1c}, \mu_{j2c}, \ldots, \mu_{jec}) = \mu_{jqc}\)
Let \(v = \{v_1, v_2, \ldots, v_e\}\) be the set of \(e\) discrete values.
Total parameters = \(k \times \sum_{j=1}^m |v_j| \)
\(p(x_j|y_c)\) can be written in a compact form as follows:
Let \( \mathbf{\mu_{jc}} = [\mu_{j1c}, \mu_{j2c}, \ldots, \mu_{jec}]\) be the parameter vector for \(p(x_j|y_c)\)
Verify that the compact form is equivalent to the following:
where \( \mathcal{1}(x_j = v_q) = 1 \text{ if } x_j = v_q \text{ else } 0\)
When \(\mathbf{x}\) is count vector i.e. each component \(x_j\) is a count of appearance in the object it represents and \( \sum x_j = l \), which is the length of the object, we use multinomial distribution to model \(p(\mathbf{x}|y_c)\).
It is parameterized by the length of object \(l\) and probability of features \(\{x_1, \ldots, x_m\}\): \( \mu_{1c}, \ldots, \mu_{mc}\).
\(p(\mathbf{x}|y_c; l, \mu_{1c}, \mu_{2c}, \ldots, \mu_{mc}) = \frac{n!}{x_1! \ldots x_m!}\prod_{j=1}^{m} \mu_{jc}^{x_j}\)
Used for modelling documents that are represented by the word counts.
The probability of \(p(\mathbf{x}|y_c)\) such that \(\sum_{j=1}^{m} x_j = l\) is given by:
Total parameters = \(k \times m \)
When \(x_j\) is a continuous feature i.e. it takes a real value, we use gaussian (or normal) distribution to model the class conditional density \(p(x_j|y_c)\).
It is parameterised by the mean \(\mu_{jc}\) and variance \(\sigma_{jc}^2\).
standard deviation of \(x_j\) for class \(y_c\)
mean of \(x_j\) for class \(y_c\)
value of \(j\)-th feature
This is 1-D gaussian distribution. It models class conditional density for a single feature.
Alternately, we can use multivariate gaussian distribution to represent \(p(\mathbf{x}|y)\) with parameters mean vector \(\mathbf{\mu}_{m \times 1}\) and covariance matrix \(\Sigma_{m \times m}\).
Total parameters = \(k \times 2m \)
In NB setting, since the features are conditionally independent of one another, all entries of \(\Sigma\) except diagonal are zero.
Once we learn parameters of different conditional densities, we use them to infer class label for new example.
We assign a class label \(y_c\) to a new example \(\mathbf{x}\) that maximizes the posterior probability.
Let \(\mathbf{w}\) be the set of all paramaters.
Using the definition of posterior probability
Since \(p(\mathbf{x}; \mathbf{w})\) is independent of \(y_c\), we ignore denominator from this computation
For a new example, \(\mathbf{x}\), we assign a class label \(y_c\) that yields max value among all \(y = \{y_1, \ldots, y_k\}\).
Expanding \(p(\mathbf{x}|y_c; \mathbf{w}) \) with naive Bayes assumption, we get
This equation involves multiplication of small numbers, there is a risk of underflow in this calculation:
The following equation is useful for getting the class label. It does not return the probability of an example belonging to class \(y_c\).
In case, we want the probability, we should use the following equation
This calculation should also be performed in \(\text{log}\) space.
Likelihood describes joint probability of observed data \(D\) given the parameter \(\mathbf{w}\) for the chosen statistical model.
For mathematical and computational convenience, we calculate log likelihood by taking log on both the sides:
The product becomes sum in the log space.
Since training examples are i.i.d., we can express this as a product of probability of individual samples:
Log likelihood is defined as
Our job is to find the parameter vector \(\mathbf{w}\) such that the \(l(\mathbf{w})\) is maximized.
Equivalently we can minimize the negative log likelihood (NLL) to maintain uniformity with other algorithms:
Simplifying with naive Bayes assumptions of conditional independence of features given label:
Applying log on product makes it summation in log.
Rearranging
The calculation of \(p(x^{(i)}_j|y^{(i)})\) depends on the probability distribution of the features.
The parameter estimation by maximizing the log likelihood function is carried out with the following three steps:
Since \(p(x_j|y)\) depends on the choice of probability distribution, we will discuss parameter estimation for different distributions separately.
Note that \(\mathbb{1}(y^{(i)}=y_c) = 1 \text{\ when }y^{(i)}=y_c \text{\ else\ } 0. \)
The prior probability for class \(y_c\) is equal to the ratio of the number of examples with label \(y_c\) to the total number of examples in the training set \(n\).
The total number of parameters to be estimated is equal to the number of class labels \(k\) - one prior per label.
Recall
, substituting this in \(l(\mathbf{w})\)
Parameters for label \(y_r\):
Distributing log into the bracket - multiplication turns into addition
(Step 1) calculate \( \frac{\partial l(\mathbf{w})}{\partial w_{jy_r}} \) and set it to \(0\).
Applying derivative to individual terms in the loss equation.
The derivatives of the first term and all terms where \(y^{(i)} \neq y_r \) are 0. Retaining terms where \(y^{(i)} = y_r \).
(Step 2) Setting \( \frac{\partial l(\mathbf{w})}{\partial w_{jy_r}} \) to \(0\).
(Step 3) Solving it further with algebraic manipulation:
This yields:
# examples with label \(y_r\)
# examples with label \(y_r\) and \(x_j = 1\)
What if \(\sum_{i=1}^{n} \mathbb{1}(y^{(i)}=y_r) x_j^{(i)} = 0\)?
Leads to \(w_{jy_r}\) = 0, which would mean \(p(x_j|y_r) \) = 0
Leads to \(p(y = y_r|\mathbf{x}) = 0\) since \(p(x_j|y_r) \) = 0
Laplace smoothing: We can correct it by adding +1 to numerator and +2 to denominator (1 for each value of feature: \(x_j \in\{0,1\}\)).
In general, we can add \(+c\) to numerator and \(+2c\) to denominator. \(c\) is a hyperparameter that helps control overfitting.
However too high value of \(c\) leads to underfitting.
In plain english, this is ratio of number of examples with label \(y_r\) and \(x_j=v\) to the total number of training examples with label \(y_r\).
Parameters:
Incorporating smoothing, we obtain
Smoothing factor \(c\) is a hyperparameter and \(c = 1\) leads to Laplace smoothing.
In plain english, this is ratio of number of training examples where \(x_j\) appears with label \(y_r\) to the sum of feature values in training examples with label \(y_r\).
Incorporating smoothing, we obtain
Smoothing factor \(c\) is a hyperparameter and \(c = 1\) leads to Laplace smoothing.
Let \(n_r\) be the number of examples of class \(y_r\)
There are two parameters per feature \(\{\mu_j, \sigma_j^2\}\) per label.
Classification evaluation measures with cross validation and test set:
Let's calculate class conditional density with chain rule
There are large number of parameters in this model.
x_1 | x_2 | x_3 | y = 0 | y = 1 |
---|---|---|---|---|
0 | 0 | 0 | ||
0 | 0 | 1 | ||
0 | 1 | 0 | ||
0 | 1 | 1 | ||
1 | 0 | 0 | ||
1 | 0 | 1 | ||
1 | 1 | 0 | ||
1 | 1 | 1 |
Row are possible values \(x_1, x_2, x_3\) can take.
Values in column \(y=0\) are \(p(x_1, x_2, x_3) \) when \(y=0\).
The column-wise sum of values is 1:
For \(m=30\) binary features, total parameters for binary classification problem > 2 billion (1 billion per label)!
Let's calculate class conditional density with chain rule
There are large number of parameters in this model.
Simplify the problem with naive bayes assumption: Features are conditionally independent given the label.
[Chain rule]
[CI]
x_1 | x_2 | x_3 | y = 0 | y = 1 |
---|---|---|---|---|
0 | 0 | 0 | ||
0 | 0 | 1 | ||
0 | 1 | 0 | ||
0 | 1 | 1 | ||
1 | 0 | 0 | ||
1 | 0 | 1 | ||
1 | 1 | 0 | ||
1 | 1 | 1 |
x_1 | x_2 | x_3 | y = 0 | y = 1 |
---|---|---|---|---|
0 | 0 | 0 | xx | |
0 | 0 | 1 | ||
0 | 1 | 0 | ||
0 | 1 | 1 | ||
1 | 0 | 0 | ||
1 | 0 | 1 | ||
1 | 1 | 0 | ||
1 | 1 | 1 |
x_1 | ... | x_100 | y = 0 | y = 1 |
---|---|---|---|---|
0 | ... | 0 | ||
0 | ... | 1 | ||
0 | ... | 0 | ||
0 | ... | 1 | ||
: | : | : | ||
1 | ... | 1 |
x_1 | y = 0 | y = 1 |
---|---|---|
0 | ||
1 |
x_2 | y = 0 | y = 1 |
---|---|---|
0 | ||
1 |
x_m | y = 0 | y = 1 |
---|---|---|
0 | ||
1 |
...
x_1 | ... | x_100 | y = 0 | y = 1 |
---|---|---|---|---|
0 | ... | 0 | ||
0 | ... | 1 | ||
0 | ... | 0 | ||
0 | ... | 1 | ||
: | : | : | ||
1 | ... | 1 |
We need to take care of an additional constraint: \(\sum_{i=1}^k y_k = 1\)
Calculate \( \frac{\partial l(\mathbf{w})}{\partial w_{y_k}} \) and set it to \(0\).
Given constraint: \(\sum_{i=1}^k w_{y_k} = 1\) implies \(\lambda = -n \), substituting it, we get
Using Lagrange multiplier: