Naive Bayes Classifier

Dr. Ashish Tendulkar

Machine Learning Techniques

IIT Madras

Introduction

Simple yet very powerful classifier that is used extensively in applications like document classification and spam filtering.

Generative counterpart of logistic regression.

Uses Bayes theorem for calculating probability of a sample belonging to a class.

Makes strong (naive) conditional independence assumption between the features given a label.

Part 1: Training Setup

Binary classification

D = \left\{ (\mathbf{X}, \mathbf{y})\right\} = \left\{ (\mathbf{x}^{(i)}, y^{(i)})\right\}_{i=1}^{n}

Feature matrix

Label vector

\((n, m)\)

\((n, )\)

Index of example

\(y^{(i)} \in \{0, 1\}\)

Multiclass classification

D = \left\{ (\mathbf{X}, \mathbf{Y})\right\} = \left\{ (\mathbf{x}^{(i)}, \mathbf{y}^{(i)})\right\}_{i=1}^{n}

Feature matrix

Label matrix

\((n, m)\)

\((n, k )\)

Index of example

\(\mathbf{y}^{(i)} \in \{0, 1\}^k\)

Spot the difference!

Part 2: Model

Naive Bayes' assumption

Naive Bayes classifier makes a strong conditional independence assumption:

Features are conditionally independent given the label.

\begin{aligned} p(x_1, x_2, \ldots, x_m|y) = p(x_1|y)\ p(x_2|y) \ldots p(x_m|y) = \prod_{j=1}^{m} p(x_j|y) \\ \end{aligned}

It enables us to express joint probability of features given label as product of probabilities of individual features given label:

p(y|\mathbf{x}) = \frac{p(\mathbf{x}, y)}{p(\mathbf{x})} = \frac{p(\mathbf{x}|y) p(y)}{p(\mathbf{x})} \\

NB classifier predicts probability, \(p(y|\mathbf{x})\), of class label, \(y\), given a feature vector \(\mathbf{x}\), using Bayes' theorem

Evidence

Posterior probability

Class conditional density

Class prior

\begin{aligned} p(y = y_c|\mathbf{x}) &= \frac{p(\mathbf{x}|y_c)\ p(y_c)} {p(\mathbf{x})} \end{aligned}

With Naive Bayes assumption, posterior probability \(p(y|\mathbf{x})\) becomes

\begin{aligned} p(\mathbf{x}|y) = p(x_1, x_2, \ldots, x_m|y) \end{aligned}

Rewriting

\begin{aligned} &= \frac{p(x_1, x_2, \ldots, x_m|y_c)\ p(y_c)}{\color{blue}{p(x_1, x_2, \ldots, x_m)}} \end{aligned}

Expressing denominator as a sum over all \(k\) labels.

\begin{aligned} &= \frac{p(x_1, x_2, \ldots, x_m|y_c)\ p(y_c)}{\color{blue}{\sum_{r=1}^{k} p(x_1, x_2, \ldots, x_m, y_r)}} \\ \end{aligned}

\begin{aligned} p(\mathbf{x}) = p(x_1, x_2, \ldots, x_m) \end{aligned}

and

Rewriting the denominator after applying the chain rule

\begin{aligned} &= \frac{p(x_1, x_2, \ldots, x_m|y_c)\ p(y_c)}{\color{blue}{\sum_{r=1}^{k} p(x_1, x_2, \ldots, x_m|y_r) p(y_r)} } \end{aligned}

\begin{aligned} \sum_{r=1}^{k} p(x_1, x_2, \ldots, x_m, y_r) = \color{blue}{\sum_{r=1}^{k} p(x_1, x_2, \ldots, x_m|y_r) p(y_r)} \end{aligned}

\begin{aligned} &= \frac{p(x_1, x_2, \ldots, x_m|y_c)\ p(y_c)}{\color{blue}{\sum_{r=1}^{k} p(x_1, x_2, \ldots, x_m, y_r)}} \\ \end{aligned}

Rewriting numerator and denominator following conditional independence assumption

\begin{aligned} p(x_1, x_2, \ldots, x_m|y) = p(x_1|y)\ p(x_2|y) \ldots p(x_m|y) \end{aligned}

\begin{aligned} p(y = y_c|\mathbf{x}) &= \frac{\color{green}{p(y_c) p(x_1|y_c) p(x_2|y_c) \ldots p(x_m|y_c)}} {\color{blue}{ \sum_{r=1}^{k} p(y_r) p(x_1|y_r) p(x_2|y_r) \ldots p(x_m|y_r)}} \\ \end{aligned}

Rewriting numerator and denominator compactly

\begin{aligned} &= \frac{\color{green}{p(y_c) \prod_{j=1}^{m} p(x_j|y_c)}} {\color{blue}{ \sum_{r=1}^{k} p(y_r) \prod_{j=1}^{m} p(x_j|y_r)}} \\ \end{aligned}

Parameters of naive Bayes classifier

\begin{aligned} p(y = y_c|\mathbf{x}) &= \frac{\color{green}{p(y_c) \prod_{j=1}^{m} p(x_j|y_c)}} {\color{blue}{ \sum_{r=1}^{k} p(y_r) \prod_{j=1}^{m} p(x_j|y_r)}} \\ \end{aligned}

\{p(y_1), p(y_2), \ldots, p(y_k)\}

\{p(x_1|y_r), p(x_2|y_r), \ldots, p(x_m|y_r)\}

\(k\) prior probabilities

\(k \times m\) class conditional densities

\(m\) conditional densities per class and there are \(k\) such classes.

The number of parameters for each conditional density vary and depends on its mathematical form.

NB schematic

Credits: https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote05.html

Modeling conditional densities: \(p(x_i|y)\)

Depends on the nature of the feature \(x_i\):
- Binary feature - e.g. word is present or not
  - categorical feature is generalization of binary.
- Multinomial feature - e.g. word count \(c_i\) as features with additional constraint that \(\sum_{i=1}^{m} c_i = l\), the length of the sequence they represent.
- Continuous feature - Features are real numbers. e.g. area of an apartment in sq. feet.

Modeling \(p(x_j|y_c)\)

Probability distribution used for modeling \(p(x_j|y_c)\) depends on the nature of the feature \(x_j\):

Categorical feature: \(p(x_j|y_c) \sim \text{Cat}(e, \mu_{j1c}, \mu_{j2c}, \ldots, \mu_{jec}) \)
Binary feature: \(p(x_j|y_c) \sim \text{Bernoulli}(\mu_{jc}) \)

Multinomial feature: \(p(\mathbf{x}|y_c) \sim \text{Multinomial}(l, \mu_{1c}, \mu_{2c}, \ldots, \mu_{mc}) \)

Continuous feature: \(p(x_j|y_c) \sim \mathcal{N}(\mathbf{\mu}_{jc}, \sigma_{jc}) \)

Let \(\mathbf{w}\) be the set of all parameters: priors as well as class conditional densities

Note: we need to estimate parameters of relevant distributions, one for each feature, for each class label.

Bernoulli Distribution

When \(x_j\) is a binary feature, we use Bernoulli distribution to model the class conditional density: \(p(x_j|y_c)\)

\(p(x_j = 1|y_c) = \mu_{jc} \)
\(p(x_j = 0|y_c) = 1- \mu_{jc} \)

p(x_j|y_c; \mu_{jc}) = \mu_{jc}^{x_j} (1 - \mu_{jc})^{(1-x_j)}

Combine these two equations into a compact form as follows:

When \(x_j=1\),

\mu_{jc}^{\color{red}{1}} (1 - \mu_{jc})^{(1-{\color{red}1})} = \mu_{jc}^{\color{red}{1}} (1 - \mu_{jc})^{0} = \mu_{jc}

and \(x_j=0\),

\mu_{jc}^{{\color{red}0}} (1 - \mu_{jc})^{(1-{\color{red}0})} = \mu_{jc}^{\color{red}{0}} (1 - \mu_{jc})^{1-0} = 1-\mu_{jc}

Parameterized by \(\mu_jc\), \(p(x_j|y_c)\) is calculated as follows:

Verify that the compact form and earlier form are equivalent.

For \(s \leq m\) binary features and \(k\) classes, we will have \(k \times s\) parameters for \(s\) Bernoulli distributions.

Categorical Distribution

When \(x_j\) is a categorical feature i.e. it takes one of the \(e \gt 2\) discrete values [e.g. \(\{\text{red, green, blue}\}\) or roll of a dice], we use categorical distribution to model the class conditional density \(p(x_j|y_c)\).

For discrete set \(v\), \(p(x_j|y_c)\) is parameterized by the \(|v|\), that is # events in \(v\) and probability of each event \( \mu_{j1c}, \mu_{j2c}, \ldots, \mu_{jec}\) such that \(\sum_{q=1}^e \mu_{jqc} = 1\)

For \(x_j = v_q\) such that \(v_q \in v\):

\(p(x_j=v_q|y_c; e, \mu_{j1c}, \mu_{j2c}, \ldots, \mu_{jec}) = \mu_{jqc}\)

Let \(v = \{v_1, v_2, \ldots, v_e\}\) be the set of \(e\) discrete values.

Total parameters = \(k \times \sum_{j=1}^m |v_j| \)

p(x_j|y_c; e, \mathbf{\mu_{jc}}) = \mu_{j1c}^{\mathcal{1}(x_j = v_1)} \mu_{j2c}^{\mathcal{1}(x_j = v_2)} \ldots \mu_{jec}^{\mathcal{1}(x_j = v_e)}

\(p(x_j|y_c)\) can be written in a compact form as follows:

Let \( \mathbf{\mu_{jc}} = [\mu_{j1c}, \mu_{j2c}, \ldots, \mu_{jec}]\) be the parameter vector for \(p(x_j|y_c)\)

Verify that the compact form is equivalent to the following:

\begin{aligned} p(x_j = v_2|y_c; e, \mathbf{\mu_{jc}}) &= \mu_{j2c} \\ \end{aligned}

where \( \mathcal{1}(x_j = v_q) = 1 \text{ if } x_j = v_q \text{ else } 0\)

\begin{aligned} p(x_j = v_1|y_c; e, \mathbf{\mu_{jc}}) &= \mu_{j1c} \\ \end{aligned}

\begin{aligned} p(x_j = v_e|y_c; e, \mathbf{\mu_{jc}}) &= \mu_{jec} \\ \end{aligned}

\begin{aligned} \vdots \\ \end{aligned}

Multinomial Distribution

When \(\mathbf{x}\) is count vector i.e. each component \(x_j\) is a count of appearance in the object it represents and \( \sum x_j = l \), which is the length of the object, we use multinomial distribution to model \(p(\mathbf{x}|y_c)\).

It is parameterized by the length of object \(l\) and probability of features \(\{x_1, \ldots, x_m\}\): \( \mu_{1c}, \ldots, \mu_{mc}\).

\(p(\mathbf{x}|y_c; l, \mu_{1c}, \mu_{2c}, \ldots, \mu_{mc}) = \frac{n!}{x_1! \ldots x_m!}\prod_{j=1}^{m} \mu_{jc}^{x_j}\)

Used for modelling documents that are represented by the word counts.

The probability of \(p(\mathbf{x}|y_c)\) such that \(\sum_{j=1}^{m} x_j = l\) is given by:

Total parameters = \(k \times m \)

Gaussian Distribution

When \(x_j\) is a continuous feature i.e. it takes a real value, we use gaussian (or normal) distribution to model the class conditional density \(p(x_j|y_c)\).

It is parameterised by the mean \(\mu_{jc}\) and variance \(\sigma_{jc}^2\).

p(x_j|y_c; \mu_{jc}, \sigma_{jc}^2) = \frac{1}{\sqrt{2\pi} \sigma_{jc}} e^{-\frac{1}{2}(\frac{x_j - \mu_{jc}}{\sigma_{jc}})^2}

standard deviation of \(x_j\) for class \(y_c\)

mean of \(x_j\) for class \(y_c\)

value of \(j\)-th feature

This is 1-D gaussian distribution. It models class conditional density for a single feature.

Multivariate Gaussian Distribution

Alternately, we can use multivariate gaussian distribution to represent \(p(\mathbf{x}|y)\) with parameters mean vector \(\mathbf{\mu}_{m \times 1}\) and covariance matrix \(\Sigma_{m \times m}\).

p(\mathbf{x}|y; \mathbf{\mu}, \Sigma) = \frac{1}{\sqrt{(2\pi)^m |\Sigma|}} \text{exp}\left(-\frac{1}{2}(\mathbf{x} - \mathbf{\mu})^T \Sigma^{-1} (\mathbf{x} - \mathbf{\mu})\right)

Total parameters = \(k \times 2m \)

In NB setting, since the features are conditionally independent of one another, all entries of \(\Sigma\) except diagonal are zero.

The diagonal entries represent variance of that feature i.e. \(\Sigma_{jj} = \sigma_j^2\)

Once we learn parameters of different conditional densities, we use them to infer class label for new example.

Inference

\begin{aligned} &= \text{argmax}_{y_c} p(\mathbf{x}|y_c; \mathbf{w}) p(y_c; \mathbf{w}) \end{aligned}

We assign a class label \(y_c\) to a new example \(\mathbf{x}\) that maximizes the posterior probability.

\begin{aligned} y &= \text{argmax}_{y_c} p(y_c|\mathbf{x}; \mathbf{w}) \end{aligned}

Let \(\mathbf{w}\) be the set of all paramaters.

Using the definition of posterior probability

\begin{aligned} &= \text{argmax}_{y_c} \frac{p(\mathbf{x}|y_c; \mathbf{w})\ p(y_c; \mathbf{w})}{p(\mathbf{x; \mathbf{w}})} \end{aligned}

Since \(p(\mathbf{x}; \mathbf{w})\) is independent of \(y_c\), we ignore denominator from this computation

\begin{aligned} &= \text{argmax}_{y_c} \left( \prod_{j=1}^{m} p(x_j|y_c; \mathbf{w}) \right) p(y_c; \mathbf{w}) \end{aligned}

For a new example, \(\mathbf{x}\), we assign a class label \(y_c\) that yields max value among all \(y = \{y_1, \ldots, y_k\}\).

Expanding \(p(\mathbf{x}|y_c; \mathbf{w}) \) with naive Bayes assumption, we get

\begin{aligned} &= \text{argmax}_{y_c} p(\mathbf{x}|y_c; \mathbf{w}) p(y_c; \mathbf{w}) \end{aligned}

This equation involves multiplication of small numbers, there is a risk of underflow in this calculation:

\begin{aligned} y &= \text{argmax}_{y_c} \left(\sum_{j=1}^{m} \text{log}\ p(x_j|y_c; \mathbf{w}) \right) + \text{log}\ p(y_c; \mathbf{w}) \\ \end{aligned}

The following equation is useful for getting the class label. It does not return the probability of an example belonging to class \(y_c\).

In case, we want the probability, we should use the following equation

\begin{aligned} p(y_c | \mathbf{x}; \mathbf{w}) &= \frac{p(\mathbf{x}|y_c; \mathbf{w})\ p(y_c; \mathbf{w})}{p(\mathbf{x; \mathbf{w}})} \end{aligned}

This calculation should also be performed in \(\text{log}\) space.

Part 3: Loss function

Likelihood describes joint probability of observed data \(D\) given the parameter \(\mathbf{w}\) for the chosen statistical model.

\begin{aligned} L(\mathbf{w}) &= \prod_{i=1}^{n} p(\mathbf{x}^{(i)}, y^{(i)}; \mathbf{w}) \end{aligned}

For mathematical and computational convenience, we calculate log likelihood by taking log on both the sides:

\begin{aligned} \text{log } L(\mathbf{w}) &= \text{log} \left( \prod_{i=1}^{n} p(\mathbf{x}^{(i)}, y^{(i)}; \mathbf{w}) \right) \end{aligned}

\begin{aligned} l(\mathbf{w}) &= \sum_{i=1}^{n} \text{log} \left(p(\mathbf{x}^{(i)}, y^{(i)}; \mathbf{w}) \right) \end{aligned}

The product becomes sum in the log space.

\begin{aligned} L(\mathbf{w}) &= p(D; \mathbf{w}) = p(\mathbf{X}, \mathbf{y}; \mathbf{w}) \end{aligned}

Since training examples are i.i.d., we can express this as a product of probability of individual samples:

Log likelihood is defined as

\begin{aligned} l(\mathbf{w}) &= \sum_{i=1}^{n} \text{log} \left(p(\mathbf{x}^{(i)}, y^{(i)}; \mathbf{w}) \right) \end{aligned}

Our job is to find the parameter vector \(\mathbf{w}\) such that the \(l(\mathbf{w})\) is maximized.

Equivalently we can minimize the negative log likelihood (NLL) to maintain uniformity with other algorithms:

\begin{aligned} J(\mathbf{w}) &= -l(\mathbf{w}) \\ &= -\sum_{i=1}^{n} \text{log} \left(p(\mathbf{x}^{(i)}, y^{(i)}; \mathbf{w}) \right) \end{aligned}

Simplifying with naive Bayes assumptions of conditional independence of features given label:

\begin{aligned} &= \sum_{i=1}^{n} \text{log}\ p(y^{(i)}; \mathbf{w}) + \sum_{i=1}^{n} \sum_{j=1}^{m} \text{log}\ p(x^{(i)}_j|y^{(i)}; \mathbf{w}) \end{aligned}

\begin{aligned} l(\mathbf{w}) &= \sum_{i=1}^{n} \text{log} \left(p(\mathbf{x}^{(i)}, y^{(i)}; \mathbf{w}) \right) \end{aligned}

\begin{aligned} &= \sum_{i=1}^{n} \text{log} \left( \left( \prod_{i=1}^{m} p(\mathbf{x}^{(i)}_j|y^{(i)}; \mathbf{w}) \right) \ p(y^{(i)}; \mathbf{w}) \right) \end{aligned}

\begin{aligned} &= \sum_{i=1}^{n} \left( \sum_{i=1}^{m} \text{log}\ p(x^{(i)}_j|y^{(i)}; \mathbf{w}) \right) + \text{log}\ p(y^{(i)}; \mathbf{w}) \end{aligned}

Applying log on product makes it summation in log.

Rearranging

\begin{aligned} l(\mathbf{w}) &= \sum_{i=1}^{n} \text{log}\ p(y^{(i)}; \mathbf{w}) + \sum_{i=1}^{n} \sum_{i=1}^{m} \text{log}\ p(x^{(i)}_j|y^{(i)}; \mathbf{w}) \end{aligned}

The calculation of \(p(x^{(i)}_j|y^{(i)})\) depends on the probability distribution of the features.

Part 4: Optimization for parameter estimation

The parameter estimation by maximizing the log likelihood function is carried out with the following three steps:

Calculate partial derivation of log likelihood function w.r.t. each parameter.
Set the partial derivative to 0, which is the condition at maxima.
Solve the resulting equation to obtain the parameter value.

Since \(p(x_j|y)\) depends on the choice of probability distribution, we will discuss parameter estimation for different distributions separately.

Estimating prior probability: \(p(y)\)

p(y = y_c) = \frac{\sum_{i=1}^{n} \mathbb{1}(y^{(i)}=y_c)}{n}

Note that \(\mathbb{1}(y^{(i)}=y_c) = 1 \text{\ when }y^{(i)}=y_c \text{\ else\ } 0. \)

The prior probability for class \(y_c\) is equal to the ratio of the number of examples with label \(y_c\) to the total number of examples in the training set \(n\).

The total number of parameters to be estimated is equal to the number of class labels \(k\) - one prior per label.

Estimating class conditional densities

Bernoulli distribution

\begin{aligned} &= \sum_{i=1}^{n} \text{log}\ w_{y^{(i)}} + \sum_{i=1}^{n} \sum_{j=1}^{m} \text{log} \left( w_{jy^{(i)}}^{x_j^{(i)}} (1 - w_{jy^{(i)}})^{1 - x_j^{(i)}} \right) \\ \end{aligned}

p(x_j|y_c; w_{jc}) = w_{jc}^{x_j} (1 - w_{jc})^{(1-x_j)}

Recall

, substituting this in \(l(\mathbf{w})\)

Parameters for label \(y_r\):

\mathbf{w} = w_{1r}, w_{2r}, \ldots, w_{mr}

\begin{aligned} l(\mathbf{w}) &= \sum_{i=1}^{n} \text{log}\ p(y^{(i)}; \mathbf{w}) + \sum_{i=1}^{n} \sum_{j=1}^{m} \text{log}\ p(x^{(i)}_j|y^{(i)}; \mathbf{w}) \\ \end{aligned}

\begin{aligned} &= \sum_{i=1}^{n} \text{log}\ w_{y^{(i)}} + \sum_{i=1}^{n} \sum_{j=1}^{m} x_j^{(i)} \text{log}\ w_{jy^{(i)}} + (1 - x_j^{(i)}) \text{log}\ (1 - w_{jy^{(i)}}) \\ \end{aligned}

Distributing log into the bracket - multiplication turns into addition

\begin{aligned} \frac{\partial l(\mathbf{w})}{\partial w_{jy_r}} &= \frac{\partial}{\partial w_{jy_r}} \left(\sum_{i=1}^{n} \text{log}\ w_{y^{(i)}} + \sum_{i=1}^{n} \sum_{j=1}^{m} x_j^{(i)} \text{log}\ w_{jy^{(i)}} + (1 - x_j^{(i)}) \text{log}\ (1 - w_{jy^{(i)}}) \right) \\ \end{aligned}

(Step 1) calculate \( \frac{\partial l(\mathbf{w})}{\partial w_{jy_r}} \) and set it to \(0\).

\begin{aligned} &= \sum_{i=1}^{n} \frac{\partial}{\partial w_{jy_r}} \text{log}\ w_{y^{(i)}} + \sum_{i=1}^{n} \sum_{j=1}^{m} \frac{\partial}{\partial w_{jy_r}} \left(x_j^{(i)} \text{log}\ w_{jy^{(i)}} + (1 - x_j^{(i)}) \text{log}\ (1 - w_{jy^{(i)}}) \right) \\ \end{aligned}

Applying derivative to individual terms in the loss equation.

\begin{aligned} &= \sum_{i=1}^{n} \mathbb{1}(y^{(i)}=y_r) \frac{\partial}{\partial w_{jy_r}} \left(x_j^{(i)} \text{log}\ w_{jy_r} + (1 - x_j^{(i)}) \text{log}\ (1 - w_{jy_r}) \right) \\ \end{aligned}

The derivatives of the first term and all terms where \(y^{(i)} \neq y_r \) are 0. Retaining terms where \(y^{(i)} = y_r \).

= \sum_{i=1}^{n} \mathbb{1}(y^{(i)}=y_r) \left( \frac{x_j^{(i)}}{w_{jy_r}} - \frac{1 - x_j^{(i)}}{1 - w_{jy_r}} \right) \\

\begin{aligned} \frac{\partial l(\mathbf{w})}{\partial w_{jy_r}} &= \sum_{i=1}^{n} \mathbb{1}(y^{(i)}=y_r) \left( \frac{x_j^{(i)}}{w_{jy_r}} - \frac{(1 - x_j^{(i)})}{(1 - w_{jy_r})} \right) = 0\\ \end{aligned}

(Step 2) Setting \( \frac{\partial l(\mathbf{w})}{\partial w_{jy_r}} \) to \(0\).

\begin{aligned} \sum_{i=1}^{n} \mathbb{1}(y^{(i)}=y_r) \left( \frac{x_j^{(i)}}{w_{jy_r}} - \frac{(1 - x_j^{(i)})}{(1 - w_{jy_r})} \right) &= 0\\ \sum_{i=1}^{n} \mathbb{1}(y^{(i)}=y_r) \left( x_j^{(i)} (1 - w_{jy_r}) - (1 - x_j^{(i)}) w_{jy_r} \right) &= 0\\ \sum_{i=1}^{n} \mathbb{1}(y^{(i)}=y_r) \left( x_j^{(i)} - w_{jy_r} \right) &= 0\\ \sum_{i=1}^{n} \mathbb{1}(y^{(i)}=y_r) x_j^{(i)} &= \sum_{i=1}^{n} \mathbb{1}(y^{(i)}=y_r) w_{jy_r} \\ \end{aligned}

(Step 3) Solving it further with algebraic manipulation:

This yields:

w_{jy_r} = \frac{\sum_{i=1}^{n} \mathbb{1}(y^{(i)}=y_r) x_j^{(i)}} {\sum_{i=1}^{n} \mathbb{1}(y^{(i)}=y_r)} \\

# examples with label \(y_r\)

# examples with label \(y_r\) and \(x_j = 1\)

What if \(\sum_{i=1}^{n} \mathbb{1}(y^{(i)}=y_r) x_j^{(i)} = 0\)?

Leads to \(w_{jy_r}\) = 0, which would mean \(p(x_j|y_r) \) = 0

Leads to \(p(y = y_r|\mathbf{x}) = 0\) since \(p(x_j|y_r) \) = 0

Fixing problem with zero count

w_{jy_r} = \frac{\sum_{i=1}^{n} \mathbb{1}(y^{(i)}=y_r) x_j^{(i)} + 1} {\sum_{i=1}^{n} \mathbb{1}(y^{(i)}=y_r)+ 2 } \\

Laplace smoothing: We can correct it by adding +1 to numerator and +2 to denominator (1 for each value of feature: \(x_j \in\{0,1\}\)).

In general, we can add \(+c\) to numerator and \(+2c\) to denominator. \(c\) is a hyperparameter that helps control overfitting.

w_{jy_r} = \frac{\sum_{i=1}^{n} \mathbb{1}(y^{(i)}=y_r) x_j^{(i)} + c} {\sum_{i=1}^{n} \mathbb{1}(y^{(i)}=y_r)+ 2c } \\

However too high value of \(c\) leads to underfitting.

Categorical distribution

w_{jvy_r} = \frac{\color{blue}{\sum_{i=1}^n \mathbb{1}(y^{(i)}=y_r)\ \mathbb{1}(x^{(i)}_j = v)}}{\color{red}{\sum_{i=1}^n \mathbb{1}(y^{(i)}=y_r)}}

In plain english, this is ratio of number of examples with label \(y_r\) and \(x_j=v\) to the total number of training examples with label \(y_r\).

Parameters:

\mathbf{w} = \{w_{111}, \ldots, w_{1e1}, \ldots, w_{m11}, \ldots, w_{me1}, \ldots, w_{mek}\}

Incorporating smoothing, we obtain

w_{jvy_r} = \frac{\color{blue}{\sum_{i=1}^n \mathbb{1}(y^{(i)}=y_r)\ \mathbb{1}(x^{(i)}_j = v)} + c}{\color{red}{\sum_{i=1}^n \mathbb{1}(y^{(i)}=y_r)}+ce}

Smoothing factor \(c\) is a hyperparameter and \(c = 1\) leads to Laplace smoothing.

Multinomial distribution

w_{jy_r} = \frac{\color{blue}{\sum_{i=1}^n \mathbb{1}(y^{(i)}=y_r)\ x_j^{(i)}}}{\color{red}{\sum_{i=1}^n \mathbb{1}(y^{(i)}=y_r) \sum_{j=1}^{m} x^{(i)}_j}}

In plain english, this is ratio of number of training examples where \(x_j\) appears with label \(y_r\) to the sum of feature values in training examples with label \(y_r\).

Incorporating smoothing, we obtain

Smoothing factor \(c\) is a hyperparameter and \(c = 1\) leads to Laplace smoothing.

w_{jy_r} = \frac{\color{blue}{\sum_{i=1}^n \mathbb{1}(y^{(i)}=y_r)\ x^{(i)}_j} + c}{\color{red}{\sum_{i=1}^n \mathbb{1}(y^{(i)}=y_r) \sum_{j=1}^{m} x^{(i)}_j} + cm}

Gaussian/Normal distribution

Let \(n_r\) be the number of examples of class \(y_r\)

n_r = \sum_{i=1}^{n} \mathbb{1}(y^{(i)} = y_r)

There are two parameters per feature \(\{\mu_j, \sigma_j^2\}\) per label.

\begin{aligned} \mu_{jr} &= \frac{1}{n_r} \sum_{i=1}^{n} \mathbb{1}(y^{(i)} = y_r) x_j^{(i)} \\ \sigma_{jr}^2 &= \frac{1}{n_r} \sum_{i=1}^{n} \mathbb{1}(y^{(i)} = y_r) (x_j^{(i)} - \mu_{jr})^2 \end{aligned}

Part 5: Evaluation

Evaluation

Classification evaluation measures with cross validation and test set:

Confusion matrix
Precision/recall/F1 score
AUC ROC/PR curve

Appendix

Estimating class conditional density

\begin{aligned} p(\mathbf{x}|y) &= p(x_1, x_2, \ldots, x_m|y) \\ &= p(x_1|y) p(x_2|x_1, y) \ldots p(x_m|x_1, x_2, \ldots, x_{m-1}, y) \\ &= \prod_{j=1}^{m} p(x_j|x_1, \ldots, x_{j-1}, y) \end{aligned}

Let's calculate class conditional density with chain rule

There are large number of parameters in this model.

For \(m=3\) binary features \(x_1, x_2, x_3\)
- # parameters per label: \(2^m\)
For binary label, \( y \in \{0,1\}\), total parameters = \(2 \times 2^m \).

x_1	x_2	x_3
0	0	0
0	0	1
0	1	0
0	1	1
1	0	0
1	0	1
1	1	0
1	1	1

Row are possible values \(x_1, x_2, x_3\) can take.

Values in column \(y=0\) are \(p(x_1, x_2, x_3) \) when \(y=0\).

The column-wise sum of values is 1:

\(\sum p(x_1, x_2, x_3, y=0) = 1\) and
\(\sum p(x_1, x_2, x_3, y=1) = 1\)

\frac{\sum_{i=0}^{n} \mathrm{I}(x_1 = 0, x_2=0, x_3=0, y=0)}{\sum_{i=0}^{n} \mathrm{I}(y=0)}

For \(m=30\) binary features, total parameters for binary classification problem > 2 billion (1 billion per label)!

Lots of parameters!
Need a lot of data to learn them without overfitting.

Estimating class conditional density

Let's calculate class conditional density with chain rule

There are large number of parameters in this model.

Simplify the problem with naive bayes assumption: Features are conditionally independent given the label.

Conditional Independence (CI)

[Chain rule]

[CI]

x_1	x_2	x_3
0	0	0
0	0	1
0	1	0
0	1	1
1	0	0
1	0	1
1	1	0
1	1	1

Training examples with \(m=3\) binary features \(x_1, x_2, x_3\) and a label \(y \in \{0, 1\}\)
In this simple case, the joint distribution \(p(\mathbf{x}, y) \) has \(2^m\) parameters for each class.
Total parameters = \(2 \times 2^m \).

Importance of CI

x_1	x_2	x_3	y = 0
0	0	0	xx
0	0	1
0	1	0
0	1	1
1	0	0
1	0	1
1	1	0
1	1	1

\(n\) training examples with \(m=3\) features \(x_1, x_2, x_3\) and a label \(y \in \{0, 1\}\)
Each feature is a binary feature: \(x_i \in \{0, 1\}\)
In this simple case, the joint distribution \(p(\mathbf{x}, y) \) has \(2^m\) parameters for each class.
Total parameters = \(2 \times 2^m \).

p(x_1=0, x_2=0, x_3=0, y=0) = \frac{\sum_{i=0}^{n} \mathrm{I}(x_1 = 0, x_2=0, x_3=0, y=0)}{n}

Importance of CI

x_1	...	x_100
0	...	0
0	...	1
0	...	0
0	...	1
:	:	:
1	...	1

How many parameters do we need to learn for \(m=100\) binary features \(x_1, x_2, \ldots, x_{100}\) and a label \(y \in \{0, 1\}\)

Total parameters = \(2 \times 2^{100} \).

Lots of parameters!
Need a lot of data to learn them without overfitting.

Parameter reduction with CI

Due to CI assumption, we need to learn the following for each class \(y= \{y_1, y_2, \ldots, y_k\}\):

Total parameters reduced from \(k \times 2^m\) to \(k \times 2m\).

x_1	y = 0	y = 1
0
1

x_2	y = 0	y = 1
0
1

x_m	y = 0	y = 1
0
1

...

x_1	...	x_100
0	...	0
0	...	1
0	...	0
0	...	1
:	:	:
1	...	1

Types of Naive Bayes (NB)

Bernoulli NB
Categorical NB
Multinomial NB
Gaussian NB

Estimating \(w_{y_k}\)

We need to take care of an additional constraint: \(\sum_{i=1}^k y_k = 1\)

Calculate \( \frac{\partial l(\mathbf{w})}{\partial w_{y_k}} \) and set it to \(0\).

\begin{aligned} \frac{\partial l(\mathbf{w})}{\partial w_{y_k}} &= \frac{\partial}{\partial w_{y_k}} \left(\sum_{i=1}^{n} \text{log}\ w_{y^{(i)}} + \sum_{j=1}^{m} x_j^{(i)} \text{log}\ w_{jy^{(i)}} + (1 - x_j^{(i)}) \text{log}\ (1 - w_{jy^{(i)}}) \right) \\ &= \sum_{i=1}^{n} \frac{\partial}{\partial w_{y_k}} \text{log}\ w_{y^{(i)}} + \sum_{j=1}^{m} \frac{\partial}{\partial w_{y_k}} \left(x_j^{(i)} \text{log}\ w_{jy^{(i)}} + (1 - x_j^{(i)}) \text{log}\ (1 - w_{jy^{(i)}}) \right) \\ &= \sum_{i=1}^{n} \frac{1}{w_{y_k}} + 0 \\ &= \sum_{i=1}^{n} \mathbb{1}(y^{(i)}=y_k) \left( \frac{1}{w_{y_k}} \right) \\ \end{aligned}

Estimating \(w_{y_k}\)

Given constraint: \(\sum_{i=1}^k w_{y_k} = 1\) implies \(\lambda = -n \), substituting it, we get

Using Lagrange multiplier:

\begin{aligned} \frac{\partial l(\mathbf{w})}{\partial w_{y_k}} + \lambda \frac{\partial \sum_k w_{y_k}}{\partial w_{y_k}} &= \sum_{i=1}^{n} \mathbb{1}(y^{(i)}=y_k) \frac{1}{w_{y_k}} + \lambda \\ \lambda &= - \sum_{i=1}^{n} \mathbb{1}(y^{(i)}=y_k) \frac{1}{w_{y_k}} \\ w_{y_k} &= - \sum_{i=1}^{n} \frac{\mathbb{1}(y^{(i)}=y_k)}{\lambda} \\ \end{aligned}

\begin{aligned} w_{y_k} &= - \sum_{i=1}^{n} \frac{\mathbb{1}(y^{(i)}=y_k)}{-n} = \sum_{i=1}^{n} \frac{\mathbb{1}(y^{(i)}=y_k)}{n}\\ \end{aligned}