Logistic Regression

Machine Learning Techniques

Dr. Ashish Tendulkar

IIT Madras

Logistic Regression

Logistic regression is a classifier that can be applied in a single or multi-label classification set ups.
- Logistic regression is a discriminative classifier.

It obtains probability of sample belonging to a specific class by computing sigmoid (aka logistic function) of linear combination of features.

The weight vector for linear combination is learnt via model training.

As usual, we will discuss five components of logistic regression just like any other ML model.

The first component is the training data.

Training Data

Binary classification

Multiclass and multilabel classification

D = \left\{ (\mathbf{X}, \mathbf{y})\right\} = \left\{ (\mathbf{x}^{(i)}, y^{(i)})\right\}_{i=1}^{n}

D = \left\{ (\mathbf{X}, \mathbf{Y})\right\} = \left\{ (\mathbf{x}^{(i)}, \mathbf{y}^{(i)})\right\}_{i=1}^{n}

Feature matrix

Label matrix

Label vector

Shape of feature matrix \((n, m)\)
Shape of label vector \((n, )\)

Shape of feature matrix \((n, m)\)
Shape of label matrix \((n, k)\)
Shape of label vector \((k,)\)

The second component is model.

Note that we will be focusing on binary setting in this topic.

The multi-class logistic regression will be covered in exponential family.

\text{Pr}(y=1|\mathbf{x}) = g(\mathbf{w}^T\mathbf{x}) = \frac{1}{1 + \text{exp}(-\mathbf{w}^T\mathbf{x})}

Model \(h_\mathbf{w}(\mathbf{x})\)

Weight vector

Label

Non-linear activation function

Linear combination of features

Feature vector

# parameters = \(m+1\), where \(m\) is #features.

\(\mathbf{x}\)

\( z = \mathbf{w}^T\mathbf{x}\)

\( g(z)\)

\(\text{Pr}(y=1|\mathbf{x})\)

Logistic regression

Linear combination

Non-linear activation

Binary classification: Sigmoid
Multi-class classification: Softmax

Let's look at how the logistic (aka sigmoid) function looks like.

x-axis is a linear combination of feature: \(z = \mathbf{w}^T \mathbf{x}\)
y-axis is \(g(z)\) - the output of logistic/sigmoid function.

As \(z \rightarrow \infty \), \(g(z) \rightarrow 1\).
As \(z \rightarrow -\infty \), \(g(z) \rightarrow 0\)
For \(z = 0, g(z) = 0.5\)

Let's look at more general form of logistic regression - with feature transformation.

\text{Pr}(y=1|\mathbf{x}) = g(\mathbf{w}^T \phi(\mathbf{x})) = \frac{1}{1 + \text{exp}(-\mathbf{w}^T \phi(\mathbf{x}))}

Model \(h_\mathbf{w}(\mathbf{x})\) with feature transformation

Feature transformation

Weight vector

Feature vector

Label

Non-linear activation function

Linear combination of features

The Feature transformation (e.g. polynomial) enables us to fit non-linear decision boundaries.

Feature vector

The learning problem here is to estimate the weight vector \(\mathbf{w}\) based on the training data by minimizing the loss function through appropriate optimization procedure.

\text{Pr}(y=1|\mathbf{x}) = g(\mathbf{w}^T \phi(\mathbf{x})) = \frac{1}{1 + \text{exp}(-\mathbf{w}^T \phi(\mathbf{x}))}

\text{Pr}(y=1|\mathbf{x}) = g(\mathbf{w}^T\mathbf{x}) = \frac{1}{1 + \text{exp}(-\mathbf{w}^T\mathbf{x})}

Let's derive the loss function for logistic regression for the case of binary classification problem.

Let's assume that -

\begin{aligned} \text{Pr}(y=1|\mathbf{x}; \mathbf{w}) &= h_\mathbf{w}(\mathbf{x}) \\ \text{Pr}(y=0|\mathbf{x}; \mathbf{w}) &= \left(1 - h_\mathbf{w}(\mathbf{x}) \right) \end{aligned}

We can rewrite this as

\text{Pr}(y|\mathbf{x}; \mathbf{w}) = (h_\mathbf{w}(\mathbf{x}))^y (1 - h_\mathbf{w}(\mathbf{x}))^{(1-y)}

For \(y=0\)

\begin{aligned} \text{Pr}(y=0|\mathbf{x}; \mathbf{w}) &= (h_\mathbf{w}(\mathbf{x}))^0 (1 - h_\mathbf{w}(\mathbf{x}))^{(1-0)} \\ &= (1 - h_\mathbf{w}(\mathbf{x})) \end{aligned}

For \(y=1\)

\begin{aligned} \text{Pr}(y=1|\mathbf{x}; \mathbf{w}) &= (h_\mathbf{w}(\mathbf{x}))^1 (1 - h_\mathbf{w}(\mathbf{x}))^{(1-1)} \\ &= h_\mathbf{w}(\mathbf{x}) \end{aligned}

For \(n\) independently generated training examples, we can write the likelihood of parameter vector as

\begin{aligned} L(\mathbf{w}) &= p(\mathbf{y}|\mathbf{X}; \mathbf{w}) \\ &= \prod_{i=1}^{n} p(y^{(i)}|\mathbf{x}^{(i)}; \mathbf{w}) \\ &= \prod_{i=1}^{n} \left( h_\mathbf{w}(\mathbf{x}^{(i)})\right) ^{y^{(i)}} \left( 1-h_\mathbf{w}(\mathbf{x}^{(i)})\right)^{1-y^{(i)}} \\ \end{aligned}

Taking log on both side as maximizing the log likelihood is easier.

\begin{aligned} \text{log}(L(\mathbf{w})) &= \text{log} \left(\prod_{i=1}^{n} \left( h_\mathbf{w}(\mathbf{x}^{(i)})\right) ^{y^{(i)}} \left( 1-h_\mathbf{w}(\mathbf{x}^{(i)})\right)^{1-y^{(i)}} \right)\\ l(\mathbf{w}) &= \sum_{i=1}^{n} y^{(i)} \text{log} \left( h(\mathbf{x}^{(i)}) \right) + (1 - y^{(i)}) \text{log} \left(1- h(\mathbf{x}^{(i)}) \right) \end{aligned}

Our job is to find the parameter vector \(\mathbf{w}\) such that the \(l(\mathbf{w})\) is maximized.

Equivalently we can minimize the negative log likelihood (NLL) to maintain uniformity with other algorithms:

\begin{aligned} J(\mathbf{w}) &= -l(\mathbf{w}) \\ &= -\sum_{i=1}^{n} y^{(i)}\ \text{log} \left( h(\mathbf{x}^{(i)}) \right) + (1 - y^{(i)})\ \text{log} \left(1- h(\mathbf{x}^{(i)}) \right) \end{aligned}

Loss function is convex.

\begin{aligned} J(\mathbf{w}) &= -\sum_{i=1}^{n} y^{(i)}\ \text{log} \left( h(\mathbf{x}^{(i)}) \right) + (1 - y^{(i)})\ \text{log} \left(1- h(\mathbf{x}^{(i)}) \right) \end{aligned}

\begin{aligned} J(\mathbf{w}) &= -\sum_{i=1}^{n} y^{(i)}\ \text{log} \left( h(\mathbf{x}^{(i)}) \right) + (1 - y^{(i)})\ \text{log} \left(1- h(\mathbf{x}^{(i)}) \right) + \frac{\lambda}{2} ||\mathbf{w}||^2 \end{aligned}

Binary cross entropy loss

Binary cross entropy loss with L2 regularization

Binary cross entropy loss with L1 regularization

Now that we have derived our loss function, let's focus on optimizing loss function to obtain the weight vector.

\begin{aligned} \mathbf{w} = \argmin_\mathbf{w} J(\mathbf{w}) \end{aligned}

We can use gradient descent for minimizing the loss that is negative log likelihood.

The weight update rule looks as follows:

\mathbf{w} := \mathbf{w} - \alpha \frac{\partial}{\partial{\mathbf{w}}} \left( J(\mathbf{w}) \right)

\begin{aligned} \frac{\partial}{\partial {z}} \frac{1}{1 + \text{exp}(-z)} &= \frac{1}{(1 + \text{exp}(-z))^2} (\text{exp}(-z)) \\ &= \color{blue}{\frac{1}{(1 + \text{exp}(-z))}} \color{red}{\left( 1 - \frac{1}{(1 + \text{exp}(-z))} \right)} \\ &= \color{blue}{g(z)} \color{red}{ (1-g(z))} \end{aligned}

We will use this in the next derivation.

Let's derive partial derivative of sigmoid function: \(\frac{\partial}{\partial z} g(z)\)

g(z) = 1/(1+\text{exp}(-z) )

Remember:

\begin{aligned} \dfrac{\partial J(\mathbf{w}) } {\partial w_j} &= -\dfrac{\partial }{\partial w_j} \sum_{i=1}^{n} y^{(i)} \text{log}(h_{\mathbf{w}}(\mathbf{x}^{(i)}))+ (1-y^{(i)})\text{log}(1-h_{\mathbf{w}}(\mathbf{x}^{(i)})) \\ &= -\dfrac{\partial }{\partial w_j} \sum_{i=1}^{n} y^{(i)} \text{ log } g(\mathbf{w}^T \mathbf{x}^{(i)}) + (1-y^{(i)}) \text{ log }(1- g(\mathbf{w}^T \mathbf{x}^{(i)})) \\ &= - \sum_{i=1}^{n} \left(y^{(i)} \frac{1}{g(\mathbf{w}^T \mathbf{x}^{(i)})} - (1-y) \frac{1}{1 - g(\mathbf{w}^T \mathbf{x}^{(i)})} \right) \color{blue}{\frac{\partial}{\partial w_j} g(\mathbf{w}^T \mathbf{x}^{(i)}) } \color{black} \\ &= - \sum_{i=1}^{n} \left(y^{(i)} \frac{1}{g(\mathbf{w}^T \mathbf{x}^{(i)})} - (1-y) \frac{1}{1 - g(\mathbf{w}^T \mathbf{x}^{(i)})} \right) \color{red}{g(\mathbf{w}^T \mathbf{x}^{(i)}) (1 - g(\mathbf{w}^T \mathbf{x}^{(i)}))} \color{blue}{\frac{\partial}{\partial w_j} \mathbf{w}^T \mathbf{x}^{(i)}} \color{black} \\ &= -\sum_{i=1}^{n} \left(y^{(i)} (1 - g(\mathbf{w}^T \mathbf{x}^{(i)})) - (1-y) g(\mathbf{w}^T \mathbf{x}^{(i)}) \right) x_j^{(i)} \\ &= -\sum_{i=1}^{n} \left( y^{(i)} - g(\mathbf{w}^T \mathbf{x}^{(i)}) \right) x_j^{(i)} \\ &= -\sum_{i=1}^{n} \left( y^{(i)} - h_\mathbf{w}(\mathbf{x}^{(i)}) \right) x_{j}^{(i)} \\ &= \sum_{i=1}^{n}\left(h_\mathbf{w}(\mathbf{x}^{(i)}) - y^{(i)} \right) x_j^{(i)} \end{aligned}

We need to derive partial derivative of loss function w.r.t. the weight vector.

The update rule becomes:

w_j := w_j - \alpha \left( \sum_{i=1}^{n} \left( h_\mathbf{w}(\mathbf{x}^{(i)}) - y^{(i)} \right) x_j^{(i)} \right)

It can be written in vectorized form as follows:

\begin{aligned} \mathbf{w} &:= \mathbf{w}- \alpha \left( \mathbf{X}^T \left( h_\mathbf{w}(\mathbf{X}) - \mathbf{y} \right) \right) \\ &:= \mathbf{w} - \alpha \left( \mathbf{X}^T \left( g(\mathbf{X}\mathbf{w}) - \mathbf{y} \right) \right) \\ \end{aligned}

Inference

y = \begin{cases} 1, & \text{if}\ \text{Pr}(y=1|\mathbf{x}) \gt 0.5 \\ 0, & \text{otherwise}. \end{cases}

Evaluation metrics

Confusion matrix
Precision, recall, F1 scores, accuracy
AUC of ROC and PR curves