Machine Learning Techniques
Dr. Ashish Tendulkar
IIT Madras
Logistic regression is a classifier that can be applied in a single or multi-label classification set ups.
Logistic regression is a discriminative classifier.
As usual, we will discuss five components of logistic regression just like any other ML model.
The first component is the training data.
Binary classification
Multiclass and multilabel classification
Feature matrix
Label matrix
Label vector
The second component is model.
Note that we will be focusing on binary setting in this topic.
The multi-class logistic regression will be covered in exponential family.
Weight vector
Label
Non-linear activation function
Linear combination of features
Feature vector
# parameters = \(m+1\), where \(m\) is #features.
\(\mathbf{x}\)
\( z = \mathbf{w}^T\mathbf{x}\)
\( g(z)\)
\(\text{Pr}(y=1|\mathbf{x})\)
Logistic regression
Linear combination
Non-linear activation
Let's look at how the logistic (aka sigmoid) function looks like.
Let's look at more general form of logistic regression - with feature transformation.
Feature transformation
Weight vector
Feature vector
Label
Non-linear activation function
Linear combination of features
The Feature transformation (e.g. polynomial) enables us to fit non-linear decision boundaries.
Feature vector
The learning problem here is to estimate the weight vector \(\mathbf{w}\) based on the training data by minimizing the loss function through appropriate optimization procedure.
Let's derive the loss function for logistic regression for the case of binary classification problem.
Let's assume that -
We can rewrite this as
For \(y=0\)
For \(y=1\)
For \(n\) independently generated training examples, we can write the likelihood of parameter vector as
Taking log on both side as maximizing the log likelihood is easier.
Our job is to find the parameter vector \(\mathbf{w}\) such that the \(l(\mathbf{w})\) is maximized.
Equivalently we can minimize the negative log likelihood (NLL) to maintain uniformity with other algorithms:
Loss function is convex.
Binary cross entropy loss
Binary cross entropy loss with L2 regularization
Binary cross entropy loss with L1 regularization
Now that we have derived our loss function, let's focus on optimizing loss function to obtain the weight vector.
We can use gradient descent for minimizing the loss that is negative log likelihood.
The weight update rule looks as follows:
We will use this in the next derivation.
Let's derive partial derivative of sigmoid function: \(\frac{\partial}{\partial z} g(z)\)
Remember:
We need to derive partial derivative of loss function w.r.t. the weight vector.
The update rule becomes:
It can be written in vectorized form as follows:
(1) Data
(2) Model
(3) Loss function
(4) Optimization procedure
(5) Evaluation
Cross entropy loss
Linear combination of features + non-linear activation function
Features and label (discrete)
GD/MBGD/SGD
Precision, recall, F1-score