Logistic Regression
Machine Learning Techniques
Dr. Ashish Tendulkar
IIT Madras
Logistic Regression
-
Logistic regression is a classifier that can be applied in a single or multi-label classification set ups.
-
Logistic regression is a discriminative classifier.
-
- It obtains probability of sample belonging to a specific class by computing sigmoid (aka logistic function) of linear combination of features.
- The weight vector for linear combination is learnt via model training.
As usual, we will discuss five components of logistic regression just like any other ML model.
The first component is the training data.
Training Data
Binary classification
Multiclass and multilabel classification
Feature matrix
Label matrix
Label vector
- Shape of feature matrix \((n, m)\)
- Shape of label vector \((n, )\)
- Shape of feature matrix \((n, m)\)
- Shape of label matrix \((n, k)\)
- Shape of label vector \((k,)\)
The second component is model.
Note that we will be focusing on binary setting in this topic.
The multi-class logistic regression will be covered in exponential family.
Model \(h_\mathbf{w}(\mathbf{x})\)
Weight vector
Label
Non-linear activation function
Linear combination of features
Feature vector
# parameters = \(m+1\), where \(m\) is #features.
\(\mathbf{x}\)
\( z = \mathbf{w}^T\mathbf{x}\)
\( g(z)\)
\(\text{Pr}(y=1|\mathbf{x})\)
Logistic regression
Linear combination
Non-linear activation
- Binary classification: Sigmoid
- Multi-class classification: Softmax
Let's look at how the logistic (aka sigmoid) function looks like.
- x-axis is a linear combination of feature: \(z = \mathbf{w}^T \mathbf{x}\)
- y-axis is \(g(z)\) - the output of logistic/sigmoid function.
- As \(z \rightarrow \infty \), \(g(z) \rightarrow 1\).
- As \(z \rightarrow -\infty \), \(g(z) \rightarrow 0\)
- For \(z = 0, g(z) = 0.5\)
Let's look at more general form of logistic regression - with feature transformation.
Model \(h_\mathbf{w}(\mathbf{x})\) with feature transformation
Feature transformation
Weight vector
Feature vector
Label
Non-linear activation function
Linear combination of features
The Feature transformation (e.g. polynomial) enables us to fit non-linear decision boundaries.
Feature vector
The learning problem here is to estimate the weight vector \(\mathbf{w}\) based on the training data by minimizing the loss function through appropriate optimization procedure.
Let's derive the loss function for logistic regression for the case of binary classification problem.
Let's assume that -
We can rewrite this as
For \(y=0\)
For \(y=1\)
For \(n\) independently generated training examples, we can write the likelihood of parameter vector as
Taking log on both side as maximizing the log likelihood is easier.
Our job is to find the parameter vector \(\mathbf{w}\) such that the \(l(\mathbf{w})\) is maximized.
Equivalently we can minimize the negative log likelihood (NLL) to maintain uniformity with other algorithms:
Loss function is convex.
Binary cross entropy loss
Binary cross entropy loss with L2 regularization
Binary cross entropy loss with L1 regularization
Now that we have derived our loss function, let's focus on optimizing loss function to obtain the weight vector.
We can use gradient descent for minimizing the loss that is negative log likelihood.
The weight update rule looks as follows:
We will use this in the next derivation.
Let's derive partial derivative of sigmoid function: \(\frac{\partial}{\partial z} g(z)\)
Remember:
We need to derive partial derivative of loss function w.r.t. the weight vector.
The update rule becomes:
It can be written in vectorized form as follows:
Inference
Evaluation metrics
- Confusion matrix
- Precision, recall, F1 scores, accuracy
- AUC of ROC and PR curves
Logistic regression: Recap
(1) Data
(2) Model
(3) Loss function
(4) Optimization procedure
(5) Evaluation
Cross entropy loss
Linear combination of features + non-linear activation function
Features and label (discrete)
GD/MBGD/SGD
Precision, recall, F1-score
Copy of deck
By Swarnim POD
Copy of deck
- 111