features
samples
Let's write it compactly!
Set the gradient to zero to find \(w^*\)
Use pseudo inverse if inverse doesn't exist ( because pseudo inverse matrix satisfies all the properties that an inverse matrix satisfy )
Hessian. If \(XX^T\) is PSD, the solution is unique
That is, the columns of \(X\) has to be linearly independent
is called the projection matrix
We are not sure whether the observed \(d\) features are sufficient to explain the observed \(y\) .
We might have missed some features (correlated or uncorrelated) that could potentially relate to \(y\)
We could model this as a random variable \(\epsilon\) (or noise), that could explain the deviation in the prediction.
\(\epsilon\) could follow any distribution, generally assumed to be normal if not specified otherwise
This gives a fixed value for \(y\) for given \(x\) .
However, we can see that prediction deviates from the actual values. (there could be more than one value for \(y\) for given \(x\))
Now, the prediction is the expectation of \(y,E[y|x]\) (the value of \(y\) on an average for the given \(x\))
This gives us
mean
Likelihood of \(y\) given \(x\)
Maximize log-likelihood
Maximizing the log-likelihood is equivalent to minimizing the sum square error.
Complexity: \(O(d^3)\), requires all samples
Initialize \(w\) randomly and compute the gradient
Update \(w\)
Repeat until the loss (or gradient) becomes zero.
Go with stochastic gradient descent if we do not have enough compute.
Fit datapoints globally
w.k.t
Computing \(XX^T\) is costly for \(d>>n\)
Manipulating it further
during testing
w.k.t
Kernel maps the data points to infinite dimension, then it requires infinite number of parameters in the transformed domain. So, does it always overfit the training data points.?
No. It overfits when \(K\) has a full rank (therefore, inverse exists). In practice, it rarely occurs when we have number of samples equals to the number of features
features
samples
is called the projection matrix
We could model this as a random variable \(\epsilon\) (or noise), that could explain the deviation in the prediction.
mean
Likelihood of \(y\) given \(x\)
Maximize log-likelihood
Maximizing the log-likelihood is equivalent to minimizing the sum square error.
Complexity: \(O(d^3)\), requires all samples
Initialize \(w\) randomly and compute the gradient
Update \(w\)
Stop the iteration based on some criteria (such as number of iterations, loss_threshold..)
Go with stochastic gradient descent if we do not have enough compute.
w.k.t
Computing \(XX^T\) is costly for \(d>>n\)
Manipulating it further
during testing