Random Search on Error Surface
Try with different \( f(x) \) such as \(exp(x),sin(x),x^3 \cdots \) and see how closely taylor series approximate the function at the point \(x = X\) and around its neighbour \(\epsilon \)
For ease of notation, let \(\Delta \theta = u\) , then from Taylor series, we have,
[ \( \eta\) is typically small, so \( \eta^2,\eta^3 \cdots\rightarrow 0\)]
Note that the move \(\eta u\) would be favorable only if
[i.e., if the new loss is less than the previous loss]
This implies,
But, what is the range of \( u^T \nabla_{\theta}\mathcal{L}(\theta) < 0 \)?
Let \( \beta \) be the angle between \( u^T \) and \( \nabla_{\theta} \mathcal{L}(\theta)\)
Adjust the slider \(p\) to see the Linear approximation of \(f(x)\)at the point \(p\)
Notice the gradient (\(dp\)) value.
Change the value of \(p\) according to the gradient value. That is, take a step \(p\pm dp\) (Enter the new value for \(p\) in the input box only)
After few adjustments, did you reach the local minimum?
If no, Repeat the game and make necessary
changes to land in local minima.
where, \( \nabla w_t = \frac{\partial \mathcal{L}(w,b)}{\partial w}\), at \(w=w_t,b=b_t \),
and, \( \nabla b_t = \frac{\partial \mathcal{L}(w,b)}{\partial b}\), at \(w=w_t,b=b_t \),
So we now have a more principled way of moving in the (\(w,b\)) plane than our “guess work” algorithm
Let us create an algorithm for this rule ...
Algorithm:gradient_descent()
\(t \leftarrow 0;\)
max_iterarions \(\leftarrow 1000\);
(initialization of w, b?)
while \(t <\) max_iterations do
\(w_{t+1} \leftarrow w_t-\eta \nabla w_t\);
\(b_{t+1} \leftarrow b_t-\eta \nabla b_t\);
\( t \leftarrow t+1;\)
end
To see this algorithm in practice let us first derive \( \nabla w \) and \( \nabla b\) for our toy neural network
import numpy as np
X = [0.5,2.5]
Y = [0.2,0.9]
def f(x,w,b): # Sigmoid with input x, parameters w,b
return 1/(1+np.exp(-(w*x+b)))
def error(w,b):
err = 0.0
for x,y in zip(X,Y):
fx = f(x,w,b)
err += (fx-y)**2
return 0.5*err
def grad_b(x,w,b,y):
fx = f(x,w,b)
return (fx-y)*fx*(1-fx)
def grad_w(x,w,b,y):
fx = f(x,w,b)
return (fx-y)*fx*(1-fx)*x
def do_gradient_descent():
w,b,eta,max_epochs = -2,-2,1.0,1000
for i in range(max_epochs):
dw,db = 0,0
for x,y in zip(X,Y):
dw += grad_w(x,w,b,y)
db += grad_b(x,w,b,y)
w = w - eta*dw
b = b - eta*db
..........
Tower Maker
Tower Maker
Tower Maker
Tower Maker
..........