|
---|
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_L=\hat {y} = \hat{f}(x)\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
\(a_i(x) = b_i +W_ih_{i-1}(x)\)
\(h_i(x) = g(a_i(x))\)
\(f(x) = h_L(x)=O(a_L(x))\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_L=\hat {y} = \hat{f}(x)\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
\(a_i = b_i +W_ih_{i-1}\)
\(h_i = g(a_i)\)
\(f(x) = h_L=O(a_L)\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_L=\hat {y} = \hat{f}(x)\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
\(\hat y_i = \hat{f}(x_i) = O(W_3 g(W_2 g(W_1 x_i + b_1) + b_2) + b_3)\)
\(\theta = W_1, ..., W_L, b_1, b_2, ..., b_L (L = 3)\)
\(min \cfrac {1}{N} \displaystyle \sum_{i=1}^N \sum_{j=1}^k (\hat y_{ij} - y_{ij})^2\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_L=\hat {y} = \hat{f}(x)\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
|
---|
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_L=\hat {y} = \hat{f}(x)\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
\(w_{t+1} \gets w_t - \eta \nabla w_t\)
\(b_{t+1} \gets b_t - \eta \nabla b_t\)
\(t \gets 0;\)
\(max\_iterations \gets 1000; \)
end
while \(t\)++ \(< max\_iterations\) do
\(Initialize w_0,b_0;\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_L=\hat {y} = \hat{f}(x)\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
\(t \gets 0;\)
\(max\_iterations \gets 1000; \)
\(Initialize \theta_0 = [w_0,b_0];\)
end
while \(t\)++ \(< max\_iterations\) do
\(\theta_{t+1} \gets \theta_t - \eta \nabla \theta_t\)
\(t \gets 0;\)
\(max\_iterations \gets 1000; \)
\(Initialize\) \(\theta_0 = [W_1^0,...,W_L^0,b_1^0,...,b_L^0];\)
end
while \(t\)++ \(< max\_iterations\) do
\(\theta_{t+1} \gets \theta_t - \eta \nabla \theta_t\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_L=\hat {y} = \hat{f}(x)\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{111}}\)
\(...\)
\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{11n}}\)
\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{121}}\)
\(...\)
\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{12n}}\)
\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{1n1}}\)
\(...\)
\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{1nn}}\)
\( \vdots\)
\( \vdots\)
\( \vdots\)
\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{211}}\)
\(...\)
\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{21n}}\)
\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{221}}\)
\(...\)
\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{22n}}\)
\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{2n1}}\)
\(...\)
\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{2nn}}\)
\( \vdots\)
\( \vdots\)
\( \vdots\)
\(...\)
\(...\)
\(...\)
\( \vdots\)
\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{L,11}}\)
\( ... \)
\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{L,1k}}\)
\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{L,21}}\)
\( ... \)
\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{L,2k}}\)
\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{L,n1}}\)
\( ... \)
\(\frac {\partial \mathscr{L}(\theta)}{\partial W_{L,nk}}\)
\( \vdots\)
\( \vdots\)
\( \vdots\)
\(\frac {\partial \mathscr{L}(\theta)}{\partial b_{11}}\)
\(...\)
\(\frac {\partial \mathscr{L}(\theta)}{\partial b_{L1}}\)
\(\frac {\partial \mathscr{L}(\theta)}{\partial b_{12}}\)
\(...\)
\(\frac {\partial \mathscr{L}(\theta)}{\partial b_{L2}}\)
\(\frac {\partial \mathscr{L}(\theta)}{\partial b_{1n}}\)
\(...\)
\(\frac {\partial \mathscr{L}(\theta)}{\partial b_{Lk}}\)
\( \vdots\)
\( \vdots\)
\( \vdots\)
|
---|
|
---|
|
---|
\(\mathscr {L}(\theta) = \cfrac {1}{N} \displaystyle \sum_{i=1}^N \sum_{j=1}^k (\hat y_{ij} - y_{ij})^2\)
Neural network with \(L - 1\) hidden layers
isActor Damon
isDirector
Nolan
imdb
Rating
Critics
Rating
RT
Rating
\(y_j =\) {\(7.5 8.2 7.7\)}
\(x_i\)
\(. .\)
. . . . . .
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_L=\hat {y} = \hat{f}(x)\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
Intentionally left blank
Neural network with \(L - 1\) hidden layers
\(y =\) [\(1 0 0 0\)]
Neural network with \(L - 1\) hidden layers
\(y =\) [\(1 0 0 0\)]
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_L=\hat {y} = f(x)\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_L=\hat {y} = f(x)\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
Neural network with \(L - 1\) hidden layers
\(y =\) [\(1 0 0 0\)]
\(\mathscr {L}(\theta) = - \displaystyle \sum_{c=1}^k y_c \log \hat y_c \)
\(\hat y_\ell = [O(W_3 g(W_2 g(W_1 x + b_1) + b_2) + b_3)]_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_L=\hat {y} = f(x)\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
Output Activation | ||
Loss Function |
Outputs |
---|
Real Values | Probabilities |
Linear
Softmax
Squared Error
Cross Entropy
Output Activation | ||
Loss Function |
Outputs |
---|
Real Values | Probabilities |
Linear
Softmax
Squared Error
Cross Entropy
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_1\)
\(h_2\)
\(a_1\)
\(a_2\)
\(a_3\)
\(\hat {y} =\hat{ f}(x)\)
\(b_1\)
\(b_2\)
\(b_3\)
\(t \gets 0;\)
\(max\_iterations \gets 1000; \)
\(Initialize \theta_0 = [w_0,b_0];\)
end
while
\(t\)++ \(< max\_iterations\)
do
\(\theta_{t+1} \gets \theta_t - \eta \nabla \theta_t\)
\(W_{111}\)
\(x_1\)
\(a_{11}\)
\(W_{211}\)
\(a_{21}\)
\(W_{L11}\)
\(a_{L11}\)
\(\hat y = \hat{f}(x)\)
\(\mathscr {L} (\theta)\)
\(h_{11}\)
\(h_{21}\)
\(x_1\)
\(a_{11}\)
\(a_{21}\)
\(a_{L11}\)
\(\hat y = f(x)\)
\(\mathscr {L} (\theta)\)
\(h_{11}\)
\(h_{21}\)
\(W_{111}\)
\(W_{211}\)
\(W_{L11}\)
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(b_3\)
\(W_3\)
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(b_3\)
\(W_3\)
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(b_3\)
\(W_3\)
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(b_3\)
\(W_3\)
\(b_2\)
\(W_2\)
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(b_3\)
\(W_3\)
\(b_2\)
\(W_2\)
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(b_3\)
\(W_3\)
\(b_2\)
\(W_2\)
\(W_1\)
\(b_1\)
Talk to the
weight directly
Talk to the output layer
Talk to the previous hidden layer
and now talk to
the weights
Talk to the previous hidden layer
Talk to the
weight directly
Talk to the output layer
Talk to the previous hidden layer
and now talk to the weights
Talk to the previous hidden layer
Talk to the
weight directly
Talk to the output layer
Talk to the previous hidden layer
and now talk to the weights
Talk to the previous hidden layer
\(=\)
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
\( \cfrac {\partial}{\partial \hat y_i}(- \log \hat y_\ell) \)
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
Talk to the
weight directly
Talk to the output layer
Talk to the previous hidden layer
and now talk to the weights
Talk to the previous hidden layer
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(b_3\)
\(W_3\)
Intentionally left blank
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(b_3\)
\(W_3\)
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(b_3\)
\(W_3\)
\(-\log \hat y_\ell\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(b_3\)
\(W_3\)
Talk to the
weight directly
Talk to the output layer
Talk to the previous hidden layer
and now talk to the weights
Talk to the previous hidden layer
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(b_3\)
\(W_3\)
\(b_2\)
\(W_2\)
Intentionally left blank
\(-\log \hat y_\ell\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(W_1\)
\(W_1\)
\(b_1\)
\(b_3\)
\(W_3\)
\(b_2\)
\(W_2\)
\(t \gets 0;\)
\(max\_iterations \gets 1000; \)
\(Initialize\) \(\theta_0 = [W_1^0,...,W_L^0,b_1^0,...,b_L^0];\)
end
while \(t\)++ \(< max\_iterations\) do
\(\theta_{t+1} \gets \theta_t - \eta \nabla \theta_t\)
\(a_1,h_1,a_2,h_2,...,a_{L-1},h_{L-1},a_L,\hat y=forward\) _ \(propagation(\theta_t)\)
\(\nabla \theta_t = backward\)_\(propagation(h_1,h_2,...,h_{L-1},a_1,a_2,...,a_L,y,\hat y)\)
\(a_k = b_k + W_k h_{k-1} ;\)
\(h_k = g(a_k)\)
\(a_L = b_L + W_L h_{L-1} ;\)
\(\hat y = O(a_L) ;\)
end
for \(k = 1\) to \(L-1\) do
\(\nabla _ {a_L} \mathscr {L} (\theta) = - (e(y) - \hat y); \)
\(\nabla _ {W_k} \mathscr {L} (\theta) = \nabla _ {a_k} \mathscr {L} (\theta) h_{k-1}^T ;\)
\(\nabla _ {b_k} \mathscr {L} (\theta) = \nabla _ {a_k} \mathscr {L} (\theta) ;\)
\(\nabla _ {h_{k-1}} \mathscr {L} (\theta) = W_k^T \nabla _ {a_k} \mathscr {L} (\theta) ;\)
\(\nabla _ {a_{k-1}} \mathscr {L} (\theta) = \nabla _ {h_{k-1}} \mathscr {L} (\theta) \odot [...,g' (a_{k-1,j}),...];\)
end
for \(k = L\) to \(1\) do
\(g(z) = \sigma (z)\)
Logistic function |
---|
|
---|
\(=\cfrac {1}{1+e^{-z}}\)
\(g'(z) = (-1) \cfrac {1}{(1+e^{-z})^2} \cfrac {d}{dz} (1+e^{-z})\)
\(= (-1) \cfrac {1}{(1+e^{-z})^2} (-e^{-z})\)
\(= \cfrac {1}{(1+e^{-z})} \cfrac {1+e^{-z}-1}{1+e^{-z}}\)
\(=g(z) (1-g(z)\))
\(g(z) = tanh (z)\)
\(=\cfrac {e^z-e^{-z}}{e^z+e^{-z}}\)
\(g'(z) = \cfrac {\Bigg ((e^z+e^{-z}) \frac {d}{dz}(e^z-e^{-z}) \allowbreak - (e^z-e^{-z}) \frac {d}{dz} (e_z+e^{-z})\Bigg )}{(e^z+e^{-z})^2} \)
\(=\cfrac {(e^z+e^{-z})^2-(e^z-e^{-z})^2}{(e^z+e^{-z})^2}\)
\(=1- \cfrac {(e^z-e^{-z})^2}{(e^z+e^{-z})^2}\)
\(=1-(g(z))^2\)
tanh function |
---|
|
---|