\( \)
\(\nabla _ {a_{k-1}} \mathscr {L} (\theta) = \nabla _ {h_{k-1}} \mathscr {L} (\theta) \odot [...,g' (a_{k-1,j}),...];\)
\(\nabla _ {W_k} \mathscr {L} (\theta) = \nabla _ {a_k} \mathscr {L} (\theta) h_{k-1}^T ;\)
\(\nabla _ {b_k} \mathscr {L} (\theta) = \nabla _ {a_k} \mathscr {L} (\theta) ;\)
\(f(x) = max(0, x) \)
\(f(x) = max(0, x ) − max(0, x − 6)\)
\(f(x)= max(0, x)\)
\(w_1x_1 + w_2x_2 + b < 0 [if b << 0]\)
\(f(x) = max(0.1x,x) \)
Sampling with replacement
Strong response to the input and hence larger weight update
Weak response to the input and hence smaller weight update
Dropped out
Strong response to the input and hence larger weight update
Weak response to the input and hence smaller weight update
Dropped out
Strong response to the input and hence larger weight update
Weak response to the input and hence smaller weight update
Dropped out
MaxOut
Dropped out (because they are less than max)
Strong response to the input and hence larger weight update
Weak response to the input and hence smaller weight update
Dropped out
For a different set of inputs, the scenario may switch!
num_layers = 5
D = np.random.randn(1000,500)
for i in range(num_layers):
x = D if i==0 else h_cache[i-1]
W = 0.01*np.random.randn(500,500)
a = np.dot(x,W)
h = sigmoid(a)
num_layers = 5
D = np.random.randn(1000,500)
for i in range(num_layers):
x = D if i==0 else h_cache[i-1]
W = 0.01*np.random.randn(500,500)
a = np.dot(x,W)
h = sigmoid(a)
W = 0.01*np.random.randn(500,500)
W = 0.01*np.random.randn(500,500)
W = 0.01*np.random.randn(500,500)
W = np.random.randn(500,500)
W = np.random.randn(500,500)
W = np.random.randn(500,500)
W = 0.5*np.eye(500)
W = 1.5*np.eye(500)
W = 0.5*np.eye(500)
W = 1.5*np.eye(500)
?
W = np.random.randn(fan_in,fan_out)/np.sqrt(fan_in)
W = np.random.randn(fan_in,fan_out)/np.sqrt(fan_in)
W = np.random.randn(500,500)/ sqrt(fan_in)
380 iterations
30 iterations/s
W = np.random.randn(fan_in,fan_out)/np.sqrt(fan_in/2)
tanh
tanh
tanh
BN
BN
tanh
tanh
tanh
BN
BN
Accumulated activations for \(m\) training samples
Accumulated activations for \(m\) training samples
Blue: Without BN
Red: With BN