Lecture 3: Sigmoid Neurons,Gradient Descent,
Feedforward Neural Networks, Representation Power of Feedforward Neural Networks.
IIT Madras
AI4Bharat
CS6910: Fundamentals of Deep Learning
Mitesh M. Khapra, Arun Prakash A
Learning Objectives
At the end of this lecture, student will have a good understanding of the following topics:
-
Sigmoid Neurons
-
Gradient Descent
-
Feedforward Neural Networks
-
Representation Power of Feedforward Neural Networks.
Acknowledgment
For Module 3.4, I have borrowed ideas from the videos by Ryan Harris on “visualize backpropagation” (available on youtube)
For Module 3.5, I have borrowed ideas from this excellent book* which is available online
I am sure I would have been influenced and borrowed ideas from other sources and I apologize if I have failed to acknowledge them
Module 3.1: Sigmoid Neuron
Mitesh M. Khapra
IIT Madras
AI4Bharat
The story ahead
Enough about boolean functions!
Can we have a network which can (approximately) represent such functions ?
Before answering the above question we will have to first graduate from perceptrons to sigmoidal neurons ...
What about arbitrary functions of the form \(y=f(x)\) where \(x \in R^n \) (instead of \(\{0,1\}^n\) and \(y \in R\) (instead of \(\{0,1\}\))
|
---|
Recall
A perceptron will fire if the weighted sum of its inputs is greater than the threshold \((-w_0)\)
The thresholding logic used by a perceptron is very harsh !
For example, let us return to our problem of deciding whether we will like or dislike a movie
Consider that we base our decision only on one input \( x_1 = criticsRating \in (0,1)\)
criticsRating
If the threshold is 0.5 \((-w_0=0.5)\) and \((w_1=1)\)
then what would be the decision for a movie with criticsRating = 0.51 ? (like)
What about a movie with criticsRating = 0.49 ? (dislike)
It seems harsh that we would like a movie with rating 0.51 but not one with a rating of 0.49
y
bias =-w_0 = 0.5
w_1
x_1
This behavior is not a characteristic of the specific problem we chose or the specific weight and threshold that we chose
-w_0
y
It is a characteristic of the perceptron function itself which behaves like a step function
There will always be this sudden change in the decision (from 0 to 1) when \(\sum \limits_{i=1}^{n}w_ix_i \) crosses the threshold \(-w_0\)
z = \sum \limits_{i=1}^{n}w_ix_i
For most real world applications we would expect a smoother decision function which gradually changes from 0 to 1
z
y
Introducing sigmoid neurons where the output function is much smoother than the step function
Here is one form of the sigmoid function called the logistic function
y = \frac{1}{1+exp\big(-(w_0+\sum \limits_{i=1}^{n}w_ix_i)\big)}
We no longer see a sharp transition around the threshold
z = \sum \limits_{i=1}^{n}w_ix_i
-w_0
Also the output \(y\) is no longer binary but a real value between 0 and 1 which can be interpreted as a probability.
-w_0
Instead of a like/dislike decision we get the probability of liking the movie.
Perceptron
Sigmoid (Logistic) Neuron
y = 1 \ \ \text{if} \ \sum \limits_{i=0}^{n}w_ix_i \geq 0
= 0 \ \ \text{if} \ \sum \limits_{i=0}^{n}w_ix_i < 0
y = \frac{1}{1+exp\big(-(w_0+\sum \limits_{i=1}^{n}w_ix_i)\big)}
w_1
x_0 = 1
y
w_0 = -\theta
w_2
w_n
\cdots
x_1
x_2
\cdots
x_n
\cdots
w_1
x_0 = 1
y
w_0 = -\theta
w_2
w_n
\cdots
x_1
x_2
\cdots
x_n
\sigma
z = \sum \limits_{i=1}^{n}w_ix_i
-w_0
z = \sum \limits_{i=1}^{n}w_ix_i
-w_0
Not smooth,
not differentiable at (\(-\omega_0\))
Smooth, continuous, differentiable
Module 3.2: A typical Supervised Machine Learning Setup
Mitesh M. Khapra
IIT Madras
AI4Bharat
Sigmoid (Logistic) Neuron
What next ?
Well, just as we had an algorithm for learning the weights of a perceptron, we also need a way of learning the weights of a sigmoid neuron
Before we see such an algorithm we will revisit the concept of error
w_1
x_0 = 1
y
w_0 = -\theta
w_2
w_n
\cdots
x_1
x_2
\cdots
x_n
\sigma
Earlier we mentioned that a single perceptron cannot deal with these data because it is not linearly separable
What does “cannot deal with” mean?
What would happen if we use a perceptron model to classify these data ?
We would probably end up with lines like these ...
From now on, we will accept that it is hard to drive the error to 0 in most cases and will instead aim to reach the minimum possible error
This brings us to a typical supervised machine learning setup which has the following components...
Data: \(\{x_i,y_i\}_{i=1}^{n}\)
Model: Our approximation of the relation between \(\mathbf{x}\) and \(y\).
For example, \( \hat{y} = \frac{1}{1+e^{-\mathbf{w^Tx}}}\) or \(\hat{y} = \mathbf{w^Tx}\) or \(\hat{y} = \mathbf{x^TWx} \)
Parameters: In all the above cases, \(\mathbf{w}\) is a parameter which needs to be learned from the data
Objective/Loss/Error function: To guide the learning algorithm
the learning algorithm should aim to minimize the loss function
Learning algorithm: An algorithm for learning the parameters \(\mathbf{w} \) of the model (for example, perceptron learning algorithm, gradient descent, etc.)
As an illustration, consider our movie example
Data: {\({x_i}=\) movie,\(y_i=\) (like/dislike) \(\}_{i=1}^{n}\)
Model: Our approximation of the relation between x and y (the probability of liking a movie)
\hat{y} = \frac{1}{1+e^{-\mathbf{w^Tx}}}
Parameters: \( \mathbf{w} \)
Learning algorithm: gradient descent (We will see soon)
Objective/Loss/Error function: One Possibility is
The learning algorithm should aim to find a \( \mathbf{w} \) which minimizes the above function (squared error between \( y \ \text{and} \ \hat{y} \))
\mathscr{L(\mathbf{w})}= \sum \limits_{i=1}^n (\hat{y}_i-y_i)^2
Module 3.3: Learning Parameters :
(Infeasible) Guess Work
Mitesh M. Khapra
IIT Madras
AI4Bharat
w_1
x_0 = 1
f(x)
w_0 = -\theta
w_2
w_n
\cdots
x_1
x_2
\cdots
x_n
\sigma
f(x) = \frac{1}{1+exp\big(-(\sum \limits_{i=1}^{n}w_ix_i+w_0)\big)}
Keeping this supervised ML setup in mind, we will now focus on this model and discuss an algorithm for learning the parameters of this model from some given data using an appropriate objective function
\(\sigma\) stands for the sigmoid function (logistic function in this case)
1
\hat{y}=f(x)
x
\sigma
f(x) = \frac{1}{1+e^{-(wx+b)}}
Keeping this supervised ML setup in mind, we will now focus on this model and discuss an algorithm for learning the parameters of this model from some given data using an appropriate objective function
\(\sigma\) stands for the sigmoid function (logistic function in this case)
For ease of explanation, we will consider a very simplified version of the model having just 1 input
Further to be consistent with the literature, from now on, we will refer to \(w_0\) as \(b\) (bias)
Lastly, instead of considering the problem of predicting like/dislike, we will assume that we want to predict criticsRating(\(y\)) given imdbRating(\(x\)) (for no particular reason)
w
b
Input for Training
\(\{x_i,y_i\}_{i=1}^N \rightarrow N \ \text{pairs of} \ (x,y) \)
Training Objective
Find \( w\) and \(b \) such that
\( \underset{(w,b) \in R} {minimize}\) \( \mathscr{L(w,b)}= \frac{1}{N}\sum \limits_{i=1}^N (y_i-f(x_i))^2 \)
1
\hat{y}=f(x)
x
\sigma
f(x) = \frac{1}{1+e^{-(wx+b)}}
w
b
What does it mean to train the network?
Suppose we train the network with \( (x, y) = (0.5, 0.2) \) and \((2.5, 0.9)\)
x
y
(x_1,y_1)
(x_2,y_2)
f(x) = \frac{1}{1+e^{-(wx+b)}}
At the end of training we expect to find \(w^*\), \(b^*\) such that:
\( f(0.5) \rightarrow 0.2 \ \text{and} \ f(2.5) \rightarrow 0.9 \)
In other words
We hope to find a sigmoid function such that \((0.5, 0.2)\) and \((2.5, 0.9)\) lie on this sigmoid
What does it mean to train the network?
x
y
(x_1,y_1)
(x_2,y_2)
f(x) = \frac{1}{1+e^{-(wx+b)}}
In other words
We hope to find a sigmoid function such that \((0.5, 0.2)\) and \((2.5, 0.9)\) lie on this sigmoid
Suppose we train the network with \( (x, y) = (0.5, 0.2) \) and \((2.5, 0.9)\)
At the end of training we expect to find \(w^*\), \(b^*\) such that:
\( f(0.5) \rightarrow 0.2 \ \text{and} \ f(2.5) \rightarrow 0.9 \)
Let us see this in more detail ...
Can we find \( w^*,b^* \) manually?
Let us start with a Random Guess: \( w=3,b=-1\)
Clearly, it is not good, but how bad is it ?
Let us revisit \(\mathscr{L} (w, b) \) to see how bad it is ...
\mathscr{L(w,b)}= \frac{1}{N} \sum \limits_{i=1}^N (y_i-f(x_i))^2
= \frac{1}{2} * \big( (y_1-f(x_1))^2 + (y_2-f(x_2))^2 \big)
= \frac{1}{2} * \big( (0.9-f(2.5))^2 + (0.2-f(0.5))^2 \big)
= \frac{1}{2} * \big( (0.9-0.99)^2 + (0.2-0.62)^2 \big)
=0.099
We want \(\mathscr{L} (w, b) \) to be as close to zero as possible
Let us try some other values for \((w,b)\) by changing the sliders
w |
b |
Loss |
---|---|---|
0.50 | 0.00 | 0.0730 |
-0.10 | 0.00 | 0.14 |
0.94 | -0.94 | 0.0214 |
1.42 | -1.73 | 0.0028 |
1.65 | -2.08 | 0.0003 |
1.78 | -2.27 | 0.0000 |
w |
b |
Loss |
---|---|---|
0.50 | 0.00 | 0.0730 |
-0.10 | 0.00 | 0.14 |
0.94 | -0.94 | 0.0214 |
1.42 | -1.73 | 0.0028 |
1.65 | -2.08 | 0.0003 |
1.78 | -2.27 | 0.0000 |
A few Snapshots
Let us Look for something better than our "Guess work Algorithm"..
Since we have only 2 points and 2 parameters \((w, b)\) we can easily plot \(\mathscr{L}(w, b)\) for different values of \((w, b)\) and pick the one where \(\mathscr{L}(w, b)\) is minimum
Random Search on Error Surface
But of course this becomes intractable once you have many more data points and many more parameters !!
Further, even here we have plotted the error surface only for a small range of \((w, b)\) [from (−6, +6) and not from \((-\infty, +\infty)\)]
Let us look at the geometric interpretation of our “guess work” algorithm in terms of this error surface
Play with the Search Space
w |
b |
Loss |
---|---|---|
0.50 | 0.00 | 0.0730 |
-0.10 | 0.00 | 0.14 |
0.94 | -0.94 | 0.0214 |
1.42 | -1.73 | 0.0028 |
1.65 | -2.08 | 0.0003 |
1.78 | -2.27 | 0.0000 |
Module 3.4: Learning Parameters : Gradient Descent
Mitesh M. Khapra
IIT Madras
AI4Bharat
Goal
Find a better way of traversing the error surface so that we can reach the minimum value quickly without resorting to brute force search!
Vector of parameters, say,
randomly initialized
\( \theta = [w,b]\)
\( \Delta \theta = [\Delta w, \Delta b]\)
change in the value of
\(w,b\)
We moved in the
direction of \(\Delta \theta\)
Let us be a bit conservative:
move only by a small amount \(\eta \)
\(\theta_{new}=\theta+\eta \cdot \Delta \theta \)
Question: What is the right \(\Delta \theta\) to use ?
The answer comes from Taylor series
\theta
\Delta \theta
\color{red} \eta \cdot \Delta \theta
\theta_{new}
What is Taylor Series?
Taylor series is a way of approximating any continuously differentiable function \(\mathscr{L(w)}\) using polynomials of degree \(n\). The higher the degree the better the approximation!.
\(\mathscr{L}(w)=\mathscr{L}(w_0)+\frac{\mathscr{L'}(w_0)}{1!}(w-w_0)+\frac{\mathscr{L''}(w_0)}{2!}(w-w_0)^2+\frac{\mathscr{L'''}(w_0)}{3!}(w-w_0)^3+ \cdots\)
w
\mathscr{L}(w)
w_0
Linear Approximation (\( n=1 \))
\( \mathscr{L}(w)=\mathscr{L}(w_0)+ \frac{\mathscr{L'}(w_0)}{1!}(w-w_0) \)
Quadratic Approximation (\(n=2\))
\(\mathscr{L}(w)=\mathscr{L}(w_0)+\frac{\mathscr{L'}(w_0)}{1!}(w-w_0)+ \frac{\mathscr{L''}(w_0)}{2!}(w-w_0)^2 \)
\mathscr{L}(w_0)
Taylor Series Approximation
Try with different \( L(w) \) such as \(exp(w),sin(w),w^3 \cdots \) and see how closely Taylor series approximate the function at the point \(w = w_0\) and around its neighbour \(\epsilon \)
Approximating 2D function
For ease of notation, let \(\Delta \theta = u\) , then from Taylor series, we have,
\mathcal{L}(\theta+ \eta u) = \mathcal{L}(\theta)+\eta*u^T \nabla_{\theta}\mathcal{L}(\theta)+\frac{\eta^2}{2!}*u^T\nabla_{\theta}^2\mathcal{L}(\theta)u+
\frac{\eta^3}{3!} \cdots +\frac{\eta^4}{4!} \cdots
= \mathcal{L}(\theta)+\eta*u^T \nabla_{\theta}\mathcal{L}(\theta)
[ \( \eta\) is typically small, so \( \eta^2,\eta^3 \cdots\rightarrow 0\)]
Note that the move \(\eta u\) would be favorable only if
\mathcal{L}(\theta+\eta u)-\mathcal{L}(\theta) < 0
[i.e., if the new loss is less than the previous loss]
This implies,
u^T \nabla_{\theta}\mathcal{L}(\theta) < 0
But, what is the range of \( u^T \nabla_{\theta}\mathcal{L}(\theta) \)?
Let \( \beta \) be the angle between \( u^T \) and \( \nabla_{\theta} \mathcal{L}(\theta)\), then we know that
\(-1 \leq \cos(\beta) = \frac{ u^T \nabla_{\theta} \mathcal{L}(\theta)}{ ||u^T|| * ||\nabla_{\theta} \mathcal{L}(\theta)||} \leq 1 \)
Multiply throught \(k = ||u^T|| * ||\nabla_{\theta} \mathcal{L}(\theta)|| \)
\(-k \leq k*\cos(\beta) = u^T \nabla_{\theta} \mathcal{L}(\theta) \leq k \)
Thus \( \mathcal{L}(\theta+\eta u) - \mathcal{L}(\theta) = u^T \nabla_{\theta} \mathcal{L}(\theta) = k*\cos(\beta) \) is most negative when \(\cos(\beta)=-1\) (i.e., \(\beta\)=\(180^o\))
Set \(\eta=1\) initially. The function \(f(w)\) is evaluated for initial value of \(w=1.2\) and notice the gradient (\(dw\)) value.
Change the value of \(w\) according to the gradient value. That is, take a step \(w\pm \eta dw\) (Just enter \(1.2-0.31\) in the input box for the first iteration and follow the same for the rest)
After a few iterations, did you land at the local minimum?
If no, Repeat the game and make necessary changes to land at local minimum.
Land at Local Minima with Gradient as Guidance
Gradient Descent Rule
The direction \(u\) that we intend to move in should be at 180° w.r.t. the gradient
In other words, move in a direction opposite to the gradient
Parameter Updation Rule
\( w_{t+1} = w_t - \eta \nabla w_t\)
\(b_{t+1} = b_t - \eta \nabla b_t \)
where, \( \nabla w_t = \frac{\partial \mathcal{L}(w,b)}{\partial w}\), at \(w=w_t,b=b_t \),
and, \( \nabla b_t = \frac{\partial \mathcal{L}(w,b)}{\partial b}\), at \(w=w_t,b=b_t \),
So we now have a more principled way of moving in the (\(w,b\)) plane than our “guess work” algorithm
Let us create an algorithm for this rule ...
To see this algorithm in practice let us first derive \( \nabla w \) and \( \nabla b\) for our toy neural network
max_iterations = 1000
w = random()
b = random()
while max_iterations:
w = w - eta*dw
b = b - eta*db
max_iterations -= 1
Algorithm:gradient_descent()
\(t \leftarrow 0;\)
max_iterarions \(\leftarrow 1000\);
w,b \(\leftarrow\) initialize randomly
while \(t <\) max_iterations do
\(w_{t+1} \leftarrow w_t-\eta \nabla w_t\);
\(b_{t+1} \leftarrow b_t-\eta \nabla b_t\);
\( t \leftarrow t+1;\)
end
f(x) = \frac{1}{1+e^{-(wx+b)}}
1
x
\sigma
\hat{y}=f(x)
w
b
Let us assume there is only one point to fit
\mathscr{L}(w,b)=\frac{1}{2}*(f(x)-y)^2
\nabla w = \frac{\partial \mathscr{L}(w,b)}{\partial w} = \frac{\partial}{\partial w}[\frac{1}{2}*(f(x)-y)^2]
\nabla w = \frac{\partial}{\partial w}[\frac{1}{2}*(f(x)-y)^2]
= \frac{1}{2} [2*(f(x)-y) * \frac{\partial}{\partial w} (f(x)-y)]
=(f(x)-y) * \frac{\partial}{\partial w} (f(x))
=(f(x)-y) * \frac{\partial}{\partial w} (\frac{1}{1+e^{-(wx+b)}})
\frac{\partial}{\partial w} (\frac{1}{1+e^{-(wx+b)}})
= \frac{-1}{(1+e^{-(wx+b)})^2} \frac{\partial}{\partial w} (e^{-(wx+b)})
= \frac{-1}{(1+e^{-(wx+b)})^2}* (e^{-(wx+b)})* \frac{\partial}{\partial w} (-(wx+b))
= \frac{-1}{(1+e^{-(wx+b)})}* \frac{e^{-(wx+b)}}{(1+e^{-(wx+b)})}* -x
= \frac{1}{(1+e^{-(wx+b)})}* \frac{e^{-(wx+b)}}{(1+e^{-(wx+b)})}* x
= f(x)*(1-f(x))* x
=(f(x)-y) * f(x)*(1-f(x))*x
f(x) = \frac{1}{1+e^{-(wx+b)}}
1
x
\sigma
\hat{y}=f(x)
w
b
(x_1,y_1)
(x_2,y_2)
\nabla w=(f(x)-y) * f(x)*(1-f(x))*x
So If we have only one point (\(x,y\)), we have,
For two points,
\nabla w=\sum \limits_{i=1}^{2}(f(x_i)-y_i) * f(x_i)*(1-f(x_i))*x_i
\nabla b=\sum \limits_{i=1}^{2}(f(x_i)-y_i) * f(x_i)*(1-f(x_i))
import numpy as np
X = [0.5,2.5]
Y = [0.2,0.9]
def f(x,w,b):
return 1/(1+np.exp(-(w*x+b)))
def error(w,b):
err = 0.0
for x,y in zip(X,Y):
fx = f(x,w,b)
err += (fx-y)**2
return 0.5*err
def grad_b(x,w,b,y):
fx = f(x,w,b)
return (fx-y)*fx*(1-fx)
def grad_w(x,w,b,y):
fx = f(x,w,b)
return (fx-y)*fx*(1-fx)*x
def do_gradient_descent():
w,b,eta,max_epochs = -2,-2,1.0,1000
for i in range(max_epochs):
dw,db = 0,0
for x,y in zip(X,Y):
dw += grad_w(x,w,b,y)
db += grad_b(x,w,b,y)
w = w - eta*dw
b = b - eta*db
Let us do_gradient_descent()
Later on in the course we will look at gradient descent in much more detail and discuss its variants
For the time being it suffices to know that we have an algorithm for learning the parameters of a sigmoid neuron
So where do we head from here ?
Module 3.5: Representation Power of a Multilayer Network of Sigmoid Neurons
Mitesh M. Khapra
IIT Madras
AI4Bharat
Representation power of a multilayer network of perceptrons
Representation power of a multilayer network of sigmoid neurons
A multilayer network of perceptrons with a single hidden layer can be used to represent any boolean function precisely (no errors)
A multilayer network of neurons with a single hidden layer can be used to approximate any continuous function to any desired precision
In other words, there is a guarantee that for any function \(f(x) : R^n \rightarrow R^m\), we can always find a neural network (with 1 hidden layer containing enough neurons) whose output \(g(x)\)satisfies \(|g(x) − f(x)| < \epsilon\) !!
Proof: We will see an illustrative proof of this... [Cybenko, 1989], [Hornik, 1991]
See this link* for an excellent illustration of this proof
The discussion in the next few slides is based on the ideas presented at the above link
We are interested in knowing whether a network of neurons can be used to represent an arbitrary function (like the one shown in the figure)
We observe that such an arbitrary function can be approximated by several “tower” functions
More the number of such “tower” functions, better the approximation
+
+
To be more precise, we can approximate any arbitrary function by a sum of such “tower” functions
Sifting Property of Tower function (impulse or delta function)
Any function can be represented by summing shifted and scaled versions of tower functions
Suppose there is a black box which takes the original input \((x)\) and constructs these tower functions
We can then have a simple network which can just add them up to approximate the function
Our job now is to figure out what is inside this blackbox
+
Tower Maker
Tower Maker
Tower Maker
Tower Maker
..........
..........
x
..........
We will figure this out over the next few slides ...
If we take the logistic function and set \(w\) to a very high value we will recover the step function
h_{21}=h_{11}-h_{12}
Can we come up with a neural network to represent this operation of subtracting one sigmoid function from another ?
+1
-1
w_1,b_1
w_2,b_2
h_{11}
h_{12}
h_{21}
x
What could be the suitable activation function for the neuron \(h_{21}\)?
What if we have more than one input?
Suppose we are trying to take a decision about whether we will find oil at a particular location on the ocean bed (Yes/No)
Further, suppose we base our decision on two factors: Salinity \((x_1)\) and Pressure \((x_2)\)
We are given some data and it seems that \(y \in \{oil,no \ oil\}\) is a complex function of \(x_1\) and \(x_2\)
We want a neural network to approximate this function
f(x_1,x_2)=\frac{1}{\exp(-(w_{1}x_1+w_{2}x_2+b))}
We need to figure out how to get a tower in this case
Let us vary the parameter \(w_1\) keeping \(w_2=0\) and \(b=0\).
Let us vary the parameter \(w_2\) keeping \(w_1=0\) and \(b=0\).
Let us vary the parameter \(b\) keeping \(w_1=k\) and \(w_2=0\).
What if we take two such step functions (with different b values) and subtract one from the other
We still don’t get a tower (or we get a tower which is open from two sides)
\(x_1\)
\(x_2\)
\(w_1\)
\(w_2\)
\(w_3\)
\(w_4\)
\(+1\)
\(-1\)
\(\sigma\)
\(\sigma\)
+
\(y=f(x_1,x_2\))
\(h_{11}\)
\(h_{12}\)
Can we come up with a neural network to represent this entire procedure of constructing a 3 dimensional tower ?
Creating 2D Tower Function
\(x_1\)
\(x_2\)
\(+1\)
\(-1\)
\(\sigma\)
\(h_{11}\)
\(h_{12}\)
\(+1\)
\(-1\)
\(h_{13}\)
\(h_{14}\)
\(\sigma\)
\(\sigma\)
\(\sigma\)
/
/
/
\(\sigma\)
w_{11}=200
w_{12}=6
b_1=-100
w_{21}=200
w_{12}=6
b_2=100
w_{31}=6
w_{32}=200
b_3=-100
w_{41}=6
w_{42}=-200
b_4=100
w_1 = 50
w_{2}=0
b=-100
\(+1\)
\(+1\)
What if we have more than one input?
Suppose we are trying to take a decision about whether we will find oil at a particular location on the ocean bed (Yes/No)
Further, suppose we base our decision on two factors: Salinity \((x_1)\) and Pressure \((x_2)\)
We are given some data and it seems that \(y \in \{oil,no \ oil\}\) is a complex function of \(x_1\) and \(x_2\)
We want a neural network to approximate this function
For 1 dimensional input we needed 2 neurons to construct a tower
For 2 dimensional input we needed 4 neurons to construct a tower
How many neurons will you need to construct a tower in n dimensions
Think
CS6910: Lecture 3
By Arun Prakash
CS6910: Lecture 3
Sigmoid Neuron to Feedforward Neural Networks
- 6,255