Building neural network step-by-step in raw python¶

1. one single neuron for linear regression:¶

The neuron is just two matrices, one of intercepts, the other one of bias. if we have m inputs / features / independent variables $\mathbf{x}=[x_1,\cdots,x_m]^T$, we can write the neuron's function as a linear regression: $y=\mathbf{w}^T \mathbf{x}+b$, where intercepts are $\mathbf{w}=[w_1,\cdots,w_m]^T$.

alt text

2. one single neuron for non-linear regression:¶

The neuron can also be for non-linear regression, through a non-linear function called the activation function. This function maps the $y$ into $y_{\text{hat}}$ within a certain range. Choice of the function is such that 1) the neuron can act like a switch: when y is below the threshold value, the neuron is turned off; otherwise it's turned on; 2) it is better to use a smooth function so that we can use its derivative for learning. The idea of gradient descent for learning is the same as what we discussed in the previous courses.

A typical activation function can be for example sigmoid function: $y =\cfrac{1}{1+exp(-x)}$. Some others are https://en.wikipedia.org/wiki/Activation_function:

alt text

In [1]:
import numpy as np
class Neuron:
    def __init__(self,inputsize=3,weights=None,bias=None,activation='sigmoid'):
        self.weights = weights
        if not weights:
            self.weights = np.random.normal(size=inputsize)
        self.bias = bias
        if not bias:
            self.bias = np.random.rand()
        if activation=='sigmoid':
            self.activation = self.sigmoid
        else:
            self.activation = self.relu
        
    def sigmoid(self,y):
        return 1/(1+np.exp(-y))
    
    def relu(self,y):
        if y>0:
            return y
        return 0
    
    def forward(self,x):
        return self.activation(np.matmul(x,self.weights)+self.bias)

inputs = [1,2,3]
neuron = Neuron(activation='sigmoid')
print(neuron.forward(inputs))
neuron = Neuron(activation='relu')
print(neuron.forward(inputs))
0.8970168231830213
0

3. multiple layers of neurons: the neural network¶

Now suppose we have two neurons in the first layer and one output neuron in the second layer. Each neuron is only connected to neurons in other layers, and each neuron is connected to ALL of the neurons in the adjacent layers.

We'd need clearer notations of the neurons. Let's denote the inputs with footnote $i$, the parameters in the first layer neurons with footnote $j$ (since we have two neurons in the first layer, j is either 1 or 2), and the parameters in the output neuron with footnote $l$ (since there's only one output neuron, $l=3$). The weights in each layer are denoted by both its start and end neurons, e.g., $w_{j=1,l=3}$ is the weight in neuron3 for input from neuron1.

alt text

For the output neuron3, since its input is the same as the output from the previous layer, we denote its input with $o_j,j \in \{1,2\}$.

For neurons 1 and 2 in the first layer: its input is denoted $o_i$, although, since we only have two layers, in our case $o_i=x_i$.

The activation function is denoted $\phi$. The target (ground truth) is $t$. The output from the output neuron3 is the estimate of the ground truth, denoted $\hat{y}$. Since we only have one output neuron, we also have $\hat{y}=o_l$.

For the intermediate result from the linear function, in neurons 1 and 2 for example, we use $\text{net}_j = (\sum_{i \in \{1,2,3\}} w_{ij} \cdot o_i) + b_j$. We'd also have for neuron1: $o_j = \phi(\text{net}_j)$.

For neuron3: $\text{net}_l= (\sum_{j \in \{1,2\}} w_{jl} \cdot o_j) + b_l$, and output $\hat{y}=o_l=\phi(\text{net}_l)$.

In [2]:
neuron1 = Neuron()
neuron2 = Neuron()
neuron3 = Neuron(inputsize=2)

o1 = neuron1.forward(inputs)
o2 = neuron2.forward(inputs)
y_hat = neuron3.forward([o1,o2])

print(o1,o2,y_hat)
0.9878072538614416 0.8883729264715795 0.35728001637405404

4. loss function and learning through backpropagation¶

Suppose the target / ground truth / label is $t$. The loss function is mean squared error (as introduced in previous courses: measurement for regression problems): $e = \frac{1}{2} (t-\hat{y})^2$, where $\hat{y}=o_l$. Our goal is to approximate the target $t$ with our model, such that the output of the model is as close to the target as possible (i.e., minimize the mean squared error).

To do that, we need to be able to update our model parameters based on the error value. In linear regression, the parameters of the model would be the intercept and the bias. In a neural network, the parameters are all the weights and biases of all the neurons. Updating the parameters in a neural network is the same as in linear regression: through gradient descent. The only difference is the calculation of gradient: because neural networks are non-linear, we need to get the derivatives of a non-linear function.

We work it out backwards step-by-step.

  1. derivative of the loss function

    The derivative of the loss function (the mean squared error) is simply $\cfrac{\partial e}{\partial o_l} = o_l - t$.

    To update the last neuron's weights and bias, we'd need the derivative of the error $e$ with respect to each weight and bias.

    To do that, we first get the derivative of the activation function (suppose it's a sigmoid) with respect to $\text{net}_l$, then the derivative of the linear function with respect to the weights and bias.

  2. derivative of the activation function

    Neuron3's sigmoid function's derivative with respect to its input $\text{net}_l$ is:

    $\cfrac{\partial o_l}{\partial \text{net}_l} = \cfrac{\partial \phi(\text{net}_l)}{\partial \text{net}_l} = o_l \cdot (1-o_l)$

    A step-by-step calculation is here: https://towardsdatascience.com/derivative-of-the-sigmoid-function-536880cf918e

  3. derivative of the linear function

    The derivative of $\text{net}_l$ with respect to each of neuron3's intercept $w_{jl}$ is:

    $\cfrac{\partial \text{net}_l}{\partial w_{jl}} = \cfrac{\partial [(\sum_{j \in \{1,2\}} w_{jl} o_j) + b_l]}{\partial w_{jl}}=o_j$

  4. putting it all together with chain rule:

    According to the chain rule, the complete derivative from loss to intercept $w_{jl}$ in the last neuron (neuron3) is:

    $\Delta_{w_{jl}}=\cfrac{\partial e}{\partial w_{jl}}=\cfrac{\partial e}{\partial o_l} \cdot \cfrac{\partial o_l}{\partial \text{net}_l} \cdot \cfrac{\partial \text{net}_l}{\partial w_{jl}} = (o_l - t) \cdot (o_l \cdot (1-o_l)) \cdot o_j$.

    For bias, $\Delta_{b_l}=(o_l - t) \cdot (o_l \cdot (1-o_l))$.

  1. the gradient descent and updating model parameters:

    According to the gradient descent algorithm, we'd update the original weight and bias values of the neuron3 with:

    $w_{jl} \gets w_{jl} - \eta \Delta_{w_{jl}}$ and $b_{l} \gets b_{l} - \eta \Delta_{b_{l}}$

    In order to understand the wikipedia page: https://en.wikipedia.org/wiki/Backpropagation, we change our notations slightly:

    We use $\delta_l = (o_l - t) \cdot (o_l \cdot (1-o_l))$ and rewrite the above to:

    $w_{jl} \gets w_{jl} - \eta \delta_l o_j$ and $b_{l} \gets b_{l} - \eta \delta_l$.

    This formula will be universal for all neurons.

    A quick recap of gradient descent: the gradient of the loss function points out the direction of the update: if gradient is negative, the weight needs to increase; if the gradient is positive, the weight needs to decrease. The learning rate $\eta$ controls the size of the update: if the step is too big, we risk oscillating and never converging. If the step is too small, it may take too long to reach the minimum. https://en.wikipedia.org/wiki/Gradient_descent

    alt text

  2. extend it to any neurons (not just the output neuron):

    For any neurons that are not the output neuron, it would need a longer chain of derivatives backwards. For neuron1 for example:

    $o_{j=1}=\phi(\text{net}_j) = \phi(\sum_{i} (w_{ij} \cdot o_i) + b_j)$

    The chain rule from error $e$ to $w_{ij}$ is:

    $\cfrac{\partial e}{\partial w_{ij}}=\cfrac{\partial e}{\partial o_l} \cdot \cfrac{\partial o_l}{\partial \text{net}_l} \cdot \cfrac{\partial \text{net}_l}{\partial o_j} \cdot \cfrac{\partial o_j}{\partial \text{net}_j} \cdot \cfrac{\partial \text{net}_j}{\partial w_{ij}}=(o_l - t) \cdot (o_l \cdot (1-o_l)) \cdot w_{jl} \cdot (o_j \cdot (1-o_j)) \cdot o_i$

    Again we use $\delta_j=\delta_l \cdot w_{jl} \cdot (o_j \cdot (1-o_j))$ and rewrite:

    $w_{ij} \gets w_{ij} - \eta \delta_j o_i$.

    In fact, if there would be multiple output neurons, we need to add the gradients together (as on the wikipedia page):

    $\delta_j=(\sum_l \delta_l \cdot w_{jl}) \cdot (o_j \cdot (1-o_j))$

    But the update rule stays the same for all neurons: $w_{ij} \gets w_{ij} - \eta \delta_j o_i$.

    This backward pass, as opposed to the forward pass before, is called backpropagation.

    Now we rewrite the Neuron class to include the backward pass.

In [3]:
class Neuron:
    def __init__(self,inputsize=3,weights=None,bias=None,activation='sigmoid'):
        self.weights = weights
        if not weights:
            self.weights = np.random.normal(size=inputsize)
        self.bias = bias
        if not bias:
            self.bias = np.random.rand()
        if activation=='sigmoid':
            self.activation = self.sigmoid
        else:
            self.activation = self.relu
        
    def sigmoid(self,y):
        return 1/(1+np.exp(-y))
    
    def relu(self,y):
        if y>0:
            return y
        return 0
    
    def forward(self,x):
        self.o_i = x
        self.o_j = self.activation(np.matmul(x,self.weights)+self.bias)
        return self.o_j
    
    def backward(self,t=None,y_hat=None,delta_l=None,w_l=None,
                 learning_rate=0.1,output_neuron=False):
        ''' if output_neuron is True, the neuron is the output neuron. 
                Otherwise it's a neuron in an internal layer. '''
        if output_neuron:
            delta_j = (y_hat - t) * self.o_j * (1 - self.o_j)
        else:
            delta_j = np.matmul(delta_l,w_l)
        self.delta_j = delta_j
        
        new_weights = []
        for i,w in enumerate(self.weights):
            new_weights.append(w - learning_rate * delta_j * self.o_i[i])
        self.weights = new_weights
        self.bias = self.bias - learning_rate * delta_j

Set up the network:

In [4]:
t = 1.0
inputs = [1,2,3]

neuron3 = Neuron(inputsize=2)
neurons = []
for j in range(2):
    neurons.append(Neuron(activation='sigmoid'))

for j in range(2):
    print('neuron {}: {}.'.format(j,neurons[j].weights))
print('output neuron: {}.'.format(neuron3.weights))
neuron 0: [-0.41451339 -1.77081177 -1.15804613].
neuron 1: [-1.78363994 -1.89654933 -0.46369448].
output neuron: [-0.43875046  0.81212614].

Run forward pass and backward pass once:

In [5]:
o_j = []
for j in range(2):
    o_j.append(neurons[j].forward(inputs))
y_hat = neuron3.forward(o_j)

neuron3.backward(t=t,y_hat=y_hat,output_neuron=True)
delta_l = neuron3.delta_j
for j in range(2):
    w_l = neuron3.weights[j]
    neurons[j].backward(delta_l=[delta_l],w_l=[w_l],output_neuron=False)

for j in range(2):
    print('neuron {}: {}.'.format(j,neurons[j].weights))
print('output neuron: {}.'.format(neuron3.weights))
print('estimate: {}, error: {}.'.format(y_hat,t-y_hat))
neuron 0: [-0.41853992193557854, -1.778864844034011, -1.1701257432807464].
neuron 1: [-1.7761864539890273, -1.8816423677625997, -0.4413340381348142].
output neuron: [-0.4387411310235029, 0.8121493546532328].
estimate: 0.6131035670318141, error: 0.3868964329681859.

5. exercise¶

  1. how do you train the above model such that the parameters are optimal? write training codes below (for simplicity we only have one record: one input and one label):
In [6]:
# your training code below
  1. what if you have two records: two input arrays and two labels? keeping the Neuron class the same, change your training code such that both records can be used for training:
In [7]:
inputs = [[1,2,3],[4,5,6]]
t = [1.0,0.7]

# your training code below
  1. what if you want to train faster and read in a batch of two records at the same time? how do you change the Neuron class and your training code?
In [8]:
# your new class definition and training code below
  1. (optional) currently $o_l$ is a one-dimentional label. what if you want to have multiple labels, such as in multi-class classification? how do you change your Neuron class?
In [9]:
t = [[1.0,0.0,0.0],[0.0,1.0,0.0]]

# your new class definition and training code below
  1. (optional) choose hyperbolic tangent as activation function and rewrite the forward and backward pass in the class definition (you'd have to re-calculate the derivatives).
In [10]:
# your new class definition below