Module 2: Artificial Neural Networks

Gradient Descent and Optimization in Neural Networks

Introduction

In this section, we will discuss the concept of gradient descent, a fundamental algorithm used for optimizing weights and biases in neural networks. Understanding gradient descent is crucial before diving into the mechanics of how neural networks learn through backpropagation.

Understanding the Problem

Suppose we have a dataset where zz  is twice the value of xx. Our goal is to find the optimal weight ww that generates a line best fitting this data. To achieve this, we define a cost or loss function, denoted as JJ.

Cost Function

The cost function measures the difference between the actual values of zz  and the values predicted by the model, i.e., wxwx. It is given by:

J(w)=i=1n(ziwxi)2J(w) = \sum_{i=1}^{n} \left( z_i - w \cdot x_i \right)^2

The objective is to find the value of ww that minimizes this cost function, leading to the best fit line for the data.

Example: Simple Linear Data

For simplicity, consider the case where z=2xz=2x. The optimal value of ww  that minimizes the cost function is w=2w=2, as it perfectly fits the line z=2xz=2x.

Introduction to Gradient Descent

Gradient descent is an iterative optimization algorithm used to find the minimum value of a function. It is particularly useful for minimizing the cost function in neural networks.

How Gradient Descent Works

  1. Initialization: Start with a random initial value of ww, denoted as w0w_0.
  1. Compute the Gradient: Calculate the gradient (slope OR derivative) of the cost function at the current value of ww. The gradient indicates the direction in which the cost function is increasing.
  1. Update Rule: Adjust the value of ww  by moving in the direction opposite to the gradient. This is done using the formula:
wi+1=wiαgradient(J(wi))w_{i+1} = w_i - \alpha \cdot \text{gradient}(J(w_i))

Here, α\alpha is the learning rate, controlling the step size.

  1. Iteration: Repeat the process until the algorithm converges to the minimum value of the cost function or a value close to it.

Choosing the Learning Rate

Example with Iterations

Assume we start with w0=0.2w_0=0.2  and use a learning rate α=0.4\alpha = 0.4:

Application in Neural Networks

In neural networks, gradient descent is used to optimize multiple weights and biases simultaneously. The algorithm updates each parameter in a way that minimizes the overall cost function, which measures how well the network's predictions match the actual data.

Forward Propagation and Gradient Descent

During training, neural networks use forward propagation to calculate the output and then apply gradient descent to adjust the weights and biases, improving the network's performance over time.

Summary

Gradient descent is a powerful optimization algorithm that iteratively adjusts parameters to minimize a cost function. By understanding how to apply gradient descent to a simple linear problem, we are now equipped to explore more complex scenarios, such as optimizing weights in neural networks using backpropagation.


Gradient Descent and Backpropagation in Neural Networks

Training Overview

Neural networks are trained using a supervised learning approach, where each data point has a corresponding label or ground truth. The goal of training is to minimize the difference (error) between the predicted value by the network and the ground truth. This error is calculated and then propagated back into the network to adjust the weights and biases.

Error Calculation and Cost Function

Gradient Descent for Optimization

To minimize the error, gradient descent is used. It iteratively updates the weights and biases in the network:

  1. Starting Point: Begin with random initial weights and biases.
  1. Gradient Calculation: Compute the gradient (slope) of the cost function with respect to each weight and bias using calculus. This shows how much the error will change if we slightly change the weights or biases.
  1. Update Rule: The weights and biases are updated using the formula:
wnew=woldlearning rate×gradientw_{\text{new}} = w_{\text{old}} - \text{learning rate} \times \text{gradient}

The learning rate controls how big a step we take towards the minimum of the cost function.

Backpropagation

Backpropagation is the method used to calculate the gradients of the error with respect to the weights and biases. It applies the chain rule of calculus to compute how the error propagates back through the network:

Example with One Input and Two Neurons

Consider a network with two neurons:

Weight Update Equations

For the second neuron:

Ew2=(Ta2)×a2×(1a2)×a1\frac{\partial E}{\partial w_2} = -(T - a_2) \times a_2 \times (1 - a_2) \times a_1

Iterative Training Process

Training involves repeatedly performing the following steps until the error is minimized:

  1. Forward Propagation: Calculate the network output.
  1. Error Calculation: Compute the error between the prediction and ground truth.
  1. Backpropagation: Calculate gradients for each weight and bias using the chain rule.
  1. Update Weights and Biases: Adjust parameters to reduce the error.

This process continues over multiple iterations or epochs until the error is sufficiently small or the maximum number of iterations is reached.


Vanishing Gradient Problem in Neural Networks

Overview

The vanishing gradient problem is a significant issue associated with using the sigmoid activation function in neural networks. It affects the training efficiency and prediction accuracy of the network.

Problem Description

Mathematical Insight

When using the sigmoid function, the derivatives of the activation function can be very small. During backpropagation, the gradient of the error with respect to the weights is calculated as a product of these derivatives. Thus, gradients tend to diminish as they propagate backward through the network:

Gradient=Ewi=Eananznznwi\text{Gradient} = \frac{\partial E}{\partial w_i} = \frac{\partial E}{\partial a_n} \cdot \frac{\partial a_n}{\partial z_n} \cdot \frac{\partial z_n}{\partial w_i}

where:

Conclusion

Due to the vanishing gradient problem, sigmoid functions and similar activation functions are not ideal for deep networks. This problem has led to the development and use of alternative activation functions that mitigate this issue.

Next Steps

In the following notes, alternative activation functions that address the vanishing gradient problem will be introduced. These functions are commonly used in hidden layers of modern neural networks to improve training efficiency and accuracy.


Activation Functions in Neural Networks

Types of Activation Functions

There are 7 types of most common activation functions:

  1. Sigmoid Function
  1. Hyperbolic Tangent (tanh) Function
  1. Rectified Linear Unit (ReLU) Function
  1. Softmax Function
  1. Binary Step Function
  1. Linear Function
  1. Leaky ReLU

Additional Note: 5-6-7 are not popular functions.

Introduction

Activation functions are crucial for the learning process of neural networks. They introduce non-linearity into the model, allowing it to learn complex patterns. While the sigmoid function was commonly used in the past, it has notable shortcomings, such as the vanishing gradient problem. This note explores several activation functions and their applications.

1. Sigmoid Function

Formula

σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}

Range

(0,1)(0, 1)

Characteristics

Applications

Previously popular, but avoided in deep networks due to vanishing gradients.

2. Hyperbolic Tangent (tanh) Function

Formula

tanh(z)=ezezez+ez\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}

Range

(1,1)(-1, 1)

Characteristics

Applications

Used in some applications but also limited by vanishing gradients in very deep networks.

3. Rectified Linear Unit (ReLU) Function

Formula

ReLU(z)=max(0,z)\text{ReLU}(z) = \max(0, z)

Range

[0,)[ 0, \infty)

Characteristics

Applications

Widely used in hidden layers of deep networks due to its efficiency and effectiveness.

4. Softmax Function

Formula

Softmax(zi)=ezijezj\text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_{j} e^{z_j}}

Range

(0,1) (sums to 1 across the output layer)(0, 1) \text{ (sums to 1 across the output layer)}

Characteristics

Applications

Commonly used in the output layer of classification networks to handle multi-class problems..

Conclusion

This concludes the overview of activation functions. For deep learning applications, start with ReLU and consider other functions if necessary based on performance.