Gradient Descent Theory

MYZ 303E Artificial Intelligence in Civil Engineering

Introduction to Gradient Descent

Gradient Descent is a fundamental optimization algorithm used in machine learning to find the minimum of a function. In the context of machine learning, this function is typically a loss or cost function that measures how well a model performs on a given dataset. The goal is to find the weight values (W) that minimize this function, resulting in an optimal model.

In essence, gradient descent is like finding the lowest point in a valley by taking steps in the steepest downhill direction. The algorithm gets its name from the fact that it follows the negative gradient (the direction of steepest descent) of the function to reach a minimum.

The Mathematical Foundation

The Gradient

The gradient of a function is a vector of partial derivatives. For a function f(x₁, x₂, ..., xₙ), the gradient is denoted as ∇f:

\[ \nabla f = \left[\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots, \frac{\partial f}{\partial x_n}\right] \]

The gradient points in the direction of the steepest increase of the function. By taking steps in the opposite direction (-∇f), we can move toward the minimum of the function. In machine learning, we calculate the gradient with respect to the weights (W) of our model.

The Update Rule

In gradient descent, we iteratively update the weights (W) using the following rule:

\[ W = W - \alpha \nabla J(W) \]

Where:

Important Note:

The learning rate α is a critical hyperparameter. If it's too small, the algorithm will converge very slowly. If it's too large, the algorithm might overshoot the minimum and fail to converge or even diverge.

Types of Gradient Descent

Batch Gradient Descent

In batch gradient descent, we compute the gradient of the cost function with respect to the weights for the entire training dataset:

W = W - α∇J(W)

This approach is computationally expensive for large datasets as it requires calculating gradients over the entire training set for each update step.

Stochastic Gradient Descent (SGD)

Stochastic gradient descent updates the weights using only one training example at a time:

\[ W = W - \alpha \nabla J(W; x^{(i)}, y^{(i)}) \]

Where (x⁽ⁱ⁾, y⁽ⁱ⁾) is a single training example. SGD is much faster but has higher variance in the parameter updates, which can cause the objective function to fluctuate heavily.

Mini-batch Gradient Descent

Mini-batch gradient descent is a compromise between batch and stochastic methods. It updates the weights using a small random subset (mini-batch) of the training data:

\[ W = W - \alpha \nabla J(W; x^{(i:i+n)}, y^{(i:i+n)}) \]

This approach reduces the variance of the parameter updates compared to SGD, leading to more stable convergence. It also allows for efficient matrix operations that can be parallelized.

Convergence and Challenges

Convergence Criteria

Gradient descent typically terminates when one of these conditions is met:

Challenges in Gradient Descent

Local Minima

For non-convex functions (like those in deep learning), gradient descent may converge to a local minimum rather than the global minimum. This is because the algorithm only moves downhill and can get trapped in valleys that aren't the deepest point globally.

Saddle Points

Saddle points are points where the gradient is zero in all directions, but it's not a minimum. In high-dimensional spaces (common in machine learning), saddle points are more prevalent than local minima and can slow down convergence.

Plateaus

Plateaus are flat regions where the gradient is very small but not zero. Gradient descent can slow down significantly in these regions, taking many iterations to escape.

Advanced Gradient Descent Variants

Momentum

Momentum helps accelerate gradient descent by adding a fraction of the previous update vector to the current update:

\[ v = \gamma v - \alpha \nabla J(W) \] \[ W = W + v \]

Where γ is the momentum coefficient (typically 0.9). This helps the algorithm navigate ravines (areas where the surface curves much more steeply in one dimension than in another) more effectively.

RMSprop

RMSprop adapts the learning rate for each weight based on the history of squared gradients:

\[ E[g^2]_t = 0.9E[g^2]_{t-1} + 0.1(\nabla J(W))^2 \] \[ W = W - \frac{\alpha}{\sqrt{E[g^2]_t + \epsilon}}\nabla J(W) \]

This helps normalize the gradient, making the algorithm less sensitive to the scale of the features.

Adam (Adaptive Moment Estimation)

Adam combines the ideas of momentum and RMSprop, keeping track of both the first moment (mean) and the second moment (uncentered variance) of the gradients:

\[ m = \beta_1 m + (1-\beta_1)\nabla J(W) \] \[ v = \beta_2 v + (1-\beta_2)(\nabla J(W))^2 \] \[ \hat{m} = \frac{m}{1-\beta_1^t} \] \[ \hat{v} = \frac{v}{1-\beta_2^t} \] \[ W = W - \frac{\alpha}{\sqrt{\hat{v}} + \epsilon}\hat{m} \]

Adam is widely used in practice because it often converges faster and more reliably than basic gradient descent.

Applications in Civil Engineering

Gradient descent is widely used in civil engineering applications of machine learning, including:

Civil Engineering Perspective:

In civil engineering applications, the choice of the loss function is particularly important. For structural applications, safety-critical concerns often necessitate asymmetric loss functions that penalize under-prediction more severely than over-prediction.

Conclusion

Gradient descent is a powerful optimization technique that forms the backbone of most machine learning algorithms. Understanding its mathematical foundations, variants, and limitations is essential for effectively applying machine learning to civil engineering problems.

While this tutorial has focused on the theoretical aspects, practical implementation requires careful consideration of hyperparameters, preprocessing of data, and selection of appropriate model architectures for the specific problem at hand.