Softmax Function Tutorial

1. What is the Softmax Function?

The softmax function is widely used in machine learning and deep learning for multi-class classification problems. Its main purpose is to convert a set of real numbers (typically the output values of a neural network's output layer) into a probability distribution.

Softmax produces a value between 0 and 1 for each element of an input vector, and the sum of these values equals 1. This property allows softmax outputs to be directly interpreted as probabilities.

The softmax function is mathematically expressed as:

\[ \sigma(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}} \]

Where:

\( \mathbf{z} \): Input vector
\( z_i \): The i-th element of the vector
\( K \): Dimension of the vector (number of classes)
\( \sigma(\mathbf{z})_i \): Probability of the i-th class

2. Implementing Softmax

Let's look at a softmax calculation for a 5-class classification problem. We'll use raw neural network outputs (logits) as input and convert them to probabilities.

Softmax Function in Vanilla JavaScript:

function softmax(logits, temperature = 1.0) {
    // Temperature application (T parameter)
    const scaledLogits = logits.map(x => x / temperature);
    
    // Find the maximum value (for numerical stability)
    const maxLogit = Math.max(...scaledLogits);
    
    // Calculate exp(x_i - max) (overflow prevention)
    const expValues = scaledLogits.map(x => Math.exp(x - maxLogit));
    
    // Calculate the sum
    const expSum = expValues.reduce((a, b) => a + b, 0);
    
    // Calculate probability for each value
    return expValues.map(x => x / expSum);
}

Here's a sample input vector (logits). Click the "Calculate" button to see the softmax values:

Results:

Class	Logit Value	Softmax Value

3. Softmax and Temperature Parameter

By adding a temperature parameter to the softmax function, we can control the sharpness of the distribution. The temperature parameter is used to scale the logits before the softmax calculation:

\[ \sigma(\mathbf{z}, T)_i = \frac{e^{z_i/T}}{\sum_{j=1}^{K} e^{z_j/T}} \]

Where \(T\) is the temperature parameter.

Low Temperature (T < 1): Makes the distribution sharper. The highest values become more prominent.
High Temperature (T > 1): Makes the distribution softer. Differences between values decrease.
T = 1: Standard softmax function.

Below, you can see the effect of the temperature parameter on the distribution:

Temperature (T): 1.0 Logit Values:

4. Properties of Softmax

Some important properties that distinguish the softmax function from other activation functions:

Multi-Class Suitability: Softmax is ideal for multi-class classification problems because it produces a probability distribution across all classes.
Sum to 1: Softmax outputs always sum to 1, forming a proper probability distribution.
Range: Each output value is between 0 and 1.
Differentiability: The softmax function is differentiable, allowing it to be trained with gradient-based optimization algorithms.

The derivative of the softmax function:

\[ \frac{\partial \sigma(\mathbf{z})_i}{\partial z_j} = \begin{cases} \sigma(\mathbf{z})_i(1 - \sigma(\mathbf{z})_i) & \text{if } i = j \\ -\sigma(\mathbf{z})_i\sigma(\mathbf{z})_j & \text{if } i \neq j \end{cases} \]

5. Numerical Stability

One of the biggest issues in softmax calculation is numerical overflow. For large input values, \(e^{z_i}\) can produce very large numbers that cause overflow in calculations.

A common technique to solve this is to subtract the maximum value from all input values:

\[ \sigma(\mathbf{z})_i = \frac{e^{z_i - \max(\mathbf{z})}}{\sum_{j=1}^{K} e^{z_j - \max(\mathbf{z})}} \]

This gives the same mathematical result but improves numerical stability. We implemented this technique in our JavaScript code above.

Stability Test:

You can use the field below to test with very large numbers:

Results with Large Values:

Class	Logit Value	Softmax Value

6. Real-World Applications

The softmax function is used in a wide variety of machine learning and deep learning applications:

Image Classification

Deep Convolutional Neural Networks (CNNs) typically use softmax in the final layer to separate images into multiple classes. For example, MNIST digit recognition (10 classes) or ImageNet classification (1000 classes).

Natural Language Processing

Word predictions, sentiment analysis, or text classification problems use softmax in the output layer to obtain a probability distribution across possible classes.

Example Scenario: Handwritten Digit Recognition

In the example below, you can see the logit values produced by a neural network to classify 5 different digits and their conversion to probabilities using softmax:

In this example, the model has assigned the highest probability to the digit "5", which means the model will classify this input as "5".