A Fundamental Tool for Multi-Class Classification
The softmax function is widely used in machine learning and deep learning for multi-class classification problems. Its main purpose is to convert a set of real numbers (typically the output values of a neural network's output layer) into a probability distribution.
Softmax produces a value between 0 and 1 for each element of an input vector, and the sum of these values equals 1. This property allows softmax outputs to be directly interpreted as probabilities.
The softmax function is mathematically expressed as:
\[ \sigma(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}} \]
Where:
Let's look at a softmax calculation for a 5-class classification problem. We'll use raw neural network outputs (logits) as input and convert them to probabilities.
function softmax(logits, temperature = 1.0) { // Temperature application (T parameter) const scaledLogits = logits.map(x => x / temperature); // Find the maximum value (for numerical stability) const maxLogit = Math.max(...scaledLogits); // Calculate exp(x_i - max) (overflow prevention) const expValues = scaledLogits.map(x => Math.exp(x - maxLogit)); // Calculate the sum const expSum = expValues.reduce((a, b) => a + b, 0); // Calculate probability for each value return expValues.map(x => x / expSum); }
Here's a sample input vector (logits). Click the "Calculate" button to see the softmax values:
Class | Logit Value | Softmax Value |
---|
By adding a temperature parameter to the softmax function, we can control the sharpness of the distribution. The temperature parameter is used to scale the logits before the softmax calculation:
\[ \sigma(\mathbf{z}, T)_i = \frac{e^{z_i/T}}{\sum_{j=1}^{K} e^{z_j/T}} \]
Where \(T\) is the temperature parameter.
Below, you can see the effect of the temperature parameter on the distribution:
Some important properties that distinguish the softmax function from other activation functions:
The derivative of the softmax function:
\[ \frac{\partial \sigma(\mathbf{z})_i}{\partial z_j} = \begin{cases} \sigma(\mathbf{z})_i(1 - \sigma(\mathbf{z})_i) & \text{if } i = j \\ -\sigma(\mathbf{z})_i\sigma(\mathbf{z})_j & \text{if } i \neq j \end{cases} \]
One of the biggest issues in softmax calculation is numerical overflow. For large input values, \(e^{z_i}\) can produce very large numbers that cause overflow in calculations.
A common technique to solve this is to subtract the maximum value from all input values:
\[ \sigma(\mathbf{z})_i = \frac{e^{z_i - \max(\mathbf{z})}}{\sum_{j=1}^{K} e^{z_j - \max(\mathbf{z})}} \]
This gives the same mathematical result but improves numerical stability. We implemented this technique in our JavaScript code above.
You can use the field below to test with very large numbers:
Class | Logit Value | Softmax Value |
---|
The softmax function is used in a wide variety of machine learning and deep learning applications:
Deep Convolutional Neural Networks (CNNs) typically use softmax in the final layer to separate images into multiple classes. For example, MNIST digit recognition (10 classes) or ImageNet classification (1000 classes).
Word predictions, sentiment analysis, or text classification problems use softmax in the output layer to obtain a probability distribution across possible classes.
In the example below, you can see the logit values produced by a neural network to classify 5 different digits and their conversion to probabilities using softmax:
In this example, the model has assigned the highest probability to the digit "5", which means the model will classify this input as "5".