Formula Deep-Dive: Softmax Function
The Formula
- \(z_i\) = raw score (logit) for class \(i\)
- \(K\) = total number of classes
- \(e\) = Euler's number (~2.718)
Why Do We Need This?
The Problem
Your neural network outputs raw scores (logits) for each class: [2.3, 1.5, 0.8] for "cat, dog, bird". These are unbounded numbers. How do you turn them into probabilities that sum to 1?
❌ Bad Idea: Just Normalize
Scores: [2.3, 1.5, 0.8] → Sum = 4.6
Divide: [2.3/4.6, 1.5/4.6, 0.8/4.6] = [0.50, 0.33, 0.17]
Problem: What if scores are negative? [-1, 2, -3] → Sum = -2 😱
- Converts any real numbers → valid probabilities (0 to 1)
- Output probabilities sum to exactly 1
- Preserves order (highest score → highest probability)
- Is differentiable (for backpropagation)
How Softmax Works
Step 1: Exponentiate Everything
Why \(e^z\)? Because:
- \(e^x > 0\) for ALL \(x\) (negative inputs become small positive numbers)
- \(e^x\) is smooth and differentiable
- \(e^x\) grows fast → amplifies differences
Example: [2.3, 1.5, 0.8]
Step 2: Normalize
Sum: \(9.97 + 4.48 + 2.23 = 16.68\)
Divide each by the sum:
Sum = 1.001 ≈ 1 ✓ (rounding error)
Step 3: Check Properties
- ✓ All values between 0 and 1
- ✓ Sum to 1
- ✓ Highest score (2.3) → highest probability (0.598)
- ✓ Differentiable everywhere
The "Soft" Part
Compare to argmax (hard maximum):
Argmax (Hard)
Scores: [2.3, 1.5, 0.8]
Output: [1, 0, 0] (one-hot)
Problem: Not differentiable! Can't backpropagate.
Softmax (Soft)
Scores: [2.3, 1.5, 0.8]
Output: [0.598, 0.269, 0.134]
Benefit: Smooth approximation! Works with gradient descent.
Temperature: Controlling Confidence
The full formula includes temperature \(T\):
| Temperature | Effect | Example Output |
|---|---|---|
| \(T = 1\) | Normal softmax | [0.598, 0.269, 0.134] |
| \(T \to 0\) | Approaches argmax (confident) | [0.999, 0.001, 0.000] |
| \(T \to \infty\) | Uniform distribution (uncertain) | [0.333, 0.333, 0.333] |
| \(T = 0.5\) | More "peaky" (confident) | [0.756, 0.188, 0.056] |
| \(T = 2\) | More "flat" (uncertain) | [0.461, 0.312, 0.227] |
- Low \(T\) → confident predictions (e.g., deployment)
- High \(T\) → exploratory predictions (e.g., creative text generation)
Concrete Example
Scenario: Image Classification (Cat, Dog, Bird)
Raw scores from neural network: [3.2, 1.3, 0.2]
Step 1: Exponentiate
- \(e^{3.2} = 24.53\)
- \(e^{1.3} = 3.67\)
- \(e^{0.2} = 1.22\)
Step 2: Sum: \(24.53 + 3.67 + 1.22 = 29.42\)
Step 3: Normalize
- Cat: \(24.53 / 29.42 = 0.834\) (83.4%)
- Dog: \(3.67 / 29.42 = 0.125\) (12.5%)
- Bird: \(1.22 / 29.42 = 0.041\) (4.1%)
Model is 83% confident it's a cat!
Why Exponential? Why Not Just Divide?
Let's see what happens without exponentiation:
Without \(e^x\)
Scores: [5, 4, 1]
Simple normalize: [0.5, 0.4, 0.1]
Difference between top two: 0.1
With \(e^x\)
Scores: [5, 4, 1]
Softmax: [0.643, 0.236, 0.006]
Difference between top two: 0.407
Derivative (For Backpropagation)
The gradient of softmax is elegant but tricky:
Common Gotchas
1. Numerical Stability
❌ Large \(z_i\) → \(e^{z_i}\) overflows (infinity)
✅ Subtract max before exponentiating:
This is mathematically equivalent but numerically stable!
2. Don't Double-Apply!
❌ softmax(softmax(logits))
Many frameworks have cross_entropy_with_logits that applies softmax internally
3. Binary Classification?
For 2 classes, use sigmoid instead (simpler and equivalent to 2-class softmax)
Softmax for \(K \geq 3\) classes
Try It Yourself
Exercise 1: Calculate manually
Logits: [1.0, 2.0, 3.0] → Softmax probabilities = ?
(Hint: \(e^1 \approx 2.72\), \(e^2 \approx 7.39\), \(e^3 \approx 20.09\))
Exercise 2: Temperature experiment
For logits [2, 1, 0], calculate softmax at \(T=0.1\), \(T=1\), and \(T=10\). What pattern do you see?
Exercise 3: Prove stability
Show that \(\text{softmax}(z) = \text{softmax}(z - c)\) for any constant \(c\)
(This is why subtracting max works!)