Formula Deep-Dive: Binary Cross-Entropy Loss
The Formula
- \(y_i\) = actual label (0 or 1)
- \(\hat{y}_i\) = predicted probability (between 0 and 1)
- \(n\) = number of samples
Why Do We Need This?
The Problem
You're training a binary classifier (cat vs. dog). Your model outputs a probability: 0.8 means "80% sure it's a cat". How do you measure how wrong you are?
❌ Bad Idea: Use Squared Error \((y - \hat{y})^2\)
- Actual label: \(y = 1\) (cat)
- Prediction: \(\hat{y} = 0.1\) (only 10% confident)
- Error: \((1 - 0.1)^2 = 0.81\)
But if \(\hat{y} = 0.9\):
- Error: \((1 - 0.9)^2 = 0.01\)
Where Cross-Entropy Comes From
Step 1: Maximum Likelihood
Your model outputs probability \(\hat{y}\). The likelihood of the true label \(y\) is:
- If \(y = 1\): likelihood = \(\hat{y}\)
- If \(y = 0\): likelihood = \(1 - \hat{y}\)
We can write both cases as:
Check it:
- If \(y=1\): \(\hat{y}^1 \cdot (1-\hat{y})^0 = \hat{y}\) ✓
- If \(y=0\): \(\hat{y}^0 \cdot (1-\hat{y})^1 = 1-\hat{y}\) ✓
Step 2: Maximize Log-Likelihood
For all \(n\) samples, total likelihood:
Products are hard. Take the log (monotonic, so maximizing log = maximizing original):
Step 3: Turn Into a Loss
We want to minimize loss (not maximize). Flip the sign and average:
That's cross-entropy loss!
Why It Works Better
For one sample where \(y=1\):
Taking the derivative with respect to \(\hat{y}\):
| Prediction \(\hat{y}\) | MSE gradient | Cross-Entropy gradient |
|---|---|---|
| 0.1 (very wrong) | 0.18 | -10.0 |
| 0.5 (unsure) | 0.10 | -2.0 |
| 0.9 (almost right) | 0.02 | -1.1 |
Concrete Example
Scenario: Cat Classifier
- Image 1: Cat (\(y=1\)), model says 0.9 → \(L = -\log(0.9) = 0.105\)
- Image 2: Cat (\(y=1\)), model says 0.1 → \(L = -\log(0.1) = 2.303\)
Image 2 contributes 22× more to the loss! The model will focus on fixing that mistake.
The Intuition
Cross-entropy measures surprise.
- 90% confident it's a cat AND it IS a cat → low surprise, low loss
- 90% confident it's NOT a cat BUT it IS → high surprise, high loss
Borrowed from information theory: "How many bits do I need to encode this outcome given my (wrong) probability distribution?"
Why "Cross" Entropy?
True distribution: \(p\) (real labels: 0 or 1)
Predicted distribution: \(q\) (model's probabilities)
Cross-entropy between \(p\) and \(q\):
In our case:
- \(p(y=1) = y_i\) and \(p(y=0) = 1-y_i\) (one-hot)
- \(q(y=1) = \hat{y}_i\) and \(q(y=0) = 1-\hat{y}_i\)
Plug in → you get our formula!
Common Gotchas
1. Don't use with raw logits!
❌ loss = cross_entropy(logits, labels)
✅ loss = cross_entropy(sigmoid(logits), labels)
Or use binary_cross_entropy_with_logits which is numerically stable
2. Log(0) = -∞!
If \(\hat{y} = 0\) or \(\hat{y} = 1\) exactly → undefined loss
Clip predictions: clip(y_pred, 1e-7, 1-1e-7)
3. Multi-class is different
Categorical cross-entropy uses softmax + multiple classes
Try It Yourself
Exercise 1: Calculate manually
- \(y = 1\), \(\hat{y} = 0.7\) → \(L = \) ?
- \(y = 0\), \(\hat{y} = 0.3\) → \(L = \) ?
Exercise 2: Think about it
Why does the formula have two terms if only one "fires" at a time?
Exercise 3: Derive it
Find the gradient \(\frac{\partial L}{\partial \hat{y}}\) for the case when \(y=0\)