Formula Deep-Dive: Binary Cross-Entropy Loss

The Formula

$$L = -\frac{1}{n}\sum_{i=1}^{n}[y_i \log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i)]$$
Where:
  • \(y_i\) = actual label (0 or 1)
  • \(\hat{y}_i\) = predicted probability (between 0 and 1)
  • \(n\) = number of samples

Why Do We Need This?

The Problem

You're training a binary classifier (cat vs. dog). Your model outputs a probability: 0.8 means "80% sure it's a cat". How do you measure how wrong you are?

❌ Bad Idea: Use Squared Error \((y - \hat{y})^2\)

  • Actual label: \(y = 1\) (cat)
  • Prediction: \(\hat{y} = 0.1\) (only 10% confident)
  • Error: \((1 - 0.1)^2 = 0.81\)

But if \(\hat{y} = 0.9\):

  • Error: \((1 - 0.9)^2 = 0.01\)
The Problem: When you're VERY wrong (0.1 vs 1.0), MSE gives a gradient of ~0.18. When you're close (0.9 vs 1.0), gradient is ~0.02. When you're very wrong, you should learn FAST! But squared error barely punishes bad predictions.

Where Cross-Entropy Comes From

Step 1: Maximum Likelihood

Your model outputs probability \(\hat{y}\). The likelihood of the true label \(y\) is:

  • If \(y = 1\): likelihood = \(\hat{y}\)
  • If \(y = 0\): likelihood = \(1 - \hat{y}\)

We can write both cases as:

$$P(y|\hat{y}) = \hat{y}^y \cdot (1-\hat{y})^{(1-y)}$$

Check it:

  • If \(y=1\): \(\hat{y}^1 \cdot (1-\hat{y})^0 = \hat{y}\) ✓
  • If \(y=0\): \(\hat{y}^0 \cdot (1-\hat{y})^1 = 1-\hat{y}\) ✓

Step 2: Maximize Log-Likelihood

For all \(n\) samples, total likelihood:

$$\prod_{i=1}^{n} \hat{y}_i^{y_i} \cdot (1-\hat{y}_i)^{(1-y_i)}$$

Products are hard. Take the log (monotonic, so maximizing log = maximizing original):

$$\log L = \sum_{i=1}^{n} [y_i \log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i)]$$

Step 3: Turn Into a Loss

We want to minimize loss (not maximize). Flip the sign and average:

$$L = -\frac{1}{n}\sum_{i=1}^{n}[y_i \log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i)]$$

That's cross-entropy loss!

Why It Works Better

For one sample where \(y=1\):

$$L = -\log(\hat{y})$$

Taking the derivative with respect to \(\hat{y}\):

$$\frac{\partial L}{\partial \hat{y}} = -\frac{1}{\hat{y}}$$
Prediction \(\hat{y}\) MSE gradient Cross-Entropy gradient
0.1 (very wrong) 0.18 -10.0
0.5 (unsure) 0.10 -2.0
0.9 (almost right) 0.02 -1.1
The magic: When you're very wrong (0.1), cross-entropy gradient is massive (-10). You learn FAST. When you're close (0.9), gradient is smaller. You fine-tune.

Concrete Example

Scenario: Cat Classifier

  • Image 1: Cat (\(y=1\)), model says 0.9 → \(L = -\log(0.9) = 0.105\)
  • Image 2: Cat (\(y=1\)), model says 0.1 → \(L = -\log(0.1) = 2.303\)

Image 2 contributes 22× more to the loss! The model will focus on fixing that mistake.

The Intuition

Cross-entropy measures surprise.

  • 90% confident it's a cat AND it IS a cat → low surprise, low loss
  • 90% confident it's NOT a cat BUT it IS → high surprise, high loss

Borrowed from information theory: "How many bits do I need to encode this outcome given my (wrong) probability distribution?"

Why "Cross" Entropy?

True distribution: \(p\) (real labels: 0 or 1)
Predicted distribution: \(q\) (model's probabilities)

Cross-entropy between \(p\) and \(q\):

$$H(p, q) = -\sum p(x) \log q(x)$$

In our case:

  • \(p(y=1) = y_i\) and \(p(y=0) = 1-y_i\) (one-hot)
  • \(q(y=1) = \hat{y}_i\) and \(q(y=0) = 1-\hat{y}_i\)

Plug in → you get our formula!

Common Gotchas

1. Don't use with raw logits!

loss = cross_entropy(logits, labels)

loss = cross_entropy(sigmoid(logits), labels)

Or use binary_cross_entropy_with_logits which is numerically stable

2. Log(0) = -∞!

If \(\hat{y} = 0\) or \(\hat{y} = 1\) exactly → undefined loss

Clip predictions: clip(y_pred, 1e-7, 1-1e-7)

3. Multi-class is different

Categorical cross-entropy uses softmax + multiple classes

Try It Yourself

Exercise 1: Calculate manually

  • \(y = 1\), \(\hat{y} = 0.7\) → \(L = \) ?
  • \(y = 0\), \(\hat{y} = 0.3\) → \(L = \) ?

Exercise 2: Think about it

Why does the formula have two terms if only one "fires" at a time?

Exercise 3: Derive it

Find the gradient \(\frac{\partial L}{\partial \hat{y}}\) for the case when \(y=0\)