Loss Functions Explained: How Machine Learning Models Know They're Wrong

Noob Series: Loss Function Explained (How Models Know They Are Wrong)

Introduction

When beginners start learning machine learning, things seem easy at first. You follow a tutorial that asks you to load a dataset, train a model, and then you see something like this: loss = "mse" or criterion = nn.CrossEntropyLoss().

And just like that, the tutorial starts talking about equations, gradients, optimization, and Greek letters. If you have ever nodded along without really understanding what a loss function does, you are not alone. Loss functions are often explained backward. Most tutorials start with the formula when they should start with the idea. This article is part of a noob series aimed at making these concepts easier to understand.

What Is a Loss Function?

A loss function is how a machine learning model knows how wrong it is. That is literally the whole concept. The model makes a prediction. The loss function compares that prediction with the correct answer. Then it gives the model a number that says, “This is how bad your mistake was.”

A high loss means the model was very wrong.

A low loss means the model was close.

During training, the model keeps adjusting itself to make the loss smaller. That is how learning happens.

If you have played a dart game, the analogy is straightforward. You throw the dart, and to improve, you need feedback — whether your dart was slightly off, far away, too high, or too far left. Without that feedback, you cannot improve. The bullseye is the correct answer and the dart is the prediction. The loss function measures how far away the dart landed. That distance becomes the model’s feedback signal.

Visualization of dart analogy

Just like the distance from the center matters, throwing too close is not the same as being way off. Similarly, a model needs to know how badly it failed — not just that it was wrong — in order to improve.

Mean Squared Error

The most common loss function for predicting numbers is Mean Squared Error (MSE). It is used when the model is predicting continuous values like house prices, temperatures, or delivery times. The idea is straightforward:

Error: For each prediction, take the gap between the guess and the truth.
Squared: Multiply each gap by itself.
Mean: Average all those squared gaps.

def mean_squared_error(predictions, actuals):
    squared_errors = [(p - a) ** 2 for p, a in zip(predictions, actuals)]
    return sum(squared_errors) / len(squared_errors)

Taking the errors and averaging over the predictions makes intuitive sense, but understanding why we square them requires a bit more explanation. Squaring serves two purposes:

Squaring makes every error positive. An error of +3 and an error of -3 are equally bad, and squaring turns both into 9, so they stop cancelling each other out.
Squaring punishes big mistakes far more harshly than small ones. This is useful in many scenarios. For example, when predicting house prices, being wrong by $1,000 versus $200,000 should be penalized accordingly.

Mean Absolute Error

Another common loss function is Mean Absolute Error (MAE). MAE also measures the gap between predictions and actual values, but it does not square the error. Instead, it simply takes the absolute value.

def mean_absolute_error(predictions, actuals):
    absolute_errors = [abs(p - a) for p, a in zip(predictions, actuals)]
    return sum(absolute_errors) / len(absolute_errors)

MAE punishes large errors, but not as harshly as MSE does.

An error of 10 costs 10 and an error of 20 costs 20.
If your data naturally has some outliers and you do not want your model to overreact, MAE is a good choice.

The graph below compares the MSE and MAE curves:

Comparison of MSE and MAE curves

Cross-Entropy Loss

So far, we have talked about predicting numbers. But many machine learning problems are about predicting categories:

Is this email spam or not?
Is this a picture of a cat, dog, or fish?
Is a certain transaction fraudulent or not?

For classification tasks, models usually output probabilities like:

Dog: 70%
Cat: 20%
Fish: 10%

If the image really is a dog, that is a good prediction. But if it is a cat, the model needs to be penalized for assigning a lower probability to the correct answer.

The intuition is:

Correct and confident — low loss
Correct but unsure — medium loss
Wrong and confident — high loss

Cross-entropy loss curve

This is why cross-entropy is so widely used for classification. It does not just care about whether the model was right — it also cares about how confident the model was.

Loss vs. Accuracy

Loss and accuracy are not the same thing.

Accuracy tells you how many predictions were correct.

Loss tells you how bad the model’s mistakes were.

Consider two models — Model A and Model B — that both get 90 out of 100 predictions correct. They will have the same accuracy. But one model may be very confident on the right answers and only slightly wrong on the incorrect ones, while the other may be barely correct on many examples and extremely confident when wrong. In that case, the accuracy would be the same, but the loss would be different.

The Training Loop

Once the model has a loss number, it can improve. The training loop looks like this:

The model makes predictions.
The loss function measures the mistakes.
The optimizer updates the model.
The model tries again.
The loss hopefully gets smaller.

When training a model, we also plot the loss over time. In the beginning, the model makes many mistakes and the loss is high. As training progresses, the loss decreases and the model gets better at making predictions.

A healthy training curve typically follows this pattern:

High loss at the start → sharp drop → gradual flattening

Training loss curve

The flattening is normal — it means the model has learned the easy patterns and is now making smaller improvements. But if the training loss goes down while the validation loss starts going up, that is a warning sign of overfitting, which means the model may be memorizing the training data instead of learning patterns that generalize.

Final Thoughts

A loss function is the model’s mistake score. It tells the model how wrong its predictions are, and it gives training a clear goal: make that number smaller.

Once you understand loss functions, many other machine learning ideas become easier to grasp — including gradient descent, backpropagation, optimization, overfitting, and evaluation metrics.

You do not need to start with scary equations. Start with the idea:

The model guesses.
The loss function scores the guess.
The model updates itself to reduce the score.

That is the heart of machine learning. Loss is how a model knows it is wrong, and training is how it learns to be less wrong.