**“How do we estimate the performance of a machine learning model?”**

Author: Sebastian Raschka

“First, we feed the training data to our learning algorithm to learn a model. Second, we predict the labels of our test labels. Third, we count the number of wrong predictions on the test dataset to compute the model’s error rate.”

Not so fast! Depending on our goal, estimating the performance of a model is not that trivial, unfortunately. Maybe we should address the previous question from another angle: “Why do we care about performance estimates at all?” Ideally, the estimated performance of a model tells how well it performs on unseen data – making predictions on future data is often the main problem we want to solve in applications of machine learning or the development of novel algorithms. Typically, machine learning involves a lot of experimentation, though — for example, the tuning of the internal knobs of a learning algorithm, the so-called hyperparameters. Running a learning algorithm over a training dataset with different hyperparameter settings will result in different models. Since we are typically interested in selecting the best-performing model from this set, we need to find a way to estimate their respective performances in order to rank them against each other. Going one step beyond mere algorithm fine-tuning, we are usually not only experimenting with the one single algorithm that we think would be the “best solution” under the given circumstances. More often than not, we want to compare different algorithms to each other, oftentimes in terms of predictive and computational performance.

Let us summarize the main points why we evaluate the predictive performance of a model:

We want to estimate the generalization error, the predictive performance of our model on future (unseen) data.

We want to increase the predictive performance by tweaking the learning algorithm and selecting the best performing model from a given hypothesis space.

We want to identify the machine learning algorithm that is best-suited for the problem at hand; thus, we want to compare different algorithms, selecting the best-performing one as well as the best performing model from the algorithm’s hypothesis space.

Although these three sub-tasks listed above have all in common that we want to estimate the performance of a model, they all require different approaches. We will discuss some of the different methods for tackling these sub-tasks in this article.

Of course, we want to estimate the future performance of a model as accurately as possible. However, if there’s one key take-away message from this article, it is that biased performance estimates are perfectly okay in model selection and algorithm selection if the bias affects all models equally. If we rank different models or algorithms against each other in order to select the best-performing one, we only need to know the “relative” performance. For example, if all our performance estimates are pessimistically biased, and we underestimate their performances by 10%, it wouldn’t affect the ranking order. More concretely, if we have three models with prediction accuracy estimates such as

M2: 75% > M1: 70% > M3: 65%,

we would still rank them the same way if we add a 10% pessimistic bias:

M2: 65% > M1: 60% > M3: 55%.

On the contrary, if we report the future performance of the best ranked model (M2) to be 65%, this would obviously be quite inaccurate. Estimating the absolute performance of a model is probably one of the most challenging tasks in machine learning.