As a data scientist, evaluating machine learning model performance is a crucial aspect of your work. To do so effectively, you have a wide range of statistical metrics at your disposal, each with its own unique strengths and weaknesses. By developing a solid understanding of these metrics, you are not only better equipped to choose the best one for optimizing your model but also to explain your choice and its implications to business stakeholders.
In this post, I focus on metrics used to evaluate regression problems involved in predicting a numeric value—be it the price of a house or a forecast for next month’s company sales. As regression analysis can be considered the foundation of data science, it is essential to understand the nuances.
Residuals are the building blocks of the majority of the metrics. In simple terms, a residual is a difference between the actual value and the predicted value.
residual = actual - prediction
Figure 1 presents the relationship between a target variable (y) and a single feature (x). The blue dots represent observations. The red line is the fit of a machine learning model, in this case, a linear regression. The orange line represents the difference between the observed value and the prediction for that observation. As you can see, the residuals are calculated for each observation in the sample, be it the training or test set.
This section discusses some of the most popular regression evaluation metrics that can help you assess the effectiveness of your model.
The simplest error measure would be the sum of residuals, sometimes referred to as bias. As the residuals can be both positive (prediction is smaller than the actual value) and negative (prediction is larger than the actual value), bias generally tells you whether your predictions were higher or lower than the actuals.
However, as the residuals of opposing signs offset each other, you can obtain a model that generates predictions with a very low bias, while not being accurate at all.
Alternatively, you can calculate the average residual, or mean bias error (MBE).
The next metric is likely the first one you encounter while learning about regression models, especially if that is during statistics or econometrics classes. R-squared (R²), also known as the coefficient of determination, represents the proportion of variance explained by a model. To be more precise, R² corresponds to the degree to which the variance in the dependent variable (the target) can be explained by the independent variables (features).
The following formula is used to calculate R²:
Effectively, you are comparing the fit of a model (represented by the red line in Figure 2) to that of a simple mean model (represented by the green line).
Knowing what the components of R² stand for, you can see that represents the fraction of the total variance in the target that your model was not able to explain.
Here are some additional points to keep in mind when working with R².
R² is a relative metric; that is, it can be used to compare with other models trained on the same dataset. A higher value indicates a better fit.
R² can also be used to get a rough estimate of how the model performs in general. However, be careful when using R² for such evaluations:
A potential drawback of R² is that it assumes that every feature helps in explaining the variation in the target, when that might not always be the case. For this reason, if you continue adding features to a linear model estimated using ordinary least squares (OLS), the value of R² might increase or remain unchanged, but it never decreases.
Why? By design, OLS estimation minimizes the RSS. Suppose a model with an additional feature does not improve the value of R² of the first model. In that case, the OLS estimation technique sets that feature’s coefficients to zero (or some statistically insignificant value). In turn, this effectively brings you back to the initial model. In the worst-case scenario, you can get the score of your starting point.
A solution to the problem mentioned in the previous point is the adjusted R², which additionally penalizes adding features that are not useful for predicting the target. The value of the adjusted R² decreases if the increase in the R² caused by adding new features is not significant enough.
If a linear model is fitted using OLS, the range of R² is 0 to 1. That is because when using the OLS estimation (which minimizes the RSS), the general property is that RSS ≤ TSS. In the worst-case scenario, OLS estimation would result in obtaining the mean model. In that case, RSS would be equal to TSS and result in the minimum value of R² being 0. On the other hand, the best case would be RSS = 0 and R² = 1.
In the case of non-linear models, it is possible that R² is negative. As the model fitting procedure of such models is not based on iteratively minimizing the RSS, the fitted model could have an RSS greater than the TSS. In other words, the model’s predictions fit the data worse than the simple mean model. For more, information, see When is R squared negative?
Bonus: Using R², you can evaluate how much better your model fits the data as compared to the simple mean model. Think of a positive R² value in terms of improving the performance of a baseline model—something along the lines of a skill score. For example, R² of 40% indicates that your model has reduced the mean squared error by 40% compared to the baseline, which is the mean model.
Mean squared error (MSE) is one of the most popular evaluation metrics. As shown in the following formula, MSE is closely related to the residual sum of squares. The difference is that you are now interested in the average error instead of the total error.
Here are some points to take into account when working with MSE:
Root mean squared error (RMSE) is closely related to MSE, as it is simply the square root of the latter. Take the square to bring the metric back to the scale of the target variable, so it is easier to interpret and understand. However, take caution: one fact that is often overlooked is that although RMSE is on the same scale as the target, an RMSE of 10 does not actually mean you are off by 10 units on average.
Other than the scale, RMSE has the same properties as MSE. As a matter of fact, optimizing for RMSE while training a model will result in the same model as that obtained while optimizing for MSE.
The formula to calculate mean absolute error (MAE) is similar to the MSE formula. Replace the square with the absolute value.
Characteristics of MAE include the following:
Mean absolute percentage error (MAPE) is one of the most popular metrics on the business side. That is because it is expressed as a percentage, which makes it much easier to understand and interpret.
To make the metric even easier to read, multiply it by 100% to express the number as a percentage.
Points to consider:
While discussing MAPE, I mentioned that one of its potential drawbacks is its asymmetry (not limiting the predictions that are higher than the actuals). Symmetric mean absolute percentage error (sMAPE) is a related metric that attempts to fix that issue.
Points to consider when using sMAPE:
I have not described all the possible regression evaluation metrics, as there are dozens (if not hundreds). Here are a few more metrics to consider while evaluating models:
As with the majority of data science problems, there is no single best metric for evaluating the performance of a regression model. The metric chosen for a use case will depend on the data used to train the model, the business case you are trying to help, and so on. For this reason, you might often use a single metric for the training of a model (the metric optimized for), but when reporting to stakeholders, data scientists often present a selection of metrics.
While choosing the metrics, consider a few of the following questions:
I believe it is useful to explore the metrics on some toy cases to fully understand their nuances. While most of the metrics are available in the metrics module of scikit-learn , for this particular task, the good old spreadsheets might be a more suitable tool.
The following example contains five observations. Table 1 shows the actual values, predictions, and some metrics used to calculate most of the considered metrics.
y | y_hat | residual | squared error | abs error | abs perc error | sAPE |
100 | 120 | -20 | 400 | 20 | 20.00% | 18.18% |
100 | 80 | 20 | 400 | 20 | 20.00% | 22.22% |
80 | 100 | -20 | 400 | 20 | 25.00% | 22.22% |
200 | 240 | -40 | 1,600 | 40 | 20.00% | 18.18% |
40 | 5 | 35 | 1225 | 35 | 87.50% | 155.56% |
MSE | 805 |
RMSE | 28.37 |
MAE | 27 |
MAPE | 34.50% |
sMAPE | 47.27% |
The first three rows contain scenarios in which the absolute difference between the actual value and the prediction is 20. The first two rows show an overforecast and underforecast of 20, with the same actual. The third row shows an overforecast of 20, but with a smaller actual. In those rows, it is easy to observe the particularities of MAPE and sMAPE.
The fifth row in Table 1 contained a prediction 8x smaller than the actual value. For the sake of experimentation, replace that prediction with one that is 8x higher than the actual. Table 3 contains the revised observations.
y | y_hat | residual | squared error | abs error | abs perc error | sAPE |
100 | 120 | -20 | 400 | 20 | 20.00% | 18.18% |
100 | 80 | 20 | 400 | 20 | 20.00% | 22.22% |
80 | 100 | -20 | 400 | 20 | 25.00% | 22.22% |
200 | 240 | -40 | 1,600 | 40 | 20.00% | 18.18% |
40 | 320 | -280 | 78,400 | 280 | 700.00% | 155.56% |
MSE | 16240 |
RMSE | 127.44 |
MAE | 76 |
MAPE | 157.00% |
sMAPE | 47.27% |
Basically, all metrics exploded in size, which is intuitively consistent. That is not the case for sMAPE, which stayed the same between both cases.
I highly encourage you to play around with such toy examples to more fully understand how different kinds of scenarios impact the evaluation metrics. This experimentation should make you more comfortable making a decision on which metric to optimize for and the consequences of such a choice. These exercises might also help you explain your choices to stakeholders.
In this post, I covered some of the most popular regression evaluation metrics. As explained, each one comes with its own set of advantages and disadvantages. And it is up to the data scientist to understand those and make a choice about which one (or more) is suitable for a particular use case. The metrics mentioned can also be applied to pure regression tasks—such as predicting salary based on a selection of features connected to experience—but also to the domain of time series forecasting.
If you are looking to build robust linear regression models that can handle outliers effectively, see my previous post, Dealing with Outliers Using Three Robust Linear Regression Models.
Jadon, A., Patil, A., & Jadon, S. (2022). A Comprehensive Survey of Regression Based Loss Functions for Time Series Forecasting. arXiv preprint arXiv:2211.02989.
Hyndman, R. J. (2006). Another look at forecast-accuracy metrics for intermittent demand. Foresight: The International Journal of Applied Forecasting, 4(4), 43-46.
Hyndman, R. J., & Koehler, A. B. (2006). Another look at measures of forecast accuracy. International journal of forecasting, 22(4), 679-688.
Hyndman, R.J., & Athanasopoulos, G. (2021) Forecasting: principles and practice, 3rd edition, OTexts: Melbourne, Australia. OTexts.com/fpp3.