Why R-Squared Misleads and How to Properly Evaluate Regression

The Limitations of R-Squared

R-squared (R²) is a relative measure of how much variance a model explains compared to a baseline (the mean). Its primary flaw is that it never decreases when adding features, even if those features are random noise. This encourages "feature bloat," where a model appears to improve simply by gaining more degrees of freedom. To counter this, Adjusted R-squared applies a complexity tax, penalizing the inclusion of features that do not meaningfully improve predictive power.

The Hierarchy of Error Metrics

Regression metrics provide different lenses into model performance. Understanding their relationships is critical for debugging:

MAE (Mean Absolute Error): Measures the average magnitude of errors in the original units. It is highly interpretable and robust to outliers.
MSE (Mean Squared Error): Penalizes large errors disproportionately by squaring them. It is the standard loss function for gradient descent due to its mathematical properties (differentiability).
RMSE (Root Mean Squared Error): The square root of MSE, bringing the error back into the target variable's units. It acts as a "dramatic sibling" to MAE; if RMSE is significantly higher than MAE, your model has catastrophic outliers.
Standard Error (SE): A more rigorous version of RMSE that accounts for degrees of freedom. It provides a more honest estimate of prediction uncertainty, especially in smaller datasets.

Diagnostic Workflow

To move from a "good-looking" model to a performant one, use these techniques:

Residual Analysis: Plot residuals against predictions. A healthy model shows a bell-shaped distribution centered at zero with no discernible patterns.
Feature Engineering: Use polynomial features to capture non-linear relationships, but monitor Adjusted R-squared to prevent overfitting.
Regularization: Use Ridge or Lasso regression to penalize complexity and improve out-of-sample performance.
Cross-Validation: Never rely on a single train-test split. Use 10-fold cross-validation to ensure metrics are stable and not the result of a lucky split.

Metric Selection Guide

Comparing models with different feature counts: Use Adjusted R².
Communicating with stakeholders: Use MAE.
Training optimization: Use MSE.
Reporting uncertainty: Use Standard Error.
General goodness-of-fit: Use R².

The Limitations of R-Squared

The Hierarchy of Error Metrics

Diagnostic Workflow

Metric Selection Guide

More from Data Science & Visualization

skfolio: Build & Tune Portfolio Optimizers in Python

Scanpy Pipeline for PBMC scRNA-seq Clustering & Trajectories

TabPFN Beats Tree Models on Tabular Accuracy with Zero Training

Synthetically Label Sparse Bequest Donors Realistically