A dogma of bias and variance

If you are planning to choose machine learning for your business problem assuming its predictions are always correct, then there is something you should know. Machine learning algorithms are probabilistic by nature and are not perfectly accurate. Then what should we do when no ML model is perfect? There must be a way to select the best one among many.

But what is the best model – the model that can generalize well on the unseen data by accurately learning the regularities from the training data but does not learn the noisy or unrepresentative data.

The process of selecting one final best machine learning model among a collection of candidate models for a given training dataset is called model selection.

This post will explain an important property of selecting the best machine learning model for a given dataset i.e., bias and variance tradeoff, and how to strike a fine balance between the two.

The performance of every estimator is judged by its quality of the fit on new data which is generally defined in terms of overall expected error. Evidently, the lower the error, the better is the model. The error is broadly categorized into 3 types – bias, variance, and noise

Let’s discuss noise first

Well, as the name characteristically suggests, there is no way this error can be eliminated.

Supervised machine learning model approximates the association between the independent variables and the target variable. The model’s goal is to learn the best estimate of this functional mapping called as the target function

Suppose that we want to predict the target variable y, given an input vector X = [X1, X2, X3,..Xn] and the normally distributed error .

But y can only be close to the ground truth y, if all the predictors that drive the behavior of y are present in X. But there will always be some attributes that have small predictive power and are left out of the training data.

Note the presence of noise in the true function, it is an aspect of the data which is difficult to model. The best attempt to reduce the noise has less to do with mathematics and more on working with a better sample of data. Hence, no matter how good your model is, there is least you can do to remove this data-intrinsic error.

Overview of bias and variance

Let’s understand the errors that we can control i.e., bias and variance errors.

Bias

It is the systematic deviation from the underlying true estimator. It stems from the oversimplified assumptions made by the model to the extent that the algorithm is not able to learn the pertinent relations between predictors and the target variable.

As those assumptions are trivial for a given complex problem, the model misses learning the structural associations leading to an increased error. Such high bias models are an underfit for the training data and need more attributes that better capture the association with the target variable.

Formally speaking, bias is the difference between the expected value of the estimator f’(X) and the parameter that we want to estimate i.e., f(X)

As interpreted from the above formulation, we strive to build a low bias model by learning the best estimate so that the difference is as small as possible.

Linear Regression models are the most common example of high bias models as they expect the underlying assumptions like multivariate normality, homoscedasticity, linear relationship between independent variables, and target to hold true. If you are interested in learning about the assumptions of linear regression in detail, refer to this link.

Low bias models are preferred as they make fewer assumptions about the target function, for example, Decision Tree, KNN, and SVM

Variance

It arises when a highly complex model becomes too sensitive to small noisy fluctuations in the training data. Such a model ends up learning the patterns that could exist inadvertently and do not depict the true property of data. As these patterns are specific to a dataset and might not exist in the test data, the learned model is an overfit on the training data. Overfitted model is specialized to a particular training set and would change when trained on a different training set. Hence it has very low training error but when confronted with unseen test data yields poor predictive performance.

Confusion with the expectation

The expectation in the above formulas implies “averaged over different training sets” and not “averaged over all the records in the training set”. These different training sets are sampled from an unknown true distribution to generate an approximate function.

Source: Sebastian Raschka

Underfitting and overfitting

Underfitting implies that the model could not pick the patterns leading to poor performance on the training data while an overfit model tries to fit all the data points in the training set with an intention to learn it ‘by heart’ failing to generalize on unseen data. Be it underfitting or overfitting the data, the model has failed to return correct predictions on the dataset.

Bias and variance dilemma

The bias and variance tradeoff is a very important property of statistical learning methods that is aimed at minimizing these sources of error.

The desirable trait is to build a low variance and low bias model that can accurately capture the regularities in its training data and at the same time generalize well on the unseen data.

Source: Illustration of bias and variance with model complexity

Building balance between bias and variance

Bias and Variance are inversely related to each other. There is no one best way to build a low bias and low variance model. The attempt to reduce the bias will lead to a high variance model and vice-versa. Let’s check how we can strive to build the balance between the two:

High Variance

Bring in more records: High variance models fall prey to random noise mistaking it as a signal. An increase in the training data will bring more data points for models to learn from and generalize. The catch is that high bias models do not benefit from the increased data as such models are lacking the essential attributes for the task at hand.
Dimensionality Reduction: Dropping the features that bring noisy signals and confuse the model with the actual signal leads to better generalization and brings down the variance.
Regularization: It reduces the variance by simplifying the complex model, albeit at the cost of increased bias. Careful selection of regularization parameters will ensure the reduction in the variance to the extent that the overall expected error reduces too.

High bias

Bring in more attributes: Additional features decrease the bias but at the expense of an increase in variance.
Complex algorithms: The algorithm might make gross assumptions about the data and problem statement leading to low bias and an underfit solution. If you encounter a high bias, it’s time to move to a complex algorithm

The dogma of bias and variance trade-off can be wisely handled by aiming for the crossover point where the rate of increase in bias is equal to the rate of decrease in variance.

The ensemble models and cross-validation are popular methods to balance the rate of change between the two, thereby striking the right balance.

Remember that though learning bias and variance are important from the perspective of understanding the predictive behavior of models, our primary objective is to build a model with low overall error.