Building a successful machine learning model is no mean feat. It involves an arduous model-building phase and what comes next requires another rigor of maintaining the model output quality. A machine learning model once trained cannot live up to the changing data dynamics on its own, if not monitored and evaluated periodically.
There is a very thin difference between training and retraining signals. To set the context, training involves building a machine learning model on an adequate amount of historical data with independent features and the target variable for a supervised learning task.
But what is the right size of historical data to start the model training?
It not only depends on the complexity of the problem at hand but also the complexity of the model i.e., the number of model parameters to be estimated to achieve a desirable level of model performance.
For example, the iris data set is only 150 records but is objectively simple because of the predictive ability of the features with respect to the target.
Now that we understand the size of data and the model training, let’s see in what ways an ML pipeline can get updated and what part of it comes under retraining:
Theoretically speaking, the model retraining is all about retraining the pipeline as is, with the new data. There are ideally no code or feature changes introduced as part of re-running the pipeline. However, if you explore a new algorithm or a new feature that was previously not present in training, you should incorporate it during the deployment of a retrained model.
How do you know when is the right time to retrain? One of the most actively adopted approaches is out of time analysis:
Assuming the business has given an SLA of maintaining an F1 score of 80%, and the F1 score at the end of the week of deployment was 84%. The model continues to make predictions for the coming weeks and not surprisingly, the F1 score starts to decline to 83%, then 81%, and finally 78% with each passing week.
During T+3 weeks, the pipeline should trigger a retraining signal to learn from the newly accumulated data over the last few weeks.
Well now that you have assessed the need for model retraining, it's time to decide on what data the retraining should be run:
Source: Vidhi Chugh
The next set of approaches is based on slicing the data in different ways to retrain a subsample of data.
N days of the oldest data should be removed for the N days of newly added data. For example, if we are retraining after 3 weeks, then 3 weeks of the oldest data should be removed. It handles the shortcomings of the previous approach to dealing with high-volume training data. X number of records added leads to removing X older days. The drawback of this approach is the X old days removed might be carrying a good signal or the new data might be noisy – both of which will reduce the predictive performance of the retrained model.
Depending upon the scale at which the data is generated per day, it is possible to retrain the model on the latest data from timestamp ‘T’ onwards and remove all older data till time ‘T-1’. It will ensure the model is picking up the signal from the latest trend but might lose out on long-term trends.
Source: Vidhi Chugh
The above strategies are based on slicing the data in different ways for retraining the model. Besides, there are two ways that can use the entire data to date i.e. additive learning and ensemble learning. Additive learning takes the last trained model as a base and updates the parameters with the newly available data. Ensemble learning allows you to ensemble the last trained model with the model trained on newly available data.
In this article, we have discussed why ML models need retraining followed by when is the right time to retrain the ML model. We further discussed the multiple retraining strategies, their pros and their cons. There is no single strategy that works across all datasets and uses cases. Hence, careful evaluation of retraining strategies goes a long way in the effective and successful operationalization of the machine learning model.