Building a successful machine learning model is no mean feat. It involves an arduous model-building phase and what comes next requires another rigor of maintaining the model output quality. A machine learning model once trained cannot live up to the changing data dynamics on its own, if not monitored and evaluated periodically.
What is retraining?
There is a very thin difference between training and retraining signals. To set the context, training involves building a machine learning model on an adequate amount of historical data with independent features and the target variable for a supervised learning task.
But what is the right size of historical data to start the model training?
It not only depends on the complexity of the problem at hand but also the complexity of the model i.e., the number of model parameters to be estimated to achieve a desirable level of model performance.
For example, the iris data set is only 150 records but is objectively simple because of the predictive ability of the features with respect to the target.
Now that we understand the size of data and the model training, let’s see in what ways an ML pipeline can get updated and what part of it comes under retraining:
- New data is not restricted to just the addition of new records. It could include new attributes that are now made available with the passage of time.
- The trained model with a certain set of chosen attributes might not yield the expected model performance when deployed. During the model maintenance, the data scientist finds new features from feature engineering or selects new features to improve the model– does it come under the retraining paradigm?
- Searching the hyperparameters space again with the new data.
- Select the best new model by re-running the model selection pipeline again.
Theoretically speaking, the model retraining is all about retraining the pipeline as is, with the new data. There are ideally no code or feature changes introduced as part of re-running the pipeline. However, if you explore a new algorithm or a new feature that was previously not present in training, you should incorporate it during the deployment of a retrained model.
When is the right time to retrain?
How do you know when is the right time to retrain? One of the most actively adopted approaches is out of time analysis:
Assuming the business has given an SLA of maintaining an F1 score of 80%, and the F1 score at the end of the week of deployment was 84%. The model continues to make predictions for the coming weeks and not surprisingly, the F1 score starts to decline to 83%, then 81%, and finally 78% with each passing week.
During T+3 weeks, the pipeline should trigger a retraining signal to learn from the newly accumulated data over the last few weeks.
on what data should the retraining be run?
Well now that you have assessed the need for model retraining, it's time to decide on what data the retraining should be run:
1. the more is the merrier
Continue appending the latest data to the existing database and retrain on all data accumulated thus far. Now, this is an overgeneralization. The older data might not be relevant and representative of production data. Data is dynamic and could have changed drastically since the last time the model was trained. For example, the products which were not viewed a lot earlier are now counted under highly viewed products due to changes in trends or consumer behavior. Notably, machine learning runs on the critical assumption that train and test data are characteristically from the same population. Adding the data with changing distribution will only confuse the model in drawing statistical associations and may lead to an under-performing retrained model. Besides, the irrelevant data will add to high latency training systems due to an increase in training data volume. In short, it is expensive to retrain models from scratchSource: Vidhi Chugh
The next set of approaches is based on slicing the data in different ways to retrain a subsample of data.
2. First in First out
N days of the oldest data should be removed for the N days of newly added data. For example, if we are retraining after 3 weeks, then 3 weeks of the oldest data should be removed. It handles the shortcomings of the previous approach to dealing with high-volume training data. X number of records added leads to removing X older days. The drawback of this approach is the X old days removed might be carrying a good signal or the new data might be noisy – both of which will reduce the predictive performance of the retrained model.
3. Only latest data sample
Depending upon the scale at which the data is generated per day, it is possible to retrain the model on the latest data from timestamp ‘T’ onwards and remove all older data till time ‘T-1’. It will ensure the model is picking up the signal from the latest trend but might lose out on long-term trends.
4. Random sample
We cannot always foresee how the data distribution will continue to change over the next few weeks or months. In that case, it would be better to not train on the latest data slices, but rather monitor the rate at which data distribution and patterns continue to evolve. Then, taking the random sample from those time slices will ensure that the retrained model has a good generalizability quotient over the unseen data.
Source: Vidhi Chugh
The above strategies are based on slicing the data in different ways for retraining the model. Besides, there are two ways that can use the entire data to date i.e. additive learning and ensemble learning. Additive learning takes the last trained model as a base and updates the parameters with the newly available data. Ensemble learning allows you to ensemble the last trained model with the model trained on newly available data.
In this article, we have discussed why ML models need retraining followed by when is the right time to retrain the ML model. We further discussed the multiple retraining strategies, their pros and their cons. There is no single strategy that works across all datasets and uses cases. Hence, careful evaluation of retraining strategies goes a long way in the effective and successful operationalization of the machine learning model.