Top Data Science Breakthroughs in 2021

It’s the time of the year when we look back at some of the data science breakthroughs and accomplishments of the past 12 months. As in the previous years, AI and machine learning techniques have evolved at a rapid pace; and amazing research was published in 2021.

We asked the DSS community about some of the contributions in machine learning that caught their attention in 2021 and which they found particularly useful. Read about the top 5 advancements in this post!

1. MulTimodal learning with turing bletchley

In November 2021, Microsoft introduced a 2.5-billion parameter Universal Image Language Representation model (T-UILR). The model, named Turing Bletchley, can carry out image-language tasks in 94 languages! T-Bletchley was trained on billions of image-caption pairs. Data was encoded using a transformer-based image-text encoder (like the BERT-large model architecture). After separately encoding the images and captions, a contrastive loss was applied for model training.

T-Bletchley outperformed the state-of-the-art models on English image-language (ImageNet, CIFAR, COCO) and universal image language (Multi30k and COCO) data sets.

2. Tensorflow Updates (Latest 2.7)

Google’s TensorFlow received several updates around the year, major updates were announced at Google I/O 2021.

TensorFlow Lite and TesnorFlow.js

Recently, companies and data scientists have focused on quick model deployment. With the release of two TensorFlow versions, TensorFlow Lite (smaller version) and TensorFlow.js (JavaScript version), aspects of model deployment have been simplified. TensorFlow Lite is designed for on-device machine learning and TensorFlow.js runs the model in the browser.

On-device Machine Learning

TensorFlow Lite supports on-device machine learning – running a machine learning model on your smartphone. Running the model on-device ensures the data does not leave the device and the model can be used offline. Google’s on-device machine learning developer page contains documentation to help you get started.

TensorFlow 2.7

In November 2021, TensorFlow 2.7 was released. Google focuses on further improving TensorFlow’s user experience with an “improved debugging experience”. Stack traces and error messages are now simplified and will enable you to easily identify and fix bugs in the code.

3. Pytorch updates

There were many new features and versions of PyTorch introduced in 2021. Let’s look at two of the most exciting releases:

PyTorch Tabular

Are you looking for an easy and fast way to develop a deep learning model with pandas dataframes? Introduced in April 2021, PyTorch Tabular aims to fill in the gap for an “easy, ready-to-use” deep learning library that works with pandas dataframes directly. PyTorch and PyTorch Lightning were used to build the library. The library already consists of some state-of-the-art models – NODE and TabNet.

PyTorch ResNet50

A week after the release of TorchVision v0.11 with the aim of modernizing the library, PyTorch managed to improve the accuracy of a ResNet50 model by 4.5% on ImageNet. The improvement was achieved by using new training techniques to train a ResNet50 architecture model. Some of the techniques used are TrivialAugment, a “Tuning-free Yet State-of-the-Art Data Augmentation”, Exponential Moving Average and WeightDecay.

4. Transformer Architecture

The transformer architecture started gaining attention in 2017 with the Attention Is All You Need paper. The deep learning model, which is based on the concept of self-attention, has received increased attention for its performance capabilities in computer vision. Vision Transformers (ViT) have set a new benchmark in computer vision with a top-1 accuracy of 90.45% on ImageNet.

However, Convolution and self-Attention Networks (CoAtNets), which were introduced in June this year, set a new state-of-the-art standard with a slightly better accuracy of 90.88%. CoAtNets (pronounced as “coat” nets) provide the best of both worlds as they are a combination of convolutions, which have better inductive bias, and transformers, which have better performance.

5. Self-Supervised learninG – Facebook AI's SEER

Consider how a newborn child learns to sit, walk, and talk. Babies learn how the world operates by observing and experimenting (trial and error). Self-supervised learning is based on the same concept. Facebook hypotheses that intelligence in humans and animals is related to the common sense developed through this process of observation and experiments. It refers to common sense as the “dark matter of artificial intelligence”.

In March 2021, Facebook created SEER, a billion-parameter self-supervised computer vision model. The model was trained on random one billion public Instagram images and achieved a top-1 accuracy of 84.2% on ImageNet. Unlike most computer vision training today, the public Instagram images were unlabelled and not curated.

Bonus Resources

Besides the contributions from large enterprises there were some great accomplishments made by smaller companies, which helped data scientists and ML engineers develop even more exciting solutions at a faster pace. Some of them include:

Spark NLP

Need a text processing library that supports Python, Java and Scala? Why not try the most used NLP library in the Enterprise? John Snow Labs’ open-source library, Spark NLP, supports the three languages and provides “production-grade, scalable and trainable versions of the latest research” in NLP.

Gradio – fastest way to demo your machine learning model

Demo your machine learning models with a simple, fast, and effective tool called Gradio. It helps create a quick demo for anyone to use, anywhere!

What ML breakthrough in 2021 do you find noteworthy and is missing here? Discuss it with the community on the DSS Slack Workspace.