DSS Blog

How cloud computing benefits data science

Written by Data Science Salon | Mar 8, 2022 1:34:49 PM

Cloud computing is essentially the backbone of data science. But what exactly makes this technology so critical for data scientists and the AI solutions they build?

Businesses are moving to the cloud 

According to the Flexera (formerly RightScale) State of Cloud 2021 report, the usage of cloud technology is rising, with 55% of enterprise workloads expected to be cloud-based within the next 12 months and almost 60% of companies planning to focus on cloud migration.

One of the key drivers of cloud transformation is increasing cost-effectiveness and savings, with 79% of companies using these factors to measure cloud progress. Basically, cloud-based computing and storage are far more cost-effective than on-prem infrastructure. 

And this is why cloud computing is so beneficial for data science, and one of the key reasons why AI and data science have emerged in the corporate world.  

The importance of cloud computing in data science

With gargantuan datasets to process and heavy models to train, data science can be considered one of the most resource-hungry aspects of modern business. The core of data science is analyzing and processing large volumes of data, harvesting insights from them, and training AI models to automate the workflows. 

Cloud storage

As mentioned above, the core of data science is the use of big data, usually gathered by companies and scientific institutions. For example, the ImageNet dataset consists of more than 14 million images with manually added labels on what the image depicts. The dataset is widely used by organizations to train image recognition and object detection algorithms.

Data scientists who use these large-scale databases need to store and share the data, access it remotely and update it in an easy way. This normally exceeds the capacity of their local storage and cloud storage provides an affordable solution.

Cloud-based computing power

Having the gargantuan datasets stored on cloud-based solutions, data scientists can start to train AI models, which is a computational heavy process. According to researchers from MIT Computer Science and A.I. Lab, the progress of deep learning applications is strongly dependent on the availability of computing power.

Modern machine learning and deep learning models trained to perform sophisticated tasks can require unimaginable resources. AlphaStar, the Deepmind agent that has beaten human StarCraft II champion players, was built by combining five separate reinforcement-learning agents, trained for 14 days on 16 Tensor Processing Units each.

Access to this amount of processing power was possible due to cloud computing. With the cloud, a scalability company can rent the required machines in a pay-as-you-go model and abandon them as they become obsolete. Buying this power as an on-prem infrastructure would be highly ineffective. The machines would have been used only with a fraction of their available power for most of the time. 

Shareability

Cloud providers are aware of the fact that, by combining the large storage and computing power, they provide a great environment for data scientists. When using cloud-based tools, data science teams can share and collaborate on their projects in an easier and more efficient way than when using an on-prem infrastructure.

Apart from the raw power and storage, cloud providers come up with tools to facilitate and automate data science-related workflows. For example, Google released CoLabolatory, which is basically a Jupyter notebook with full integration with Google Cloud environment. The CoLaboratory also benefits from the experiences gathered with the development of Google Docs, providing coders with the ability to collaborate and comment on their work.

ML as a service

With the growing need for AI-powered tools to be used by companies all around the world, the largest tech companies decided to deliver their own sets of tools to easily launch the desired solution. The users are provided with ready-to-use building blocks that can enrich the existing workflows and power new products. 

Google Cloud comes with a set of pre-trained solutions to use in a nearly off-the-shelf model. The client company can take the model prepared before and implement it as one of the building blocks in the existing, Google Cloud-based business workflow.

A similar solution is available through Amazon Web Services and from Microsoft, known as AI Builder

Summary

It is not surprising that data science is a computational and storage-heavy domain, with gargantuan datasets and many hours of full-power computing used. With the ability to access the nearly infinite online storage for a decent price combined with computing power scalable from zero to infinity and backward in seconds, cloud computing is a perfect match for modern data science.