Introducing AutoML

AutoML is a term that appears increasingly in tech industry articles and vendor product claims, and is also a hot topic within AI research in academia. Consider how nearly all of the public cloud vendors promote some form of AutoML service. The tech “unicorns” are developing AutoML services for their data platforms, many of which have been made open source. A flurry of smaller tech startups promise to “democratize” ML and relieve AI-related hiring pains for enterprise customers. Given all the buzz, what does “AutoML” mean?

Outline:

What do we need to automate?
Feature Engineering
Machine Learning Goals
AutoML: History as Search
Neural Architecture Search
Hyperparameter Optimization
Meta-Learning
Resources and Caveats

What do we need to automate?

What are the goals of AutoML? In other words, what is the “auto” in AutoML: what do we need to automate? These are a series of processes, methods, and frameworks designed to make machine learning more efficient, and in some cases accelerate the progress of research we make as a data science community as a result. Some AutoML projects focus on making specific tasks easier for data scientists. Some AutoML projects are sold as automating all of data science. We might not believe claims made towards the latter, but what we do know is that machine learning is difficult, and the skills gap in machine learning is well documented — and still a great concern. Some of these problems aren’t economical to solve without automation because of those shortages and difficulties.

Feature Engineering

Let's step back and take a closer look at what happens inside a machine learning algorithm. What are the data scientist's responsibilities for training a usable model – in order to understand where AutoML fits? In the end it will help with the following:

Automated Feature Engineering
Hyperparameter Optimization
Neural Architecture Search
Meta-Learning

First, let’s consider ML in the case of supervised learning, leading toward ways to automate feature engineering. This represents the majority of ML use cases in industry currently, even for deep learning – although that may be changing. A supervised learning algorithm produces models that are trained to map input vectors to some set of output labels. The labels mean we know precisely what we’re trying to predict. But to get to the training stage, a data scientist has to have completed a laundry list of items for preparing the training data. That list includes cleaning the data, feature engineering within the data, then selecting appropriate ML algorithms to produce models suitable for the data.

Let’s consider an example to illustrate the steps in this process. Suppose we have real estate data organized as vectors:

<zip_code, property_acreage, population_density>

From that training data we want to produce a ML model to classify property listings in Northern California as either “urban” or “rural” properties. Let’s say was start with two input vectors:

<"94103", 0.04, 7.3e+03> – a swank condo in San Francisco labeled "urban"
<"95472", 3.51, 1.1e+01> – a small farm in Sonoma County labeled "rural"

At this point, it’d be easy to take some things for granted when first looking at this data set. We can presume the data scientist has already reviewed the data and cleaned it to fix missing values or likely errors. They’ve probably consulted with a domain expert (such as an urban economist or realtor) and represented the expert’s insights about nuances in the data. The data scientist almost certainly performed some feature engineering – which is the practice of transforming the existing data into alternative data representations. As an example, imagine that the original data provided both a city population and city size (in square miles). When compared with rural properties, some of the nuances about city real estate may become lost in rounding errors. Instead, calculating a ratio of the two values produces a population density estimate, which provides better information for the model to use. In other words, feature engineering is an art where the data scientist derives “features” as input for training the ML model which are more representative of the problem than the original data in its raw form.

These input vectors have three independent variables and it’s not difficult to differentiate between "urban" and "rural" for each variable. The zip code "94103" is in the South of Market district (SOMA) in downtown San Francisco, while "95472" features lots of vineyards, ranches, and orchards. A value of 0.04 acres is a moderately sized condo, versus a +3 acre farm. Population densities of more than 7000 people per square kilometer represent some of the most concentrated urban centers in the US. Of course in practice an ML model may train on millions of input vectors and differentiate among thousands of labels, with lots of uncertain “gray areas” in between. This example has been intentionally over-simplified to illustrate the key point about differentiation within the data.

One area of focus for AutoML has been the potential to automate the feature engineering described above by deriving systems to create and test combinations of data from the original data set. In practice, this idea is fairly complicated. Imagine the vast feature space possible when combining, manipulating, and testing each feature of your data together with one another in hopes of discovering powerful latent relationships. Even with a moderately small dataset of some hundred features, the number of combinations grows dramatically.

Let’s consider another real estate example to posit the potential value of automated feature engineering. We have a classifier above that labels whether homes are urban or rural. What if instead we were attempting to predict the value of those homes, and had their characteristics as input data – for example, pool sq. feet, bathrooms, bathroom sq. feet. A data scientist would spend much of their time comparing relationships between these features at the advice of an expert (say a realtor conveying their intuition that bigger bathrooms are important for price). It’s natural to expect, however, that there may be relationships yet undiscovered that automated feature engineer may bring to light – possibly the ratio of bathroom sq. feet to pool sq. feet is particularly important for predicting price. Helping data scientists find extra value in their data preparation work is a great value proposition when it comes to AutoML.

Machine Learning Goals

To train an ML model, an algorithm must optimize a set of learners, i.e., mathematical functions that learn to differentiate how the input variables map to the output labels. In general these functions get described in terms of a loss function – which generalizes among similar input vectors – and a regularization term – which prevents “overfitting” by giving priority to simpler functions instead of overly complex ones that are biased toward the training data. Although various ML algorithms emerged from vastly different disciplines (control systems, biology, symbolic systems, electrical engineering, etc.), supervised learning algorithms are some variation of this. The gist here is about mathematical optimization: input values can be plotted on a multidimensional gradient, then learners get optimized to define boundaries along that gradient.

A common method used to optimize learners – in other words, minimize the loss function within the constraints of the regularization term – is to use gradient descent. When we use frameworks such as TensorFlow or PyTorch they are built specifically to manage these optimization workloads, based on differentiable input data.

Similar kinds of optimization are used in areas of ML other than supervised learning. The unsupervised learning algorithms don’t need labeled data, but they use gradients to differentiate data, e.g., into distinct clusters for in the case of K-means clustering. Algorithms for reinforcement learning use simulations to construct gradients that describe their simulated environments, which are optimized as differentiable data. In each case of ML algorithms, the resulting models can be described as sets of parameters while the configurations used as starting points to optimize learners are called hyperparameters. The gradients describe the input data, the hyperparameters define where and how to start optimizing those gradients, and the parameters represent the results.

For example, a K-means clustering model used on N-dimension data may have just one hyperparameter – K for the desired number of resulting clusters – while it has K*N parameters, one for each mean value of each dimension in each cluster. Note that deep learning models can grow quite large, for example the Megatron-LM model for natural language – one of the “transformer” models – has 8.3 billion parameters.

AutoML: History as Search

Looking at machine learning from a slightly different perspective, the challenge of building ML models is to find the appropriate hyperparameters for a given use case and set of input data. While that may appear to be the big idea on first glimpse, it’s not so simple. Realistically, a data science team responsible for building ML models must grapple with several challenging questions:

Which ML algorithms would fit well with the use case?
How much training data is needed, or even available?
Which features in the data are best to use for training models?
What hyperparameter settings produce the best results?
How to evaluate different models and compare their relative trade-offs?
How could the features be transformed prior to training to improve results?
Have any assumptions introduced potential risks for the use case?
Will the resulting models satisfy requirements for the production environment?

We’re talking about automation and artificial intelligence, but when taking those questions into consideration there are many human decisions that must go into the mix. Plus, one could add several more items to the list above, such as decisions and assumptions that went into data collection, or how closely the results of automation fit the needs of the use case.

Imagine if it were possible to look at all the possible variations of input data (all possible gradients) and compare those with all the possible variations of ML models (optimized parameters). Imagine if many perspectives and learnings from prior use cases could be accumulated and leveraged by others. Then could we find the best configurations – the optimal hyperparameters?

Taken from that perspective, some of the most challenging parts of machine learning can be formulated as search problems. In other words, find the optimal parameters for an ML model and you’ve identified a solution. Backing up a step, find the appropriate hyperparameters that converge to optimal parameters after training and you’ve identified the path toward a good solution. Identify the features and data transformations that produce optimal training sets, and you’ve enhanced the steps along that path.

Rolling the clock back a few decades, if you’d been hanging around AI seminars at leading computer science universities, you would have heard much discussion about A* and B* search algorithms: formulate hard problems as graphs, then use best-first search algorithms to find solutions within those graphs, augmented with heuristics to accelerate searching. All you’d need would be lots of knowledge represented in graphs, and lots of fast hardware to run the search algorithms – or so the story was told at the time. It’s perhaps no coincidence that two of the more successful commercial spin-offs from a leading AI school were search engines: Yahoo! and Google respectively.

Of course in practice, that “search problem” for optimizing the hyperparameters in a large ML model may be quite expensive to solve. For example, the Megatron-LM model mentioned above required more than 9 days to optimize, while running on a rather large cluster of 512 high-end GPUs. That implies buying or renting lots of expensive hardware, plus significant amounts of energy required to power and cool them. Those costs are more than simply operating expenses: large energy requirements also imply large carbon footprints and subsequent policy concerns. See the recent paper “Energy and Policy Considerations for Deep Learning in NLP”which describes how training the new transformer models for natural language can require up to 5 times the carbon footprint of the total lifetime of operating an automobile.

A brief aside about ML trends

As the rate of AI adoption in enterprise continues to accelerate, as we move more and more ML models into production throughout industry, what are the implications of training large ML models? On the one hand, is the extreme energy use justified by additional increases in predictive accuracy? On the other hand, some of these models require so many resources to train that research papers published about them can only be verified by a limited number of organizations – top tier universities and a handful of companies. Do those papers still represent scientific advances if their claims become increasingly more difficult to verify?

Conversely, if the efficiency of training ML models can be improved in clever ways we may be able to turn around those trends. We could reduce the associated carbon footprint dramatically and place the R&D back with range for a broader number of schools and businesses.

That point about training ML models in clever ways – in other words, more intelligently – that’s where we begin to get AI helping to improve AI. That’s the promise of AutoML.

Neural Architecture Search

Deep learning – and its foundations in carefully designed neural networks – have become the center of much hype in the data science world. Even so, these kinds of models are fairly tricky to train. These sticky points of the data science process are natural places to seek means for automating some of the required steps – to help both data scientists and organizations building ML pipelines. Consider Andrej Karpathy’s list of pitfalls in his recipe for neural nets:

This is just a start when it comes to training neural nets. Everything could be correct syntactically, but the whole thing isn’t arranged properly, and it’s really hard to tell. The “possible error surface” is large, logical (as opposed to syntactic), and very tricky to unit test. For example, perhaps you... initialized your weights from a pretrained checkpoint but didn’t use the original mean. Or you just screwed up the settings for regularization strengths, learning rate, its decay rate, model size, etc. Therefore, your misconfigured neural net will throw exceptions only if you’re lucky; Most of the time it will train but silently work a bit worse.

Much of the complexity in the training process comes from the nuances of choosing a neural network architecture. There are many potential options: depth (layers), learning rates, etc., and often the end result of all that effort is merely a set of trained weights that we use in a single application. Granted, the resulting ML application may make useful predictions – say it classifies images of dogs – but the training time required to build that neural network is a sunk cost, and often quite a very large cost, both in terms of human expertise and time spent running a powerful server cluster.

Consider how much useful information now lies locked deep in the design of that image classifying neural network? What if we could use the information that’s locked away to inform the design of the next neural network that we need? This idea – creating AutoML frameworks to guide the design of deep learning projects – is called neural architecture search or NAS. For an example, see the popular AutoKeras library in Python. NAS is the basis of several AutoML offerings from larger vendors, and the results tend to produce unique network architectures. Often the results are architectures that even AI experts wouldn’t have created on their own, but produce models that are successful in their ultimate goal of accurately generalizing from the training data.

Hyperparameter Optimization

Consider the process of supervised learning described above. If we have training data and we have an algorithm defined by a loss function and regularization term, searching for the best hyperparameters to use is an optimization problem. It’s another good use of gradients. This is called hyperparameter optimization, sometimes abbreviated as HPO.

Much of the early emphasis in AutoML started with HPO. Open source projects such as hyperopt and spearmint are good examples for supervised learning, and serve as foundations for other AutoML libraries. AutoKeras is a bit more recent, and used for deep learning models. Even more recently, Ray Tune provides a scalable framework for HPO focused on deep learning and reinforcement learning.

Overall, check out this spreadsheet for a curated list of open source projects related to AutoML. One interesting research project listed there is PALEO, which evaluates multiple dimensions of optimization: scalability and performance of deep learning models. Its authors subsequently launched a firm called Determined.AI and have been exploring some of the business cases related to AutoML, such as how to terminate non-optimal models early. Two articles from their team stand out in particular:

“Stop doing iterative model development” by Yoav Zimmerman
“Reproducibility in ML: why it matters and how to achieve it” by Jennifer Villa and Yoav Zimmerman

These emphasize the point that while machine learning is not an entirely deterministic process, there are ways to make ML model workflows more reproducible and use automation to help optimize.

A key term to note here is optimization since most of these approaches attempt to optimize the training of ML models. It’s important to ask: What aspect of the ML models is being optimized? Are we concerned about optimizing the accuracy of the resulting ML models? If so, one can point to at least four different definitions for “accuracy” in machine learning, which sometimes represent conflicting goals:

precision: how many positives are relevant
recall (also similar to sensitivity): identifying the true positives
specificity: identifying the true negatives
perplexity: handling variability in the data

Are we instead concerned about optimizing the confidence in the predicted results? If so, how well can the uncertainty in the results be estimated?

Are we interested in optimizing costs? Are those costs the amount of money spent training models? Or perhaps the amount of time spent training, before a new model can be deployed to customers? Or is it the cost of running the models in production? Optimizing for one of those dimensions may increase the costs in another dimension.

Optimization may instead be a matter of the resources required to run a model – for example, the memory size or power required for an embedded model or other edge computing use cases. Alternatively, your use case may require optimizing for entirely different aspects: fairness and bias, privacy (i.e., not leaking PII), minimizing attack surface and potential security risks, explainability, and other areas of compliance requirements.

When we use AutoML – when we defer important discussions to automated processes – we need to keep in mind what dimensions are getting optimized. One of the critiques about early efforts in AutoML is that some projects promised to “automate data science” when in reality they would only optimize for one aspect. Real world applications require a more sophisticated view of the world. That’s vital for robust systems and AI trust.

Across the end-to-end ML lifecycle

As mentioned earlier, many human decisions that must go into the mix for developing ML models. The article by Yoav Zimmerman makes the case that “every decision you make during model development is a hyperparameter” – which emphasizes the point that AutoML must take into account much more than simply producing “the most accurate” models.

Let’s consider a broader scope: ML Ops is about the end-to-end lifecycle for managing ML in production. Where can automation and machine learning be applied throughout that end-to-end lifecycle?

The first stage of the ML lifecycle typically requires data preparation. We say that data science teams spend roughly 80% of their time doing just this. One very interesting open source project for this stage is HoloClean, with the end goal of producing trustworthy datasets:

HoloClean a statistical inference engine to impute, clean, and enrich data. As a weakly supervised machine learning system, HoloClean leverages available quality rules, value correlations, reference data, and multiple other signals to build a probabilistic model that accurately captures the data generation process, and uses the model in a variety of data curation tasks. HoloClean allows data practitioners and scientists to save the enormous time they spend in building piecemeal cleaning solutions, and instead, effectively communicate their domain knowledge in a declarative way to enable accurate analytics, predictions, and insights from noisy, incomplete, and erroneous data.

See the ACM SIGMOD article “Data Cleaning is a Machine Learning Problem that needs Data Systems Help!” by Ihab Ilyas. HoloClean leverages weak-supervision principles (see Snorkel) to bring humans into the loop, leveraging and codifying human expertise about how to prepare the data. The humans involved range from data systems administrators to ML experts to domain experts. Many decisions go into data preparation, so this approach begins to represent those decisions and learn from them over time.

The next typical stage in the ML lifecycle is feature engineering, i.e., constructing more useful gradients to use during ML model training. MLBoxand automl-toolkit are a couple examples of open source projects that help automate feature selection. See “Learning Feature Engineering for Classification” by Fatemeh Nargesian, et al.

In related work, see the paper “On the Stability of Feature Selection Algorithms” by Sarah Nogueira, Konstantinos Sechidis, and Gavin Brown. The authors present rigorous statistical methods for selecting features that provide more “stability” in ML training workflows: “An algorithm is ‘unstable’ if a small change in data leads to large changes in the chosen feature subset.” This begins to quantify reproducibility in ML workflows, which ultimately will benefit AutoML across the end-to-end lifecycle.

Generally speaking, there are multiple ML algorithms that can be applied for most use cases. HPO approaches are quite useful to optimize how to train ML models for specific algorithms; however, two broader questions involve how to select which of those algorithms are fit best with the use case and how to compare models that have been trained by different algorithms. That’s typically a stage in the ML lifecycle that follows model training. Moreover, recall that one of the main takeaways from the Netflix Prize was about the power of using ensembles, i.e., combinations of multiple ML models. Some areas of AutoML focus on this – evaluating models and building ensembles. For example, see auto-sklearn

At this point in the end-to-end lifecycle, one has a trained ML model in hand, ready to deploy into production. One direction is to deploy models online, for example creating microservices. With that comes needs for auto-scaling to meet customer demands. That has been one of the strengths of Amazon SageMaker, although knative and other open source projects also address the challenges of auto-scaling. Another direction is to deploy into embedded use cases, such as smartphones, IoT devices, and other kinds of mobile or edge applications. For these it may be necessary to perform model compression, and in some cases ML models can be compressed by orders of magnitude for less memory, less power, etc. See “EIE: Efficient Inference Engine on Compressed
Deep Neural Network” by Song Han, et al. The tinyML Summit is an excellent conference about this kind of work, now in its second year.

Meta-Learning

Assembling together the AutoML pieces described above, several open source projects and commercial service offerings have begun to address the full end-to-end ML lifecycle. TPOT uses genetic programming (an evolutionary algorithm) to optimize entire ML pipelines. The lale project is another, focused on AutoML for scikit-learn which adds consistency checks and interoperability. Both of these are examples of ways to semi-automate data science work, i.e., providing a “data science assistant” to augment people. Even so, there aren’t any projects available (to the authors’ knowledge) that integrate all of the AutoML techniques described above. That will come in time.

Meanwhile, the notion of meta-learning (or “learning to learn”) has been emerging to describe a more generalized approach. Definitions vary, but the gist is to develop a knowledge base from the history of AutoML approaches used, and learn from that history. This is quite an active area of research, and some helpful papers include:

“SmartML: A Meta Learning-Based Framework for Automated Selection and Hyperparameter Tuning for Machine Learning Algorithms”
Mohamed Maher, Sherif Sakr
(2019–03–26)
“Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks”
Chelsea Finn, Pieter Abbeel, Sergey Levine
(2017–07–18)
“Learning to Optimize”
Ke Li, Jitendra Malik
(2016–06–06)
“Cross-disciplinary perspectives on meta-learning for algorithm selection”
Kate A. Smith-Miles
(2009–01)

Referring back to earlier discussion about weak supervision, think of the knowledge collected in a typical production use case for ML … The history of workflow configurations used, bugs identified and tracked, code commit history, JIRA tickets, product decisions, user testing, etc. Generally speaking, those software engineering artifacts in a project have lots of metadata: provenance about who did what, when, and with what kinds of outcomes. That’s excellent input to use for weak supervision modeling, such as in snorkel. Imagine “data mining” of your team’s history of production ML projects, then building ML models to help guide future ML development.

Some product claims for AutoML services tend toward “AI will create ML apps for you.” A more realistic outlook is that AutoML will augment your data science team by data mining and modeling your institutional knowledge – and from the experiences of other organizations – such that when you start a new ML project you’re working from an established baseline.

Resources and Caveats

Automation for ML workflows may take other forms. For example, AutoPandas and TabNine provide program synthesis and autocompletion based on deep learning, which can accelerate coding and also help handle more advanced constructs. The latter is trained on ~2 million source code files from GitHub. While these aren’t building ML models directly, they may aid in the overall workflows, and they certainly fit within the meta-learning theme described above.

One caveat about motivations and consequences … Check out the first 5 minutes of this talk by Jeff Dean from Google: https://youtu.be/kSa3UObNS6o?t=1333

There are ways in which AutoML can lead to better ML models than what even the leading ML experts create manually. There are ways in which the economics provide a potential for AutoML to save costs for firms that cannot afford to hire their own ML experts. The trade-off is that this approach may lead to ~100x increase in computation required for ML workflows. On the one hand, with hardware evolving rapidly – especially new architectures for hardware customized for machine learning that provide orders of magnitude performance gains – perhaps we should plan for increasing computation needs today, with a net drop in computation costs over the next few years. On the other hand, perhaps we continue to increase computation needs outpacing the rate of performance gains … which gets us back to dramatically increased carbon footprints, at least until better renewable energy resources become widespread. In either case, AutoML has a long road ahead, with much immediate potential.

If you’d like to keep advised of the latest research in AutoML, check out the highly recommended Awesome-AutoML-Papers by Mark Lin tracks updates in the latest AutoML research and publications.

And as always, we'd love to hear your perspectives, questions, comments, suggestions –