In the past few years, you might have noticed the increasing pace at which vendors are rolling out “platforms” that serve the AI ecosystem, namely addressing data science and machine learning (ML) needs. The “Data Science Platform” and “Machine Learning Platform” are at the front lines of the battle for the mind share and wallets of data scientists, ML project managers, and others that manage AI projects and initiatives. If you’re a major technology vendor and you don’t have some sort of big play in the AI space, then you risk rapidly becoming irrelevant. But what exactly are these platforms and why is there such an intense market share grab going on?
The core of this insight is the realization that ML and data science projects are nothing like typical application or hardware development projects. Whereas in the past hardware and software development aimed to focus on the functionality of systems or applications, data science and ML projects are really about managing data, continuously evolving learning gleaned from data, and the evolution of data models based on constant iteration. Typical development processes and platforms simply don’t work from a data-centric perspective.
It should be no surprise then that technology vendors of all sizes are focused on developing platforms that data scientists and ML project managers will depend on to develop, run, operate, and manage their ongoing data models for the enterprise. To these vendors, the ML platform of the future is like the operating system or cloud environment or mobile development platform of the past and present. If you can dominate market share for data science / ML platforms, you will reap rewards for decades to come. As a result, everyone with a dog in this fight is fighting to own a piece of this market.
However, what does a Machine Learning platform look like? How is it the same or different than a Data Science platform? What are the core requirements for ML Platforms, and how do they differ from more general data science platforms? Who are the users of these platforms, and what do they really want? Let’s dive deeper.
What is the Data Science Platform?
Data scientists are tasked with wrangling useful information from a sea of data and translating business and operational informational needs into the language of data and math. Data scientists need to be masters of statistics, probability, mathematics, and algorithms that help to glean useful insights from huge piles of information. A data scientist creates data hypothesis, runs tests and analysis of the data, and then translates their results for someone else in the organization to easily view and understand. So it follows that a pure data science platform would meet the needs of helping craft data models, determining the best fit of information to a hypothesis, testing that hypothesis, facilitating collaboration amongst teams of data scientists, and helping to manage and evolve the data model as information continues to change.
Furthermore, data scientists don’t focus their work in code-centric Integrated Development Environments (IDEs), but rather in notebooks. First popularized by academically-oriented math-centric platforms like Mathematica and Matlab, but now prominent in the Python, R, and SAS communities, notebooks are used to document data research and simplify reproducibility of results by allowing the notebook to run on different source data. The best notebooks are shared, collaborative environments where groups of data scientists can work together and iterate models over constantly evolving data sets. While notebooks don’t make great environments for developing code, they make great environments to collaborate, explore, and visualize data. Indeed, the best notebooks are used by data scientists to quickly explore large data sets, assuming sufficient access to clean data.
However, data scientists can’t perform their jobs effectively without access to large volumes of clean data. Extracting, cleaning, and moving data is not really the role of a data scientist, but rather that of a data engineer. Data engineers are challenged with the task of taking data from a wide range of systems in structured and unstructured formats, and data which is usually not “clean”, with missing fields, mismatched data types, and other data-related issues. In this way, the role of a data engineer is an engineer who designs, builds and arranges data. Good data science platforms also enable data scientists to easily leverage compute power as their needs grow. Instead of copying data sets to a local computer to work on them, platforms allow data scientists to easily access compute power and data sets with minimal hassle. A data science platform is challenged with the needs to provide these data engineering capabilities as well. As such, a practical data science platform will have elements of data science capabilities and necessary data engineering functionality.
What is the Machine Learning Platform?
We just spent several paragraphs talking about data science platforms and not even once mentioned AI or ML. Of course, the overlap is the use of data science techniques and machine learning algorithms applied to the large sets of data for the development of machine learning models. The tools that data scientists use on a daily basis have significant overlap with the tools used by ML-focused scientists and engineers. However, these tools aren’t the same, because the needs of ML scientists and engineers are not the same as more general data scientists and engineers.
Rather than just focusing on notebooks and the ecosystem to manage and work collaboratively with others on those notebooks, those tasked with managing ML projects need access to the range of ML-specific algorithms, libraries, and infrastructure to train those algorithms over large and evolving datasets. An ideal ML platforms helps ML engineers, data scientists, and engineers discover which machine learning approaches work best, how to tune hyperparameters, deploy compute-intensive ML training across on-premise or cloud-based CPU, GPU, and/or TPU clusters, and provide an ecosystem for managing and monitoring both unsupervised as well as supervised modes of training.
Clearly a collaborative, interactive, visual system for developing and managing ML models in a data science platform is necessary, but it’s not sufficient for an ML platform. As hinted above, one of the more challenging parts of making ML systems work is the setting and tuning of hyperparameters. The whole concept of a machine learning model is that it requires various parameters to be learned from the data. Basically, what machine learning is actually learning are the parameters of the data, and fitting new data to that learned model. Hyperparameters are configurable data values that are set prior to training an ML model that can’t be learned from data. These hyperparameters indicate various factors such as complexity, speed of learning, and more. Different ML algorithms require different hyperparameters, and some don’t need any at all. ML platforms help with the discovery, setting, and management of hyperparameters, among other things including algorithm selection and comparison that non-ML specific data science platforms don’t provide.
The different needs of big data, ML engineering, model management, operationalization
At the end of the day, ML project managers simply want tools to make their jobs more efficient and effective. But not all ML projects are the same. Some are focused on conversational systems, while others are focused on recognition or predictive analytics. Yet others are focused on reinforcement learning or autonomous systems. Furthermore, these models can be deployed (or operationalized) in various different ways. Some models might reside in the cloud or on-premise servers while others are deployed to edge devices or offline batch modes. These differences in ML application, deployment, and needs between data scientists, engineers, and ML developers makes the concept of a single ML platform not particularly feasible. It would be a “jack of all trades and master of none.”
As such, we see four different platforms emerging. One focused on the needs of data scientists and model builders, another focused on big data management and data engineering, yet another focused on model “scaffolding” and building systems to interact with models, and a fourth focused on managing the model lifecycle – “ML Ops”. The winners will focus on building out capabilities for each of these parts.
The Four Environments of AI (Source: Cognilytica)
The winners in the data science platform race will be the ones that simplify ML model creation, training, and iteration. They will make it quick and easy for companies to move from dumb unintelligent systems to ones that leverage the power of ML to solve problems that previously could not be addressed by machines. Data science platforms that don’t enable ML capabilities will be relegated to non-ML data science tasks. Likewise, those big data platforms that inherently enable data engineering capabilities will be winners. Similarly, application development tools will need to treat machine learning models as first-class participants in their lifecycle just like any other form of technology asset. Finally, the space of ML operations (“ML Ops”) is just now emerging and will no doubt be big news in the next few years.
When a vendor tells you they have an AI or ML platform, the right response is to say “which one?”. As you can see, there isn’t just one ML platform, but rather different ones that serve very different needs. Make sure you don’t get caught up in the marketing hype of some of these vendors with what they say they have with what they actually have.