Machine learning is powering most of the recent advancements in AI, including computer vision, natural language processing, predictive analytics, autonomous systems, and a wide range of applications. Machine learning systems are core to enabling each of these seven patterns of AI. In order to move up the data value chain from the information level to the knowledge level, we need to apply machine learning that will enable systems to identify patterns in data and learn from those patterns to apply to new, never before seen data. Machine learning is not all of AI, but it is a big part of it.
While building machine learning models is fundamental to today’s narrow applications of AI, there are a variety of different ways to go about realizing the same ends. So-called machine learning platforms facilitate and accelerate the development of machine learning models by providing functionality that combines many necessary activities for model development and deployment. Since the fields of machine learning and data science are not new, there are a large number of tools that help with different aspects of machine learning development.
Five Key Platforms for Building Machine Learning Models
There are five major categories of solutions that provide machine learning development capabilities:
- Machine Learning toolkits
- Machine Learning Platforms
- Analytics Solutions
- Data Science Notebooks
- Cloud-native Machine Learning as a Service (MLaaS) offerings.
There are seven primary patterns in the way that AI is implemented for applications. At the high level those seven patterns are shown below:
The Seven Patterns of AI
Machine learning systems are core to enabling each of these seven patterns of AI. In order to move up the data pyramid from information to knowledge, we need to apply machine learning that will enable systems to identify patterns in data and learn from those patterns to apply to new, never before seen data. Machine learning is not all of AI, but it is a big part of it.
MORE FOR YOU
It is important to understand the relationships between machine learning algorithms, machine learning models, and training data. A machine learning model is the product of training a machine learning algorithm with training data. In other words, it is the result of a machine learning training process. Machine learning models are essentially trained with algorithms; they are generated when algorithms are applied to a specific given data set. While algorithms are simply general approaches to solve an objective, machine learning models can evaluate future unknown data and make predictions or insights. Therefore, one can create many models from the same algorithm, as long as different training data are available. A machine learning model is a mathematical representation of the situational and specific pattern, which can be used for real world situations.
Machine Learning Platforms
Machine learning platforms facilitate and accelerate the development of machine learning models by providing functionality that combines many necessary activities for model development and deployment including:
- Selection of one or more algorithms to use for a particular learning task
- Processing of training data from numerous sources
- Accelerating machine learning using cluster or GPU computing resources
- Automating repetitive and time-consuming data scientist tasks
- Assisting with model evaluation and hyperparameter selection and tuning
- Integrating other data science or data engineering tooling to value-add machine learning processes
- Visualizing machine learning processes and outcomes
- Enabling machine learning model deployment
Machine Learning development platforms combine machine learning capabilities in different task areas including:
- Provide the building blocks to create solutions for data science problems leveraging as wide an array of available machine learning algorithms as possible
- Establish a good environment for data scientists to experiment with different machine learning models and outcomes and evaluate their performance
- Support the needs of processing data for data preparation, data exploration, visualization, and data aggregation
- Enable data scientists to work in both online and offline modes of model development and iteration
- Allow data, models, artifacts, visualizations, evaluations, and other relevant model-related information to be shared among data scientists and other managers
- Accelerate the process of machine learning model development, evaluation, and deployment
- Help improve overall performance, accuracy, and efficiency of machine learning models
Data Science Notebooks
First popularized by academically-oriented math-centric platforms like Wolfram Mathematica and Mathworks Matlab, but now prominent in the Python, R, and SAS communities, data science notebooks are used to perform data science experiments, document data research and simplify reproducibility of results by allowing the notebook to run on different source data. Data science notebooks are shared, collaborative environments where groups of data scientists can work together and iterate models over constantly evolving data sets. While notebooks don’t make great environments for developing code, they make great environments to collaborate, explore, and visualize data. Indeed, the best notebooks are used by data scientists to quickly explore large data sets, assuming sufficient access to clean data.
Data Science Notebooks include open source offerings such as Jupyter, RStudio, and Apache Zeppelin offer a combination of data aggregation, data visualization, coding, model training, and model evaluation. The resulting models can be ported to other platforms for further operationalization. For small scale machine learning model development activities, data science notebooks can provide most of what is needed without having to invest further in larger scale machine learning platforms.
While Data Science Notebooks provide many of the features of machine learning platforms, they aren’t in themselves fully featured machine learning platforms. Often Data Science Notebooks are used during the experimentation and initial training phases and projects are moved to more fully-functional machine learning platforms once those first iterations are completed.
Machine Learning Toolkits
The field of machine learning and data science is not new, predating the latest wave of market interest by decades. As such, there are a large number of point solutions tools that help with different aspects of machine learning development. These tools perform aspects of the machine learning platforms listed above, but are meant to be used in a modular fashion, in conjunction with other tools or as part of larger machine learning platforms.
These machine learning toolkits are very popular and many are open source. Some are focused on specific machine learning algorithms and applications, such as Keras, Tensorflow, and PyTorch that are focused on development of deep learning models, while others such as Apache Mahout and SciKit Learn provide a range of machine algorithms and tools to be used for various parts of the lifecycle. Many of these toolkits are embedded in larger machine learning platform solutions, but can be used in a standalone fashion or inside of data science notebook environments.
Many of these machine learning toolkits are very popular with thousands of developers and data scientists using the tools. However, many toolkits are focused on a narrow aspect of machine learning model development, such as building deep learning neural networks, or specific kinds of supervised learning models. As such, these toolkits are not meant to provide a comprehensive set of algorithms across all machine learning approaches or AI patterns. However, these machine learning toolkits are often used as components of larger machine learning platforms or in conjunction with Data Science notebooks or other such tools.
Furthermore, many of the machine learning toolkits have the backing of large technology companies that have spurred their development. For example, Facebook supports PyTorch and CAFFE, Google supports Keras and TensorFlow, Amazon supports MXNet, Microsoft supports CNTK Toolkit, and others are supported by companies like IBM, Baidu, Apple, Netflix, and others.
General Purpose Analytic Suites
In addition to Data Science Notebooks, ML toolkits and ML platforms, solutions traditionally aimed at data analytics, statistics, and mathematics applications have realized the power of adding machine learning capabilities to their existing statistical and/or analytics offerings. Many of these vendor offerings, such as Mathworks MATLAB, SAS, IBM SPSS, and Wolfram Mathematica have had decades of real-world adoption and experience and leverage their strength in the enterprise and research environments. Companies that have already invested in analytics solutions will find that they can retain and expand on their existing analytic tools that now support machine learning development and deployment.
Many solutions for Data Science Notebooks, machine learning toolkits, machine learning platforms, and analytic suites are available as open source offerings. Indeed, open source offerings dominate the space for machine learning as much of the work for machine learning was done in research and academic environments that have tended to support open source offerings to a greater degree than enterprises. For all categories of machine learning solutions explored in this report there are both open source offerings as well as paid, commercial solutions. Paid commercial solutions generally offer support, greater range of features and add-ons, consultative services, training, access to high powered compute resources, and other benefits as part of the price paid to access the solution.
Cloud-based Machine Learning Environments
Another consideration for those looking to build machine learning models is a determination as to whether models should be built in an online, cloud-based environment or built on local machines. The benefit to working in the cloud is the use of available computing and storage infrastructure as well as enhanced collaboration and access to large data sets. The benefit to working on a local machine or on-premise environment is the speed of development, lower cost of storage and bandwidth, secure and reliable access to information, and the possibility to work while offline.
In the early days of machine learning work, most machine learning models were developed on the local machines of data scientists (on laptops, even!) and then models moved or ported once the desired objectives had been reached. However, the emergence of strong cloud-based alternatives provides a way to run machine learning projects from start to finish in a cloud-based environment.
The main advantages of cloud-based machine learning model development include:
- Easy to scale up: companies can easily test their solution through cloud-based machine learning, which helps scale the larger production as demand increases.
- Access to a wide range of value-add capabilities: Cloud-based services provide native integration with additional tools for data engineering, pre-trained models, model lifecycle functions, a wide and increasing array of algorithms, automated machine learning capabilities and a reduced need to cobble together solutions that reduce the technical threshold for companies and complexity.
However, disadvantages of cloud-based machine learning platforms include:
- Difficulty in integrating with local data: Required that the data is local and not in the cloud environment can pose technical, compliance, and cost challenges in integrating with cloud-native machine learning platforms.
- Potential for high cost in the experimentation phase: Cloud providers generally charge for storage, computing, bandwidth, functionality, and other considerations. While there exists “free tiers” that can provide relief for small implementations, large organizations that do all their experimentation in the cloud can see potentially high costs that can be avoided with local, on-premise experimentation.
Machine Learning as a Service (MLaaS)
Extending on this concept of cloud-native machine learning platforms, some vendors have set up entirely cloud-based machine learning offerings that not only provide machine learning development capabilities but full machine learning lifecycle functionality, robust data management capabilities, pre-trained machine learning models, and other non-machine learning capabilities that build upon existing Platform-as-a-Service (PaaS) and Infrastructure-as-a-Service (IaaS) capabilities. These Machine learning as a service (MLaaS) offerings cover a broad range of services priced on a per-consumption basis, such as per-minute computing, per-storage unit, metered traffic, and query-based pricing.
Related to the idea of MLaaS is the concept of Model-as-a-Service, in which cloud-based providers provide metered access to pre-trained models via API on a consumption basis. Some in the industry equate MLaaS with Model-as-a-Service as opposed to cloud-based machine learning platforms. For the purposes of this report, we treat Model-as-a-Service as a separate topic, covered in a separate report, as Model-as-a-Service vendors are not focused on enabling customers to build and manage their own models using their own data. There is no agreement that MLaaS refers to cloud-native ML platforms vs. model-as-a-service, and as such, in this report we will refer to cloud-native ML platforms as cloud-native ML platforms/MLaaS.
Making sense of it all
There certainly are a lot of different things to consider in terms of where, when, and how to choose the right machine learning development platform, something I spend quite a bit of my time looking at as an analyst at Cognilytica. At the foundational level, however, simply knowing what the different options are and that there is no one-size-fits-all for machine learning will help you make better decisions and not fall into the trap of vendor hype and spin.