Selecting an Effective & Productive Machine Learning Platform
Selecting an Effective & Productive Machine Learning Platform
As you embark on building or enhancing your organization’s ML capabilities, the choice you make for your ML Platform is crucial. Setting up a machine learning platform is a very complicated exercise, as each company/team you are in could be of different sizes and varying functions.
In my years of experience leading ML teams at different companies, I had to go through this exercise many times. I believe there are valuable learnings that someone who is in the process of setting up or evaluating how to improve their current situation can benefit from.
I am fortunate to have seen and shape ML practices across a variety of industries, team compositions, and technologies. I have worked across industries ranging from retail to digital media. I’ve worked with data science (DS) and machine learning engineering (MLE) teams of various shapes and sizes; in some cases taking a small team and rapidly scaling them up to over 80 data scientists. I’ve worked with ML & AI technologies across the Azure, Google Cloud, and Amazon Web Services landscapes.
Throughout all these experiences, I have gone over a number of considerations when it comes to selecting an ML platform. The major consideration has always been finding a solution that can accompany the company’s ever-changing Data Science and ML initiatives, and allow team members to execute at a rapid pace. I have seen challenges arise from multiple fronts and there are a number of learnings that I would like to share.
I see 3 key factors that shape a solid foundation for an ML platform:
- Cloud & Platform Portability
- Simplicity of Data Science tooling
- Notebook & Collaboration experience
1st Consideration: Cloud Platform Environment vs Platform Portability
Data Scientists and Machine Learning Engineers require convenient and secure access to data. The organization will usually have a primary cloud that holds its data assets, often in a data lake and/or data warehouse. In order to allow the DS & MLE teams to start being productive immediately, the decision is often to utilize that cloud platform’s native ML & AI tooling. However, these usually come with restrictions (e.g. lack of portability across environments, custom DSLs, etc.) that are not future-proof and will easily mis-align with where the data sits in the company.
One company I’ve worked at has the perfect example for this scenario. A bulk of the data was stored in AWS, hence most workloads were set up to execute on AWS EC2 instances. We spent considerable time setting up lots of infrastructure within the AWS environment. One year down the track, the technology team decided to move all the data assets to Google Cloud. Not only did it disrupt our rhythm, but it also incurred huge time & resource costs in retooling workflows across environments.
However, this could have been avoided if we chose a cloud-agnostic offering, such as Databricks. A cloud-agnostic offering allows us to develop machine learning activities & workflows independently, with the assurance that they will operate uniformly across clouds. This future-proofs our work, and allows our organizations to adopt multi-cloud architectures and/or move data architectures across clouds.
2nd Consideration: Simplicity of Data Science Tooling
A normal workflow for data scientists usually includes exploratory data analysis (EDA), manipulating datasets into different forms, feature engineering, creating training datasets, performing multiple sampling strategies, building models then exploring multiple models to compare against their performance, and finally picking the best model to go into production, and provide performance reporting for business stakeholders. That’s a lot of tasks with different tools and languages involved. While working with a team of data scientists, you will also face people with different preferences on tool sets, some prefer Python or R, while others may be more comfortable with Scala or Java.
You usually end up with a few decisions that you have to make to simplify this technology landscape:
- Build and maintain a common tooling function, such that it can be reusable so that not everyone in the team is duplicating tasks
- Build in an environment that is language agnostic, such that common tooling functions can be shared and reused easily
At one of my previous companies, we decided to build our own Automated Machine Learning (AutoML) platform. This was a platform made available to all our data scientists, and they could also improve the underlying algorithms.
We built this on Argo. It was widely adopted, with almost 100s of models being trained concurrently, and minimal bottlenecks thanks to Kubernetes auto-scaling. However, this setup is quite a time- and resource-intensive, especially with the ever-changing metrics and model type of each new development. It required valuable MLE time to maintain many working parts. Hence, in later stages, we started looking for a platform that increased productivity; and decided to evaluate Databricks and Vertex AI.
We were already using open-source MLflow to capture model experimentation metrics. Databricks has very strong integration with MLflow, while Vertex AI requires lots of extra custom logging in order to provide the visibility that MLflow and Databricks can provide. That itself will save you lots of effort. I recently helped a startup set up their ML Platform from scratch with MLflow on Databricks; something that took hours rather than months.
Deciding which tool to use for the platform is another no-brainer as it is important to get up to speed to bring value to the business, spending the least amount of time with all the DS tooling setup will be most beneficial.
Model deployment and diagnosis
Our ML Platform should allow us to deploy models to production and warn us when the model’s performance starts to deteriorate. This means that the ease of deployment and diagnosis is very important. The way that this is normally done would be by setting up your own containers and deploying models using an API layer (e.g. Flask, FastAPI), and hosting it on cloud computing.
A few things I have experienced in this area:
- Model deployment in containers requires extra setup to host and serve models and would need to set up extra logging facilities to capture results
- In more immature teams, models are deployed as pickle files and executed separately. Results are then captured retrospectively or only model results are saved, hence creating confusion when diagnosing results.
Based on these points, it is very important to simplify the deployment workflow and minimize the variation between different team members’ work. So to make this work systematically, one would require detailed frameworks and guidelines to ensure consistency in deployments and minimize tech debt.
Hence, after comparing against lots of existing platforms in the field, Databricks stands out again with the nice functionality around the ease of deployment. This results in a more standardized deployment workflow, which will reduce inconsistency of work between team members. With easy-to-use tagging, and testing staging and production models, Databricks definitely allows you to manage all of this without doing a lot of heavy lifting.
Also, Databricks shares a common feature with Kubeflow in which it lets you attach diagnostics to the model being built during Experimentation. Those are very nice add-ons that took some extra effort to develop at one of my previous jobs.
3rd Consideration: Notebook Development Environment
Notebooks enable DS and MLEs to easily get started and share knowledge. Notebooks are a key productivity enabler for my teams, and something I always evaluate seriously. In companies I have worked at, I see two common patterns:
- Self-hosted Jupyter Notebooks attached to Kubernetes clusters
- Managed Notebooks (e.g. Databricks notebooks / Vertex AI Workbench notebooks / Kubeflow-hosted notebooks with attached containers) where the underlying infrastructure and administrative overheads are taken care of by the platform provider
Method 1 definitely costs the most, in time and effort, to host and customize. I would consider this a good learning exercise, but not the most responsible way to empower teams to work fast. Someone has to administer and solve problems such as resource allocation issues, drive mounts, managing container versions for each u, package management, security enforcement, and the list goes on!!
Method 2 is what I always recommend. There are several choices in the market, and I have a preference for Databricks notebooks. My experiences with Vertex AI notebooks are that it is well set up; however, when using AutoMLrelated support, not a lot of details are provided compared to Databricks. Also, the interface across cluster resources versus notebook allocations is not easily viewable by users. One thing I find is that Databricks did a much better job displaying this information to users compared to other platforms. As I have experienced in another platform, some users who have limited experience on these platforms, have trouble managing requirements with notebooks (needed to do extra steps for these members to set this up) while this feature comes as default Databricks notebooks.
Another point is that the environment setup — with Python, SQL, and Spark — is the best with Databricks. Databricks takes care of pre-loading all the latest libraries and pre-configures them to interoperate well with each other. You also have the option to pre-load your custom libraries into a Databricks Workspace and have them immediately available to your team.
Databricks notebooks also allow Data Scientists to easily get started in their language of choice: Python, SQL, R, Scala, or Java. From my experience with other platforms, there is a lot more effort involved in setting up your environment. This could make sense if there’s a lot of support staff, however, if you are a slim team and a data scientist-heavy team, you will benefit from a ready-to-use platform.
Other evaluation criteria
Looking across managed notebooks, I can summarise my key evaluation criteria:
- Transparency: Databricks provides the most transparent AutoML offering. Their glass box approach allows you to download the notebooks associated with your AutoML runs. The notebooks give you all the code — including code for feature engineering, hyperparameters, deployment, and inference — to reproduce any model built using AutoML. This gives data scientists full confidence in getting to a minimal viable model (MVM) rapidly, downloading the associated code, and further customizing it to fit their problem domain. With my experiences on other platforms, AutoML produces an end result for you but does not give you access to chosen hyperparameters or the low-level code to reproduce model artifacts. As far as I’m concerned, this is a blocker for any complicated modeling tasks and diagnoses.
- Multi-Language Support: I love how a DS or MLE can work with their language of choice — Python, SQL, R, Scala, Java — in a notebook, and even mix and match languages in the same notebook! You would appreciate the simplicity of this feature, which I have not been able to find on any other platform.
- Feature Store Integration: One of the most important features! (pardon the pun!) At one of the companies I worked at, we spent a lot of effort setting up our own feature store that supported streaming features. I can appreciate how much effort is required to build it up to production-grade standards and to maintain and evolve the API interfaces. It is advantageous to have a readily-available production-ready Feature Store; unless you want to spend 6 months building one and then constantly supporting its maintenance and development.
Conclusion
There are a lot of other smaller challenges I have been through with my teams in the past, however, I think these are the major ones that you should consider when choosing an ML platform for your team!
- Cloud & Platform Portability
- Simplicity of Data Science tooling
- Notebook & Collaboration experience
So hopefully next time when you have to build or evaluate the setup of your Machine Learning Platforms, you can use the above to decide what should be used. And if you find this article useful, please click the hand claps button below to support me :D