MLOps guide
Excerpt
I help companies deploy machine learning into production. I write about AI applications, tooling, and best practices.
A collection of materials from introductory to advanced. This is roughly the path Iâd follow if I were to start my MLOps journey again.
Table of contents
ML + engineering fundamentals
MLOps
âŠ. Overview
âŠ. Intermediate
âŠ. Advanced
Career
Case studies
Bonus
ML + engineering fundamentals
While itâs tempting to want to get straight to ChatGPT, itâs important to have a good grasp of machine learning, deep learning, NLP, and reinforcement learning fundamentals.
- 10 free ML courses: make sure to take those classes in order.
- [Book] Machine Learning: A Probabilistic Perspective (Kevin P. Murphy). A draft PDF link can be found here.
- [Book] Information Theory, Inference, and Learning Algorithms (David MacKay). Free online version here.
- [Book] Deep Learning (Ian Goodfellow, Yoshua Bengio, and Aaron Courville). Free online version.
- [Book] Introduction to Information Retrieval (Christopher D. Manning, Prabhakar Raghavan, and Hinrich SchĂŒtze). Essential for anyone interested in Natural Language Processing. Free online version.
- [Book] Reinforcement Learning: An Introduction (Richard S. Sutton and Andrew G. Barto). Essential for reinforcement learning. Free online version.
- [Tutorials] OpenAIâs Spinning up in Deep Reinforcement Learning: A collection of articles that give great intuition for many RL algorithms. Highly recommended for anyone interested in RL.
- [Video] Andrej Karpathyâs Zero to Hero series
- Tools and concepts Iâd prioritize learning
- A survivorâs guide to AI courses at Stanford (Updated Feb 2020)
Whatâs MLOps?
Ops in MLOps comes from DevOps, short for Developments and Operations. To operationalize something means to bring it into production, which includes deploying, monitoring, and maintaining it.
Currently, this section contains a lot of my writing, certainly because of my bias and because when I set out to learn about MLOps, there wasnât a lot of resources about it yet. Iâll add more materials soon!
- [Book] Designing Machine Learning Systems (OâReilly, 2022)
- [Community] Some of the best discussions I have had are on our MLOps Discord server [15k+ members]. Youâre welcome to ask questions and join us in our monthly talks/discussions!
Overview
Overview of ML in production.
- [Video] Machine learning production myths (Stanfordâs MLSys Seminars)
- [Lecture note] Introduction to machine learning in production
- Rules of Machine Learning: Best Practices for ML Engineering (Martin Zinkevich, 2019)
- What I learned from looking at 200 machine learning tools [Jun 2020]
- Machine Learning Tools Landscape v2 (+84 new tools) [Dec 2020]
- The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction (Breck et al., 2017)
- Building LLM applications for production
Intermediate
Deep dives into different aspects of ML production.
- [Lecture note] Creating training data: sampling, labeling, handling class imbalance, data augmentation
- [Lecture note] Feature engineering
- [Book excerpt] Data Distribution Shifts and Monitoring
- Instrumentation, Observability & Monitoring of Machine Learning Models (Josh Wills, 2019)
- RLHF: Reinforcement Learning from Human Feedback
Advanced
Build the best MLOps platform for your organization!
- Real-time machine learning: challenges and solutions
- [Lecture note] Data system fundamentals for data scientists
- A friendly introduction to machine learning compilers and optimizers
- Why data scientists shouldnât need to know Kubernetes
- Self-serve feature platforms: architectures and APIs
Career
- [Free book] Machine Learning Interviews Book
- [Twitter thread] The ML interviews process
- Career advice for recent Computer Science graduates
- Four lessons I learned after my first full-time job after college
- 7 reasons not to join a startup and 1 reason to
- Analysis of compensation, level, and experience details of 19k tech workers
- What Glassdoor interview reviews reveal about tech hiring cultures
- What we look for in a resume
Case studies
To get a sense of the challenges of machine learning production, itâs helpful to learn from companies who are doing it.
-
Using Machine Learning to Predict Value of Homes On Airbnb (Robert Chang, Airbnb Engineering & Data Science, 2017)
In this detailed and well-written blog post, Chang described how Airbnb used machine learning to predict an important business metric: the value of homes on Airbnb. It walks you through the entire workflow: feature engineering, model selection, prototyping, moving prototypes to production. Itâs completed with lessons learned, tools used, and code snippets too.
-
Using Machine Learning to Improve Streaming Quality at Netflix (Chaitanya Ekanadham, Netflix Technology Blog, 2018)
As of 2018, Netflix streams to over 117M members worldwide, half of those living outside the US. This blog post describes some of their technical challenges and how they use machine learning to overcome these challenges, including to predict the network quality, detect device anomaly, and allocate resources for predictive caching.
To understand Netflixâs infrastructure for machine learning, check out Ville Tuulosâs talk Human-Centric Machine Learning Infrastructure @Netflix.
-
150 Successful Machine Learning Models: 6 Lessons Learned at Booking.com (Bernardi et al., KDD, 2019)
As of 2019, Booking.com has around 150 machine learning models in production. These models solve a wide range of prediction problems (e.g. predicting usersâ travel preferences and how many people they travel with) and optimization problems (e.g.optimizing the background images and reviews to show for each user). Adrian Colyer gave a good summary of the six lessons learned here:
- Machine learned models deliver strong business value.
- Model performance is not the same as business performance.
- Be clear about the problem youâre trying to solve.
- Prediction serving latency matters.
- Get early feedback on model quality.
- Test the business impact of your models using randomized controlled trials.
-
Machine Learning-Powered Search Ranking of Airbnb Experiences (Mihajlo Grbovic, Airbnb Engineering & Data Science, 2019)
This article walks you step by step through a canonical example of the ranking and recommendation problem. The four main steps are system design, personalization, online scoring, and business aspect. The article explains which features to use, how to collect data and label it, why they chose Gradient Boosted Decision Tree, which testing metrics to use, what heuristics to take into account while ranking results, how to do A/B testing during deployment. Another wonderful thing about this post is that it also covers personalization to rank results differently for different users.
-
From shallow to deep learning in fraud (Hao Yi Ong, Lyft Engineering, 2018)
Fraud detection is one of the earliest use cases of machine learning in the industry. This article explores the evolution of fraud detection algorithms used at Lyft. At first, an algorithm as simple as logistic regression with engineered features was enough to catch most fraud cases. Its simplicity allowed the team to understand the importance of different features. Later, when fraud techniques have become too sophisticated, more complex models are required. This article explores the tradeoff between complexity and interpretability, performance and ease of deployment.
-
Space, Time and Groceries (Jeremy Stanley, Tech at Instacart, 2017)
Instacart uses machine learning to solve the task of path optimization: how to most efficiently assign tasks for multiple shoppers and find the optimal paths for them. The article explains the entire process of system design, from framing the problem, collecting data, algorithm and metric selection, topped with a tutorial for beautiful visualization.
-
Creating a Modern OCR Pipeline Using Computer Vision and Deep Learning (Brad Neuberg, Dropbox Engineering, 2017)
An application as simple as a document scanner has two distinct components: optical character recognition and word detector. Each requires its own production pipeline, and the end-to-end system requires additional steps for training and tuning. This article also goes into detail the teamâs effort to collect data, which includes building their own data annotation platform.
-
Scaling Machine Learning at Uber with Michelangelo (Jeremy Hermann and Mike Del Balso, Uber Engineering, 2019)
Uber uses extensive machine learning in their production, and this article gives an impressive overview of their end-to-end workflow, where machine learning is being applied at Uber, and how their teams are organized.
-
How we grew from 0 to 4 million women on our fashion app, with a vertical machine learning approach (Gabriel Aldamiz, HackerNoon, 2018)
To offer automated outfit advice, Chicisimo tried to qualify peopleâs fashion taste using machine learning. Due to the ambiguous nature of the task, the biggest challenges are framing the problem and collecting the data for it, both challenges are addressed by the article. It also covers the problem that every consumer app struggles with: user retention.
Bonus
Some stuff I did that donât quite fit into any section above, but I want to share anyway :P
- [Code] Python-is-cool: Cool Python features that I used to be too afraid to use
- [Code] just-pandas-things: Pandas quirks that used to traumatize me
- [Code] Coding exercises and solutions for coding interviews
- [Video] Switching From a Batch to Streaming Mindset w/ Chip Huyen
- [VentureBeat] 4 AI and ML job hunting tips from Chip Huyen
- [Booklet] Machine learning systems design (2019): My initial notes on ML systems back. This 8000-word booklet gave ideas for the book Designing Machine Learning Systems in 2022.