Back to Talks

Airflow + Scikit-Learn: A Hacker's Guide to Deploying Machine Learning Models

Zachary Beaver Alphabet Inc. / Nest Labs

Audience level: Intermediate
Topic area: Case Study

Description

Data scientists often hit roadblocks when "productionizing" their machine learning models. This talk is about making that "last mile" of analysis easier by leveraging a popular workflow tool called Airflow. We'll walk through how Nest uses it to build and deploy machine learning models for fraud detection and then discuss more generally the unique benefits it provides to Pythonic data scientists.

Abstract:

Airflow + Scikit-Learn: A Hacker's Guide to Deploying Machine Learning Models

What's the Problem?

  • The "last mile" of ML = Deployment

  • Data scientists' love/hate relationship with the "last mile"

  • Goal of this talk: Present a problem-based workflow to ameliorate the problem

Common Solutions (...and why I hate them)

  • Hand your work off to a software engineer

  • Set up your own cron-job and server

  • Use a paid service

  • Refresh it manually, as-needed

  • Be a bad-ass, full-stack data scientist (I wish...sigh)

Airflow to the Rescue! (...sometimes)

  • What is Airflow?

  • Airflow does what data scientists don't

-- Monitoring

-- Alerting

-- Task Dependencies

-- Pre-built hooks into data sources

  • It provides the right level of abstraction

-- All jobs need a base level of monitoring

-- Airflow allows you to wrap your code with monitoring goodness

-- Code receives this monitoring via class inheritance

  • Signals that Airflow may be a good choice:

-- Many disparate data sources feeding analysis

-- Multiple sub-tasks and dependencies

-- You desire more autonomy in production

  • Benefits demonstrated with the following example...

Airflow in Practice: Building + Deploying ML Fraud Models

  • The Problem: Hardware Fraud at Nest

-- A brief overview

-- The world before machine learning

-- Problems with pre-ML approaches

  • The Solution: ML model to prioritize cases for investigator review

Building the Model

  • Prototyping locally

  • Feature engineering

  • Measuring lift

  • Prototyping = learning lessons for production

The Deployment Ecosystem (and why it's hard)

  • Deployment Diagram

  • Specific challenges for ML deployment

  • Benefits of letting Airflow manage the workflow

Model-Building Cadence

  • Considerations for model refresh timing

  • Accuracy/sanity checks for new models

Feedback and Continuous Improvement

  • Automated checks

  • Soliciting Qualitative feedback from those using the ML results

  • Benefits of modular workflows

  • Horse-racing models: Tandem model deployment with Airflow

Summary

  • Airflow complements data scientists by:

-- Simplifying complex ML workflows

-- Providing "free" SWE expertise with little implementation cost

-- Modularizing the process: allowing for faster model upgrades

-- Providing a UI for easy debugging in production

  • The "best" option? Not always...but a good candidate for Pythonistas

  • Please reach out with additional questions!