Back to Talks

Visual Pipelines for Text Analysis

Benjamin Bengfort District Data Labs

Audience level: Intermediate
Topic area: Modeling


Employing machine learning in practice is half search, half expertise, and half blind luck. In this talk we will explore how to make the luck half less blind by using visual pipelines to steer model selection from raw input to operational prediction. We will look specifically at extending transformer pipelines with visualizers for sentiment analysis and topic modeling text corpora.



As machine learning techniques have become increasingly important in applications and human decision making, the tools to train models have become increasingly simple to use at ever higher levels of abstraction. The good news is that it now takes relatively few lines of code to employ Scikit-Learn or Tensorflow across a variety of model families and forms; the bad news is that there are so many options it's difficult to evaluate the relative effectiveness of one model over another. And although model selection has been somewhat automated through standardized APIs, search, or even GUI-based applications, most practitioners will tell you that human intuition, domain expertise, and guidance is necessary to hone in on quality models.

So in a world of pipelines, grid search, and representational mysticism how can you effectively interpret fitted models or achieve better performance? Can we do better than cross-validation and a metric to select our models? In this talk, we propose that with visual pipelines implemented in a new Python library, Yellowbrick, the model selection process can be steered to more quickly achieve better performing and more understandable models. And as importantly, pipelines can be used to operationalize models from user input to application-specific predictions and inferences.

Yellowbrick extends the Scikit-Learn API with a new estimator, the Visualizer, an object that learns from data and presents it in a visual form. Yellowbrick visualizations can act as transformers, shedding light on high dimensional feature space as well as model evaluators, providing visual insight into how specific model families operate to make predictions. The API wraps matplotlib to create publication-ready figures, integrating seamlessly with the machine learning workflow.

In this talk, we'll demonstrate how to create visual pipelines in a common machine learning application: sentiment analysis and topic modeling. We will show how transformer pipelines that go from raw input to prediction can be extended with model-specific visualization to improve text classification and clustering analysis and promote the interpretability of results. Our exploration will also allow us to more clearly understand model complexity and the bias/variance trade-off, as well as being able to detect model decay over time. Finally we will present some text-specific visualizers and explore how to create our own, model-specific visualizations.