Agenda

Conference start/end times:

Friday June 23
12pm-6pm
Saturday June 24:
10am-6pm
Sunday June 25:
10am-3pm

BEEF: Balanced English Explanation of Forecasts abstract


V.S. Subrahmanian, University of Maryland
Audience level: Novice
Topic area: Misc

We describe BEEF, a computational framework that explains in easily understood plain English, the evidence for and against a forecast made by a binary classifier, irrespective of the underlying classification engine. We will also provide a brief demonstration of BEEF on a set of diverse classification tasks using a set of diverse classifiers.


Scalable Document Classification abstract


Malek Ben Salem
Audience level: Intermediate
Topic area: Modeling

We present a novel approach to predict the confidentiality/sensitivity level of an organization’s documents based on their contents. Identifying sensitive information is critical to reduce information risk. We use Natural Language Processing and Machine Learning and show that we can accurately predict the confidentiality level of a document for 93% of the documents in our first use case.


Building a Gigaword Corpus: Data Ingestion, Management, and Processing for NLP abstract


Rebecca Bilbro, Bytecubed & District Data Labs
Audience level: Intermediate
Topic area: Modeling

As the applications we build are increasingly driven by text, doing data ingestion, management, loading, and preprocessing in a robust, organized, parallel, and memory-safe way can get tricky. In this talk we walk through the highs (a custom billion-word corpus!), the lows (segfaults, 400 errors, pesky mp3s), and the new Python libraries we built to ingest and preprocess text for machine learning.

SLIDE: https://speakerdeck.com/dataintelligence/building-a-gigaword-corpus-data-ingestion-management-and-processing-for-nlp


Tranforming Legacy Code to Leverage Spark abstract


Rachita Chandra
Audience level: Intermediate
Topic area: Modeling

In this sessin we cover how we selected and prioritized the components of a solution to leverage Spark and the issues we faced while developing and testing the transformed solution with Spark code. The solution is an end-to-end multi-tenant enterprise application comprising of several components - data transformations, quality checks, analytics, ML functions and visualization.


Open Geospatial Machine Learning abstract


Kevin Stofan, DataRobot
Audience level: Intermediate
Topic area: Modeling

We will guide attendees through the entire geospatial machine learning workflow. Attendees will be be exposed to a variety of open source tools used to process, model, and visualize geospatial data. This workshop will focus on concepts unique to handling geospatial data such as spatial autocorrelation, lagged spatial features, and spatial partitioning.


Parsimonious Pythonic Pipelines with Provenance abstract


Ben Mabey, Recursion Pharmaceuticals
Audience level: Intermediate
Topic area: Misc

Productionalizing a ML model doesn't need to be an exercise of learning a complex workflow system. Instead, by decorating your functions with the provenance library you can quickly setup a pipeline with serialization and provenance (lineage) tracking. Using the same system you can also share models and features to facilitate team collaboration in a research setting. Learn how in this talk!


Predicting Internet Attention abstract


Andrew Montalenti, Corporate - http://parse.ly
Audience level: Novice
Topic area: Case Study

In aggregate, billions of individuals click, read, watch, and share hundreds of millions of pieces of unique content every day. A widely-used content/audience analytics platform has a unique dataset in this area. In this talk we will ask and answer this question: can studying Internet traffic today help to predict web-wide attention -- and thus real-world events -- tomorrow?

SLIDES: https://speakerdeck.com/amontalenti/insights-from-parse-dot-ly


When a Picture is Worth a Thousand Network Packets and System Logs abstract


Awalin Sopan, FireEye Inc
Audience level: Intermediate
Topic area: Misc

A typical Security Operation Center (SOC) employs security analysts who monitor security log from heterogeneous devices. The analysts identify whether there is a security threat and how to respond to that threat by analyzing that data. Visualizing this large-scale data to a succinct human digestible form can reduce their cognitive load and enable them to operate more efficiently.

SLIDES: https://speakerdeck.com/dataintelligence/when-a-picture-is-worth-a-thousand-network-packets-and-system-logs


Pitfalls of Texting Mining abstract


Dalila Benachenhou, Femvestor, Inc.
Audience level: Intermediate
Topic area: Case Study

Text in NLP, Information extraction, and supervised or unsupervised learning, brings challenges to the researchers that are none existant with structured data. Here, we present 4 scenarios and unique approaches developed to deal with them.


Visual Pipelines for Text Analysis abstract


Benjamin Bengfort, District Data Labs
Audience level: Intermediate
Topic area: Modeling

Employing machine learning in practice is half search, half expertise, and half blind luck. In this talk we will explore how to make the luck half less blind by using visual pipelines to steer model selection from raw input to operational prediction. We will look specifically at extending transformer pipelines with visualizers for sentiment analysis and topic modeling text corpora.

Slides: https://speakerdeck.com/dataintelligence/visual-pipelines-for-text-analysis


Airflow + Scikit-Learn: A Hacker's Guide to Deploying Machine Learning Models abstract


Zachary Beaver, Alphabet Inc. / Nest Labs
Audience level: Intermediate
Topic area: Case Study

Data scientists often hit roadblocks when "productionizing" their machine learning models. This talk is about making that "last mile" of analysis easier by leveraging a popular workflow tool called Airflow. We'll walk through how Nest uses it to build and deploy machine learning models for fraud detection and then discuss more generally the unique benefits it provides to Pythonic data scientists.


Untangling Data Ownership, Provenance, and Privacy abstract


Van Lindberg, Dykema Cox Smith
Audience level: Intermediate
Topic area: Misc

We are firmly in the world of "big data," where more data is almost always considered better. But sometimes our instincts as programmers or data scientists run afoul of laws that decree that too much data, or data from the wrong source, is illegal. This talk is an exploration of three legal situations - ownership, provenance, and privacy - where the law restricts which data we can use.


Seeking Exotics: A Story of Visualization and Model Based Anomaly Detection abstract


Francois Dion
Audience level: Novice
Topic area: Misc

Medicare payments, UPC code descriptions, fertility rate and fires. All of it is data, some of which is erroneous and some of which is anomalous. Seeking Exotics introduces the audience to the world of outliers and anomaly detection through the use of metrics, visualizations and open source machine learning tools.


How an Open Analytics Ecosystem Became a Lifesaver abstract


Doug Liming
Audience level: Novice
Topic area: Case Study

Given the diverse talent and skill sets of today’s Data Scientist, it is time for an analytic platform where you should not have to choose a single approach. To be viable in the open ecosystem of today’s economy and analytics, methods have to be open and integrated. You should not have to choose between analytics languages like Python, R, or SAS. Find out how you can literally have it all.


Find the Farm (Data Science Insights into Real Estate Pricing) abstract


En Zyme, Ad Hoc and Nimble, a consultancy
Audience level: Novice
Topic area: Case Study

Real estate transactions are geographically and temporally sparse. There is often both a listing and a selling agent. Pricing models typically rely on physical parameters; there has been little work done in assessing the contribution of the realtor. A realtor 'farm' may be discoverable by cluster identification, and analyzed for negotiation strength in listing and sales prices.


Machine Learning Melee: AWS vs Azure abstract


Frank La Vigne
Audience level: Intermediate
Topic area: Misc

Cloud service providers are vying for your attention in the Machine Learning space. Both Amazon and Microsoft are feverishly working on creating compelling solutions for developers to build intelligent solutions upon. Is one better than the other? What are each one’s strengths and weaknesses?

SLIDES: http://www.franksworld.com/2017/06/26/machine-learning-melee-aws-vs-azure/


Modeling Behavioral Patterns of Consumers Seeking Legal Services abstract


Michael Terry, Lawfty
Audience level: Intermediate
Topic area: Case Study

We present a case study on how our data team models and optimizes the behavioral patterns of consumers seeking legal services. By linking together the regional offline & online consumer data into a common model, we are able to optimize digital advertising investment, and take advantage of time-of-day and location-based opportunities, in arguably the most competitive Google Adwords vertical.


Privacy Techniques for Data Science in Regulated Environments abstract


Jim Klucar
Audience level: Novice
Topic area: Misc

Demand is increasing for technology companies to safeguard individual data. This presentation examines data privacy regulations currently in place, and teaches data privacy algorithms such as K-anonymization, Randomized Response, and Differential Privacy. Moreover, it will cover Differentially Private machine learning algorithms the impact data privacy has on Machine Learning performance.


Matching addresses, it's surprisingly difficult. abstract


Evan Richards
Audience level: Intermediate
Topic area: Misc

In this talk, you'll learn the weirdest edge cases in the United States addressing system; the hierarchy between city and state, the sublime beauty behind the zipcode, and the constitute parts of an address.

We'll cover how to compare addresses in a way that gives you an F-score you'll be proud of.


Pomegranate: Fast and Flexible Probabilistic Modeling in Python abstract


Jacob Schreiber, Paul G. Allen School of Computer Science, University of Washington
Audience level: Intermediate
Topic area: Modeling

We will describe the python package pomegranate, which implements flexible probabilistic modeling. We will highlight several supported models including mixtures, hidden Markov models, and Bayesian networks. At each step we will show how the supported flexibility allows for complex models to be easily constructed. We will also demonstrate the parallel and out-of-core APIs.

SLIDES: https://github.com/jmschrei/pomegranate/blob/master/slides/pomegranate%20data%20intelligence%202017.pdf


Is a Number Worth a Thousand Words? abstract


Anne-Marie Currie, Advisory Board
Audience level: Novice
Topic area: Misc

Is a number worth a thousand words?

Inspiration for this talk comes from Net Promoter Survey data and the drive to create a better Net Promoter System for a specific business context. This talk will focus on the Net Promoter Scores and examine a few ways to enhance the insights gained by building a context-rich Net Promoter System.


Building Data Capacity TOGETHER!! abstract


Natalie Evans Harris
Audience level: Novice
Topic area: Misc

Data has evolved beyond reporting to underpinning technology that guides policy decisions and transforms service delivery to citizens. While the power of data is no longer questioned, we continue to struggle with the lack of capacity to use data to support missions. A collaborative effort to build open sourced data infrastructure is imperative to support the responsible use of data.


Identifying Language Communities for Security and Performance abstract


David Guy Brizan, University of San Francisco / CUNY Graduate Center
Audience level: Intermediate
Topic area: Misc

.


Hunting with Data Science - Increasing the Signal-to-Noise Ratio abstract


Austin Taylor
Audience level: Intermediate
Topic area: ETL

After anomalous network traffic has been identified there can still be an abundance of results for an analyst to process. This talk is for Data Scientist and Network Security professionals who want to increase the signal-to-noise through feature extraction and post-processing output.

SLIDES: https://www.slideshare.net/AustinTaylor8/threat-hunting-with-data-science


Exploring the Data Science Process abstract


Vishal Patel
Audience level: Novice
Topic area: Misc

The entire data science process can be organized into multiple steps/phases, and it is helpful to establish a standardized workflow for team members to collaborate effectively and generate valuable results. In this presentation, we will provide a detailed walk-through of seven phases of the data science process.

SLIDES: https://www.slideshare.net/VishalPatel321/exploring-the-data-science-process


Learning to Learn Model Behavior: Explain Predictions While Being Model Agnostic abstract


Pramit Choudhary, DataScience.com
Audience level: Intermediate
Topic area: Modeling

Post the model build process, we often have a black-box which can be used for prediction.

The usefulness of the model could still be questionable unless one understands the true behavior of the algorithm. As machine learning models are actively getting adopted in solving real world problems, one needs to look beyond just wins and losses. One needs more detailed information about model’s behavior.


Software Complexity Modeling abstract


Thuc Tran, The George Washington University
Audience level: Intermediate
Topic area: Misc

There currently does not exist a comprehensive software complexity methodology that takes into considerations different dimensions of software applications, allowing software applications to grow unnecessarily complex as they mature. To assess the complexity of software, we strive to develop a model that considers different types of dimensions as foundational features.


Clustering the Linearly Separable & Inseparable Datasets. abstract


Harish Krishnamurthy
Audience level: Intermediate
Topic area: Modeling

We shall study about linearly separable and inseparable datasets. We shall then then apply various clustering algorithms to these datasets. This is a hands-on workshop where the attendees will be using our online learning platform, refactored.ai to execute code on their laptops.


Playing Detective with CNNs(Convolutional Neural Networks) abstract


Sanjana Ramprasad
Audience level: Intermediate
Topic area: Modeling

We aim to differentiate between two handwriting samples by modeling and learning between writer variation and within writer variation. In order to achieve this task we use a dataset of handwriting samples which was a true representative of the US population. We went about by training a Convolutional Neural Network(CNN) and tuning it by experimenting with several architectures and techniques.


Systematic approach for machine learning methods design based on potential theory abstract


Nadia Udler
Audience level: Intermediate
Topic area: Modeling

With the increase of computer power machine learning methods become a method of choice for solving many real world problems, where previously analytical approximations would be more appropriate in terms of speed. This creates a great demand for machine learning software that is easy to use. We present an approach for constructing such methods in systematic way. We demonstrate several tutorials that help to understand essential building blocks and parameters of machine learning methods.


Artificial Intelligence: : Methods that make human-level intelligence possible abstract


Sargur Srihari, Unversity at Buffalo, The State University of New York
Audience level: Novice
Topic area: Modeling

This talk will discuss the current generation of AI methods, and how they differ from previous generation methods. In particular I will discuss algorithms which are based on discriminative/generative models and computational architectures consisting of a hierarchy of concepts known as deep learning.


Ideas for Interpreting Machine Learning abstract


Patrick Hall, H2O.ai
Audience level: Intermediate
Topic area: Modeling

Interpreting deep learning and machine learning models is not just another regulatory burden to be overcome. Practitioners, researchers, and consumers that use these technologies in their work and their day-to-day lives have the right to trust and understand AI. This talk is an overview of techniques for interpreting deep learning and machine learning models and telling stories from their results.


Changing the World with Data - with Combatting Human Trafficking as an Example. abstract


Eric Schles, NYU
Audience level: Intermediate
Topic area: Modeling

My talk is about how to set up national networks, support national networks and international networks that work on a specific issue. I'll pick specific examples from my work in the anti trafficking space and how I worked nationally and internationally. I'll choose specific tools I built and the problems they solved with in the anti trafficking space.

SLIDES: https://github.com/EricSchles/data_intelligence_conf


Which Model Came Hot and Fresh Out the Kitchen in our Malware Classifier Bakeoff? abstract


Phil Roth, Endgame
Audience level: Experienced
Topic area: Modeling

There is no single machine learning model that is best for all applications. In the process of building a malware classifier, Endgame used a bakeoff process in order to choose the model best suited for us. We will describe this process, how the results could be improved with further research, and the challenge of using machine learning for malware classification in general.

SLIDES: https://www.slideshare.net/mrphilroth/machine-learning-model-bakeoff


Data Version Control: Tool for Iterative Machine Learning abstract


Dmitry Petrov
Audience level: Intermediate
Topic area: Modeling

Data version control or DVC is a new open source tool  which is designed to help data scientists keep track of their ML processes and file dependencies in the simple form of git-like commands. This presentation post walks you through an iterative process of building a machine learning model with DVC.

SLIDES: https://speakerdeck.com/dataintelligence/data-version-control-tool-for-iterative-machine-learning


Machine Learning & Election Campaigns abstract


Ria Baldevia, Booz Allen Hamilton
Audience level: Novice
Topic area: Misc

South Korean company Fount AI's introduction of a political info chat bot, Rose, on KakaoTalk messenger indicates machine learning may play a significant role in future elections.

SLIDES: https://speakerdeck.com/dataintelligence/machine-learning-and-election-campaigns


ENCASE abstract


Antonia Gogoglou, Aristotle University of Thessaloniki, SignalGeneriX Ltd Cyprus
Audience level: Intermediate
Topic area: Modeling

The ENCASE project aims to leverage the latest advances in web security and privacy to design and implement a browser-based architecture for the protection of minors from malicious actors in online social networks, by exploiting sentiment and affective analysis along with graph mining.

SLIDES: https://speakerdeck.com/dataintelligence/encase


ArtificiaI Intelligence Enables Precision Medicine abstract


Mohammed Eslami, Netrias, LLC
Audience level: Intermediate
Topic area: Case Study

Current bio-informatics tools do not capitalize on the great advancements made in Machine Learning (ML) that can enable them to generate more, and more rapid, breakthroughs. Big Data technologies can facilitate the complete integration of heterogeneous sets of experimental data to identify key metabolic pathways and drug targets to enable precision medicine.


Real-time Meatspace Data Science abstract


Jason Walsh, Penn Medicine
Audience level: Intermediate
Topic area: Streaming

Penn Signals is an award-winning (https://goo.gl/MHqwVv) microservices software platform for processing real-time clinical data from a variety of systems. This talk demonstrates how the data science team at Penn Medicine has combined open source technologies that allow data scientists and researchers to create and use predictive applications to support improvements in health care.

SLIDES: https://github.com/pennsignals/data-intelligence