Topic area: Misc
Medicare payments, UPC code descriptions, fertility rate and fires. All of it is data, some of which is erroneous and some of which is anomalous. Seeking Exotics introduces the audience to the world of outliers and anomaly detection through the use of metrics, visualizations and open source machine learning tools.
In 1777, Daniel Bernoulli wrote in his paper on the Most Likely Induction (maximum likelihood): "Is it right to hold that the several observations are of the same weight or moment, or equally prone to any and every error?" Ever since, mankind has been struggling as to what to do with erroneous and anomalous data. Finding them seems like a simple problem of clustering, and labeling them as such, like a simple problem of classification, but that would be oversimplifying the problem.
The "Seeking Exotics" talk will start with a light introduction to the world of outliers and anomaly detection. For more historical and background information, the audience is kindly invited to listen before the talk to episode 2 of the podcast "Something for Your Mind".
Through several sets of data covering various fields and types of data, such as Medicare payments, UPC bar code data and fertility rates, several visualization techniques will be demonstrated. This will range from static box and stemgraphic plots to interactive mpld3 scatter plots. These will be combined with dimensionality reduction and clustering techniques (beyond PCA) in order to derive more insight from the data. Finally, one class classifiers (such as Isolation Forest) will do some heavy lifting for us with the easy to use scikit-learn giving us some results, ranging from sobering to surprising.
Each data set will be covered in a separate Jupyter notebook. At the end of each notebook, some time will be set aside for some questions.