Topic area: Misc
We are firmly in the world of "big data," where more data is almost always considered better. But sometimes our instincts as programmers or data scientists run afoul of laws that decree that too much data, or data from the wrong source, is illegal. This talk is an exploration of three legal situations - ownership, provenance, and privacy - where the law restricts which data we can use.
We live in the world of "big data," where more data is almost always considered better. Data is usually seen as a raw material - the stuff from which models are built and decisions are made. We may not know, see, or even care where the underlying data comes from. We just want to answer our questions and make our recommendations. But sometimes our instincts as programmers or data scientists run afoul of laws that decree that too much data, or data from from the wrong source, is illegal. This talk is an exploration of three situations - ownership, provenance, and privacy - where the law restricts which data we can use.
We sometimes think of data as just "facts," existing outside of any sort of ownership or legal structure. But the law doesn't always see it that way. Sometimes observations are owned by those who observe and sometimes by those who are observed. There are certain types of data, such as market-moving information, that may be legal or illegal to use depending on how you learned it. What are the laws that govern data ownership in the US and EU, and how does that affect what sorts of agreements and protections we need in place even before we can start asking questions?
Once we have established ownership, there is the difficulty of proving where we learned certain pieces of information, and tracking that metadata through a processing pipeline. We also need to build in controls into certain sorts of data applications, because there are also some types of information that are legal to use when separated, but may be illegal when brought together under certain circumstances.
A key driver for a lot of our data law is the privacy of certain types of information about people. This goes by different names, such as "PII" (Personally Identifying Information) or "PHI" ("Protected Health Information"), but the concept is the same: some types of data are considered so private or so prone to misuse that they cannot be collected or, if they are collected in the course of business, are protected from disclosure and certain types of use. The problem is that the boundary of what is considered private changes all the time based upon both the law and our ability to de-anonymize datasets. What is considered private in the US and in Europe, and how is that changing? What does protecting privacy mean for a data pipeline?