Back to Talks

Scalable Document Classification

Malek Ben Salem

Audience level: Intermediate
Topic area: Modeling


We present a novel approach to predict the confidentiality/sensitivity level of an organization’s documents based on their contents. Identifying sensitive information is critical to reduce information risk. We use Natural Language Processing and Machine Learning and show that we can accurately predict the confidentiality level of a document for 93% of the documents in our first use case.


Organizations use documents to communicate, perform business transactions, collaborate and innovate. These documents, which include e-mails, project reports, proposals, contracts, and design drafts, may carry confidential information and intellectual property. They have to be protected from unauthorized access, exfiltration or loss, but they need not be protected at the same level given that their contents are not equally sensitive. So, identifying and properly labeling sensitive documents is important.

The current classification process is manual; Document creators label the documents according to the classification taxonomy of their organization when a document is created or uploaded to a file share. The classification taxonomy varies by organization, but generally has 4 levels of confidentiality (Public/Unrestricted, Internal Use, Restricted, and Highly Confidential).

The impact of data disclosure or breach varies by confidentiality level, and so does the level of protection required for that data. Various security controls can be deployed to minimize the risk of losing or leaking this information such as access controls, encryption, Data Loss Prevention deployments, Enterprise Data Rights Management, etc. These controls are not effective unless the sensitive or confidential information is properly identified.

Manual classification however is not accurate. Employees seem to lack the proper training or proper discipline to label the documents appropriately, thus raising an organization’s information risk level. Worse, malicious users may intentionally label sensitive documents to non-sensitive in order to be able to ex-filtrate data without getting detected. In summary, manual classification is often unreliable and error-prone.

We present a tool (and an approach) that automatically classifies business documents using Natural Language Processing and Machine Learning techniques, in order to avoid the misclassification errors introduced by manual classification. We also share a use case where we used his tool and the classification results achieved in this use case with a client’s dataset that included tens of thousands of documents. We show that the classification approach is accurate and scalable.