Using Trainable Classifiers to Assign Office 365 Retention Labels
The Challenge of Retention Processing
Retention labels control how long items remain in an Office 365 workload and what happens once the retention period expires. Labels can be assigned manually, but the success of manual labeling depends on users understanding how to make the best choice from the available retention labels. Sometimes the choice is clear, as in a document which obviously contains information that should be kept, and sometimes it’s not.
Auto-label policies try to solve the problem by looking for documents and messages which match patterns. For example, if a document holds four instances of a credit card number, it should be assigned the Financial Data label. On the other hand, if a document holds personal information like a social security number, it should get the PII Data label.
Auto-label policies work well when items hold content that is identifiable by matching against the 100-plus sensitive data types defined by Microsoft or a keyword search for a specific phrase (like “project Contoso”). They are especially valuable when organizations have large numbers of existing documents to be labeled. Computers are better at repetitive tasks than humans, and it makes sense to deploy intelligent technology to find and label documents at scale.
That is, if you can be sure that the documents you want to label can be accurately located. Sensitive data types and keyword searches do work, but there’s always likely to be some form of highly-specific information in an organization that searching by data type or keyword doesn’t quite suit. Using a trainable classifier might help in these situations.
Say Goodbye to Traditional PC Lifecycle Management
Traditional IT tools, including Microsoft SCCM, Ghost Solution Suite, and KACE, often require considerable custom configurations by T3 technicians (an expensive and often elusive IT resource) to enable management of a hybrid onsite + remote workforce. In many cases, even with the best resources, organizations are finding that these on-premise tools simply cannot support remote endpoints consistently and reliably due to infrastructure limitations.
Standard Classifiers and Licensing
A trainable classifier is a digital map of a type of document (Office 365 has supported digital fingerprints extracted from template documents for several years). The classifier is trainable because it learns by observing samples of the documents you want to process plus some examples of non-matching items until the predictions made by the classifier are accurate enough for it to be used.
Microsoft has a set of classifiers for use in compliance features, like the Profanity or Threat classifiers used in communication compliance policies. As the names suggest, these classifiers identify items containing profane or threatening text. Microsoft created the classifiers by training them with large numbers of text examples for the classifiers to learn the essential signs of what might constitute profane or threatening language.
A preview allowing tenants to create and use trainable classifiers in Office 365 is available in the data classification section of the Microsoft 365 compliance center. Like all auto-label functionality, when trainable classifiers are generally available, they’ll need Office 365 or Microsoft 365 E5 compliance licenses.
Creating a Trainable Classifier
To create a trainable classifier, you’ll need at least 250 samples of the type of document you eventually want to use the classifier to locate (more is better). The documents can’t be encrypted, must be in English (for now), and be stored in SharePoint Online folders that only hold items to be used for training.
To test things out, I created a classifier for Customer Invoices using ten years’ worth of the Excel worksheets I use to generate invoices. The steps I took were:
- Create a folder in a SharePoint Online site and copied customer invoices to the folder. The training model is built from these documents.
- Create the new trainable classifier in the Microsoft 365 compliance center by giving the classifier a name and telling it the folder holding the seed documents.
- Wait for the seed documents to be processed to create the training model (this can take 24 to 48 hours). After indexing the folder, the new classifier will examine the seed documents to understand their characteristics. In my case, what makes an invoice? For instance, the classifier will learn that invoices have a customer name, the name of my company, a date, some lines of billing information, and instructions how to pay. Although the seed documents contain different information, the essential structure of the documents are the same, and this is what helps the classifier learn how to recognize future documents of the same type.
- Go through a review process (batches of 30 items) to check the predictions made by the classifier. A human review tells the classifier when it is right or wrong (Figure 1). The training model is updated after you complete a batch and applied to the next batch of reviews.
Publish the Classifier
As testing proceeds, the accuracy of the classifier should improve as it processes more seed documents. Eventually the accuracy will get good enough (Figure 2) and you’ll be able to publish the classifier to make it available to auto-label policies. Microsoft says that they have seen successful classifiers at 88% accuracy, and providing that the classifier is stable and predictable at that point, it’s good to go. It’s important that you don’t rush to publish until the classifier is thoroughly trained because you can’t force the classifier to go through extra training after publication.
Two steps remain before you can use the trainable classifier. First, you create a suitable retention label for the classifier. This can be an existing label, but you might want to create a new label for exclusive use with the classifier.
Second, you create an auto-label policy to apply the chosen label when the trainable classifier matches an item. The policy is built from the label, the classifier (Figure 3), and the locations where you want auto-labeling to happen. This can be all SharePoint sites and mailboxes in the tenant or just a selected few. My recommendation is to start with one or two sites and monitor progress until you’re happy to use the classifier everywhere.
Differentiating Between SharePoint Sites and SharePoint Sites
For some reason, auto-label policies differentiate between “regular” SharePoint sites and those connected to Microsoft 365 groups. Make sure that you select the right category: I spent a week or so wondering why a policy wasn’t working only to discover that it was because I had input the URL of the site belonging to a group (under SharePoint sites) instead of the group name (under Microsoft 365 groups). I don’t understand why Microsoft differentiates regular and group-connected sites.
It’s possible that you might want to apply labels only to documents in a site belonging to a group and not to messages in the group mailbox, but there doesn’t seem to be a good way to do this in the current setup.
Checking Classifier Effectiveness
As noted above, once published you can’t retrain a classifier, but you can check what it’s doing by monitoring items labeled by the auto-label policy. Remember that auto-label policies will not process items that already been assigned a label.
The simplest test is to examine the retention labels on documents which you expect to be auto-labeled. If this is the case, then there’s a reasonable chance that the classifier is working as expected. To confirm that this is true, the activity explorer in the data classification section of the Microsoft 365 compliance center (Figure 4) gives an insight into the application of retention labels and sensitivity labels.
You can also check by looking for audit records in the Office 365 audit log. Records for ComplianceSettingChanged operations are generated when retention labels are applied, but only for SharePoint Online and OneDrive for Business documents.
Black Box Processing
Checking outputs from a process is a good way of knowing if the process works, but it’s not as satisfactory as it would be if greater visibility existed into aspects of auto-label policies such as:
- When the auto-label policy is processed against the selected locations.
- What documents are classified (and documents that match but are not labeled because they already have a label).
- Any errors which occur.
Ideally, an administrator should be able to view an auto-label policy and see details of recent runs. It would also be good if the administrator could force the policy to run against one or more selected locations, much in the same way that a site owner can force SharePoint Online to reindex a site.
Good Application of Machine Learning
Even though the implementation of trainable classifiers in auto-label policies has some rough edges, I like the general thrust of what Microsoft is trying to do. Being able to build tenant-specific classifiers based on real-life information is goodness. Casting more light into how classifiers work when used in auto-label policies would make these policies so much sweeter.