How Microsoft Uses Machine Learning to Improve Windows 10 Update Experience
Microsoft started using machine learning (ML) to manage the rollout of Windows 10 feature updates with the Windows 10 April 2018 Update (version 1803). In a new blog post by Microsoft’s Archana Ramesh and Michael Stephenson, both data scientists for Microsoft Cloud and AI, the company outlines improvements made since then.
Microsoft has been having a tough time recently with the quality of cumulative updates (CU) and feature updates for Windows 10. While the tech media tends to blow things out of proportion sometimes, I think it’s fair to say that quality has taken a knock since internal testers were dismissed in favor of the Windows Insider Program. Biannual feature updates haven’t been without their issues either. But because of the diversity of the Windows ecosystem, regardless of how much testing is done, there is always the potential for issues when making changes to a complex piece of software like Windows.
But if you are a large or medium sized organization that manages updates using Windows Server Update Services (WSUS), Microsoft System Center Configuration Manager (SCCM), or other product, you can do your own testing and don’t need to rely on Microsoft’s automated rollout. Smaller organizations can use Windows Update for Business, which is a series of Group Policy and Mobile Device Management (MDM) settings that give somewhat limited control over when CUs and feature updates are installed. Individuals and businesses without IT support rely on Microsoft to determine when feature updates should be installed. Although ‘seekers’, i.e. people who go into Windows Update and click Check for updates, might be offered feature updates before they are designated OK for the device. If Microsoft finds any serious problems related to hardware or software, blocks (safeguard holds) are placed on the update for affected devices until the issues have been resolved.
Say Goodbye to Traditional PC Lifecycle Management
Traditional IT tools, including Microsoft SCCM, Ghost Solution Suite, and KACE, often require considerable custom configurations by T3 technicians (an expensive and often elusive IT resource) to enable management of a hybrid onsite + remote workforce. In many cases, even with the best resources, organizations are finding that these on-premise tools simply cannot support remote endpoints consistently and reliably due to infrastructure limitations.
Windows 10 1903 Feature Update Rollout with Machine Learning
Microsoft used ML to evaluate 35 areas of PC health for determining whether devices were ready for the May 2019 Update. It says that PCs designated to be updated by ML have a much better update experience, with half the number of system-initiated rollbacks, kernel mode crashes, and five times fewer driver issues. Microsoft designed a ML model that is dynamically trained on the most recent set of PCs that have been updated, and that can tell the difference between PCs having a good and bad update experience. ML can identify issues that let Microsoft create safeguard holds to prevent updates being installed on PCs that might have a poor update experience. And ML predicts and nominates devices that should have a smooth update experience. As more PCs receive a feature update, the ML model learns from the most recent dataset. And as Microsoft resolves identified issues, PCs that might have had a poor update experience are offered the update.
Training data for the model is collected from the latest set of PCs on the feature update, and their configuration (hardware, apps, drivers, etc.) at the time the update was installed. A binary label is also assigned to determine whether the device had a good update experience based on indicators like whether there was a system-initiated uninstall and reliability score after the update. Azure Databricks is used to build the model and ensure only high-quality data is used to train it. Anomaly detection is used to identify when a feature, or two or more features, causes a higher failure rate is present in the rest of the data. Along with lab tests, feedback, and support calls, this data is used to establish safehold guards.
ML Requires Guinea Pigs
Because the Windows ecosystem is so diverse and the ML model is refreshed daily, Microsoft needs to make sure it has data that represents similar devices to those that still must be updated. A technique called saturation is used to determine how many of the diverse systems are currently represented in the dataset. This helps Microsoft understand how a feature update has penetrated the Windows ecosystem. ML is only used to offer updates to systems if Microsoft has seen a fair number of similar configurations in its data. At the end of the day, Insiders and ‘seekers’ are important so that enough relevant data can be gathered to inform the ML model on the widest variety of device configurations in the ecosystem.
Machine Learning Can’t Replace Real-World Testing
It’s still early days but ML does seem to be improving the success rate of good update experiences. Whether we like Microsoft’s Software-as-a-Service approach to Windows 10 and lack of internal quality testers is another question. Organizations with the right resources should always pilot new Windows 10 feature updates on a variety of different device configurations to establish whether they update smoothly and result in any functional issues after the upgrade that could affect line-of-business operations. Machine learning can’t replace real-world testing. At least not yet.
For a more technical deep dive into how ML is used for Windows 10 feature updates, see Microsoft’s blog post here.