Microsoft Partners with Nvidia to Build Azure-Powered AI Supercomputer

Datacenter networking servers

Microsoft has announced a new “multi-year” deal with Nvidia to build an AI supercomputer hosted in Microsoft Azure and powered by tens of thousands of Nvidia GPUs. The new partnership should enable organizations to train, deploy and scale AI applications and services.

“AI is fueling the next wave of automation across enterprises and industrial computing, enabling organizations to do more with less as they navigate economic uncertainties,” said Scott Guthrie, executive VP of the Cloud + AI Group at Microsoft. “Our collaboration with NVIDIA unlocks the world’s most scalable supercomputer platform, which delivers state-of-the-art AI capabilities for every enterprise on Microsoft Azure.”

According to Microsoft, the Azure instances on this supercomputer already comes with Nvidia’s Quantum 200Gb/s InfiniBand networking and A100 GPUs. In the future, Microsoft plans to boost performance with Nvidia’s Quantum-2 400Gb/s InfiniBand networking and H100 GPUs.

Nvidia claims that the H100 chip features a dedicated “Transformer Engine” for machine learning workloads. Compared to the A100 chip, it helps to reduce power consumption and enhance performance by up to 1.5 and 6 times.

Microsoft to work with Nvidia to optimize its DeepSpeed library

Microsoft also plans to collaborate with Nvidia to optimize its DeepSpeed library, which should eventually minimize memory usage and computing power during training large language models. Moreover, Microsoft is bringing its software development suite to Azure enterprise customers, though there is no ETA yet.

“Our collaboration with Microsoft will provide researchers and companies with state-of-the-art AI infrastructure and software to capitalize on the transformative power of AI,” said Manuvir Das, VP of enterprise computing at NVIDIA.

Microsoft’s new partnership builds on a previous agreement with OpenAI to create one of the world’s fastest supercomputers on top of Azure’s infrastructure. The company emphasizes that the amount of computation required to train AI models has increased exponentially in the past ten years. This new project will help organizations to meet the growing demand for natural language processing and other AI workloads.