An Azure Infrastructure Year in Review – 2018
January – Spectre
In early December we received notification from Microsoft to expect full Azure host reboots on January 9th instead of the usual 30-second (approx.) warm hypervisor reboots (in-place migration). Rumour was a big security fix was being deployed. We returned from the Christmas break to find Google started talking to the world on January 3rd about Spectre & Meltdown, two CPU vulnerabilities, a week ahead of schedule. This forced Azure, and other vendors/clouds, to accelerate their plans and push out updates over the following 48 hours.
That night, Azure customers had virtual machines go down for up to 15 minutes, instead of the usually unnoticeable in-place migration. Microsoft’s fix for Spectre was done at the software layer, offering a level of fix that Intel continues to deliver to the firmware of their chipsets.
Downtime is always considered bad, but the advantage that every Azure customer had was “instant” mitigation to the vulnerability – something that I bet many hypervisors and hosting customers still don’t have.
Say Goodbye to Traditional PC Lifecycle Management
Traditional IT tools, including Microsoft SCCM, Ghost Solution Suite, and KACE, often require considerable custom configurations by T3 technicians (an expensive and often elusive IT resource) to enable management of a hybrid onsite + remote workforce. In many cases, even with the best resources, organizations are finding that these on-premise tools simply cannot support remote endpoints consistently and reliably due to infrastructure limitations.
February – Storage Service Endpoints
Service endpoints are a mechanism for virtually connecting platform features of Azure to a virtual network. This simplifies the “routing” of packets from subnets to these services and allows customers to have greater firewall control (network security groups).
A service endpoint for storage accounts allowed virtual machines to connect to these storage accounts “over the virtual network”. Firewall support was added to the storage account to control what source IP addresses/ranges could access the storage. You can learn more here.
The original list of service endpoint available services has exploded since February. And more services are adding support for their own firewalls, or even adding the ability to join a virtual network for isolation. This is great for customers that require high levels of network security: PCI-DSS, health, GDPR, and so on.
March – Open Source
If you didn’t think that Microsoft was all-in on open source before March, then the news about Service Fabric should have convinced you.
Few know what Service Fabric is – a microservices fabric that allows a new style of cloud-scale and highly available applications that can be updated in a rapid fashion. Service Fabric powers many of Microsoft’s cloud services, such as Skype and Azure SQL. It quite literally is the foundation of Microsoft’s clouds. And they went and released it as open source.
That wasn’t the first and it won’t be the last big piece of open source news from Microsoft. If you haven’t realized that Microsoft has changed, then you need to crawl out from under that penguin shaped rock that fell on your head?
April – Application Security Groups
There was a little bit of news that I paid very little attention to. Microsoft made Application Security Groups (ASGs) generally available. At this time, ASGs were a PowerShell/CLI/ARM way to create custom groupings of virtual machine NICs that could be used as sources or destinations in a Network Security Group rule. I think either Microsoft underplayed this or I did a bad job at reading things – maybe the lack of a full-GUI experience played a role too.
It was only later in September, at the Ignite conference, that I realized how much the role of the ASG would impact secure network design for me. ASGs allow you to almost flatten a virtual network design, from one subnet to each application tier to maybe 2-3 subnets – a hard boundary between outside and inside, with the inside subnet being broken up into dynamic security zones by ASGs.
I still am more likely to go with the classic subnet-per-tier model for smaller architectures, but I would lean towards ASGs for more complex designs.
May – Confidential Computing
Mark Russinovich started to add a bit more detail on Azure’s Confidential Computing preview in May. A private preview for what would become the DC-Series of virtual machines started. The hosts offered Intel SGX features with Microsoft software-based intellectual property to create data compartmentalization within the virtual machines. The idea is that code and sensitive data can be isolated from the guest operating system of the virtual machine and be tamper/inspection proof from operators, rogue admins, and intruders – it’s the sort of tech that might have prevented the recent attack on British Airways.
June – Share Value
After over a decade of being valued at $20-$25 per share, Microsoft hit over $100 per share on June 1st. This was a massive milestone, marking the success of Microsoft’s re-invention as a cloud services provider. Azure has been growing at huge rates quarter after quarter, with headline customers being frequently highlighted. Office 365 has all but wiped the (bad) memories of Google Apps from our minds. The Windows company has successfully moved on.
July – Azure File Sync
Part of Azure’s success is the hybrid-first approach of the Microsoft cloud. Another hybrid service arrived in July: Azure File Sync (AFS).
While we all might want to get rid of the on-premises file server, for many, that is not possible today or in the foreseeable future. AFS, through an agent, synchronizes files to the cloud, enables cloud tiering to reduce on-premises storage needs, and provides a better way to do back up, restores, and disaster recovery. The killer features are that AFS is easy and it works by adding an agent to an existing file server.
August – Maturing Governance
A big concern about a cloud that offers delegation and self-service is governance:
- Who is doing it?
- What are they doing?
- Are they doing enough of it or too much?
- Is money being wasted?
- Is it secure?
- Can we track costs?
- Is there an audit trail?
- Can we force compliance?
These are the sorts of customers that don’t have just one subscription or even a few. They might have dozens or hundreds of subscriptions. That means a per-subscription governance model will not suffice. In August, Microsoft made management groups generally available; this feature allows you to nest subscriptions from a single tenant into a structure similar to organizational units from Active Directory Domain Services. The groups allow for delegation of administration and policy assignment at the subscription level but in a more organized and scalable manner.
The release of Management Groups was an indication that governance in Azure was maturing. Next to come was Blueprints which scaled out the template deployment concept to include elements of governance too.
Microsoft Ignite is announcement-palooza for Microsoft. The Redmond corporation released a “book of news” that included “all” of the Azure news. It was far from complete. I made my best effort at summarizing the Azure Infrastructure announcements in my monthly Azure Infrastructure summary post.
I think this more of a “one to watch” than “do it now”: Microsoft launched a new type of Azure PowerShell modules in October called Az. Az, which is based on PowerShell Core, is intended to replace the AzureRM modules, which are based on Windows PowerShell. The reasoning for this is that PowerShell Core is an open, cross-platform version of PowerShell, whereas Windows PowerShell is restricted to just Windows.
The cmdlet names change; for example, Get-AzureRMVM changes to Get-AzVM. This change could break scripts, so Microsoft included a compatibility mode to create aliases for the old cmdlets. There is a gotcha – PowerShell Core cannot handle saving credentials as $PSCredential. So if your scripts need to save a credential and pass it into a cmdlet, then this won’t work with Az. I suspect that this will affect a lot of people and damage adoption. Another gotcha is that Azure Automation doesn’t support Az yet, but I wouldn’t be surprised if that changes – it might allow Microsoft to use Linux-based containers to host automation jobs, which could reduce Microsoft’s costs and rapidly improve the launch time of Automation jobs.
Microsoft cloud news was dominated by two outages in November. Multi-Factor Authentication (MFA) secures sign-ins to a customer’s tenant (Azure AD) and therefore secures access to the associated subscription services, such as Office 365, Azure, CRM 365, and 3,000+ third-party cloud services that you can link to a tenant, not to mention any other corporate services that can be linked too.
The first outage lasted up to 14 hours, impacting customers all around the world. Unless you had a “break-glass” account to undo MFA or used Conditional Access (Azure AD Premium) to deem MFA as unrequired for users in physically secure locations, then your business shut down. Microsoft released a root cause analysis a week later – and then MFA broke down again within 24 hours.
The theory of cloud is that the people who can do this stuff best do it for us, and we can focus on the things that matter: code, settings, and data, which are the things that the business really cares about. When directors care about infrastructure, then something is broken, and that’s what happened here … twice … in just over a week.
On top of the previous Windows 10 1809 and Windows Server 2019 release withdrawals, and the slew of poor quality releases by Microsoft, I asked if Microsoft needed to reset their business to focus on “finishing the job”.
Microsoft revealed that something old and something new was being used in combination to make Azure more reliable for customers. Hardware faults happen, and Azure has lots of hardware. As one might expect, Microsoft has all kinds of monitoring pulling in crazy amounts of data about health and performance of Azure hosts. Reading this matrix of data is impossible for a human, and traditional rule monitoring would generate lots of false/missed alerts. Machine Learning can apply logic at scale to understand and even predict failures.
Azure did not have Live Migration (vMotion) – but that secretly changed in early 2018. Now Microsoft can move virtual machines from a host if:
- A regular host maintenance cycle is planned.
- Machine Learning identifies a potential problem with a host.
The latter has reduced customer machine outages by 50%. I wonder if Microsoft will allow us to feed this kind of data from Azure Stack hosts, and eventually Hyper-V hosts to Azure for analysis and alerting?
FYI, some elements of this kind of monitoring have been in Hyper-V since Windows Server 2012 R2.
You might have noticed that Azure Infrastructure news slowed down in November and was a tiny trickle in December. This isn’t unusual – a lot of Microsoft is on vacation in December – some folks disappear for the month! With so many people away, it’s a bad idea to launch new features or deploy changes in December, just in case something goes wrong, and the required people are sipping mojitos on a distant beach.
Eagle-eyed watchers might have noticed that work is still being done. Previews have been on-going, and Redmond-ites have been in planning meetings. Expect Q1 to be busy – the first or second week in January might have some nice new releases, even if they aren’t promoted by Marketing or blog posts yet.