Microsoft has published a new post on its Azure blog, where Chief Technical Officer Mark Russinovich reveals that the company’s cloud services have been running at an average 99.995% uptime over the past year. The number may seem fairly high, but with the increased reliance of companies and people on cloud infrastructure, any failure or problem, such as the one that happened at the beginning of May, is one too many for most people.
Microsoft acknowledges this, and as such, the blog post goes to announce that the company has created a new Quality Engineering team inside Azure, which will be working alongside the existing Site Reliability Engineering team. The new team will set out to develop new ways to make the platform more reliable and prevent incidents such as the one mentioned above.
To that endeavor, there are a few initiatives already underway. Microsoft is investing in safe deployment practices that involve longer testing through various phases to ensure that changes are deployed safely. The company is also working on zero-impact and low-impact maintenance procedures, greatly reducing or eliminating downtime completely during update processes, for example. To make sure services are ready to deal with failures, Microsoft is injecting faults into its services more frequently before they’re made public, and it’s also planning to give customers the ability to do this on their own.
Microsoft is also working to expand availability zones, which provide even more reliability to its customers. Currently, they’re available in Azure’s 10 largest markets, and Microsoft is hoping to bring them to the next 10 by 2021. It’s also providing storage-account failover options to customers so that each organization can choose whether to prioritize data retention or time to restore when things go wrong. Microsoft has typically prioritized data retention, which caused some outages to last longer than some would like.
Lastly, there’s Project Tardigrade, which was first announced at this year’s Build. This will allow Azure to detect hardware failures or memory leaks that can cause systems to crash, and freeze the virtual machine before it does. This allows the service to move the workload to a healthy host.