Reliability in Azure Well-Architected Framework

Reliability the one pillar concept of Azure Architected Framework. What is it? Where we can use this and how we can use this in Azure? What we need to know to utilize this effectively?

Let's go through some points to discover reliability in details.


Build-in Resiliency features in Azure Platform:

Azure Storage, Azure SQL Database, and Azure Cosmos DB all provide built-in data replication across availability zones and regions.

Azure managed disks are automatically placed in different storage scale units to limit the effects of hardware failures.

Virtual machines (VMs) in an availability set are spread across several fault domains. Spreading VMs across fault domains limits the impact of physical hardware failures, network outages, or power interruptions.

Availability Zones are physically separate locations within each Azure region. With availability zones, you can design and operate applications, and databases that automatically transition between zones without interruption, which ensures resiliency if one zone is affected. 


Building for reliability includes:

Ensuring a highly available architecture

Recovering from failures such as data loss, major downtime, or ransomware incidents


Design Principles

Building a reliable application in the cloud is different from traditional application development.

Cloud acknowledge that failures happened. Instead of trying to prevent failures altogether, the goal is to minimize the effects of a single failing component.

The following design principles provide:

Context for questions

Why a certain aspect is important?

How an aspect is applicable to Reliability?

These critical design principles are used as lenses to assess the Reliability of an application deployed on Azure. 

Design for business requirements - Consider a mission-critical application with a 99.999% service level agreement (SLA) requires a higher level of reliability than another application >> As a consequence of this demand, cost implications are inevitable and it goes to higher end. We need to carefully consider the trade-off between cost and reliability and availability.

Design for failure – At cloud, we do not go to prevent everything from failure, we know failure will happen and by anticipating failures, from individual components to entire Azure regions, we need to develop a solution in a resilient way to increase reliability.

Observe application health - By monitoring the operation of an application relative to a healthy state, we can detect and predict reliability issues.

Drive automation – All we are human and human made mistakes. Many failure happens because of human error due to the deployment of insufficiently tested software or through misconfiguration. 

Now, we are talking about automation, especially on automation testing using tool like Selenium, Ansible, Jenkins etc. Using this strategy, you can execute multiple round of testing by changing environment, testing parameter, data etc. By the way, Azure DevOps needs some time to meet this automation framework fully and I’m it will be sooner than later.

Automation improves:

Reliability

Automated testing

Deployment

Management


Design for self-healing - Self-healing describes the ability of a system to deal with failures automatically. 

For example, you can think about implementing circuit breaker design. Implementing this design pattern will block to send the data to failure side and give them sometimes to self-heal.

Handling failures happens through pre-defined remediation protocols. These protocols connect to failure modes within the solution. 


Design for scale-out - Scale-out is a concept that focuses on the ability of a system to respond to demand through horizontal growth. Through scale units, a system can handle expected and unexpected traffic increases, essential to overall reliability. Scale units further reduce the effects of a single resource failure.



Design for Reliability

Reliable applications should maintain a pre-defined percentage of uptime (availability). They should also balance between high resiliency, low latency, and cost (High Availability). Just as important, applications should be able to recover from failures (resiliency).

Checklist:

How have you designed your applications with reliability in mind?

Define availability and recovery targets to meet business requirements.

Build resiliency and availability into your apps by gathering requirements.

Ensure that application and data platforms meet your reliability requirements.

Configure connection paths to promote availability.

Use Availability Zones where applicable to improve reliability and optimize costs.

Ensure that your application architecture is resilient to failures.

Know what happens if the requirements of Service Level Agreements are not met.

Identify possible failure points in the system to build resiliency.

Ensure that applications can operate in the absence of their dependencies.


Azure Services that we can use here

·         Azure Front Door

·         Azure Traffic Manager

·         Azure Load Balancer

·         Azure NAT Gateway

·         Service Fabric

·         Kubernetes Service (AKS)

·         Azure Site Recovery



Lets review one experimental model to design your system




Let's plan for the failure - Azure Web App running on East Zone




Let's assume something goes wrong your app is not available - let's stop the app. You can receive 403 error saying App is stopped. This is because your App is running on single region (here it is in East US)



Let's plan something better. We need reliability and availability. Let's make this available in multi region - East US and West US. Configure the same quickly in Azure portal by configuring front end and backend pool.





Now, if you go and try to stop your east region availability zone, your app will be still supported by your west coast region. Your app is running!

However, please note, for a few seconds or minutes you can get app down error message in your browser until internally Azure switch the app's availability regions. I hope you will don't mind this :)





Okay, now lets configure something on Azure so that you can get some sort of alert when something goes down. 




Once alert is created, you can setup your action plan - specific set of action/execution steps which can fix this problem.



So, what we are trying here, is to build a reliable web app following Azure well-Architected Framework


This is not only software architecture or design pattern, this also includes your infrastructure pattern. This pattern address the challenges in refactoring a monolithic ASP.NET application with a MS SQL database as backend and help us to develop a modern, reliable, and scalable ASP.NET Core application.



What do you need more?

If you have any questions, please feel free to post here. Enjoy Azure Cloud!


Comments

Popular posts from this blog

How to fix Azure DevOps error MSB4126

SharePoint Admin Center

How to create Custom Visuals in Power BI – Initial few Steps