In todays 247 environments that are running mission critical applications, businesses rely heavily on the availability of their data. Although servers and their software are generally reliable, there is always the risk of a hardware failure or a software bug, each of which could bring a server down. To mitigate these risks, business-critical applications often rely on redundant hardware to provide fault tolerance. If the primary system fails, then the application can automatically fail over to the redundant system. This is the underlying principle of high availability (HA).
Even with the implementation of HA technologies, there is always a small risk of an event that causes the application to become unavailable. This could be due to a major incident, such as the loss of a data center, due to a natural disaster, or due to an act of terrorism. It could also be caused by data corruption or human error, resulting in the applications data becoming lost or damaged beyond repair.
In these situations, some applications may rely on restoring the latest backup to recover as much data as possible. However, more critical applications may require a redundant server to hold a synchronized copy of the data in a secondary location. This is the underpinning concept of disaster recovery (DR). This chapter discusses the concepts behind HA and DR.
Level of Availability
The amount of time that a solution is available to end users is known as the level of availability , or uptime . To provide a true picture of uptime, a company should measure the availability of a solution from a users desktop. In other words, even if your SQL Server has been running uninterrupted for over a month, users may still experience outages to their solution caused by other factors. These factors can include network outages or an application server failure.
In some instances, however, you have no choice but to measure the level of availability at the SQL Server level. This may be because you lack holistic monitoring tools within the Enterprise. Most often, however, the requirement to measure the level of availability at the instance level is political, as opposed to technical. In the IT industry, it has become a trend to outsource the management of data centers to third-party providers. In such cases, the provider responsible for managing the SQL servers may not necessarily be the provider responsible for the network or application servers. In this scenario, you need to monitor uptime at the SQL Server level to accurately judge the performance of the service provider.
The level of availability is measured as a percentage of the time that the application or server is available. Companies often strive to achieve 99 percent, 99.9 percent, 99.99 percent, or 99.999 percent availability. As a result, the level of availability is often referred to in 9s. For example, five 9s of availability means 99.999 percent uptime and three 9s means 99.9 percent uptime.
Table details the amount of acceptable downtime per week, per month, and per year for each level of availability.
Table 1-1.
Levels of Availability
Level of Availability | Downtime Per Week | Downtime Per Month | Downtime Per Year |
---|
99% | 1 hour, 40 minutes, 48 seconds | 7 hours, 18 minutes, 17 seconds | 3 days, 15 hours, 39 minutes, 28 seconds |
99.9% | 10 minutes, 4 seconds | 43 minutes, 49 seconds | 8 hours, 45 minutes, 56 seconds |
99.99% | 1 minute | 4 minutes, 23 seconds | 52 minutes, 35 seconds |
99.999% | 6 seconds | 26 seconds | 5 minutes, 15 seconds |
All values are rounded down to the nearest second.
To calculate other levels of availability, you can use the script in Listing . Before running this script, replace the value of @Uptime to represent the level of uptime that you wish to calculate. You should also replace the value of @UptimeInterval to reflect uptime per week, month, or year.
Listing 1-1. Calculating the Level of Availability
DECLARE @Uptime DECIMAL(5,3) ;
--Specify the uptime level to calculate
SET @Uptime = 99.9 ;
DECLARE @UptimeInterval VARCHAR(5) ;
--Specify WEEK, MONTH, or YEAR
SET @UptimeInterval = 'YEAR' ;
DECLARE @SecondsPerInterval FLOAT ;
--Calculate seconds per interval
SET @SecondsPerInterval =
(
SELECT CASE
WHEN @UptimeInterval = 'YEAR'
THEN 60*60*24*365.243
WHEN @UptimeInterval = 'MONTH'
THEN 60*60*24*30.437
WHEN @UptimeInterval = 'WEEK'
THEN 60*60*24*7
END
) ;
DECLARE @UptimeSeconds DECIMAL(12,4) ;
--Calculate uptime
SET @UptimeSeconds = @SecondsPerInterval * (100-@Uptime) / 100 ;
--Format results
SELECT
CONVERT(VARCHAR(12), FLOOR(@UptimeSeconds /60/60/24)) + ' Day(s), '
+ CONVERT(VARCHAR(12), FLOOR(@UptimeSeconds /60/60 % 24)) + ' Hour(s), '
+ CONVERT(VARCHAR(12), FLOOR(@UptimeSeconds /60 % 60)) + ' Minute(s), '
+ CONVERT(VARCHAR(12), FLOOR(@UptimeSeconds % 60)) + ' Second(s).' ;
Service-Level Agreements and Service-Level Objectives
When a third-party provider is responsible for managing servers, the contract usually includes service-level agreements (SLAs). These SLAs define many parameters, including how much downtime is acceptable, the maximum length of time a server can be down in the event of failure, and how much data loss is acceptable if failure occurs. Normally, there are financial penalties for the provider if these SLAs are not met.
In the event that servers are managed in-house, DBAs still have the concept of customers. These are usually the end users of the application, with the primary contact being the business owner. An applications business owner is the stakeholder within the business who commissioned the application and who is responsible for signing off on funding enhancements, among other things.
In an in-house scenario, it is still possible to define SLAs, and in such a case, the IT Infrastructure or Platform departments may be liable for charge-back to the business teams if these SLAs are not being met. However, in internal scenarios, it is much more common for IT departments to negotiate service-level objectives (SLOs) with the business teams, as opposed to SLAs. SLOs are very similar in nature to SLAs, but their use implies that the business do not impose financial penalties on the IT department in the event that they are not met.
Proactive Maintenance
It is important to remember that downtime is not only caused by failure, but also by proactive maintenance. For example, if you need to patch the operating system, or SQL Server itself, with the latest service pack, then you must have some downtime during installation.
Depending on the upgrade you are applying, the downtime in such a scenario could be substantialseveral hours for a stand-alone server. In this situation, high availability is essential for many business-critical applicationsnot to protect against unplanned downtime, but to avoid prolonged outages during planned maintenance.