Complexities of DR Design for the Federal IT Enterprise
Written by Mano Malayanur & Julia NathanApril 24, 2014
Design of disaster recovery (DR) solutions is sometimes defined as the art of balancing the cost of disruption against the cost of recovery. This simple and elegant definition often masks the true complexity of DR. Cost of disruption is rarely measured solely in terms of revenue loss and often includes the damage to intangibles such as reputation. Likewise, the true cost of recovery is hard to quantify, and it is spread across the choices made in information technology (IT) resources, processes, and technologies deployed.
These complexities are magnified in case of the federal IT enterprise. Lack of a profit motive for federal agencies can make it impossible to make a monetary case for DR. In private companies, cost of a disaster can often be measured in terms of loss of revenue and profit. However, the purpose of a federal IT system is typically not to make a profit or even raise revenue, but to provide a service or support a mission. IT in federal agencies is often very complex, with state-of-the-art technology coexisting and dependent on legacy systems decades old. Since any company complying with the federal acquisition regulation can compete to offer services, federal IT contracts are often fragmented: people who designed the system may be different from those supporting it and may operate independently from those supporting the dependent systems and the underlying infrastructure. Agencies are subject to many federal mandates such as the Federal Data Center Consolidation Initiative, the Federal Cloud Computing Strategy, guidelines from the National Institute of Standards and Technology (NIST), and directives from Office of Management and Budget (OMB), which affect IT investment strategy. Effective DR under these conditions requires an enterprise-wide approach so that expectations are clearly set for all involved.
BIAs and the Headline Risk
No agency leader relishes the prospect of making newspaper headlines as a result of a failed IT system. Failure of a key federal IT system carries with it the risk of administration and Congressional inquiries and exposure to the public, with bad consequences to follow. This consideration is often an excellent motivation for implementing an effective DR solution. DR solution design starts with a business impact analysis (BIA) to identify critical systems, gain a realistic understanding of the end user workflow, determine business requirements, and establish recovery needs. In complex federal IT environments, it is crucial to understand the dependencies between IT systems at this stage. This task can be especially challenging given the fragmented nature of federal IT support. It does no good for a critical system to be able to recover in minutes if a system it depends on takes hours to recover.
Federal Standards and Guidelines
Several Federal Information Processing Standards (FIPS) and NIST guidelines need to be taken into account when designing a federal DR solution so that they will be supported by the end product. The following is a sample:
- FIPS 199 “Standards for Security Categorization of Federal Information and Information Systems”
- FIPS 200 “Minimum Security Requirements for Federal Information and Information Systems”
- NIST SP 800-34 “Contingency Planning Guide for Federal Information Systems”
- NIST SP 800-53 “Recommended Security Controls for Federal Information Systems and Organizations”
- In addition, each agency typically has its own policies and guidelines that must also be addressed.
The Right Number of Tiers for DR
Federal enterprises tend to own systems that support varied capabilities, which translate to broad ranges of recovery point objective (RPO) and recovery time objective (RTO). DR tiers need to support the full range as informed by the BIA. Also, tiers need to be clearly distinct from each other in terms of cost and technology solution so that they can be explained to non-IT customers easily. This limits the number of tiers to no more than a handful. A notional set of DR tiers may look as seen in Table 1.
Tier 1 supports the most aggressive recovery requirements. Solutions tend to be complex, expensive, and are usually not required except in extreme use cases. Agencies may find they have no use for this tier at all. Tier 2 supports slightly less aggressive recovery objectives and should be reserved for the most critical agency systems. Tiers 3 through 5 may be separated by the readiness of the DR infrastructure and the technology used for data recovery. Given the complexity of the federal acquisitions process, pre-existing agreements with infrastructure vendors may be required.
Tier
|
RPO
|
RTO
|
Cost
|
Technology Solution Considerations
|
1
|
Seconds or less
|
Seconds or less
|
$$$$$
|
Active-active solution across all layers of IT (presentation, application, data, and common infrastructure); usually designed into the application itself
|
2
|
Minutes
|
Minutes
|
$$$$
|
Active-active across presentation and application layers; automated fail-over for the data layer; data replication performed by the application (rather than storage)
|
3
|
Minutes
|
Hours
|
$$$
|
Fail-over can be manual; hardware and common infrastructure ready to be fired up at the DR site; SAN level or database level data replication would apply
|
4
|
1 Day
|
Days
|
$$
|
Fail-over using backups; hardware and common infrastructure may not be ready at the DR site
|
5
|
Days
|
Weeks
|
$
|
Fail-over using off-site backups; hardware can possibly be acquired after disaster
|
Table 1 Notional DR Tiers
It’s all About the Data
Federal IT systems often manage critical and sensitive data of national importance. Safekeeping of that data is crucial. That consideration affects the DR technical solutions in two significant ways.
Data loss: Federal data centers tend to be sprawled across the country, often for various reasons, with the DR site hundreds of miles away from the primary site. While that distance and the networking delays that result from it may not cause performance issues for the users of IT systems, it affects DR design significantly where milliseconds do matter. Data replication tends to be extremely sensitive to physical distance, which helps rule out “zero data loss” replication, which in turn implies that replication often lags the primary site, and some data loss should be expected in case of a disaster. DR solutions must account for this.
Data inconsistencies Because of their complexity, federal systems tend to be highly interdependent, with relevant data spread across multiple systems. After a disaster, it is quite possible for the states of the systems to be inconsistent with one another. Replication of one system may lag another significantly, resulting in differing amounts of data loss. The discrepancy between the systems can cause major issues on fail-over. DR design needs to address such cases by ensuring that the appropriate processes and tools are in place.
Cloud-Based DR?
At first blush, cloud-based DR looks appealing: instead of having the IT infrastructure sit idle waiting for a disaster to occur, one could leverage the rapid elasticity characteristic of cloud computing and access a pre-developed cloud-based infrastructure after the disaster. This could result in cost savings by avoiding capital expenditure.
On closer scrutiny, this appeal begins to wane. If the cloud is private, the agency still needs to build and maintain the cloud infrastructure for DR. While the resources can be used for other purposes during normal operations, the savings would not be as high as anticipated. If the cloud is public or community, other problems arise: is the agency ready to deploy applications to a public cloud? Cloud deployments in general require many preparatory steps, including pre-existing contracts, trusted connectivity to the cloud provider site, operations and maintenance (O&M) process integration, and the security certification per agency policy, including a FedRAMP certification, which is required for all federal agency public cloud deployments. [The Federal Risk and Authorization Management Program (FedRAMP) is a unified, government-wide risk management program focused on large outsourced and multi-agency systems. FedRAMP has been established t provide a standard approach to assessing and authorizing cloud computing services and products.]
The questions only compound at the individual system level. What are the system security requirements? As of now, FedRAMP covers only Federal Information Security Management Act (FISMA) low- and moderate-rated environments. What is the system’s infrastructure platform? Public and community cloud providers often restrict the choices of technology; for instance, typically the operating systems are limited to Linux and Windows. Are the storage systems compatible between the agency data center and the cloud provider? What are the system RPO and RTO requirements? The more aggressive the recovery objective, the more resources need to be pre-deployed in the cloud. Replicated data requires that storage be pre-allocated in the cloud. Automated fail-over requires that servers are pre-configured in the cloud. A high degree of process coordination is required to meet aggressive RTO and RPO requirements. Network connection modifications typically require large lead times, so they need to be in place in advance. Costs of these items whittle away at the basic appeal of cloud-based DR. In general, cloud-based DR may have a better economic appeal for systems that have less aggressive recovery objectives.
DR Tiers:
|
1
|
2
|
3
|
System Class X
| |||
Presentation
|
Active/Active Web Servers
|
Passive Web Servers
|
Must be purchased
|
Application
|
Application Clusters
|
VM replication
|
Must be purchased
|
Data
|
Database/SAN Replication
|
SAN Replication
|
Tape Backup
|
Common Inf.
|
Live DR site
|
Live DR site
|
Must be purchased
|
System Class Y
| |||
Presentation
|
n/a
|
Passive Web Servers
|
Must be purchased
|
Application
|
n/a
|
Passive App Servers
|
Must be purchased
|
Data
|
n/a
|
SAN Replication
|
Tape Backup
|
Common Inf.
|
n/a
|
Live DR site
|
Must be purchased
|
Table 2. DR Solution Example
One-Size DR Technology Doesn’t Fit All
Given the wide range of available DR technologies and the diverse requirements of a federal IT infrastructure, a holistic enterprise-wide approach to DR technologies is warranted.
The first step is to classify systems so that a standard DR solution set can be proposed for each class and recovery tier. A system class can be defined as a feature or features of the system that are significant in a DR context, such as hardware platform, or even support organization. For example, the classes could be “Class 1: SPARC-based” or “Class 2: Mainframe.” Such classification system can also be used to reinforce agency policies on platform choices for applications. A small number of classes are recommended to keep the solution set manageable.
The next step is to constrain the DR technology space in terms of the defined system classes. A complete DR solution consists of a layered selection of technologies that collectively meet the DR requirements of the class. Technologies can be categorized using the following four-layer model:
- Presentation – Technologies that allow users to access applications. These focus on security, Web access, or content delivery.
- Application– Technologies that implement the IT system business rules and processes.
- Data – Technologies that focus on data recovery and replication: database or application-based, host-based, storage-based.
- Common Enterprise Infrastructure – Technologies or DR considerations related to common services that support multiple applications, such as network services, firewalls, load balancers, Domain Name Service (DNS), and basic data center facilities.
Candidate technologies at each layer can then be organized in terms of the system class that they support. As the technologies are organized, a complete DR solution develops that meets the requirements for each system class. A conceptual example, which is not intended to be a technology recommendation, is shown as Table 2.
The result is a DR solution matrix that provides a standard set of enterprise-level solutions while still meeting the needs of diverse enterprise architecture.
Making the Solutions Stick
Given the sprawling nature of the Federal IT enterprise, every opportunity must be taken to ensure that the DR solutions are adopted by all. Once DR tiers are developed for the enterprise and the DR technology solutions determined, they need to be communicated as widely as possible and integrated tightly into the IT processes.
DR tiers and solutions should be adopted as an agency standard in the agency technology reference guide, if one exists. Compliance with such guides is often required of contractors.
DR solutions should be embedded in the agency Software and System Design Life-Cycles (SDLC). BIAs should be part of the business requirements development, and a standard set of DR requirements should be established for use by the business. DR tier determination should be part of the system design. If the application design needs to play a significant role in the DR solution, software development should include DR features, including approaches to reconcile lost data with other systems. Testing should include making sure that the RPO and RTO requirements of the application can be met.
DR solutions should also be embedded in the agency O&M processes. Changes made to the production configuration should result in corresponding ones in the DR environment, consistent with the DR solutions. Incident and problem management take on a new significance with DR; there needs to be detailed processes to fail systems over to the DR site including data reconciliation with other systems; to operate and maintain the systems in the DR state; and to re-constitute the primary site after the disaster and fail systems safely back. Part of the O&M plan also needs to include regular, rigorous DR testing with realistic situations.
Bringing it All Together
Successful DR requires that every link in the chain work effectively, starting with the BIA, DR solution design, implementation of the DR solution through standards and the SDLC, and embedding of DR effectively in the O&M processes. Failure of any one piece is sufficient to cause major issues. These steps become ever more important in the federal IT enterprise, given the complexities of federal IT.
Mano Malayanuris a principal systems engineer at the MITRE Corporation, a not-for-profit company that operates multiple federally funded research and development centers. His areas of focus are infrastructure engineering, operations, and management.
| |
Julia Nathan is a lead systems engineer at the MITRE Corporation. Her areas of focus are infrastructure engineering and human factors.
|
The authors’ affiliation with The MITRE Corporation is provided for identification purposes only, and is not intended to convey or imply MITRE’s concurrence with, or support for, the positions, opinions or viewpoints expressed by the authors.