EditAttachPrintable
r16 - 2014-03-13 - 02:42:20 - Main.sbeardYou are here: TWiki >  Deployment Web > DeploymentPlanningAndDesign > DisasterRecovery

uc.png Disaster recovery principles

Authors: StevenBeard, DanToczala, RalphSchoon
Build basis: None

High availability (HA) and disaster recovery (DR) are both related aspects of Wikipedia: Business continuity planning.

Disaster recovery is the process, policies and procedures related to preparing for recovery or continuation of technology infrastructure critical to an organization after a natural or human-induced disaster - Wikipedia: Disaster recovery.

This topic focuses upon a major failure of the primary data center requiring failover to a secondary disaster recover data center or fundamental rebuild of the primary data center.

The related High availability principles focuses on failure scenarios and recovery within the primary data center.

This topic outlines the different principles of DR that you should consider when designing a Rational development environment. It is critical that DR is considered from the outset of designing your environment because the design itself will constrain the DR solution. DR that is developed as an afterthought may result in significant rework of the environment or result in a suboptimal solution. However, the first thing to consider is what are your organizations real requirements for DR based on your business and technical needs and requirements for the environment itself.

Recovery point objective (RPO)

Recovery point objective (usually measured in specific time) is the measure of the most an organization can afford to lose in a failure. An RPO of 1 hour means that an organization can always revert back to a restore point that is never more than one hour old. This may suggest that the organization executes backups every hour. Practically, a software factory may have a RPO of one day, which may mean that data is backed up every night at a specific time.

Recovery time objective (RTO)

Recovery time objective (measured in time) is the measure of how long it takes an organization to restore services through either high availability, disaster recovery, or any combination. An RTO of 15 minutes means that an organization can restore its environment within 15 minutes or less. Practically, most software factories measure their RTO in hours. As surprising as it may seem, some organizations measure RTO in days.

**** HERE ******

Disaster recovery levels for a development environment

Note: This subsection is derived from the IBM Systems Magazine article, Disaster Recovery levels, by Robert Kern and Victor Peltz (November 2003).

Recovery Time Objective (RTO) and Recovery Point Objective (RPO)

An administration team that is developing a disaster recovery plan for a development environment must weigh the need to recover quickly and completely against the cost to implement the recovery. The impact to the I/O performance of the environment should be considered, as well as the environment recovery time objective (RTO). In other words, how much time is available to fully recover the environment with all critical development operations up and running again? Another important factor to consider is the environment's recovery point objective (RPO): How much data is lost, or at what actual recovery point-in-time (RPiT) is all data current?

Cross-volume data integrity and consistency groups

Computers must write data to disks with full integrity, even in the event of hardware failures and power failures. To accomplish this, environment designers employ many techniques, such as:

  • Mirrored storage subsystem cache to prevent data loss in the event of a cache hardware failure
  • Battery backup to prevent cache data loss in the event of a power failure
  • Mirrored disk or parity-based RAID schemes for protecting against hard-disk drive failures

Another, more subtle, requirement for preserving the integrity of data being written is making sure that "dependent writes" are executed in the applications intended sequence. Many years ago, application developers developed various dependent write sequences to preserve data integrity/data consistency for data being written to disk across power failures. Consider this typical sequence of writes for a database update transaction:

  1. Execute a write to update the database log, indicating that a database update is about to take place.
  2. Execute a second write to update the database.
  3. Execute a third write to update the database log, indicating that the database update has completed successfully.

These "dependent writes" must be written to remote mirrored disk in the same sequence in which the application issued them. In the previous example, there are no guarantees that the database log and the database are on the same storage subsystem. Failure to execute the write sequence correctly may result in writes (1) and (3) being executed, followed immediately by a system failure. When it is time to recover the database, the database log would incorrectly indicate that the transaction completed successfully. The transaction would be lost, and the integrity of the database would be questionable.

When considering the RTOs and RPOs in any disaster-recovery solution involving data replication, it is critical to understand the need for cross-volume data integrity and data consistency. Essential elements for creating cross-volume data integrity and data consistency include the ability to:

  • Create RPiT copies of the data as necessary
  • Provide a site "Data Freeze" causing all data at the remote site to be consistent with a RPiT
  • Use a consistent timestamps across all write updates to order all writes at the remote site
  • Create data set/file consistency groups

Cross-volume data integrity and data consistency enable database RESTARTs if the second copy of the data is actually used. Solutions that employ cross-volume mirroring and remote-disk mirroring must address the issue of data consistency to support cross-volume and cross-storage subsystem data integrity.

Most customers, when designing a multi-site solution, must minimize the time it takes to restart applications once the data at the secondary site has been recovered.

Tiers of multi-site service availability

In the late 1980s, the SHARE Technical Steering Committee, working with IBM, developed a white paper that described levels of service for disaster recovery using Tiers 0 through 6. Since then, a number of businesses using IBM zSeries have moved toward an IBM TotalStorage solution called the Geographically Dispersed Parallel Sysplex, which allows an installation to manage end-to-end application and data availability across multiple geographically separate sites. This resulted in an additional seventh tier representing the industry's highest level of availability driven by technology improvements.

  • Tier 0: No disaster recovery - Most customers today understand the need for disaster recovery of their development environments, as well as the need for backup of critical data. However, Tier 0 is still common in practice because to many organizations do not fully test their disaster recovery properly, resulting it it failing in the event of a disaster. Also the design and implementation of disaster recovery is often differed to later resuting in a poor or never implemented solution.

  • Tiers 1 and 2: Physical transport - The majority of today's customers use a traditional method of creating tapes nightly and transporting them to a remote site overnight. Tier 1 users send the tapes to a warehouse or "cold" site for storage. Tier 2 users send the tapes to a "hot" site where the tapes can be quickly restored in the event of a disaster.

    Various schemes have been developed to improve the process of offloading data nightly from production sites and production volumes to tapes. Some of these solutions provide full integration with various databases (Oracle, DB2 SQL, etc.). Here are some of the names created to describe these off-line backup solutions:
    • Server-less backup
    • LAN-less backup
    • Split mirroring
    • SNAP/SHOT*

      Hardware vendors have created products to fit into this marketplace. For example, the IBM Enterprise Storage Server (ESS) FlashCopy function provides this capability and, when coupled with ESS disk mirroring solutions, can create a RPiT copy of data within the same ESS logical storage subsystem without impacting applications.

  • Tier 3: Electronic vault transport - This is usually achieved by copying the tape from the primary site directly into a tape storage subsystem located at the remote secondary site. This replaces the need to physically transport tapes, with the tradeoff of the added network bandwidth.

  • Tier 4: Two active sites with application software mirroring - Various database, file system or application-based replication techniques also have been developed to replicate current data to a second site, but these techniques are limited to data contained in the particular database or file system for which they were designed. An example of this in the open systems world is software mirroring at the file-system level. If all of your data resides within the file system, these techniques can be a fast and efficient method of replicating data locally or remotely. Software-based file system mirroring can also be fully integrated with various host-base server clustering schemes like AIX High Availability Geographic Cluster (HAGEO). Host failover causes the software mirroring to failover as well.

  • Tier 5: Two-site, two-phase commit - This technique is specific to the database and its configuration used in conjunction with the application environment. Various databases provide specific data replication of database logs, coupled with programs to apply the log changes at the secondary site. Typically, one only gets data consistency within the specific database, and transactions across multiple databases are not supported.

  • Tier 6: Disk and tape storage subsystem mirroring - This technique includes two types of mirroring:
    • Disk mirroring - Disk mirroring is popular because it can be implemented in the storage subsystem and, as a result, is independent of the host applications, databases and file systems that utilize the storage subsystem.
    • Tape mirroring - You can mirror tape data via various hardware and software solutions. Typically this data is non-critical but still needs to be recovered in the event of a disaster that prevents moving back to the primary site for an extended period of time.

  • Tier 7: IBM GDPS - Use GDPS to implement the highest level in the multi-site availability hierarchy. GDPS can enable an installation to provide the means to support the highest level of application availability. GDPS combines software and hardware as the means for managing a complete switch of all resources from one site to another automatically, providing continuous operations as well as disaster recovery support for both planned and unplanned outages.

Disaster recovery quality-of-service

Most organizations typically consider four different disaster recovery quality-of-service (QoS) levels for their enterprise:

  • Platinum: RPO = seconds, RTO = under two hours. Delivered by a Tier 7 infrastructure like GDPS.
  • Gold: RPO = two hours, RTO = six hours. Delivered by a Tier 4/5 database log mirror/log apply solution.
  • Silver: RPO = 24 hours, RTO = 48 hours. Delivered by a RPiT backup capability using a Tier 1-3 approach.
  • Bronze: RPO = 24 hours, RTO = 48 hours. Delivered by nightly backups for the purpose of disaster recovery using a Tier 1-3 approach.

Most development environments require bronze or silver disaster recover QoS. If the operational systems the development environment supports are business critical and may require timely development changes, a gold QoS maybe required. Only a very few development environment that support nationally critical operational systems, or massive national or international banking systems will ever require platinum QoS.

The primary question you should ask is: "How long can my operational systems go without being able to make a critical development change?" Further, does this require the whole development environment or just part?

Secondary questions that might contribute to deciding the QoS level require include:

  • How much in lost productivity does it cost your organization to have the development environment down per hour/day?
  • What are the implications of loosing some of you development data depending upon the RPO of your QoS level (quality and/or compliance)?
  • Would a limited development environment allow you to make critical development changes for operational systems, while you recover the development environment more slowly?
  • What is the balance between the cost of your disaster recovery solution and the resultant cost due to different QoS levels?

Disaster recovery considerations

The following subsections cover the major considerations that a Jazz administrator will need to account for when creating a disaster recovery architecture and procedures.

Minimize unplanned outages: Use SAN or NAS technology

The Jazz solution configuration should use storage area network (SAN) or NAS technologies if possible, because these technologies provide a degree of fault tolerance for disk storage resources. The Jazz repositories should reside on these resources. If virtualization is being utilized for the Jazz servers, then the virtual machine (VM) images should also be stored on this media.

In addition, NAS solutions offer utilities for doing “offline backups”, by breaking the disk mirror and then backing up the “offline” copy of the disk storage. These systems will also provide the capability to store these backups for quick restoration, as well as utilities to manipulate these backups. Well-prepared Jazz installations will also make sure that these backups are moved offsite at some interval; to mitigate the risks of data loss should the data from an entire site be destroyed.

Important: The enterprise Jazz infrastructure is expected to support the software development needs of an organization, and the consequences of losing these software assets are quite severe. Use of SAN and NAS technologies can help mitigate this risk, and will allow for backup of the Jazz based artifacts with a minimum amount of down time. While NAS may impact performance, when compared to a SAN implementation, it provides hot swapping and is more reliable.

Use the best available database backup technologies

Depending on the database vendor chosen, various different database backup utilities are available. If you use IBM DB2, on-line backups are available and provide a way to keep quality backups with minimal interruption. See the DB2 documentation for details on enabling on-line backups. Some specific settings are required in the area of log file management. Even with a solid backup story, scheduling occasional downtime for a full backup is wise. On Jazz.net, on-line database backups are taken nightly and in conjunction with transaction logs. Up to the last committed transaction can be restored. Off-line backups occur only when migrations or host system configuration changes are completed.

When you use an Oracle database, configure for ARCHIVE LOG mode. You can use the Oracle RMAN utility or Enterprise manager to take backups and schedule them on a periodic basis in conjunction with archive logs. The same tools can provide functionality to restore the database in case of failure. There might be other components that need to be backed up, like instance configuration settings, to exactly restore the database to original configuration.

In addition to backing up your Jazz repository databases, back up data that defines your application server configuration or configurations. The key items to backup from the Jazz application servers include: the jazz.war file, the workitemindex directory, the teamserver properties, the profile.ini file, the update-site directory, and the conf directory.

Also back up the following files from your web server: server.xml, web.xml, tomcat-users.xml (Tomcat), or the WebSphere profiles when using WebSphere Application Server. WebSphere Application Server provides a backup utility for doing this.

Tip: Deploy the Jazz repositories on an enterprise-ready database technology. Use the functionality and features available with that technology to best minimize downtime for backup and restore operations. Use standard enterprise backup procedures and policies with respect to backup media and offsite storage. Test these technologies and procedures, and document the specific approaches and procedures in place in your environment. For details on database backup technologies and concerns, see the Deployment Guide for Jazz Team Server and the Getting Started with Rational Team Concert 3.x: A Deployment Guide.

Utilizing redundancy and virtualization

Some High Availability (HA) configurations that allow automatic server failover are not currently fully supported for Jazz based solutions. If you are not able or willing to deploy a supported HA configuration, use redundancy in server resources, in separate physical locations, to help mitigate the risk associated with loss of service due to the loss of communications with a Jazz server.

Plan A: Designated Jazz DR servers

A set of Jazz Application and Jazz Web DR servers can provide a warm backup for a server that has lost connectivity. The Jazz application or Jazz web server can be configured and ready to take over operations with a minimum of change. The server can be preconfigured with the proper installed software, with only the files called out in the previous section needing to be restored before the DR server can be brought online.

The server indicated as the Jazz Repository DR server will need to have a different set of backup and restore criteria, since this server is the database server that stores the data. Due to the amount of data that needs to be restored, restoration of a Jazz repository server will take much longer to restore.

Tip: Using Jazz DR servers requires some additional hardware be kept ready at all times, and be underutilized. Some organizations will be hesitant to spend money on underutilized resources, but this should be looked at as a way to mitigate the risks associated with the loss of software development artifacts, and the loss of time while systems are not operational.

Plan B: Designated virtual machines

Using Jazz application and Jazz web servers in virtual machines provides an organization with the capability to quickly bring additional capability online. The use of virtual hardware to host these resources allows an organization to quickly and accurately bring up exact replicas of resources lost to disaster situations.

Using Jazz repository servers hosted on virtual machines allows for some of the same benefits; however, the rapid pace of software development data changing within the repositories means that some sort of restoration of the databases will need to be done. Loss of performance must be measured against the DR considerations when deciding on a Jazz solution architecture.

Tip: The use of virtual hardware for hosting the Jazz application and Jazz web components can be beneficial to the disaster recovery strategy of an organization, without a large impact to performance. Providing for the recovery from the loss of a Jazz repository server can be done with a mix of database replication technologies as well as virtualization technologies. The relative risks, consequences, and performance impacts of an organizations decision need to be carefully weighed.

Disaster recovery scenarios

The following list contains common disaster recovery scenarios. They are included here as a way to help identify the common procedures and processes needed for a proper disaster recovery plan.

Disk is lost

Disk failure can cause loss of data, and users will notice that the Jazz applications and Jazz server seem either unresponsive, or unable to store changes (when information returned to user is from the server cache). Users that are developing in Eclipse workspaces will be able to continue their work on software development assets, but no data will be stored to the Jazz repositories.

Risk assessment: The use of SAN and NAS technologies can help reduce the risk of a single disk failure causing an unplanned outage. Expect the risk to be on par of losing a power supply.

Network is lost

The LAN is offline. Users will notice that all applications (including Jazz applications) are unresponsive. Users that are developing in Eclipse workspaces will be able to continue their work on software development assets, but no data will be stored to the Jazz repositories.

Risk assessment: Often, networks may experience significant slowdown, but they are rarely offline. If there is a network outage, it impacts all users and will be handled by IT support as a top priority problem. No special procedures by the Jazz administrators are warranted.

Server is lost

The Jazz application, Jazz web, or Jazz repository (database) server is lost. DR procedures will focus on getting users back to work using the existing infrastructure. Short-term solutions will optimize accessibility over performance.

Risk assessment: A lost server is the probably the largest risk, because this is the event that is most likely to happen. Loss of the Jazz web or Jazz application servers will be able to be responded to more quickly. Loss of the Jazz repository server will take longer, because the databases will need to be restored.

Site is lost / silo is lost

In this scenario, a site is lost or inaccessible. Recovery procedures will follow those of the server lost scenario above, but require the databases to be restored from offsite backup. A lost site will likely entail loss of ALL Jazz server resources, and some data loss.

Risk assessment: In case of a lost site (think fire, flood, tornado), re-establishing the development environment is not a Day 1 priority.

Disaster recovery process

The following phases are the phases of any disaster recovery process. This section outlines the high-level steps needed for proper disaster recovery. A separate step-by-step disaster recovery document should be made available so administrators who are unfamiliar with Jazz can also perform these operations. The last subsection highlights the need to exercise these processes on a regular basis to ensure that disaster recovery processes and procedures provide adequate mitigation of the risks associated with loss of the Jazz systems and their data.

Some good examples of disaster recovery capabilities and processes can be seen in the Jazz Team Server Backup Details and the RRC Backup and Restore documents on the Jazz wiki site.

Notification/activation phase

During this phase, the Jazz administrator becomes aware of a loss of service. In some scenarios, this is detected electronically, and automated processes are kicked off. In other scenarios, the notification is more manual in nature.

After the Jazz administrator is aware of a potential loss of service, other impacted parties need to be notified immediately. At this point, the Jazz administrator and a other stakeholders will need to assess and identify the problem, and the correct DR procedures need to be identified and executed.

  • Notification procedures

  • Damage assessment

  • Plan activation

Recovery phase

During this phase, recovery assets are put into place and the disaster recovery procedures are completed. Recovery refers to the recovery of service for the end users and stakeholders of the Jazz infrastructure.

  • Sequence of recovery activities

  • Recovery procedures

Reconstitution phase

After service--possibly at a reduced performance or capacity--has been restored, the original issue must be addressed. After the cause of the loss of service has been identified and addressed, plans for moving back to the original production systems must be made and executed. This resumption of normal operations should occur with as little impact as possible to the Jazz user community.

  • Primary site recovery

  • Primary site replacement

Quarterly recovery drills

Several staff members should be trained and practiced in disaster recovery procedures. A regular disaster recovery drill enforces the training and verifies that the infrastructure and procedures are working and up-to-date.

The process to back up the repositories requires that the Jazz server be shut down in order to ensure that database integrity is maintained. Full backups, as opposed to incremental backups, should be performed, in order to ensure database integrity. After the backup is completed, the Jazz server processes can be restarted.

Use of NAS technologies, or high availability database technologies, can help minimize the downtime associated with periodic backups of the Jazz repositories. General backup process

### Need to consider anything else not covered above

Related topics: Back up the Rational solution for Collaborative Lifecycle Management, High availability principles, Approaches to implementing high availability and disaster recovery for Rational Jazz environments

External links:

Additional contributors: None

Edit | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r20 | r18 < r17 < r16 < r15 | More topic actions...
 
This site is powered by the TWiki collaboration platformCopyright © by IBM and non-IBM contributing authors. All material on this collaboration platform is the property of the contributing authors.
Contributions are governed by our Terms of Use. Please read the following disclaimer.
Dashboards and work items are no longer publicly available, so some links may be invalid. We now provide similar information through other means. Learn more here.