High availability (HA) and disaster recovery (DR) are both related aspects of Wikipedia: Business continuity planning.
Disaster recovery is the process, policies and procedures related to preparing for recovery or continuation of technology infrastructure critical to an organization after a natural or human-induced disaster - Wikipedia: Disaster recovery.
This topic focuses upon a major failure of the primary data center requiring failover to a secondary disaster recover data center or fundamental rebuild of the primary data center.
The related High availability principles focuses on failure scenarios and recovery within the primary data center.
This topic outlines the different principles of DR that you should consider when designing a Rational development environment. It is critical that DR is considered from the outset of designing your environment because the design itself will constrain the DR solution. DR that is developed as an afterthought may result in significant rework of the environment or result in a suboptimal solution. However, the first thing to consider is what are your organizations real requirements for DR based on your business and technical needs and requirements for the environment itself.
Recovery point objective (usually measured in specific time) is the measure of the most an organization can afford to lose in a failure. An RPO of 1 hour means that an organization can always revert back to a restore point that is never more than one hour old. This may suggest that the organization executes backups every hour. Practically, a software factory may have a RPO of one day, which may mean that data is backed up every night at a specific time.
Recovery time objective (measured in time) is the measure of how long it takes an organization to restore services through either high availability, disaster recovery, or any combination. An RTO of 15 minutes means that an organization can restore its environment within 15 minutes or less. Practically, most software factories measure their RTO in hours. As surprising as it may seem, some organizations measure RTO in days.
**** HERE ******
Note: This subsection is derived from the IBM Systems Magazine article, Disaster Recovery levels, by Robert Kern and Victor Peltz (November 2003).
An administration team that is developing a disaster recovery plan for a development environment must weigh the need to recover quickly and completely against the cost to implement the recovery. The impact to the I/O performance of the environment should be considered, as well as the environment recovery time objective (RTO). In other words, how much time is available to fully recover the environment with all critical development operations up and running again? Another important factor to consider is the environment's recovery point objective (RPO): How much data is lost, or at what actual recovery point-in-time (RPiT) is all data current?
Computers must write data to disks with full integrity, even in the event of hardware failures and power failures. To accomplish this, environment designers employ many techniques, such as:
Another, more subtle, requirement for preserving the integrity of data being written is making sure that "dependent writes" are executed in the applications intended sequence. Many years ago, application developers developed various dependent write sequences to preserve data integrity/data consistency for data being written to disk across power failures. Consider this typical sequence of writes for a database update transaction:
These "dependent writes" must be written to remote mirrored disk in the same sequence in which the application issued them. In the previous example, there are no guarantees that the database log and the database are on the same storage subsystem. Failure to execute the write sequence correctly may result in writes (1) and (3) being executed, followed immediately by a system failure. When it is time to recover the database, the database log would incorrectly indicate that the transaction completed successfully. The transaction would be lost, and the integrity of the database would be questionable.
When considering the RTOs and RPOs in any disaster-recovery solution involving data replication, it is critical to understand the need for cross-volume data integrity and data consistency. Essential elements for creating cross-volume data integrity and data consistency include the ability to:
Cross-volume data integrity and data consistency enable database RESTARTs if the second copy of the data is actually used. Solutions that employ cross-volume mirroring and remote-disk mirroring must address the issue of data consistency to support cross-volume and cross-storage subsystem data integrity.
Most customers, when designing a multi-site solution, must minimize the time it takes to restart applications once the data at the secondary site has been recovered.
In the late 1980s, the SHARE Technical Steering Committee, working with IBM, developed a white paper that described levels of service for disaster recovery using Tiers 0 through 6. Since then, a number of businesses using IBM zSeries have moved toward an IBM TotalStorage solution called the Geographically Dispersed Parallel Sysplex, which allows an installation to manage end-to-end application and data availability across multiple geographically separate sites. This resulted in an additional seventh tier representing the industry's highest level of availability driven by technology improvements.
Most organizations typically consider four different disaster recovery quality-of-service (QoS) levels for their enterprise:
Most development environments require bronze or silver disaster recover QoS. If the operational systems the development environment supports are business critical and may require timely development changes, a gold QoS maybe required. Only a very few development environment that support nationally critical operational systems, or massive national or international banking systems will ever require platinum QoS.
The primary question you should ask is: "How long can my operational systems go without being able to make a critical development change?" Further, does this require the whole development environment or just part?
Secondary questions that might contribute to deciding the QoS level require include:
The following subsections cover the major considerations that a Jazz administrator will need to account for when creating a disaster recovery architecture and procedures.
The Jazz solution configuration should use storage area network (SAN) or NAS technologies if possible, because these technologies provide a degree of fault tolerance for disk storage resources. The Jazz repositories should reside on these resources. If virtualization is being utilized for the Jazz servers, then the virtual machine (VM) images should also be stored on this media.
In addition, NAS solutions offer utilities for doing “offline backups”, by breaking the disk mirror and then backing up the “offline” copy of the disk storage. These systems will also provide the capability to store these backups for quick restoration, as well as utilities to manipulate these backups. Well-prepared Jazz installations will also make sure that these backups are moved offsite at some interval; to mitigate the risks of data loss should the data from an entire site be destroyed.
Important: The enterprise Jazz infrastructure is expected to support the software development needs of an organization, and the consequences of losing these software assets are quite severe. Use of SAN and NAS technologies can help mitigate this risk, and will allow for backup of the Jazz based artifacts with a minimum amount of down time. While NAS may impact performance, when compared to a SAN implementation, it provides hot swapping and is more reliable.
Depending on the database vendor chosen, various different database backup utilities are available. If you use IBM DB2, on-line backups are available and provide a way to keep quality backups with minimal interruption. See the DB2 documentation for details on enabling on-line backups. Some specific settings are required in the area of log file management. Even with a solid backup story, scheduling occasional downtime for a full backup is wise. On Jazz.net, on-line database backups are taken nightly and in conjunction with transaction logs. Up to the last committed transaction can be restored. Off-line backups occur only when migrations or host system configuration changes are completed.
When you use an Oracle database, configure for ARCHIVE LOG mode. You can use the Oracle RMAN utility or Enterprise manager to take backups and schedule them on a periodic basis in conjunction with archive logs. The same tools can provide functionality to restore the database in case of failure. There might be other components that need to be backed up, like instance configuration settings, to exactly restore the database to original configuration.
In addition to backing up your Jazz repository databases, back up data that defines your application server configuration or configurations. The key items to backup from the Jazz application servers include: the jazz.war file, the workitemindex directory, the teamserver properties, the profile.ini file, the update-site directory, and the conf directory.
Also back up the following files from your web server: server.xml, web.xml, tomcat-users.xml (Tomcat), or the WebSphere profiles when using WebSphere Application Server. WebSphere Application Server provides a backup utility for doing this.
Tip: Deploy the Jazz repositories on an enterprise-ready database technology. Use the functionality and features available with that technology to best minimize downtime for backup and restore operations. Use standard enterprise backup procedures and policies with respect to backup media and offsite storage. Test these technologies and procedures, and document the specific approaches and procedures in place in your environment. For details on database backup technologies and concerns, see the Deployment Guide for Jazz Team Server and the Getting Started with Rational Team Concert 3.x: A Deployment Guide.
Some High Availability (HA) configurations that allow automatic server failover are not currently fully supported for Jazz based solutions. If you are not able or willing to deploy a supported HA configuration, use redundancy in server resources, in separate physical locations, to help mitigate the risk associated with loss of service due to the loss of communications with a Jazz server.
A set of Jazz Application and Jazz Web DR servers can provide a warm backup for a server that has lost connectivity. The Jazz application or Jazz web server can be configured and ready to take over operations with a minimum of change. The server can be preconfigured with the proper installed software, with only the files called out in the previous section needing to be restored before the DR server can be brought online.
The server indicated as the Jazz Repository DR server will need to have a different set of backup and restore criteria, since this server is the database server that stores the data. Due to the amount of data that needs to be restored, restoration of a Jazz repository server will take much longer to restore.
Tip: Using Jazz DR servers requires some additional hardware be kept ready at all times, and be underutilized. Some organizations will be hesitant to spend money on underutilized resources, but this should be looked at as a way to mitigate the risks associated with the loss of software development artifacts, and the loss of time while systems are not operational.
Using Jazz application and Jazz web servers in virtual machines provides an organization with the capability to quickly bring additional capability online. The use of virtual hardware to host these resources allows an organization to quickly and accurately bring up exact replicas of resources lost to disaster situations.
Using Jazz repository servers hosted on virtual machines allows for some of the same benefits; however, the rapid pace of software development data changing within the repositories means that some sort of restoration of the databases will need to be done. Loss of performance must be measured against the DR considerations when deciding on a Jazz solution architecture.
Tip: The use of virtual hardware for hosting the Jazz application and Jazz web components can be beneficial to the disaster recovery strategy of an organization, without a large impact to performance. Providing for the recovery from the loss of a Jazz repository server can be done with a mix of database replication technologies as well as virtualization technologies. The relative risks, consequences, and performance impacts of an organizations decision need to be carefully weighed.
The following list contains common disaster recovery scenarios. They are included here as a way to help identify the common procedures and processes needed for a proper disaster recovery plan.
Disk failure can cause loss of data, and users will notice that the Jazz applications and Jazz server seem either unresponsive, or unable to store changes (when information returned to user is from the server cache). Users that are developing in Eclipse workspaces will be able to continue their work on software development assets, but no data will be stored to the Jazz repositories.
Risk assessment: The use of SAN and NAS technologies can help reduce the risk of a single disk failure causing an unplanned outage. Expect the risk to be on par of losing a power supply.
The LAN is offline. Users will notice that all applications (including Jazz applications) are unresponsive. Users that are developing in Eclipse workspaces will be able to continue their work on software development assets, but no data will be stored to the Jazz repositories.
Risk assessment: Often, networks may experience significant slowdown, but they are rarely offline. If there is a network outage, it impacts all users and will be handled by IT support as a top priority problem. No special procedures by the Jazz administrators are warranted.
The Jazz application, Jazz web, or Jazz repository (database) server is lost. DR procedures will focus on getting users back to work using the existing infrastructure. Short-term solutions will optimize accessibility over performance.
Risk assessment: A lost server is the probably the largest risk, because this is the event that is most likely to happen. Loss of the Jazz web or Jazz application servers will be able to be responded to more quickly. Loss of the Jazz repository server will take longer, because the databases will need to be restored.
In this scenario, a site is lost or inaccessible. Recovery procedures will follow those of the server lost scenario above, but require the databases to be restored from offsite backup. A lost site will likely entail loss of ALL Jazz server resources, and some data loss.
Risk assessment: In case of a lost site (think fire, flood, tornado), re-establishing the development environment is not a Day 1 priority.
The following phases are the phases of any disaster recovery process. This section outlines the high-level steps needed for proper disaster recovery. A separate step-by-step disaster recovery document should be made available so administrators who are unfamiliar with Jazz can also perform these operations. The last subsection highlights the need to exercise these processes on a regular basis to ensure that disaster recovery processes and procedures provide adequate mitigation of the risks associated with loss of the Jazz systems and their data.
Some good examples of disaster recovery capabilities and processes can be seen in the Jazz Team Server Backup Details and the RRC Backup and Restore documents on the Jazz wiki site.
During this phase, the Jazz administrator becomes aware of a loss of service. In some scenarios, this is detected electronically, and automated processes are kicked off. In other scenarios, the notification is more manual in nature.
After the Jazz administrator is aware of a potential loss of service, other impacted parties need to be notified immediately. At this point, the Jazz administrator and a other stakeholders will need to assess and identify the problem, and the correct DR procedures need to be identified and executed.
During this phase, recovery assets are put into place and the disaster recovery procedures are completed. Recovery refers to the recovery of service for the end users and stakeholders of the Jazz infrastructure.
After service--possibly at a reduced performance or capacity--has been restored, the original issue must be addressed. After the cause of the loss of service has been identified and addressed, plans for moving back to the original production systems must be made and executed. This resumption of normal operations should occur with as little impact as possible to the Jazz user community.
Several staff members should be trained and practiced in disaster recovery procedures. A regular disaster recovery drill enforces the training and verifies that the infrastructure and procedures are working and up-to-date.
The process to back up the repositories requires that the Jazz server be shut down in order to ensure that database integrity is maintained. Full backups, as opposed to incremental backups, should be performed, in order to ensure database integrity. After the backup is completed, the Jazz server processes can be restarted.
Use of NAS technologies, or high availability database technologies, can help minimize the downtime associated with periodic backups of the Jazz repositories. General backup process
### Need to consider anything else not covered above
Status icon key: