<div id="header-title" style="padding: 10px 15px; border-width:1px; border-style:solid; border-color:#FFD28C; background-image: url(<nop>https://jazz.net/wiki/pub/Deployment/WebPreferences/TLASE.jpg); background-size: cover; font-size:120%"> ---+!! Approaches to implementing high availability and disaster recovery for Jazz environments %DKGRAY% Authors: Main.StevenBeard, Main.GrantCovell, Main.TimFeeney, Main.MikeLyons, Main.MikeDelargy, Main.MichaelAfshar <br> Build basis: Jazz applications in CLM and SSE %ENDCOLOR%</div></sticky> <!-- Page contents top of page on right hand side in box --> <sticky><div style="float:right; border-width:1px; border-style:solid; border-color:#DFDFDF; background-color:#F6F6F6; margin:0 0 15px 15px; padding: 0 15px 0 15px;"> %TOC{title="Page contents"}% </div></sticky> <sticky><div style="margin:15px;"></sticky> __Please note: as announced with CLM v4.0.6: IBM will be simplifying the approach to High Availability (HA) for the Rational CLM/Jazz platform. Customers and IBM are increasingly using other technologies, such as cloud and virtualization strategies, that provide a solution which adapts to business demands for both availability and scale in a cost-effective manner. IBM will be removing support for !WebSphere Application Server based application clustering from the CLM/Jazz platform for version 5.0. Many of the customers who have evaluated !WebSphere Application Clustering decided that it was too expensive and complex for the benefits provided for Jazz applications and exceeded their required availability.__ This set of topic pages provides guidance on potential ways to implement HA and Disaster Recovery (DR) for Jazz environments using capabilities of the platforms and middleware you have chosen to deploy your Rational Jazz environment on. Often this enables you to reuse existing procedures, configurations and technologies that already use for other IT systems. There are numerous different approaches, technologies and implementation options, so this set of topic pages will focus on the most commonly available and used approaches and technologies. The implementation options are usually constrained by your organizations existing approach to using these technologies for HA and DR, so in many situations we will only provide general guidelines and best practice to point your infrastructure administrators in the right direction. Therefore, for many of implementation options we will not provide detailed installation and configuration guides. Also you should rely upon the documentation, guidance and best practice, services and support of middleware vendors to install and configure the respective middleware. This page aims to provide pragmatic and cost-effective approaches to achieving HA and DR that meet your requirements and needs. There is the trade-off between costs/complexity of the approach and the level of availability you can achieve, see [[#PragmaticApproach][Pragmatic approach to achieving your required HA and DR]]. Please read to following introductory topics to gain an understanding of key HA and DR principles: * [[HighAvailability][High availability principles]] * [[DisasterRecovery][Disaster recovery principles]] ---++ Plan for failure Despite the best guidance, best practice, planning, deployment and management, your Jazz development environment will someday fail. When properly designed, [[DisasterRecovery#DRProcessAndProcedures][implemented]] and [[DisasterRecovery#QuarterlyRecoveryDrills][practiced]], HA and DR approaches will help you manage failure and recovery within acceptable time frames. If HA, DR, and backup procedures and policies are habits, then downtime is reduced. You must recommend and implement suitable HA and DR to meet the customers __REAL__ requirements and needs: * HA is system design and implementation that achieves system and data availability most or all of the time * HA usually focuses on availability within the primary datacenter * DR leverages technology and automation to ensure business continuity in the event of an unanticipated event * DR usually focuses on failover to a secondary datacenter * Backup is the most basic part of DR and should always be a part of a DR solution * You are likely to achieve higher availability than you target if you focus on the right aspects #PragmaticApproach ---++ Pragmatic approach to achieving your required HA and DR An administration team that is developing a HA and/or DR plan for a development environment must weigh the need to recover quickly and completely against the cost to implement the recovery. Also the impact to the I/O performance of the environment should be considered, as well as the environment [[DisasterRecovery#RecoveryTimeObjective][recovery time objective (RTO)]]. In other words, how much time is available to fully recover the environment with all critical development operations up and running again? Another important factor to consider is the environment's [[DisasterRecovery#RecoveryPointObjective][recovery point objective (RPO)]]: How much data is lost, or at what actual recovery point-in-time is all data current? <div style="text-align:center;"><img src="%ATTACHURLPATH%/ha_expense.jpg" alt="ha_expense.jpg" width="50%" height="50%" /><br><br> __[[http://www.redbooks.ibm.com/redbooks/pdfs/sg247700.pdf][The expense of availability]]__ </div> The above image shows that there is an important balance between the costs of failure, and the costs of HA and DR. HA and DR requirements for systems that support business or national critical capabilities, such as stock-exchanges, government agencies, hospitals, banks or e-commerce website, are likely to be much higher that a typical development environment, unless the development environment needs to have similar available as the system it supports. Even in this situation the required availability is usually less that the business or national critical system! As you design your HA and DR strategy, consider the inconveniences (costs) if your SCM system goes down for: 1 day, 4 hours, 1 hour, 15 minutes, 15 seconds, etc. In many cases, the costs of ensuring very small outage windows are not offset by the organization cost to design, implement and practice HA/DR. Additional business continuity costs can include: * Hardware and software * Additional datacenters, servers, physical components * Additional software, licenses * People * Product admins, IT admins, off-site contractors, 3rd-party * Business continuity procedures, planning, training, implementation, testing * And after deployment, remember to practice, practice, practice * Determining when to actually implement HA and DR * Monitoring: automatic and/or manual * Tools, logging, policies, procedures * And of course, more practice, practice, practice Only consider and implement a HA and/or DR solution that you are sure your organization is capable of supporting. If your organization has no experience with a technology or lack of general administrative maturity to supporting it, poorly implementing and supporting it may cause more outages than it is intended to prevent! Make sure that your organization [[HighAvailability#MeasuringAvailability][measures the actual availability]] it achieves. Document potential and actual failure scenarios within your organization, and use them to incrementally improve you HA and DR solution to further meet your needs. ---++ Typical and realistic requirements for HA and DR for a Rational Jazz environment Below are listed the key measure of HA and DR and respective realistic and typical requirements that customer and IBM development environments have: * *[[HighAvailability#AvailabilityNumberOfNines][Availability]]* - Most customers have an availability requirement for their development environment in the range of 99.5% - 99.7% for supported business operating hours. A few customer have a __REAL__ requirement for 99.9% availability, but be very careful not to set your requirement higher than needed as the cost of achieving very high levels of availability can be prohibitive! Further, many customer focus their requirement for HA within their primary datacenter and explicitly exclude DR scenarios requiring failover to a secondary datacenter or major rebuild of the primary datacenter. It is often useful to understand the required availability of similar systems within your organization and the level of availability supported by [[DeploymentPlanning#ReuseCommonManagedITInfrastructureServices][common IT services that you could reuse]], which will help you define a realistically achievable and needed requirement for availability. * *[[HighAvailability#MeanTimetoRecovery][Mean time to recovery (MTTR)]]* - Again we often see MTTR requirements for recovery focused on HA within the primary datacenter, rather than DR scenarios. We see typically see MTTR requirements in the range of 2-6 hours for a development environment within the primary datacenter. * *[[DisasterRecovery#RecoveryPointObjective][Recovery point objective (RPO)]]* - Most organizations base their required RPO on a nightly or once a day [[BackupCLM][offline back-up]]. This constrains your achievable RPO to 24 hours with recovery point in time at the time of the back-up. * *[[DisasterRecovery#RecoveryTimeObjective][Recovery time objective (RTO)]]* - The required RTO we see in customers is usually in the range of 4 hours to 2 days. However, most organizations realize that in the event of a major primary datacenter failure, the development environment will often be one of the last systems to be restored. #FailureScenarios ---++ Failure scenarios with general approach to resolving failure and possible platform/middleware implementations | *Failure scenario* | *General approach to resolving failure* | *Possible platform/middleware implementations* | *Possible MTTR for HA and achievable RTO for DR* | | *Jazz application failure* | Monitoring scripts detect that the Jazz application is unavailable <br> Scripts try to restart the Jazz application a number of times <br> If the Jazz application does not restart successfully, consider applying the approach to resolving a WAS failure <br> *NOTE: depending on experience, cost and complexity of setting-up the application monitoring and recovery, you could treat a Jazz application failure using the same approach as a single application server failure* | [[ImplementingApplicationHAUsingPowerHA#JazzApplicationRecovery][Jazz application recovery using PowerHA]] <br> [[ImplementingApplicationHAUsingVMwareHA#JazzApplicationRecovery][Jazz application recovery using VMware HA]] | 5 - 10 minutes | | *WAS failure* | Monitoring scripts detect that WAS is unavailable <br> Scripts try to restart WAS a number of times and then the Jazz application(s) <br> If WAS does not restart successfully, consider applying the approach to a single application server failure <br> *NOTE: depending on experience, cost and complexity of setting-up the WAS monitoring and recovery, you could treat a WAS failure using the same approach as a single application server failure*| [[ImplementingApplicationHAUsingPowerHA#WasRecovery][WAS recovery using PowerHA]] <br> [[ImplementingApplicationHAUsingVMwareHA#WasRecovery][WAS recovery using VMware HA]] <br> [[ImplementingApplicationRecoveryExampleScript][Example script for a Jazz deployment on WebSphere]] <br> [[WarmStandbyWASRecoveryUsingWASND][Warm standby WebSphere Application Server recovery using WebSphere Network Deployment]] | 10 - 15 minutes | | *Single application server failure* | This could be due to HW, OS or 3rd party SW errors or failures <br> The middleware hypervisor detects the failure and invokes the restart or fail over of the application server | [[ImplementingApplicationHAUsingPowerHA#SingleLPARRecovery][Single LPAR recovery using PowerHA]] <br> [[ImplementingApplicationHAUsingVMwareHA#SingleVirtualServerRecovery][Single virtual server recovery using VMware HA]] | 15 - 20 minutes | | ^ | ^ | [[ImplementingJazzApplicationServerRecoveryUsingAnIdleStandbyConfiguration][Jazz application server recovery using an idle standby configuration]] | 30 minutes - 2 hours | | * Database failure* | This could be due to the database servers, storage or network connections failing | [[DatabaseRecovery][Database recovery]] | 15 minutes to 4 hours | | *Web server tier failure* | A common strategy for HA in the web server tier is the use of a load balancer that feeds a reverse proxy server | | 15 minutes - 2 hours | | *Jazz indexes corrupted* | This should never happen due to a Jazz application, WAS or single application server failure <br> The only time this could happen is due to a storage corruption | [[JazzIndexesBackupAndRecoveryConsiderations][Jazz indexes, backup and recovery considerations]] | 1 - 24 hours | | *Primary data center or site failure* | A lost or inaccessible site will likely entail loss of ALL Jazz server resources and some data loss <br> This failure might be due to natural (fire, flood, tornado etc.) or human (hacking, malicious damage, stupidity etc.) failure <br> *Note:* often re-establishing a development environment will be less of a priority than re-establishing other business critical systems | DR using !PowerHA and database back-up or clustering TBD <br> DR using VMware technologies and database back-up or clustering TDB | 4 hours to 2 days | | *Network failure* | The local area network (LAN) or wide area network (WAN) is offline <br> Users will notice that all applications (including Jazz applications) are unresponsive <br> Users that are developing in Eclipse workspaces will be able to continue their work on software development assets, but no data will be stored to the Jazz repositories <br> Often, networks may experience significant slowdown, but they are rarely offline <br> If there is a network outage, it impacts all users and will be handled by IT support as a top priority problem. | No special procedures by the Jazz administrators are warranted | 15 minutes to 2 days | | *Storage failure* | This would be due to server disk, Storage Area Network (SAN) or Network Area Storage (NAS) failure <br> Storage failure can cause loss of data, and users will notice that the Jazz applications and Jazz server seem either unresponsive, or unable to store changes (when information returned to user is from the server cache) <br> Users that are developing in Eclipse workspaces will be able to continue their work on software development assets, but no data will be stored to the Jazz repositories <br> The use of SAN and NAS technologies can help reduce the risk of a single disk failure causing an unplanned outage | [[StorageRecovery][Storage recovery]] | 15 minutes to 2 days | _Please note_: If two or more failures occur at the same time, it will probably take longer to recover than the longest MTTR of all the individual failures. In a worst case scenarios, you will not be able to restore all the individual parts of your Rational environment to a consistent state, resulting in you having to recovering from your [[BackupCLM][Jazz back-up]]. For this reason it is essential that you always back-up your environment on a regular basis as the last line of defense for all HA and DR scenarios. ---+++++!! Related topics: * [[ImplementingApplicationHAUsingPowerHA][Implementing application HA using PowerHA]] * [[ImplementingApplicationHAUsingVMwareHA][Implementing application HA using VMware HA]] * [[ImplementingJazzApplicationServerRecoveryUsingAnIdleStandbyConfiguration][Implementing Jazz application server recovery using an idle standby configuration]] ---+++++!! External links: * None ---+++++!! Additional contributors: None <sticky></div></sticky>
This topic: Deployment
>
WebHome
>
DeploymentPlanningAndDesign
>
ApproachesToImplementingHAAndDR
History: r28 - 2017-01-03 - 12:13:22 - Main.sbeard
Copyright © by IBM and non-IBM contributing authors. All material on this collaboration platform is the property of the contributing authors.
Contributions are governed by our
Terms of Use.
Please read the following
disclaimer
.
Dashboards and work items are no longer publicly available, so some links may be invalid. We now provide similar information through other means. Learn more
here
.