Systematic approach to performance and availability monitoring and management uc.png

Authors: StevenBeard
Build basis: CLM and SSE all versions

This page is intended to provide an overall systematic approach to monitoring and managing performance and availability issues. It will:

  • Initially outline the general approach DETECTING and issue, DECIDING what to DO and approaches to DOING something to rectify the underlying root cause
  • How to initially DETECT availability and performance issues
  • How to systematically working from the symptoms to the root cause of performance or availability issues and identify the failure scenario
  • Outline performance and availability scenarios with symptoms, root cause, recommended DECISION on what to DO and how to DO it

General approach

  • DETECT that there is an performance issue or outage. It is critical that the Development Environment Team DETECTS that there has been an outage or a performance issue is building up as quickly as possible. This should be fully automated and automatically notify the whole Development Environment Team team straight away. It is critical to the confidence in the Development Environment Service that when the users raise a failure incident, the administrative team can say, ‘Yes we know about the failure and we are already dealing with it’ or even better the administrative team have already proactively communicated that there has been a failure and they are working to resolve it. On the other hand a response of, ‘what failure we did not know there was a problem?’, can be catastrophic!

Recommendation: You should investigate whether your Development Environment Team can be automatically notified of application failures via SMS or email.

  • DECIDE what to do. This is often where the most time is wasted when recovering from a performance issue or outage. It is critical that ALL the perceived failure scenarios are documented and what the standard procedure are for recovery in the SLA. It is human nature that Administrators will believe, ‘if you give me just another hour, I can work out what the problem is and fix it’, when the better approach for a first-time failure is to just restart the single point of failure or environment and then try and diagnose what the root cause of the problem is when the users can get on with their development. Generally, the best approach is to just restart the environment for a first-time failure, but on the second or third failure with the same symptoms take a little more time to try and diagnose the root cause. Also there should be a clear point in time when you should failover to the DR site rather than trying to resolve HA failure scenarios in the primary data center.

Recommendation: Define target Maximum Time to Recovery (MTTR) for any single or multiple HA scenarios of 1-2 hours. At this point after the failure there should always be a review meeting to decide to fail over to the DR data center OR in exceptional situations exactly how much additional time should be allowed to recover before the next review meeting. Too often we see customers trying to diagnose and fix their production development environment for too long, when it would have made more sense to failover to their disaster recovery site earlier and get the development teams working again sooner.

  • Do something to fix the problem. Where possible this should be fully automated or at least have detailed manual procedures defined that have been well tested.

DETECTING availability issues

Before we try and identify the root cause of an outage, we must be able to DETECT that there has been an outage!

HOW TO ADD

DETECTING performance issues

Before we try and identify the root cause of a performance issue, we should DETECT that there is a performance issue to start with!

Most performance issues can be DETECTED by systematic environment systems monitoring.

HOW TO ADD

Systematically working from the symptoms to the root cause of performance or availability issues

Performance and availability scenarios with symptoms, root cause, recommended DECISION on what to DO and how to DO it

Scenario Symptoms Root cause DECISIONS on what to DO How to DO
Incremental growth in usage leading to gradual performance issues Systems monitoring trends gradually going-up and users starting to complain about slower performance Deployment topology needs to be scaled-up If existing systems resources already at comfortable maximum then scale-up topology LINK: Scale-up topology
OR In sufficient systems resources OTHERWISE increase appropriate systems resources LINK: Scale-up systems resources
Network latency issues for specific sites        
Failure of a single server or tier of environment        
Failure of common IT service        
Fundamental failure of primary data center        
User operations known to ALWAYS have high demands        
User operations known to SOMETIMES have high demands        
Potential IBM product defect        

tactical resolution vs permanent fix - consider

Develop a pragmatic approach to assessing actual user performance for each of the development tools

Acceptable performance can be very subjective leading to users, teams/projects and organizations having a perception of performance problems when none exist or they are poorly understood.

Understanding the symptoms and hopefully getting to the root cause of performance issues can be very hard, so a critical first step is to DECIDE whether there is or is not a performance issue to start with!

Recommendation: Establish a small set of benchmark manual performance tests that can be used to initially assess performance when a user raises a performance service request/incident. A small number of manually timed (user’s wrist watch) performance tests for common operations (login, create WI, run a standard query, open dashboard etc.) are an excellent way of assessing the current performance from a user’s workstation when they raise a performance service request/incident.

It is critical that the manual tests are consistent and run against the same RTC, RQM and RDNG project and data so that the results can be consistently compared. For each test you should DECIDE an acceptable range of performance so that they can DECIDE whether the user has a performance issue and its magnitude.

Related topics: Deployment web home

External links:

Additional contributors: None

Topic attachments
ISorted descending Attachment Action Size Date Who Comment
Tifftiff decisiontree.tiff manage 357.7 K 2016-03-24 - 12:59 UnknownUser  
This topic: Deployment > WebHome > DeploymentPlanningAndDesign > SystematicApproachToPerformanceAndAvailability
History: r1 - 2016-03-24 - 13:54:34 - Main.sbeard
 
This site is powered by the TWiki collaboration platformCopyright © by IBM and non-IBM contributing authors. All material on this collaboration platform is the property of the contributing authors.
Contributions are governed by our Terms of Use. Please read the following disclaimer.
Dashboards and work items are no longer publicly available, so some links may be invalid. We now provide similar information through other means. Learn more here.