This page was initially written for Rational Team Concert. In Engineering Lifecycle Management 7.x, DOORS Next adopted the same architecture as the other applications (RTC->Engineering Workflow Manger, and RQM->Engineering Test Manager), so this article is now in use for all 3 main applications. There is general applicability to the other Java-based applications also.
When an user says, "The system is too slow" or "ELM is down", what do they really mean? How do you quantify the user perception of slow?
Maybe the user was expecting that a plan with 500 Work Items would be displayed within nearly the same amount of time it takes to open a small one with 50 Work Items. Or maybe just part of a network was down, affecting all users from a single region.
The only way to understand
what is working,
what is not working, and
how long it is taking, is by collecting all the data needed to measure how long each kind of access normally takes under normal usage and, also, under high usage. Therefore, when the user says, "It is slow", you can compare and confirm (or not) how much it is slow.
Here is a guide for the Engineering Lifecycle Management (ELM) Administrator of the main points to understand about your ELM install and usage, as well as the tools that can be used to collect data
before,
during, and
after the slowness period.
Performance Troubleshooting Basics
Keys to Understanding Performance Issues
- Performance is relative to the user – We need to know the expected time under normal conditions and the actual time when experiencing slowness
- The business need of the tool – What is used more? SCM, Reports, planning
- Performance issues can occur because of many factors
- Where is the performance degradation occurring?
- Only one user or all users? (for example, certain users located in a certain building)
- Is the performance degradation across the whole server?
- When is the performance degradation occurring? (all the time, under light load, under heavy load)
- What exactly is slow? SCM, load plans, builds, Reports, Dashboards? Or only certain components? (for example, it is slow loading plans, but work items are fine)
- Are there any hangs, crashes, Out-of-Memory (OOM), high CPU?
Server versus Client side
When answering the question "where", normally we can detect if the slowness has been caused by a Server side issue (all users affected) or a Client side issue (some users affected). For example, all users connecting via VPN are being affected (the others are working fine), or a single user is not being able to access the application when using Visual Studio (configuration problem on a single machine).
Figure 1: Server versus Client performance indicators
Characterizing the Performance Problem
The Duration (How long is it taking?)
Collect information on several data points, taking note of the exact date and time, as well as the amount of time it takes to complete the task or activity. Why? To be able to compare the expected time versus the actual time during the slowness, and understand what is slow.
- Prepare a #Test using tasks related to the slowest performance activity for the user (build, a specific query, saving Work Item (WI)). Always use the same tasks.
- Choose a big Project Area (PA)
- Choose a Plan view with several WIs
- Choose a word for a quick search
- Prepare a Query to run, for instance – Open WIs on a selected PA
- Check-in & deliver of a sample Project Area
- Sample Build
Performance issues at the Server and Client side:
Client side
- Metronome output and test performance during the #Test Steps execution
- Collect HTTPWatch output during the #Test steps execution
Understanding the Environment
- Operating system of the user and server (RHEL, SUSE, Windows)
- Database vendor
- Application Server (Tomcat or WebSphere)
- Client type (web or eclipse) and Client version, Web Browser (Firefox, Internet Explorer, Chrome) and Browser version
- System Specifications: RAM, Heap Size, CPU, Virtualized or Physical machine?
- Network Information:
- Where is the DB located relative to the Application Server?
- Where are the users located relative to the Application Server?
- Is there a firewall, loadbalancer, proxy?
- How and When are the Maintenance tasks executed? (backup, reindex, update statistics, automatic tasks to clean up temporary folders)
- Has a server rename occurred?
- Is your server using IPv6? Are your clients using IPv6?
- Which applications/products/APIs are accessing this Jazz Team Server (JTS) or DB Repository? Requirements Management (RM), Quality Management (QM), Change and Configuration Management (CCM), Jazz Reporting Service (rb), BuildForge, TaskTop, custom applications?
- What else is running on these environment, sharing the hardware resources?
- Physical driver location (Indices location) – NFS, local/shared paths
Understanding the Size and Usage
- How many active users? (JTS/Admin – active users)
- Are all the users in the same timezone as the Server?
- How many concurrent users? (PMI settings on WebSphere Application Server (WAS))
- What are the days and times with more usage?
- How many Project Areas?
- What is the size of the biggest Project Area? And the smallest one?
- What does the user use the most? Reporting, Planning, SCM?
- Regarding Check-in&Delivery, What is the size of the biggest stream to be delivered?
- What is the size of your DB repository?
- The number of Builds, how many build engines, how much source code is involved? Where is the build being executed (network, version)?
Data Collection
Collect the Configuration
Client side
- Parameters used to start Eclipse
Server side
Prepare the Data Collection
Client side
- Activate metronome on Eclipse Client
- (Firefox) Download and Install Firebug
- (Firefox) Download and Install NetExport
- (Firefox and Internet Explorer) Download and Install HttpWatch
Server side
- Prepare a #TestConnection : Ping and traceroute application x database.
- Enable PMI collection data Tuning WebSphere servers for Rational Team Concert performance
- Monitoring and Tuning -> Performance Monitoring Infrastructure (PMI) -> [Server Name] -> Enable Performance Monitoring Infrastructure (PMI)
- Performance Monitoring Infrastructure (PMI) > server1 > Custom monitoring level:
Threads – Enable ActiveCount
PoolSize Servlet Session manager – Enable ActiveCount and LiveCount
- (WAS) Enable the verbose log by adding -verbose:gc on your JVM or by following Enabling verbose garbage collection (verboseGC) in WebSphere Application Server
- (Tomcat) Enable the Garbage collection through the following JVM Options for CATALINA_OPTS:
-verbose:gc -Xloggc:$CATALINA_HOME/logs/gc.log or Xloggc:%CATALINA_HOME%/logs/gc.log
-XX:+PrintHeapAtGC (only if more details about the memory consumed is needed, extra output generated)
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:-HeapDumpOnOutOfMemoryError
WAIT
- Register at WAIT and read the WAIT manual
- Download the script on each application machine (jts/ccm/qm/rm)
Out-of-Memory (OOM)
Data Collection during Slow Performance
Server side
Hang, Crash or High CPU
- Windows – Capture screen shots showing the Task Manager (CPU sorted). Also, from the Task Manager/Resource Monitor button, take screen shots of all tabs, expanding the sections.
Server and Client side, if System is Accessible:
- Execute the #Test steps prepared at the “Characterizing the performance problem”
Client side
- Metronome output and test performance during the #Test Steps execution
- Collect HTTPWatch output during the #Test steps execution
Data Collection after Slow Performance
Server side
Out-of-Memory (OOM)
Client side
External links:
Additional contributors: DianeEveritt
Questions and comments: