Performance summary and guidance for the Data Collection Component in Rational Reporting for Development Intelligence
Darren Coffin, IBM
Last updated: July 15, 2014
Build basis: Rational Reporting for Development Intelligence, Data Collection Component 5.0
The Data Collection Component (DCC) is a feature introduced in the Collaborative Lifecycle Management (CLM) 5.0 release to address performance concerns with the current Extract, Transform, Load (ETL) solutions for populating a data warehouse with CLM data.
There is an inherent lengthy execution time problem with the existing CLM Java ETL and the existing Insight Data Manager(DM) ETL provided today. Delta ETLs can run for multiple hours, thus preventing multiple executions of the ETL within a given day. This is mostly due to the sequential execution of jobs, builds and the processing of repositories. With the Data Collection Component, the high degree of parallel execution for collecting data from repositories and processing resources in parallel has greatly reduced the overall execution time for processing data. This implies that multiple collections within a given day are possible due to the increased speed at which data can be retrieved and processed.
This article describes some of those performance improvements found during comparison tests between the new DCC ETLs and the existing Java ETLs and DM ETLs.
The information in this document is distributed AS IS. The use of this information or the implementation of any of these techniques is a customer responsibility and depends on the customer’s ability to evaluate and integrate them into the customer’s operational environment. While each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results will be obtained elsewhere. Customers attempting to adapt these techniques to their own environments do so at their own risk. Any pointers in this publication to external Web sites are provided for convenience only and do not in any manner serve as an endorsement of these Web sites. Any performance data contained in this document was determined in a controlled environment, and therefore, the results that may be obtained in other operating environments may vary significantly. Users of this document should verify the applicable data for their specific environment.
Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multi-programming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.
DCC improves ETL performance by up to three times compared to the Java ETLs. DCC also improves ETL performance by up to eight and a half times, compared to the DM ETLs.
These results are based on tests where a full load of all of the RRC, RTC, RQM, and Star ETL jobs were run on a single CLM repository containing a large amount of data (about 1.2 million records in the data warehouse).
In some cases, performance improvements can be even greater on a single repository. For example, running ETLs on a single RTC server with approximately 2000 changes takes one hour with DCC, where previously it would take five hours with Java ETLs.
DCC improves ETL Performance even more when used with multiple CLM Repositories, due to parallel processing of the ETL jobs in DCC compared to sequential processing of the ETL jobs in the Java and DM ETLs. Java and DM jobs are processed one at a time, no matter how many jobs and repositories there are. DCC runs multiple jobs at once, greatly reducing the overall time to finish the same number of jobs.
Again, these results are based on tests where a full load of all of the RRC, RTC, RQM, and Star ETL jobs were run. Three CLM repositories were used, containing a large amount of data (about 3.6 million records in the data warehouse).
With three full CLM repositories, performance improvements compared to Java ETLs improves by up to four times. Compared to DM ETLs, DCC performance improves by up to twelve times with those same three CLM repositories.
As more repositories are involved, the advantages of concurrent execution of ETL jobs continues to scale. Performance improvements of up to twenty times have been found with tests run on nineteen RTC servers.
One-time load when setting up a new data warehouse improves dramatically with DCC. An initial ETL load on a new data warehouse that would take many days, up to a week for some customers with DM ETLs, can be done in a single day with DCC. These results are with a single repository containing about 1.2 million records in the data warehouse from a single instance of RTC, RQM, and RRC.
Refreshing the Data Warehouse Reports
What these improvements in performance mean on a day-to-day basis, is that Data Warehouse reports can now be fresh within thirty minutes in most cases, with a common workload.
However, the frequency on which DCC jobs are run needs to be balanced with the impact on caching results of out-of-the-box CLM reports. These caches are cleared every time ETLs (whether DCC, Java, or Insight Data Manager) are run.
Case Study: Performance Comparison between Rational Data Collection Component and Rational Insight Data Manager ETL for Rational’s hosted Reporting Solution deployment
The intent of this case study is to present some numbers comparing the Data Collection Component with the Insight Data Manager ETL using the execution results from Rational’s own internal reporting system. The scope is limited to some of the CLM products, namely Rational Team Concert and Rational Quality Manager.
Rational’s reporting warehouse is a mature warehouse, in which data has been accumulating for more than 3 years. Therefore for the purpose of this article we will only be looking at numbers that reflect execution times for incremental loads rather than full loads. That is collecting from point products; data that has changed since the last time the data was collected and stored into the warehouse.
Insight Data Manager ETL Environment
Repositories Configured For ETL
|Product||Number of Repositories|
|Rational Team Concert||3|
|Rational Quality Manager||4|
|Rational Jazz Team Server||4|
Data Collection Component Environment
Repositories Configured For Collection
|Product||Number of Repositories|
|Rational Team Concert||4|
|Rational Quality Manager||1|
|Rational Jazz Team Server||2|
To compare the two methods for collecting data, the throughput has been used (i.e. rows processed per minute), for processing the multiple repositories that are configured. The table below expresses the magnitude of the throughput improvement seen with the Data Collection Component.
Multiple Repository Collection
|Product||Improved throughput with DCC (rows/min)|
|Rational Team Concert||2-4 times|
|Rational Quality Manager||2-3 times|
|Rational Jazz Team Server||3-4 times|
As the results show, the Data Collection Component can reduce the amount of time required today using Insight and Data manager to collect data for CLM products through parallel processing.
It is recommended that DCC is installed on a separate server from CLM for better performance, particularly with larger data sets. The minimum requirements for DCC are a 64-bit machine with at least 2 cores and 8 GB of RAM. 4 GB of RAM should be assigned to DCC, to accommodate parallel processing (4 GB is the default amount assigned to DCC).
However, the recommended configuration is a 64-bit machine with at least 4 CPU cores and 16 GB of RAM. 8 GB of RAM should be assigned to DCC in this case (for more information on configuring and tuning WAS, see: Configuring and Tuning WAS).
For Operating System, Windows Server Enterprise Edition 2008 R2 or Redhat Enterprise Linux (RHEL) Server 5.6 or 6 (64-bit) are recommended. System-specific settings for the DCC server such as setting the ulimit on Linux should be followed as per a CLM deployment. (for recommended ulimit settings, see Top Ten Tuning Tips) These settings should also be applied to the database server.
It is important that the data warehouse is properly configured to handle heavy usage. Due to the parallel processing of ETL jobs, the database will be very active at times. See Preparing the data warehouse database for recommended database parameter settings.
CLM 5.0 or later is required. The Java ETLs must be disabled.
Jazz Team Server (JTS) 5.0 or later is required, though DCC can either be registered with the same JTS server that is used by CLM, or with its own JTS server.
Websphere Application Server is recommended for large-scale deployment, version 8.5 or later (64-bit). For evaluation purposes, Apache Tomcat is included with the DCC install.
For More Information
- CLM DCC Performance Report 5.0
- Configuring and Tuning WAS
- Top Ten Tuning Tips
- Preparing the data warehouse database
About the author
Darren Coffin has been working as a software verification developer for IBM since 1999. He has worked extensively on testing a broad variety of IBM products, and is the functional verification test lead for DCC. He can be contacted at firstname.lastname@example.org.
Copyright © 2014 – IBM Canada Ltd.