r44 - 2015-05-26 - 18:06:28 - Main.sbagotYou are here: TWiki >  Deployment Web > DeploymentTroubleshooting > PerformanceTroubleshooting > WhyDoMyETLsTakeSoLongToRun

Why do my ETLs take so long to run?

Authors: GeraldMitchell, StephanieBagot
Build basis: CLM 4.x, 5.x
This situation is to help determine both cause and resolution where ETL (Extract, Transform, Load) processes take significantly longer than expected to run.

Introduction to ETLs

In order to understand how running ETLs can cause performance issues across the server, you should first understand what the ETL is doing.

First, the ETL (Extract, Transform, Load) jobs are initiated as a scheduled task, run by the following service as com.ibm.team.datawarehouse.service.internal.SnapshotRunnerTask. The ETLs are run in a certain sequence, detailed in the article Running the ETLs.

After the task is initiated, the ETL will extract the data from the application database, transform it, and load it into the operational data store (ODS) tables within the data warehouse database. While the ETL is active, network latency, database configuration, and JDBC connections can contribute to long running ETLs or performance issues. If the application servers and database servers are located remotely, the loading process is affected.

There are two types of ETLs which run on the server - Full Data Collection Job and an Incremental, or Delta, data collection Job. On the Data Collection Jobs page, each are denoted by two different icons under Action. Hover over the icon for a description. The Full Data Collection Jobs occur during the initial run of the data collection jobs. These jobs take significantly longer because it is gathering all data. Subsequent job runs are done using an incremental or 'delta' run where only changed data from the last job is collected. These jobs are much faster.

Once all the application ETLs have run, the JTS Star ETL is run, which will again extract, transform and load the data, this time from the ODS and stored in Jazz Team Server. Again, network latency, database configuration, and JDBC connections can contribute to performance degradation at this step.

Both sets of ETLs require system resources to run the tasks and extract and load the data on both the server hosting the Collaborative Lifecycle Management (CLM) application and the database server.

Note: The Data Collection Component (DCC) is available as of CLM 5.0 and higher. Deploying DCC on another system will alleviate many of the known performance constraints described below when you are using the traditional Java ETL's. If you are running the out of the box CLM ETL's, it is strongly recommended you use DCC in place of the Java ETL's. If you are using Rational Insight to execute the out of the box CLM ETL's, it is also recommended to use the Insight/DCC integration scripts to invoke the CLM DCC ETL's from your Insight Data Manager ETL. A video is also available on this page

Initial assessment and analysis

For information on how to assess and analyze information related to long running ETLs, review the Initial Assessment and Analysis page.

Known issues

Migrating from the data mart to the data warehouse

Despite improvements to ETL and server performance, the migration between the data mart and data warehouse is still expected to be long running. If you previously used the data mart, the first ETL run into the data warehouse will automatically migrate the data over from the data mart, which is typically a longer run time than if you were to run a full data collection job. This is due to many factors including the time it takes to gather the records, and move them between databases (where network or database latency may come into play).

Older CLM versions

CLM versions prior to 4.0.1 might have run full collection for each build. Improvements have been made to ensure that delta builds are being run (ensuring that only changes between the last build and the current are captured) as well as overall ETL performance. Ensure you are running on the latest version of CLM to take advantage of these enhancements.

Database statistics

After an ETL is run, the database statistics are updated for the tables. In some environments, the database statistics would cause extremely slow running ETLs and is indicated in the ETL log where total time for each build is not the sum of each of the times above (See How do I read the ETL Log Files for more information). If your database administrator (DBA) is running UPDATE STATS by default, you can turn it off from the Jazz Team Server.
Note: We do not recommend turning this off by default because over time the database statistics it will increase performance due to the indexing that is completed.
To turn this feature off, navigate to your Jazz Team Server Data Warehouse Connections page:
 https://<server URL>:<port>/jts/admin#action=com.ibm.team.reportsManagement.configureDataWarehouseConnection
and change "Automatically update the database statistics" to false.

Historical Data

The RICALM.REQUEST_BASELINE table will contain information necessary for reporting. There may be large amounts of historical data, causing the table to contain more information which may no longer be needed and be larger in size. It may be advantageous to remove some of the historical data if it is no longer needed. DO NOT PROCEED with removing the historical data without being under the advisement of support.

Oracle Data Guard

When running Oracle Data Guard, performance may be slow due to the functions of Data Guard. It may be advantageous to investigate turning Oracle Data Guard off. DO NOT PROCEED with turning Oracle Data Guard off without first consulting your Database Administrator.

Datawarehouse Indices

There are certain indices on the RICALM.REQUEST_BASELINE table which may be contributing to performance degradation during updates. It may be possible to remove these indices. DO NOT PROCEED with removing the indices without being under the advisement of support.

Possible causes and solutions

After reviewing the above information on assessment and analysis, you might find an error that correlates to the long running ETL. Navigate to one of the following pages:

Related topics:

External links:

Additional contributors: None

Questions and comments:

Edit | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r44 < r43 < r42 < r41 < r40 | More topic actions
 
This site is powered by the TWiki collaboration platformCopyright © by IBM and non-IBM contributing authors. All material on this collaboration platform is the property of the contributing authors.
Contributions are governed by our Terms of Use. Please read the following disclaimer.