r15 - 2017-06-30 - 13:33:59 - RonnieSeagrenYou are here: TWiki >  Deployment Web > DeploymentMonitoring > CLMExpensiveScenarios > CLMExpensiveScenariosDraftUpdate
bb

Known Resource-intensive Scenarios

Authors: TimFeeney
Build basis: The Rational solution for Collaborative Lifecycle Management (CLM) and the Rational solution for systems and software engineering (SSE) v6.0.3. v6.0.4.

This page aims to capture user and system scenarios across the ALM portfolio that can potentially drive relatively higher load on a Jazz application. Such scenarios can lead to server debt (such as out-of-memory errors) if run during peak times on systems that don't have sufficient spare resources available. These scenarios are qualified or quantified to make them easier to understand. Where possible, best practices are provided that can minimize or avoid the issue altogether.

This list starts with the assumption that the applications are being run in a topology and on servers that are sized and tuned following our recommendations.

Consider these scenarios:

  1. User scenarios known to ALWAYS have high demands. They tend to be computationally expensive or use/consume large amounts of data/memory. They can lead lead to system slow down, and on resource-constrained servers, have been known to bring down environments. Examples: Large BIRT reports, builds, high volume imports.
  2. User scenarios known to SOMETIMES have high demands. Their resource consumption or computation demands tend to be more reasonable and manageable. With appropriate system resources, system configuration guidance, or usage best practices, their impact could be mitigated or avoided. Examples: Plan loading, populating a dashboard.
  3. System scenarios with potential high impact. Examples: ETL jobs, backup, online migration.

The following table summarizes the known resource-intensive scenarios. Each scenario links to a description of the scenario, a unique ID (name) to be used by the applications when starting or stopping these scenarios and for log correlation, and a link to known best practices. Note that any statements regarding IBM's future direction, intent, or product plans are subject to change or withdrawal without notice.

Table 1: Summary of Known Resource-intensive Scenarios and Related Best Practices
Product Scenario Scenario ID Best Practice
Common scenarios Data validation Run_Data_Validation NA
Item count metrics Collect_Item_Count_Metrics NA
 
Rational DOORS Next Generation Enabling suspect traceability DNG_Suspect link
Running RPE/RRDG reports with large result set DNG_Report link
Importing a large number of requirements DNG_Import link
Exporting a large number of requirements DNG_Export link
Using view query with large result set DNG_Query link
Running DNG ETL jobs to populate the data warehouse DNG_ETL link
 
Rational Team Concert Comparing a repository workspace to a stream with which it is extremely out-of-date RTC_Compare_Workspace link
Annotating an extremely large text file RTC_Annotate_File NA
Importing a Microsoft Project plan with a large number of tasks RTC_MSP_Import link
Exporting a large number of work items to a Microsoft Project plan RTC_MSP_Export NA
Adding many build result contributions to a build result RTC_Add_Build_Contribution link
Loading a large plan RTC_Load_Plan link
 
Rational Quality Manager Duplicating test plans with a large test hierarchy RQM_Duplicate_Test_Plan link
Bulk archiving or deletion of test results RQM_Bulk_ArchiveDelete link
 
Jazz Reporting Services (all reporting technologies) Running BIRT reports based on live data JRS_BIRT link
Running DCC jobs that require high storage and processing power JRS_DCC link
Performing LQE index maintenance JRS_LQE_Maintenance link
Running high-volume and very complex queries JRS_LQE_Query link
Refreshing a data source in Report Builder JRS_Refresh_Data_Source link
 
Global Configuration Management Creating streams in a global configuration hierarchy GCM_Create_Stream NA
Creating a baseline staging stream from a global stream hierarchy GCM_Stage_Baseline NA
Creating a global baseline (or committing a baseline staging stream) from a global stream hierarchy GCM_Create_Baseline NA
Updating a global stream from a baseline GCM_Update_Stream NA

Table 2: Throttling and Logging by Scenario
Scenario ID Throttling[1] Advanced Logging[2]
Run_Data_Validation NA NA
Collect_Item_Count_Metrics NA NA
DNG_Suspect NA NA
DNG_Report RDMPublishReportsRunners (3) Report name, Project Area, Component, Global/Local Configuration, Module/Requirement Set
DNG_Import ReqIF.importThreadPoolSize (1) ReqIF filename/path, Project Area, Component, Local Configuration
DNG_Export ReqIF limited by # cores on DNG server. CSV limited by "CSV Export Page Size" and "Export page sleep duration" ReqIF Definition, Project Area, Component, Global/Local Configuration
DNG_Query query.client.timeout (30 seconds), SPARQL Query abort timeout (5 min), query load management Query (RQL or Module/View/Filter), Project Area, Component, Global/Local Configuration
DNG_ETL NA NA
RTC_Compare_Workspace Maximum limit of SCM workspace compare scenarios (0) Workspace, Stream, Project Area
RTC_Annotate_File 1GB file Filename/path, file size, workspace, Project Area
RTC_MSP_Import NA Filename/path, Project Area, Plan name
RTC_MSP_Export NA Project Area, Plan
RTC_Add_Build_Contribution NA Build engine, Build definition, Build result label, contribution added, Project Area
RTC_Load_Plan delayed child loading Plan name, Plan view, Project Area
RQM_Duplicate_Test_Plan NA Plan name, 'include links' setting, Source/Target Project area, Component, Global/Local configuration
RQM_Bulk_ArchiveDelete NA Project Area, Component, Configuration, Query context (i.e. description of artifacts to archive/delete)
JRS_BIRT NA Report name, Report runtime settings (as applicable) Project area, Component, Configuration
JRS_DCC job schedule NA
JRS_LQE_Maintenance NA NA
JRS_LQE_Query LQE query limits Sparql Query executed (from LQE Admin UI), Report Builder Report name (if contains advanced query)
JRS_Refresh_Data_Source metamodel.autorefresh.time (6am), metamodel.autorefresh.repeat.inminutes (720) NA
GCM_Create_Stream NA Project Area, Component, Stream, Baseline [3]
GCM_Stage_Baseline NA Project Area, Component, Stream [3]
GCM_Create_Baseline NA Project Area, Component, Stream [3]
GCM_Update_Stream NA Project Area, Component, Stream, Baseline [3]

[1] Built-in properties or characteristics that limit the scenario (defaults shown in parentheses); see the scenario for further details.

[2] When advanced logging is enabled for a scenario, this information is included.

[3] These scenarios also include start and stop logging (not advanced logging) by the remote applications when carrying out their part in these distributed scenarios.

Monitoring resource-intensive scenarios

Starting in v6.0.3, starting and stopping of resource-intensive scenarios are captured in their respective application logs. If advanced logging is turned on (managed by the new Serviceability tab for each application), then additional information about the scenario occurrence is logged.

Further, new JMX MBeans track the occurrence of the scenarios as well and can be captured by enterprise monitoring tools for further analysis and trending. These scenario MBeans and other CLM application MBeans are documented in Monitoring Jazz Application performance and usage via JMX MBeans.

Common scenarios

Data validation

Data validation is a scheduled background task that is run on each application server to perform data integrity checks against the database to identify inconsistencies. This scenario exercises the online verify functionality in the product. As part of this the verifiers that are registered might make a lot of queries against the database, resulting in high I/O, memory and CPU load on the database server and also adding more network I/O between the server and the database. Keep the frequency setting at the default setting. It only runs once at a time.

Item count metrics

This operation collects detailed item counts and item size information from the repository, which help you understand data growth and data distribution patterns. This scenario walks through all the repository item states in the Item states system table to produce details about each type of item. As part of this the scenario, the operation makes a lot of queries against the database, resulting in high I/O, memory, and CPU requirements on the database server and also adding more network I/O between the server and the database. Keep the frequency setting at the default setting. It only runs once at a time. Note for v6.0.3, the default frequency for this scenario is every 15 minutes, which is too frequent; when enabling this MBean, set the "Delay Between Invocations" property to 604800 seconds (1 week) for the "CommonMetricsCollectorTask".

Rational DOORS Next Generation

Enabling suspect traceability

Link validity and suspect links are capabilities you can use to monitor related information for updates. When dependencies change (such as a customer requirement) you need a way of marking related information as "suspect" so that it can be reviewed and updated as appropriate.

Suspect links and link validity offer very similar functions but under very different circumstances. Suspect links describe related information within a single context, whereas link validity is used with configuration management.

Because of the approach used for local indexing of suspect links, you should be aware that you need to consider their resource overhead on the server when deploying that functionality. The performance of suspect traceability and link validity depends on the number of artifacts, concurrent users, and the rate of change activity. Often it is turned on without understanding what it does, how it will be used, or the cost of doing so.

Best practices

Regular use in production during normal hours should be limited to small deployments, when the server is lightly loaded, or only when necessary, otherwise you risk driving load on the server.

When suspect tracking is enabled, a full reindex occurs up front (see Suspicion Indexing). When you enable suspect traceability (in a project that is not enabled for configuration management), an index of change information for all link types, artifact types, and attributes is automatically built. Don't turn it off and then on again, because this action will cause another full reindex. Instead, pause the indexing (by using the Suspicion Profile Settings for the Requirements Management page).

By default, the index is automatically refreshed with new changes every 60 seconds, but you can change that setting. If you lower the default refresh setting, a greater load is placed on the server. For the refresh setting, use a number that is 30 seconds or higher.

These best practices don't apply to projects with configuration management enabled, in which case link validity, not suspicion profiles, provides the capability to note changes in requirements.

Running RPE/RRDG reports with a large result set

Generation of poorly constructed RPE/RRDG reports that include a result set of over 5K requirements can be slow. Concurrent generation of PDF reports for modules including more than 5K requirements should be limited to generating only one PDF at a time. DNG has the advanced property "RDMPublishReportsRunners", which defaults to 3 and constrains the number of RPE/RRDG reports that can be generated concurrently.

Best practices

Be sure to test reports in advance with near production equivalent data. Where possible, define the report with JRS. Also see the general reporting best practices.

Importing a large number of requirements

Importing can be computationally expensive. At the end of an import, when all indexing occurs, the import can block other user activity. Imports of 10K requirements or less should be fine. DNG currently limits how many ReqIF imports occur at once through the "ReqIF.importThreadPoolSize" advanced property, which defaults to 1. If the ReqIF import thread limit is reached you can still submit additional import requests. You will get the status message "Waiting for the completion of {N} previously submitted ReqIF Import Tasks."; you can close the import dialog box to let this continue in the background.

Best practices

For large (10K or greater) imports, we recommend importing during off hours or when the system is lightly loaded.

Exporting a large number of requirements

Export is less of a problem with CSV or ReqIF exports. Note that the number of concurrent ReqIF exports is limited to the number of virtual cores on the DNG server. Similar to the ReqIF import, when the export limit is reached,you get a status message and can let the export continue in the background.

Large Word or PDF exports can be problematic and should be limited to one at a time when exporting more than 10K requirements.

Two properties can govern the behavior of CSV exports. "CSV Export Page Size" limits the number of export entries (rows) that DNG builds at a time. "Export page sleep duration" causes DNG to pause between export page processing so that the export doesn't overload the server when doing a large scale export.

Best practices

We recommend exporting during off hours for large (10K or greater) exports to Word or PDF.

Using view query with large result sets

Browsing artifacts and modules in DNG is done in the context of a view using a query (based on filter settings) to populate the contents of the view. Large view queries resulting in 10K or more requirements can be resource-intensive, especially across multiple folders or if you are filtering based on strings, dates, or links. Traceability queries are even more resource-intensive. DNG uses the "query.client.timeout" advanced property to limit the run time of view queries; it defaults to 30 seconds. The "SPARQL Query abort timeout (in ms)", which defaults to 5 minutes, will limit most queries that are not limited by "query.client.timeout" value (such as loading the folder structure). Some query scenarios not governed by these timeouts include TRS, suspect indexing (when the suspect data is deleted upon "untracking" or rebuilding the index), building the type system, recent feeds (e.g. comments, requirements). As of 6.0.3, DNG includes a query load management mechanism to proactively manage and monitor the load resulting from view queries.

Running DNG ETL jobs to populate the data warehouse

This is a more of a system initiated operation and is a function of repository size and the amount of change. Generally for repository sizes greater than 100K-200K it is best to run the ETL jobs during off hours. Alternatively, use DCC to populate the data warehouse. This doesn't apply to configuration management enabled projects that populate the LQE database instead of the data warehouse.

Best practices

For large repository sizes, containing more than 100K-200K artifacts, it is best to run the ETL jobs during off hours.

Rational Team Concert

Comparing a repository workspace to a stream with which it is extremely out-of-date

This could cause server issues if a large number of these types of comparisons happen concurrently. As of 5.0.1,you can set a server property to limit the number of comparisons that can happen at the same time (maximum limit of SCM workspace compare scenarios). This value defaults to 0, which allows unlimited comparisons. When limited, the compare operations are queued until other compares finish. The user will see a slower compare or, in the extreme case where the thread waits too long, the compare will fail after 15 minutes (this limit is not configurable).

Best practices

When a workspace compare is performed, the service IFileSystemService#compareWorkspace is called. Occurrences of these calls are listed in the active services page of the CCM application. Should there be a large number of calls appearing at once, you may want to limit the number of workspace compares by setting the "Maximum limit of SCM workspace compare scenarios" advanced property to something on the order of 30.

Annotating an extremely large text file

This limitation exists as of version 6.0.2 and is expected to be addressed in a future release. See OutOfMemoryError when trying to annotate a 2G file. The expected default of 1MB might be adjusted as further testing is performed. The issue has been observed when annotating files with sizes in the multi-gigabyte range. Since the annotaione operation puts all contents, file, history, etc, in memory, large files, especially with large history, can require significant memory.

Importing a Microsoft Project plan with large number of tasks

How long an import takes depends on the number of items in the plan and their nested structure. For example, an import from a Microsoft Project file containing 2000 tasks could take up to 30 minutes on the first import and 8-10 minutes on subsequent imports, depending on server configuration and load. Consider also the memory demands of an import that will take approximately 100KB for each task being imported over and above the memory needed for typical RTC operations. In most cases, import of Microsoft Project plans happens infrequently, generally at the start of a project. However, if imports are to be a frequent occurrence, be sure that the server memory allocation has ample spare capacity. Note that the numbers provided are based on testing in a non-production environment.

Best practices

If your Microsoft Project file contains more than 1000 tasks, we recommend you import or export during off-hours or when the server is lightly loaded.

Exporting a large number of work items to a Microsoft Project plan

Similar to an import, export time and load depend on the size and complexity of the plan. The impact is primarily to memory on the server.

Adding many build result contributions to a build result

When a large number of contributions (compilation contributions, JUnit tests, downloads, log files, links, work item references, and so on) are included in a build result, because of the way they are stored (in a single data structure), the server could spend a lot of time marshalling and unmarshalling the persisted contributions when adding or deleting contributions. At best this is a slow running operation, however, should there be a large number of concurrent builds performing similar work (adding many build result contributions), the potential for impact to the server increases.

Best practices

Keep build result contributors to a minimum. If you are using an external build tool, such as Build Forge or Jenkins that is integrated with RTC, keep the overlap between build results in both tools to a minimum, and where there is overlap, consider storing only in the external build tool. Publish large contributions as links not the actual content, that is, publish download files greater than 10MB by using the "artifactLinkPublisher" task instead of "artifactFilePublisher". See Publishing build results and contributions for a full list of tasks, some of which have a "link" instead of a "content" version.

Loading a large plan

The RTC plan editor provides users with great flexibility to display plans in customized and flexible configurations. In order to provide rapid display of custom plan configurations, the RTC planning editor must fetch all the details of each work item when loading plans. Consequently, when the scope of a plan includes a large number of work items, loading of such plans can drive server load. We have greatly improved plan loading performance with each release by deferring the loading of out placed "child" work items or by allowing users to turn on and configure server side plan filtering to avoid loading work items that will never be displayed in plans.

Rational Quality Manager

Duplicating test plans with a large test hierarchy

The impact of duplicating a test hierarchy depends on the number of items and their size, and whether you choose to copy referenced artifacts. A test plan might include multiple child test plans, each with its own test cases and test scripts, resulting in potentially thousands of artifacts, each of which could reference a large amount of content storage. Although you can count the number of objects ahead of time, you cannot determine the overall memory size of the selected hierarchy. Because the duplication occurs in a single transaction, it can require a high amount of memory to complete.

See Copying huge Test Plan brings server down with Out of Memory errors

Best practices

A best practice is not to do a deep copy and, instead, only copy references to test cases, test scripts, etc. Should a deep copy be needed, break down the overall hierarchy duplication into smaller subsets. If that is not possible, it is best to perform the operation when the system is more lightly loaded and, increase the available system memory, or both.

An even better best practice is to move away from duplication altogether in support of reuse by clone and own. Instead, transition to use of Configuration management in the QM application.

Bulk archiving or deletion of test results

If you select more than 10000 artifacts to archive or delete in bulk, the operation can take a long time and might time out. See Web browser-based artifact deletes greater than around 10,000 items will fail to execute.

Best practices

Either select one page of results at a time or ensure that fewer than 10,000 artifacts are included in the result set, that is, work with smaller sets of assets.

Jazz Reporting Services (all reporting technologies)

Running BIRT reports based on live data

bbbReports on live data run more slowly than those using the Data Warehouse or LQE, which are optimized for reporting. In addition, custom BIRT reports can be inefficient in their construction or pull large volumes of data, increasing load. Each of the applications has an advanced server property, "Maximum Record Count", that limits the number of rows a report can fetch. Any report passing that limit will fail. The default is -1 which leaves the report unconstrained. The setting should not be used as a solution to bad behaving reports. It is rather a way to discover bad behaving reports as they will fail to render when they go past the limit.

Running DCC jobs that require high storage and processing power

Most Data Collection Component jobs can run at regular intervals, obtaining a delta of changes from the previous run. However, a few DCC jobs involve a larger amount of data, and place higher demands on storage and processing power on the DCC server. Because DCC shares the same data warehouse as the applications, load on the DCC processing these jobs with intensive storage and processing demands, could affect the applications:
  • Activity Fact Details (Activity History)
  • Build Fact Details (Build History)
  • File Fact Details (File History)
  • Project Management Fact Details (Project Management History)
  • Quality Management Fact Details (Quality Management History)
  • Requirement Management Fact Details (Requirement Management History)
  • Request Management Fact Details (Request Management History)
  • Task Fact Details (Task History)
  • Jazz Foundation Services (Statistics)

Best practices

Schedule the identified jobs to run during off-hours or when server load is light. (Note: Job names listed are based on v6.0.2; where different, the names for earlier releases is in parentheses).

Performing LQE index maintenance

Backup, compaction, re-indexing, and addition or removal of data sources can drive load on LQE.

Best practices

Schedule these scenarios during off-hours or when server usage is light.

Running high-volume and very complex queries

High-volume and very complex queries can put a heavy load on the data source. As indicated above, ensure your reports return only what is necessary for the report consumer.

In the Advanced section of the Report Builder users can edit the queries generated by Report Builder, or write custom SQL (Data Warehouse) or SPARQL (LQE) queries for a report. However, after a query has been edited, you can no longer use other Report Builder functions with that report. Inexperienced users can easily write and run an inefficient or incorrect query that could cause the data source to become unresponsive.

For LQE, you can set query service properties, such as the result limit (the default is 3000 results) and query timeout (the default is 60 seconds). LQE limits SPARQL queries based on these settings. Note that these limits do not apply to SPARQL queries against metadata.

Best practices

LQE offers a simple query interface in its Administration UI. You can use this interface to run sample queries to discover information and improve your queries.

You can also copy SPARQL queries from the Advanced section in Report Builder into this interface and make small changes to debug issues. This UI can target different scopes or configurations, or all data. Be aware that queries run from this UI still impact the data source, and are subject to the LQE Query Service limits, as described above. Access to LQE data sources can be restricted.

For more information on improving LQE performance, see Monitoring and managing the performance of Lifecycle Query Engine and Improving Lifecycle Query Engine performance.

Refreshing a data source in Report Builder

When a refresh of a data source is initiated from the Report Builder administrator UI, the data source is queried for the latest metadata. This can increase demand on both LQE and Report Builder servers and impact the performance of other reports that are running. This is especially true for LQE data sources, where most of the metadata must be queried (whereas most of the data warehouse metadata is hard-coded in Report Builder which is not possible to do so for LQE). Many factors affect how long a refresh takes: the number of project areas, the complexity of their data model (for example, a large number of enumeration values), and the amount of change.

If a refresh is in progress, when a user accesses the Report Builder UI, a message displays and the user must wait until the refresh completes.

To better understand the data flow in and out of LQE, including refresh of data sources, reading from and populating the indices, see LQE Data Flows. Note that the metadata refreshes are included in the TRS feeds from the applications.

Best practices

Report Builder refreshes the data sources when the server starts. It also runs a background job twice daily to automatically refresh all data sources. Configure the refresh to run at times when your organization has a lighter load, using the following properties in the server/conf/rs/app.properties file:
  • metamodel.autorefresh.time=6\:00AM
  • metamodel.autorefresh.repeat.inminutes=720

Administrators can refresh individual data sources on demand; be aware of the potential impact and refresh when it is less likely to affect other report users.

Global Configuration Management

Creating streams in a global configuration hierarchy

When viewing a global configuration stream, you can multi-select any number of baselines contributing to that stream and create new streams for those baselines all at once. These can include local application baselines as well as global baselines. Global streams will automatically be created for any global baselines above the selected baselines as well. The time to generate the streams depends on the depth of the global configuration hierarchy, the number of local application (DNG, RQM, and DM) baselines in the hierarchy, and number of versioned artifacts in each.

Because the local application (DNG, RQM, DM) creates its streams, most of the demand is placed on those servers. If there are a large number of baselines to create in the application servers, or if the GCM hierarchy is very deep, you might want to create the new streams during a period of light usage. You can also create fewer streams at once by selecting fewer initial baselines and creating streams in smaller batches. Or, you can first create local streams in their respective applications, then use the "Replace" action in GCM to replace those local baselines with streams you've already created.

Creating a baseline staging stream from a global stream hierarchy

When you create a baseline staging stream (BSS) for a global configuration hierarchy, you also create a new baseline staging stream for each global stream in the hierarchy. The time to do this and the load on the GCM application depends on the number of global configurations in the hierarchy, how deeply nested they are, and the number of custom properties and links they have.

Creating a global baseline (or a baseline staging stream) from a global stream hierarchy

Global baselines can be created directly or through use of a baseline staging stream. First, when creating a new global baseline directly without explicit use of a baseline staging stream, the original stream is untouched, and a new baseline hierarchy is created by requesting baselines for each local configuration from the contributing application (DNG, RQM, or DM) and creating global baselines for each global configuration in the hierarchy. Similar to creating streams from the global baseline, much of the processing is done by the contributing application servers (DNG, RQM, and DM). If there are many local configurations, a given application’s load could be high.

If the expense of producing a global baseline of the entire hierarchy all at once becomes resource-intensive, then you can first create a baseline staging stream for the global hierarchy. A baseline staging stream is just a deep copy of the global streams in a hierarchy. For each global stream, a new global staging stream is created, and each baseline or local stream is added as a contribution to the global staging stream hierarchy. The baseline staging stream acts as a snapshot of the global parts, and you can then baseline those piece by piece to reduce the overall cost. Be aware that creating a baseline in pieces like this means that you will be capturing the state of the streams at different points in time. If the streams are under active development this approach could result in incompatible local baselines. To mitigate this potential issue, be sure to baseline local streams as soon as possible to reduce the problem of mismatching baselines. Because you can commit the staging stream parts of the hierarchy independently, this approach works around the server cost of baselining the entire configuration at once.

The time required to create a global baseline increases with the depth and breadth of the configuration hierarchy being baselined, specifically the number of streams in the hierarchy. Baselines that are already in the hierarchy are not baselined again.

Updating a global stream from a baseline

When you update a global stream to match a baseline, every nested baseline in the hierarchy is examined and compared against the equivalent nested baseline of the target baseline. Several changes can then occur in the stream to add, remove, and replace nested baselines. The time required to update the global stream depends on the size and complexity of the configuration hierarchy, and the number of differences between the source stream and the target baseline.

Most of the demands for this action is placed on the GCM server. If updating a very large stream, you might want update the stream during a period of light usage.


Edit | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r15 < r14 < r13 < r12 < r11 | More topic actions
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Contributions are governed by our Terms of Use. Please read the following disclaimer.
Ideas, requests, problems regarding the Deployment wiki? Create a new task in the RTC Deployment wiki project