Jazz Library Monitoring the performance of Lifecycle Query Engine using MBeans
Author name

Monitoring the performance of Lifecycle Query Engine using MBeans

Lifecycle Query Engine publishes groups of related metrics using composite data structures, which consist of nested data structures instead of just individual values. In this article you find full data structures for each MBean. All of the MBeans are published automatically. They do not have to be enabled. Attributes that are useful to monitor closely are noted within the data structures, along with a recommendation of how to configure an alert for that attribute.

It is particularly important to monitor the performance statistics related to accessing the default partition of Lifecycle Query Engine’s index. As the total amount of data stored in Lifecycle Query Engine increases, accessing the index slows down somewhat. However, a significant decrease in speed can affect many of Lifecycle Query Engine’s features and slow down query response time. If there is a significant decrease, Lifecycle Query Engine might be experiencing hardware issues, or need additional system resources to handle the data load.

It is also important to monitor the performance statistics related to fetching artifacts from the data sources. If the response time from data sources is too slow, Lifecycle Query Engine has difficulty keeping up with the changes being made to artifacts, and returns stale data when queried. The response time also has a significant impact on TRS feed reindexing performance because many artifacts must be fetched.

Data source activity metrics


Object name: com.ibm.team.integration.lqe:type=IndexingAgentMetrics,url=<TRS_URL>,node=<NODE_NAME> (Note "node" added 6.0.6)


Each of these MBeans provides performance and activity metrics for the processing of the Tracked Resource Set (TRS) feed in its object name. Use these MBeans to monitor the communication between applications, and watch for network issues preventing Lifecycle Query Engine from updating in a timely manner.

Update interval:
Instant. As soon as a task finishes, the metrics are available. While a task is running, metrics update periodically, based on the number of resources processed during the task.

Properties:

  • ChangeLogMetrics: Metrics related to the change log processing task that is currently running. If no update is currently running, the values might be zero. See LastChangeLogMetrics for details of the fields.
  • InitialIndexMetrics: Metrics related to loading this TRS feed into the index, when a data source is added for the first time or when it is reindexed.
Nested Attribute Level 1Nested Attribute Level 2Description
deletionMetrics Metrics related to the removal of any existing artifacts from this TRS feed
 deletedCountNumber of removed artifacts
 statusIndicates that the deletion was successful
 errorInfoIndicates that the deletion failed
 startTime/endTimeTime stamps
baseMetrics Metrics related to the processing of the base log of the TRS feed
 committedNumber of artifacts committed to the index
 dataSourceBusyWhether or not the tool providing this TRS feed was busy recreating TRS documents (added 6.0.6)
 errorCountNumber of errors that occurred during processing.
 errorInfoError message
 fetchCountNumber of artifacts that were retrieved successfully from the tool providing this TRS feed
 fetchTimeAvgAverage time to retrieve an artifact (in milliseconds)
 fetchTimeMaxLongest time required to retrieve an artifact (in milliseconds)
 fetchTimeMinShortest time required to retrieve an artifact (in milliseconds)
 fetchTimeStdDevStandard deviation of the times required to fetch artifacts
 ignoredCountNumber of artifacts that the tool providing this TRS feed does not want considered for reporting (added 6.0.6.1)
 skippedCountNumber of skipped resources encountered while indexing this TRS feed
 statusIndicates whether or not the base log processing was successful.
 totalResourcesTotal number of artifacts listed in the TRS base log
 trsFetchCountNumber of TRS base log pages retrieved successfully from the tool providing this TRS feed
 trsFetchTimeAvgAverage time required to retrieve a TRS base page (in milliseconds)
 trsFetchTimeMaxLongest time required to retrieve a TRS base page (in milliseconds)
 trsFetchTimeMinShortest time required to retrieve a TRS base page (in milliseconds)
 trsFetchTimeStdDevStandard deviation of the times required to fetch TRS base pages
 startTime/endTimeTime stamps

  • LastChangeLogMetrics: Metrics related to the last completed change log update task. Contains the following attributes:
Attribute Description
committedNumber of change log entries committed to the index
deleteCountNumber of artifacts deleted from the index
errorCountNumber of errors that occurred
Configure an alert if the error count is greater than zero. This indicates a possible issue with report data integrity. You only need to configure one of either this attribute or “status”.
errorInfoError message if any errors occurred
failedPatchCountNumber of failed patches attempted (formerly “rejectedCount” in 6.0.5)
Configure an alert if the failed patch count is greater than zero. This indicates a possible issue with report data integrity. You only need to configure one of either this attribute or “status”.
failedPatchTotalTotal number of failed patches for this data source after this update (added 7.0)
fetchCountNumber of artifacts that were retrieved successfully from the tool providing this TRS feed
fetchTimeAvgAverage time required to retrieve an artifact (in milliseconds)
Get a baseline for average fetch times while the applications are under a standard load, and then configure an alert if the average time rises dramatically.
fetchTimeMaxLongest time required to retrieve an artifact (in milliseconds)
fetchTimeMinShortest time required to retrieve an artifact (in milliseconds)
fetchTimeStdDevStandard deviation of the times required to fetch artifacts
ignoredCountNumber of artifacts found that the tool providing this TRS feed does not want considered for reporting (added 6.0.6.1)
patchCountNumber of processed TRS patches
recoveredCountNumber of change events discovered that were previously missing from the TRS feed (added 7.0)
Configure an alert if the recovered count is greater than zero. LQE recovered the missing data, but this indicates a problem with the tool publishing the TRS feed. Contact the support team for that tool.
skippedCountNumber of skipped resources encountered
Configure an alert if the skipped count is greater than zero. This indicates a possible issue with report data integrity. You only need to configure one of either this attribute or “status”.
skippedTotalTotal number of skipped resources for this data source after this update (added 6.0.6.1 iFix001)
statusIndicates whether or not this update was successful
An alert can be configured if status is not “success” instead of configuring specific alerts for each type of problem
totalEntriesTotal number of change log entries seen during this update
trsFetchCountNumber of TRS change log pages retrieved successfully from the tool providing this TRS
trsFetchTimeAvgAverage time required to retrieve a TRS change log page (in milliseconds)
Get a baseline for average fetch times while the applications are under a standard load, then configure an alert if average time rises dramatically.
trsFetchTimeMaxLongest time required to retrieve a TRS change log page (in milliseconds)
trsFetchTimeMinShortest time required to retrieve a TRS change log page (in milliseconds)
trsFetchTimeStdDevStandard deviation of the times required to fetch TRS change log pages

  • ReplaySkippedTaskMetrics: Metrics related to the last attempt to retry skipped resources from this TRS feed. Contains the following attributes:
Attribute Description
committedThe number of artifacts committed to the index
errorInfoError message if any errors occurred
fetchCountNumber of artifacts that were retrieved successfully from the tool providing this TRS feed
fetchTimeAvgAverage time required to retrieve an artifact (in milliseconds)
fetchTimeMaxLongest time required to retrieve an artifact (in milliseconds)
fetchTimeMinShortest time required to retrieve an artifact (in milliseconds)
fetchTimeStdDevStandard deviation of the times required to fetch artifacts
skippedCountNumber of resources skipped again during processing
statusIndicates whether or not the attempt to retry skipped resources was successful
totalResourcesThe total number of skipped resources that were retried during this task
startTime/endTimeTime stamps

  • ValidationMetrics: Metrics related to the most recent validation on this data source. Contains the following attributes (added 6.0.6 iFix003):
Nested Attribute Level 1Nested Attribute Level 2Description
deletionMetrics Metrics related to the cleanup of extra resources
 deletedCountNumber of removed artifacts
 statusIndicates that the deletion was successful
 errorInfoIndicates that the deletion failed
 startTime/endTimeTime stamps
baseMetrics Metrics related to the validating the content from the data source
 committedNumber of artifacts committed to the index
 dataSourceBusyWhether or not the tool providing this TRS feed was busy recreating TRS documents
 errorCountNumber of errors that occurred during processing.
Configure an alert if the error count is greater than zero. This indicates a possible issue with report data integrity. You only need to configure one of either this attribute or “status”.
 errorInfoError message
 fetchCountNumber of artifacts that were retrieved successfully from the tool providing this TRS feed
 fetchTimeAvgAverage time to retrieve an artifact (in milliseconds)
 fetchTimeMaxLongest time required to retrieve an artifact (in milliseconds)
 fetchTimeMinShortest time required to retrieve an artifact (in milliseconds)
 fetchTimeStdDevStandard deviation of the times required to fetch artifacts
 ignoredCountNumber of artifacts that the tool providing this TRS feed does not want considered for reporting
 missingResourcesCountNumber of artifacts adding during validation that were previously missing from the index
Configure an alert if the missing resource count is greater than zero. LQE recovered the missing data, but this indicates a problem with the tool publishing the TRS feed. Contact the support team for that tool.
 outdatedResourcesCountNumber of artifacts updated during validation that previously had stale data in the index
Configure an alert if the outdated resource count is greater than zero. LQE recovered the missing data, but this indicates a problem with the tool publishing the TRS feed. Contact the support team for that tool.
 skippedCountNumber of skipped resources encountered while indexing this TRS feed
Configure an alert if the skipped count is greater than zero. This indicates a possible issue with report data integrity. You only need to configure one of either this attribute or “status”.
 statusIndicates whether or not the validation was successful.
An alert can be configured if status is not “success” instead of configuring alerts for “errorCount” and “skippedCount”.
 totalResourcesTotal number of artifacts listed in the TRS feed
 trsFetchCountNumber of TRS base log pages retrieved successfully from the tool providing this TRS feed
 trsFetchTimeAvgAverage time required to retrieve a TRS base page (in milliseconds)
 trsFetchTimeMaxLongest time required to retrieve a TRS base page (in milliseconds)
 trsFetchTimeMinShortest time required to retrieve a TRS base page (in milliseconds)
 trsFetchTimeStdDevStandard deviation of the times required to fetch TRS base pages
 startTime/endTimeTime stamps


Accessing LQE index partitions

Object name: com.ibm.team.jis.lqe:type=DatasetMetrics,name="default" OR "shapes" OR "history" OR "version" (Note "version" added in 6.0.6 iFix003)

Each of these MBeans provides metrics about accessing the index partition in its object name. There are four in total, one for each of the four automatically created partitions. Use them to monitor for performance problems stemming from disk I/O. The most important partition to monitor is the default partition. Most of the data is contained in this partition, and it is accessed the most frequently.

Update interval: 1 minute

  • DatasetMetrics – This structure contains metrics about reading and writing data to this partition. There are two distinct types of measurements, separated into the two tables below:

    • Performance data

      LQE tracks and graphs several of these fields over time, so each reading consists of only data from the most recent minute. After the data is collected, all these statistics are reset and collection for the next minute starts again.
      An alert can be configured for any of the average values listed here. Get a baseline reading while the applications are under typical load, and configure an alert if the average time rises dramatically. A dramatic increase in any of the average values indicates a bottleneck in data processing.

      AttributeDescription
      graphProcessingTimeAvg

      Average time required to process the data read from a TRS feed before writing it to the index (in milliseconds) Use for alerts.

      graphProcessingTimeStdDevStandard deviation of the time required to process data read from TRS feeds before writing it to the index
      luceneWriteTimeAvgAverage time required to write to the Lucene text search index associated with this LQE index (in milliseconds). Use for alerts.
      luceneWriteTimeStdDevStandard deviation of the time required to write to the Lucene text search index associated with this LQE index
      queryTimeAvg

      Average time required to return the results of an incoming query (in milliseconds). Use for alerts.

      queryTimeStdDevStandard deviation of the time required to return the results of an incoming query
      queryWaitTimeAvgAverage time between receiving an incoming query and being able to access the index to process it (in milliseconds). Use for alerts.
      queryWaitTimeStdDevStandard deviation of the wait time to process queries
      readCountNumber of read transactions processed
      readTimeAvg

      Average time in milliseconds between the start and end of a read transaction. Use for alerts.

      readTimeStdDevStandard deviation of the duration of read transactions
      readWaitTimeAvgAverage time between starting a read transaction and being able to access the index to process it (in milliseconds). Use for alerts.
      readWaitTimeStdDevStandard deviation of the wait time to process read transactions
      tdbSyncTimeAvgAverage time (in milliseconds) required to commit a write transaction to the index. Use for alerts.
      tdbSyncTimeStdDevStandard deviation of the time required to commit write transactions to the index
      writeCountNumber of processed write transactions
      writeTimeAvgAverage time between the start and the end of a write transaction (in milliseconds). Use for alerts.
      writeTimeStdDevStandard deviation of the duration of write transactions
      writeWaitTimeAvgAverage time between starting a write transaction and being able to access the index to process it (in milliseconds). Use for alerts.
      writeWaitTimeStdDevStandard deviation of the wait time to process write transactions
    • Non-performance data

      These attributes either track totals since the LQE server was started, or contain the value for the instant they were polled.

      Attribute Description
      abortedWritersNumber of write transactions aborted since the server started
      activeReadersNumber of read transactions that are currently open
      activeWritersNumber of write transactions that are currently open
      committedWritersNumber of write transactions committed since the server started
      concurrentQueriesNumber of currently running queries
      finishedReadersNumber of read transactions that have completed
      fsSizeSize of this LQE partition on disk (in bytes)
      graphCountNumber of different graphs stored in this partition
      heapSuspensionsNumber of times since the server started that LQE tried to suspend activity for this partition because of high JVM heap usage
      locationFile system location where this partition is stored
      maxConcurrentQueriesLargest number of queries that have been running simultaneously since the server started
      modeType of file I/O used by this partition, either “mapped” or “direct”, depending on whether memory mapped I/O is enabled
      queryCountNumber of queries run against this partition since the server started
      queryFailureCountNumber of queries against this partition that failed since the server started
      queuedCommitsNumber of committed write transactions currently waiting to be written to this partition on disk
      stackSuspensionsNumber of times since the server started that LQE tried to suspend activity for this partition because of a high number of queued commits
      suspensionCompletionsNumber of times since the server started that activity for this partition was successfully suspended
      suspensionErrorsNumber of times since the server started that an error occurred while trying to suspend activity for this partition
      suspensionPendingWhether or not LQE is currently trying to suspend activity for this partition
      suspensionTimeoutsNumber of times since the server started that LQE failed to suspend activity for this partition because existing transactions did not complete in time
      tripleCountNumber of individual pieces of data currently stored in this partition
      versionVersion number of the Jena TDB component used to implement this partition

  • LastQueryLoadSummary – This structure contains metrics summarizing all the queries and reports within the last day that read from this partition (added 7.0).

    Attribute Description
    highPercentPercentage of the day where the query load was high as a number between 0.0 and 1.0
    lowPercentPercentage of the day where the query load was low as a number between 0.0 and 1.0
    medPercentPercentage of the day where the query load was medium as a number between 0.0 and 1.0
    queryAttemptedTotal number of queries that were submitted that day
    queryCacheHitsTotal number of queries that day that returned cached data
    queryExeAvgAverage response time (in milliseconds) for queries that day
    queryExecutedTotal number of queries that day that were actually executed, including queries that returned cached data
    queryFailedTotal number of queries that day that were submitted but not executed. These are usually caused by syntax errors in the queries
    queryTimeoutsTotal number of queries that day that timed out without completing
    statusOverall query load status for the day. Status is “failed” if query load was ever high, “warnings” if the query load was ever medium, or “success” if the query load was always low
    You can configure an alert if the load level is not “success”. You do not need this alert if you have one configured for the “queryLoad” property of the QueryLoadMetrics structure instead.
    startTime / endTimeTimestamps for the start and end of the time period being summarized

  • QueryLoadMetrics – This structure contains metrics about queries and reports that read from this partition and finished within the last minute (added 7.0).

    Attribute Description
    endTimeTimestamp when queries stopped being considered for this minute’s metrics
    queryAttemptedNumber of queries that were submitted
    queryCacheHitsNumber of queries that returned cached data
    queryExeAvgAverage response time (in milliseconds) for queries
    queryExeMaxLongest response time for a query
    queryExeMinShortest response time for a query
    queryExeStdDevStandard deviation of the query response time
    queryExecutedNumber of queries that were actually executed, including queries that returned cached data
    queryFailedNumber of queries that were submitted but not executed. These are usually caused by syntax errors in the queries
    queryLoadOverall query load during this minute. This is determined by comparing the timeout percentage to the timeout thresholds
    You can configure an alert if the load level is not “LOW”. This provides very specific information about the timing of the problem, but might cause a very large number of alerts. To get fewer alerts, configure an alert for the “status” property of the LastQueryLoadSummary structure instead.
    queryTimeoutPercentagePercentage of queries that timed out without completing
    queryTimeoutsNumber of queries that timed out without completing
    timeoutThresholdHighPercentage of queries that must time out for the query load to be considered high
    timeoutThresholdMedPercentage of queries that must time out for the query load to be considered medium

Latest LQE maintenance activity results

Object name: com.ibm.team.jis.lqe:type=MaintenanceActivity

This MBean tracks the most recent results from several maintenance activities run in the background to either administer LQE or track the health of the LQE application. A failure from any of these tasks might indicate a problem with the deployment environment for LQE, and should be investigated.

Update interval: Instant or 15 minutes. Anything that occurs on the node being monitored will be reflected instantly. Records of backups made on other nodes update every 15 minutes.

  • PreviousBackupMetrics – Metrics related to the last backup run:

    AttributeDescription
    detailsIf the backup failed, the error message associated with the failure
    locationLocation on the file system where the backup was stored
    nodeIdThe ID of the node where the backup was stored
    reasonHigh-level type of backup failure. If the backup is successful, you see “OTHER”
    sizeSize of the backup, as a display string
    statusThe overall status of the backup attempt. Configure an alert if it is not “success”.
    startTime / endTimeTime stamps

  • PreviousCompactionMetrics – Metrics related to the last compaction run:

    AttributeDescription
    detailsIf the compaction failed, the error message associated with the failure
    newSizeThe total size of the indexes after the compaction, as a display string
    nodeIdThe ID of the node where the compaction occurred
    oldSizeThe total size of the indexes before the compaction, as a display string
    reasonThe high-level type of compaction failure. If the compaction is successful, you see “OTHER”
    statusThe overall status of the compaction attempt. Configure an alert if it is not “success”.
    startTime /endTimeTime stamps

  • PreviousSystemClockMetrics – Metrics related to the last system clock verification

    AttributeDescription
    allowedDifferenceAmount of time (in milliseconds) that the system clock can differ from the time returned by the NTP server and still pass the verification
    statusOverall status of the system clock verification. Configure an alert if it is not “success”.
    timeDifferenceAmount of time (in milliseconds) the system clock differs from the time returned by the NTP server
    timestampTime stamp of when the verification was run
    typeResult type of the verification attempt


Load level of the system


Object name: com.ibm.team.jis.lqe:type=SystemLoad

This MBean publishes the data LQE collects about the load level the system is experiencing. Use it to monitor whether the server hardware is sufficient for LQE to run quickly and responsively.

Update interval: 1 minute

  • HeapMemoryUsage – Current value and thresholds for the percentage of used JVM heap memory on the JVM running the LQE application. To minimize false alerts, the current value for heap memory is the percentage of heap memory used after the most recent garbage collection before data collection, not the amount used at the moment of data collection.
  • DiskUsage – Current value and thresholds for the percentage of used disk space on the server
Each of these attributes has:

AttributeDescription
criticalThresholdThe critical threshold percentage configured for this load measurement. If the current value exceeds this value, query load shedding may occur.
valueThe current value for this load measurement. Configure an alert if the value exceeds the warning or critical threshold (your preference). You can configure LQE to send email notifications in this case, if the mailing service is enabled.
warningThresholdThe warning threshold percentage configured for this load measurement.

These MBeans help you monitor your LQE system to avoid performance issues.


About the author

Stephen Giesbrecht is a software developer working on Jazz Reporting Service. He can be reached at Stephen.Giesbrecht@ca.ibm.com

Tue, 31 Oct 2017