Monitoring the performance of Lifecycle Query Engine using MBeans
Lifecycle Query Engine publishes groups of related metrics using composite data structures, which consist of nested data structures instead of just individual values. In this article you find full data structures for each MBean. All of the MBeans are published automatically. They do not have to be enabled. Attributes that are useful to monitor closely are noted within the data structures, along with a recommendation of how to configure an alert for that attribute.
It is particularly important to monitor the performance statistics related to accessing the default partition of Lifecycle Query Engine’s index. As the total amount of data stored in Lifecycle Query Engine increases, accessing the index slows down somewhat. However, a significant decrease in speed can affect many of Lifecycle Query Engine’s features and slow down query response time. If there is a significant decrease, Lifecycle Query Engine might be experiencing hardware issues, or need additional system resources to handle the data load.
It is also important to monitor the performance statistics related to fetching artifacts from the data sources. If the response time from data sources is too slow, Lifecycle Query Engine has difficulty keeping up with the changes being made to artifacts, and returns stale data when queried. The response time also has a significant impact on TRS feed reindexing performance because many artifacts must be fetched.
Data source activity metrics
Object name: com.ibm.team.integration.lqe:type=IndexingAgentMetrics,url=<TRS_URL>,node=<NODE_NAME> (Note "node" added 6.0.6)
Each of these MBeans provides performance and activity metrics for the processing of the Tracked Resource Set (TRS) feed in its object name. Use these MBeans to monitor the communication between applications, and watch for network issues preventing Lifecycle Query Engine from updating in a timely manner.
Update interval: Instant. As soon as a task finishes, the metrics are available. While a task is running, metrics update periodically, based on the number of resources processed during the task.
Properties:
ChangeLogMetrics
: Metrics related to the change log processing task that is currently running. If no update is currently running, the values might be zero. SeeLastChangeLogMetrics
for details of the fields.InitialIndexMetrics
: Metrics related to loading this TRS feed into the index, when a data source is added for the first time or when it is reindexed.
Nested Attribute Level 1 | Nested Attribute Level 2 | Description |
---|---|---|
deletionMetrics | Metrics related to the removal of any existing artifacts from this TRS feed | |
deletedCount | Number of removed artifacts | |
status | Indicates that the deletion was successful | |
errorInfo | Indicates that the deletion failed | |
startTime/endTime | Time stamps | |
baseMetrics | Metrics related to the processing of the base log of the TRS feed | |
committed | Number of artifacts committed to the index | |
dataSourceBusy | Whether or not the tool providing this TRS feed was busy recreating TRS documents (added 6.0.6) | |
errorCount | Number of errors that occurred during processing. | |
errorInfo | Error message | |
fetchCount | Number of artifacts that were retrieved successfully from the tool providing this TRS feed | |
fetchTimeAvg | Average time to retrieve an artifact (in milliseconds) | |
fetchTimeMax | Longest time required to retrieve an artifact (in milliseconds) | |
fetchTimeMin | Shortest time required to retrieve an artifact (in milliseconds) | |
fetchTimeStdDev | Standard deviation of the times required to fetch artifacts | |
ignoredCount | Number of artifacts that the tool providing this TRS feed does not want considered for reporting (added 6.0.6.1) | |
skippedCount | Number of skipped resources encountered while indexing this TRS feed | |
status | Indicates whether or not the base log processing was successful. | |
totalResources | Total number of artifacts listed in the TRS base log | |
trsFetchCount | Number of TRS base log pages retrieved successfully from the tool providing this TRS feed | |
trsFetchTimeAvg | Average time required to retrieve a TRS base page (in milliseconds) | |
trsFetchTimeMax | Longest time required to retrieve a TRS base page (in milliseconds) | |
trsFetchTimeMin | Shortest time required to retrieve a TRS base page (in milliseconds) | |
trsFetchTimeStdDev | Standard deviation of the times required to fetch TRS base pages | |
startTime/endTime | Time stamps |
LastChangeLogMetrics
: Metrics related to the last completed change log update task. Contains the following attributes:
Attribute | Description |
---|---|
committed | Number of change log entries committed to the index |
deleteCount | Number of artifacts deleted from the index |
errorCount | Number of errors that occurred Configure an alert if the error count is greater than zero. This indicates a possible issue with report data integrity. You only need to configure one of either this attribute or “status”. |
errorInfo | Error message if any errors occurred |
failedPatchCount | Number of failed patches attempted (formerly “rejectedCount” in 6.0.5) Configure an alert if the failed patch count is greater than zero. This indicates a possible issue with report data integrity. You only need to configure one of either this attribute or “status”. |
failedPatchTotal | Total number of failed patches for this data source after this update (added 7.0) |
fetchCount | Number of artifacts that were retrieved successfully from the tool providing this TRS feed |
fetchTimeAvg | Average time required to retrieve an artifact (in milliseconds) Get a baseline for average fetch times while the applications are under a standard load, and then configure an alert if the average time rises dramatically. |
fetchTimeMax | Longest time required to retrieve an artifact (in milliseconds) |
fetchTimeMin | Shortest time required to retrieve an artifact (in milliseconds) |
fetchTimeStdDev | Standard deviation of the times required to fetch artifacts |
ignoredCount | Number of artifacts found that the tool providing this TRS feed does not want considered for reporting (added 6.0.6.1) |
patchCount | Number of processed TRS patches |
recoveredCount | Number of change events discovered that were previously missing from the TRS feed (added 7.0) Configure an alert if the recovered count is greater than zero. LQE recovered the missing data, but this indicates a problem with the tool publishing the TRS feed. Contact the support team for that tool. |
skippedCount | Number of skipped resources encountered Configure an alert if the skipped count is greater than zero. This indicates a possible issue with report data integrity. You only need to configure one of either this attribute or “status”. |
skippedTotal | Total number of skipped resources for this data source after this update (added 6.0.6.1 iFix001) |
status | Indicates whether or not this update was successful An alert can be configured if status is not “success” instead of configuring specific alerts for each type of problem |
totalEntries | Total number of change log entries seen during this update |
trsFetchCount | Number of TRS change log pages retrieved successfully from the tool providing this TRS |
trsFetchTimeAvg | Average time required to retrieve a TRS change log page (in milliseconds) Get a baseline for average fetch times while the applications are under a standard load, then configure an alert if average time rises dramatically. |
trsFetchTimeMax | Longest time required to retrieve a TRS change log page (in milliseconds) |
trsFetchTimeMin | Shortest time required to retrieve a TRS change log page (in milliseconds) |
trsFetchTimeStdDev | Standard deviation of the times required to fetch TRS change log pages |
ReplaySkippedTaskMetrics
: Metrics related to the last attempt to retry skipped resources from this TRS feed. Contains the following attributes:
Attribute | Description |
---|---|
committed | The number of artifacts committed to the index |
errorInfo | Error message if any errors occurred |
fetchCount | Number of artifacts that were retrieved successfully from the tool providing this TRS feed |
fetchTimeAvg | Average time required to retrieve an artifact (in milliseconds) |
fetchTimeMax | Longest time required to retrieve an artifact (in milliseconds) |
fetchTimeMin | Shortest time required to retrieve an artifact (in milliseconds) |
fetchTimeStdDev | Standard deviation of the times required to fetch artifacts |
skippedCount | Number of resources skipped again during processing |
status | Indicates whether or not the attempt to retry skipped resources was successful |
totalResources | The total number of skipped resources that were retried during this task |
startTime/endTime | Time stamps |
ValidationMetrics
: Metrics related to the most recent validation on this data source. Contains the following attributes (added 6.0.6 iFix003):
Nested Attribute Level 1 | Nested Attribute Level 2 | Description |
---|---|---|
deletionMetrics | Metrics related to the cleanup of extra resources | |
deletedCount | Number of removed artifacts | |
status | Indicates that the deletion was successful | |
errorInfo | Indicates that the deletion failed | |
startTime/endTime | Time stamps | |
baseMetrics | Metrics related to the validating the content from the data source | |
committed | Number of artifacts committed to the index | |
dataSourceBusy | Whether or not the tool providing this TRS feed was busy recreating TRS documents | |
errorCount | Number of errors that occurred during processing. Configure an alert if the error count is greater than zero. This indicates a possible issue with report data integrity. You only need to configure one of either this attribute or “status”. | |
errorInfo | Error message | |
fetchCount | Number of artifacts that were retrieved successfully from the tool providing this TRS feed | |
fetchTimeAvg | Average time to retrieve an artifact (in milliseconds) | |
fetchTimeMax | Longest time required to retrieve an artifact (in milliseconds) | |
fetchTimeMin | Shortest time required to retrieve an artifact (in milliseconds) | |
fetchTimeStdDev | Standard deviation of the times required to fetch artifacts | |
ignoredCount | Number of artifacts that the tool providing this TRS feed does not want considered for reporting | |
missingResourcesCount | Number of artifacts adding during validation that were previously missing from the index Configure an alert if the missing resource count is greater than zero. LQE recovered the missing data, but this indicates a problem with the tool publishing the TRS feed. Contact the support team for that tool. | |
outdatedResourcesCount | Number of artifacts updated during validation that previously had stale data in the index Configure an alert if the outdated resource count is greater than zero. LQE recovered the missing data, but this indicates a problem with the tool publishing the TRS feed. Contact the support team for that tool. | |
skippedCount | Number of skipped resources encountered while indexing this TRS feed Configure an alert if the skipped count is greater than zero. This indicates a possible issue with report data integrity. You only need to configure one of either this attribute or “status”. | |
status | Indicates whether or not the validation was successful. An alert can be configured if status is not “success” instead of configuring alerts for “errorCount” and “skippedCount”. | |
totalResources | Total number of artifacts listed in the TRS feed | |
trsFetchCount | Number of TRS base log pages retrieved successfully from the tool providing this TRS feed | |
trsFetchTimeAvg | Average time required to retrieve a TRS base page (in milliseconds) | |
trsFetchTimeMax | Longest time required to retrieve a TRS base page (in milliseconds) | |
trsFetchTimeMin | Shortest time required to retrieve a TRS base page (in milliseconds) | |
trsFetchTimeStdDev | Standard deviation of the times required to fetch TRS base pages | |
startTime/endTime | Time stamps |
Accessing LQE index partitions
Object name: com.ibm.team.jis.lqe:type=DatasetMetrics,name="default" OR "shapes" OR "history" OR "version" (Note "version" added in 6.0.6 iFix003)
Each of these MBeans provides metrics about accessing the index partition in its object name. There are four in total, one for each of the four automatically created partitions. Use them to monitor for performance problems stemming from disk I/O. The most important partition to monitor is the default partition. Most of the data is contained in this partition, and it is accessed the most frequently.
Update interval: 1 minute
DatasetMetrics
– This structure contains metrics about reading and writing data to this partition. There are two distinct types of measurements, separated into the two tables below:Performance data
LQE tracks and graphs several of these fields over time, so each reading consists of only data from the most recent minute. After the data is collected, all these statistics are reset and collection for the next minute starts again.
An alert can be configured for any of the average values listed here. Get a baseline reading while the applications are under typical load, and configure an alert if the average time rises dramatically. A dramatic increase in any of the average values indicates a bottleneck in data processing.Attribute Description graphProcessingTimeAvg Average time required to process the data read from a TRS feed before writing it to the index (in milliseconds) Use for alerts.
graphProcessingTimeStdDev Standard deviation of the time required to process data read from TRS feeds before writing it to the index luceneWriteTimeAvg Average time required to write to the Lucene text search index associated with this LQE index (in milliseconds). Use for alerts. luceneWriteTimeStdDev Standard deviation of the time required to write to the Lucene text search index associated with this LQE index queryTimeAvg Average time required to return the results of an incoming query (in milliseconds). Use for alerts.
queryTimeStdDev Standard deviation of the time required to return the results of an incoming query queryWaitTimeAvg Average time between receiving an incoming query and being able to access the index to process it (in milliseconds). Use for alerts. queryWaitTimeStdDev Standard deviation of the wait time to process queries readCount Number of read transactions processed readTimeAvg Average time in milliseconds between the start and end of a read transaction. Use for alerts.
readTimeStdDev Standard deviation of the duration of read transactions readWaitTimeAvg Average time between starting a read transaction and being able to access the index to process it (in milliseconds). Use for alerts. readWaitTimeStdDev Standard deviation of the wait time to process read transactions tdbSyncTimeAvg Average time (in milliseconds) required to commit a write transaction to the index. Use for alerts. tdbSyncTimeStdDev Standard deviation of the time required to commit write transactions to the index writeCount Number of processed write transactions writeTimeAvg Average time between the start and the end of a write transaction (in milliseconds). Use for alerts. writeTimeStdDev Standard deviation of the duration of write transactions writeWaitTimeAvg Average time between starting a write transaction and being able to access the index to process it (in milliseconds). Use for alerts. writeWaitTimeStdDev Standard deviation of the wait time to process write transactions Non-performance data
These attributes either track totals since the LQE server was started, or contain the value for the instant they were polled.Attribute Description abortedWriters Number of write transactions aborted since the server started activeReaders Number of read transactions that are currently open activeWriters Number of write transactions that are currently open committedWriters Number of write transactions committed since the server started concurrentQueries Number of currently running queries finishedReaders Number of read transactions that have completed fsSize Size of this LQE partition on disk (in bytes) graphCount Number of different graphs stored in this partition heapSuspensions Number of times since the server started that LQE tried to suspend activity for this partition because of high JVM heap usage location File system location where this partition is stored maxConcurrentQueries Largest number of queries that have been running simultaneously since the server started mode Type of file I/O used by this partition, either “mapped” or “direct”, depending on whether memory mapped I/O is enabled queryCount Number of queries run against this partition since the server started queryFailureCount Number of queries against this partition that failed since the server started queuedCommits Number of committed write transactions currently waiting to be written to this partition on disk stackSuspensions Number of times since the server started that LQE tried to suspend activity for this partition because of a high number of queued commits suspensionCompletions Number of times since the server started that activity for this partition was successfully suspended suspensionErrors Number of times since the server started that an error occurred while trying to suspend activity for this partition suspensionPending Whether or not LQE is currently trying to suspend activity for this partition suspensionTimeouts Number of times since the server started that LQE failed to suspend activity for this partition because existing transactions did not complete in time tripleCount Number of individual pieces of data currently stored in this partition version Version number of the Jena TDB component used to implement this partition
LastQueryLoadSummary
– This structure contains metrics summarizing all the queries and reports within the last day that read from this partition (added 7.0).Attribute Description highPercent Percentage of the day where the query load was high as a number between 0.0 and 1.0 lowPercent Percentage of the day where the query load was low as a number between 0.0 and 1.0 medPercent Percentage of the day where the query load was medium as a number between 0.0 and 1.0 queryAttempted Total number of queries that were submitted that day queryCacheHits Total number of queries that day that returned cached data queryExeAvg Average response time (in milliseconds) for queries that day queryExecuted Total number of queries that day that were actually executed, including queries that returned cached data queryFailed Total number of queries that day that were submitted but not executed. These are usually caused by syntax errors in the queries queryTimeouts Total number of queries that day that timed out without completing status Overall query load status for the day. Status is “failed” if query load was ever high, “warnings” if the query load was ever medium, or “success” if the query load was always low
You can configure an alert if the load level is not “success”. You do not need this alert if you have one configured for the “queryLoad” property of the QueryLoadMetrics structure instead.startTime / endTime Timestamps for the start and end of the time period being summarized
QueryLoadMetrics
– This structure contains metrics about queries and reports that read from this partition and finished within the last minute (added 7.0).Attribute Description endTime Timestamp when queries stopped being considered for this minute’s metrics queryAttempted Number of queries that were submitted queryCacheHits Number of queries that returned cached data queryExeAvg Average response time (in milliseconds) for queries queryExeMax Longest response time for a query queryExeMin Shortest response time for a query queryExeStdDev Standard deviation of the query response time queryExecuted Number of queries that were actually executed, including queries that returned cached data queryFailed Number of queries that were submitted but not executed. These are usually caused by syntax errors in the queries queryLoad Overall query load during this minute. This is determined by comparing the timeout percentage to the timeout thresholds
You can configure an alert if the load level is not “LOW”. This provides very specific information about the timing of the problem, but might cause a very large number of alerts. To get fewer alerts, configure an alert for the “status” property of the LastQueryLoadSummary structure instead.queryTimeoutPercentage Percentage of queries that timed out without completing queryTimeouts Number of queries that timed out without completing timeoutThresholdHigh Percentage of queries that must time out for the query load to be considered high timeoutThresholdMed Percentage of queries that must time out for the query load to be considered medium
Latest LQE maintenance activity results
Object name: com.ibm.team.jis.lqe:type=MaintenanceActivity
This MBean tracks the most recent results from several maintenance activities run in the background to either administer LQE or track the health of the LQE application. A failure from any of these tasks might indicate a problem with the deployment environment for LQE, and should be investigated.
PreviousBackupMetrics
– Metrics related to the last backup run:Attribute Description details If the backup failed, the error message associated with the failure location Location on the file system where the backup was stored nodeId The ID of the node where the backup was stored reason High-level type of backup failure. If the backup is successful, you see “OTHER” size Size of the backup, as a display string status The overall status of the backup attempt. Configure an alert if it is not “success”. startTime / endTime Time stamps
PreviousCompactionMetrics
– Metrics related to the last compaction run:Attribute Description details If the compaction failed, the error message associated with the failure newSize The total size of the indexes after the compaction, as a display string nodeId The ID of the node where the compaction occurred oldSize The total size of the indexes before the compaction, as a display string reason The high-level type of compaction failure. If the compaction is successful, you see “OTHER” status The overall status of the compaction attempt. Configure an alert if it is not “success”. startTime /endTime Time stamps
PreviousSystemClockMetrics
– Metrics related to the last system clock verificationAttribute Description allowedDifference Amount of time (in milliseconds) that the system clock can differ from the time returned by the NTP server and still pass the verification status Overall status of the system clock verification. Configure an alert if it is not “success”. timeDifference Amount of time (in milliseconds) the system clock differs from the time returned by the NTP server timestamp Time stamp of when the verification was run type Result type of the verification attempt
Load level of the system
Object name: com.ibm.team.jis.lqe:type=SystemLoad
This MBean publishes the data LQE collects about the load level the system is experiencing. Use it to monitor whether the server hardware is sufficient for LQE to run quickly and responsively.
Update interval: 1 minute
HeapMemoryUsage
– Current value and thresholds for the percentage of used JVM heap memory on the JVM running the LQE application. To minimize false alerts, the current value for heap memory is the percentage of heap memory used after the most recent garbage collection before data collection, not the amount used at the moment of data collection.DiskUsage
– Current value and thresholds for the percentage of used disk space on the server
Attribute | Description |
criticalThreshold | The critical threshold percentage configured for this load measurement. If the current value exceeds this value, query load shedding may occur. |
value | The current value for this load measurement. Configure an alert if the value exceeds the warning or critical threshold (your preference). You can configure LQE to send email notifications in this case, if the mailing service is enabled. |
warningThreshold | The warning threshold percentage configured for this load measurement. |
These MBeans help you monitor your LQE system to avoid performance issues.
About the author
Stephen Giesbrecht is a software developer working on Jazz Reporting Service. He can be reached at Stephen.Giesbrecht@ca.ibm.com
© Copyright IBM Corporation 2018 – 2019