Author name

Tue, 31 Oct 2017

5 min read

Monitoring the performance of Lifecycle Query Engine using MBeans

Stephen Giesbrecht, Persistent Systems
Last updated: 25 November 2019
Build basis: Jazz Reporting Service 6.0.5, 6.0.6, 6.0.6.1, 7.0

Lifecycle Query Engine publishes groups of related metrics using composite data structures, which consist of nested data structures instead of just individual values. In this article you find full data structures for each MBean. All of the MBeans are published automatically. They do not have to be enabled. Attributes that are useful to monitor closely are noted within the data structures, along with a recommendation of how to configure an alert for that attribute.

It is particularly important to monitor the performance statistics related to accessing the default partition of Lifecycle Query Engine’s index. As the total amount of data stored in Lifecycle Query Engine increases, accessing the index slows down somewhat. However, a significant decrease in speed can affect many of Lifecycle Query Engine’s features and slow down query response time. If there is a significant decrease, Lifecycle Query Engine might be experiencing hardware issues, or need additional system resources to handle the data load.

It is also important to monitor the performance statistics related to fetching artifacts from the data sources. If the response time from data sources is too slow, Lifecycle Query Engine has difficulty keeping up with the changes being made to artifacts, and returns stale data when queried. The response time also has a significant impact on TRS feed reindexing performance because many artifacts must be fetched.

Data source activity metrics

Object name: com.ibm.team.integration.lqe:type=IndexingAgentMetrics,url=<TRS_URL>,node=<NODE_NAME> (Note "node" added 6.0.6)

Each of these MBeans provides performance and activity metrics for the processing of the Tracked Resource Set (TRS) feed in its object name. Use these MBeans to monitor the communication between applications, and watch for network issues preventing Lifecycle Query Engine from updating in a timely manner.

Update interval: Instant. As soon as a task finishes, the metrics are available. While a task is running, metrics update periodically, based on the number of resources processed during the task.

Properties:

ChangeLogMetrics: Metrics related to the change log processing task that is currently running. If no update is currently running, the values might be zero. See LastChangeLogMetrics for details of the fields.
InitialIndexMetrics: Metrics related to loading this TRS feed into the index, when a data source is added for the first time or when it is reindexed.

Nested Attribute Level 1	Nested Attribute Level 2	Description
deletionMetrics		Metrics related to the removal of any existing artifacts from this TRS feed
	deletedCount	Number of removed artifacts
	status	Indicates that the deletion was successful
	errorInfo	Indicates that the deletion failed
	startTime/endTime	Time stamps
baseMetrics		Metrics related to the processing of the base log of the TRS feed
	committed	Number of artifacts committed to the index
	dataSourceBusy	Whether or not the tool providing this TRS feed was busy recreating TRS documents (added 6.0.6)
	errorCount	Number of errors that occurred during processing.
	errorInfo	Error message
	fetchCount	Number of artifacts that were retrieved successfully from the tool providing this TRS feed
	fetchTimeAvg	Average time to retrieve an artifact (in milliseconds)
	fetchTimeMax	Longest time required to retrieve an artifact (in milliseconds)
	fetchTimeMin	Shortest time required to retrieve an artifact (in milliseconds)
	fetchTimeStdDev	Standard deviation of the times required to fetch artifacts
	ignoredCount	Number of artifacts that the tool providing this TRS feed does not want considered for reporting (added 6.0.6.1)
	skippedCount	Number of skipped resources encountered while indexing this TRS feed
	status	Indicates whether or not the base log processing was successful.
	totalResources	Total number of artifacts listed in the TRS base log
	trsFetchCount	Number of TRS base log pages retrieved successfully from the tool providing this TRS feed
	trsFetchTimeAvg	Average time required to retrieve a TRS base page (in milliseconds)
	trsFetchTimeMax	Longest time required to retrieve a TRS base page (in milliseconds)
	trsFetchTimeMin	Shortest time required to retrieve a TRS base page (in milliseconds)
	trsFetchTimeStdDev	Standard deviation of the times required to fetch TRS base pages
	startTime/endTime	Time stamps

LastChangeLogMetrics: Metrics related to the last completed change log update task. Contains the following attributes:

Attribute	Description
committed	Number of change log entries committed to the index
deleteCount	Number of artifacts deleted from the index
errorCount	Number of errors that occurred Configure an alert if the error count is greater than zero. This indicates a possible issue with report data integrity. You only need to configure one of either this attribute or “status”.
errorInfo	Error message if any errors occurred
failedPatchCount	Number of failed patches attempted (formerly “rejectedCount” in 6.0.5) Configure an alert if the failed patch count is greater than zero. This indicates a possible issue with report data integrity. You only need to configure one of either this attribute or “status”.
failedPatchTotal	Total number of failed patches for this data source after this update (added 7.0)
fetchCount	Number of artifacts that were retrieved successfully from the tool providing this TRS feed
fetchTimeAvg	Average time required to retrieve an artifact (in milliseconds) Get a baseline for average fetch times while the applications are under a standard load, and then configure an alert if the average time rises dramatically.
fetchTimeMax	Longest time required to retrieve an artifact (in milliseconds)
fetchTimeMin	Shortest time required to retrieve an artifact (in milliseconds)
fetchTimeStdDev	Standard deviation of the times required to fetch artifacts
ignoredCount	Number of artifacts found that the tool providing this TRS feed does not want considered for reporting (added 6.0.6.1)
patchCount	Number of processed TRS patches
recoveredCount	Number of change events discovered that were previously missing from the TRS feed (added 7.0) Configure an alert if the recovered count is greater than zero. LQE recovered the missing data, but this indicates a problem with the tool publishing the TRS feed. Contact the support team for that tool.
skippedCount	Number of skipped resources encountered Configure an alert if the skipped count is greater than zero. This indicates a possible issue with report data integrity. You only need to configure one of either this attribute or “status”.
skippedTotal	Total number of skipped resources for this data source after this update (added 6.0.6.1 iFix001)
status	Indicates whether or not this update was successful An alert can be configured if status is not “success” instead of configuring specific alerts for each type of problem
totalEntries	Total number of change log entries seen during this update
trsFetchCount	Number of TRS change log pages retrieved successfully from the tool providing this TRS
trsFetchTimeAvg	Average time required to retrieve a TRS change log page (in milliseconds) Get a baseline for average fetch times while the applications are under a standard load, then configure an alert if average time rises dramatically.
trsFetchTimeMax	Longest time required to retrieve a TRS change log page (in milliseconds)
trsFetchTimeMin	Shortest time required to retrieve a TRS change log page (in milliseconds)
trsFetchTimeStdDev	Standard deviation of the times required to fetch TRS change log pages

ReplaySkippedTaskMetrics: Metrics related to the last attempt to retry skipped resources from this TRS feed. Contains the following attributes:

Attribute	Description
committed	The number of artifacts committed to the index
errorInfo	Error message if any errors occurred
fetchCount	Number of artifacts that were retrieved successfully from the tool providing this TRS feed
fetchTimeAvg	Average time required to retrieve an artifact (in milliseconds)
fetchTimeMax	Longest time required to retrieve an artifact (in milliseconds)
fetchTimeMin	Shortest time required to retrieve an artifact (in milliseconds)
fetchTimeStdDev	Standard deviation of the times required to fetch artifacts
skippedCount	Number of resources skipped again during processing
status	Indicates whether or not the attempt to retry skipped resources was successful
totalResources	The total number of skipped resources that were retried during this task
startTime/endTime	Time stamps

ValidationMetrics: Metrics related to the most recent validation on this data source. Contains the following attributes (added 6.0.6 iFix003):

Nested Attribute Level 1	Nested Attribute Level 2	Description
deletionMetrics		Metrics related to the cleanup of extra resources
	deletedCount	Number of removed artifacts
	status	Indicates that the deletion was successful
	errorInfo	Indicates that the deletion failed
	startTime/endTime	Time stamps
baseMetrics		Metrics related to the validating the content from the data source
	committed	Number of artifacts committed to the index
	dataSourceBusy	Whether or not the tool providing this TRS feed was busy recreating TRS documents
	errorCount	Number of errors that occurred during processing. Configure an alert if the error count is greater than zero. This indicates a possible issue with report data integrity. You only need to configure one of either this attribute or “status”.
	errorInfo	Error message
	fetchCount	Number of artifacts that were retrieved successfully from the tool providing this TRS feed
	fetchTimeAvg	Average time to retrieve an artifact (in milliseconds)
	fetchTimeMax	Longest time required to retrieve an artifact (in milliseconds)
	fetchTimeMin	Shortest time required to retrieve an artifact (in milliseconds)
	fetchTimeStdDev	Standard deviation of the times required to fetch artifacts
	ignoredCount	Number of artifacts that the tool providing this TRS feed does not want considered for reporting
	missingResourcesCount	Number of artifacts adding during validation that were previously missing from the index Configure an alert if the missing resource count is greater than zero. LQE recovered the missing data, but this indicates a problem with the tool publishing the TRS feed. Contact the support team for that tool.
	outdatedResourcesCount	Number of artifacts updated during validation that previously had stale data in the index Configure an alert if the outdated resource count is greater than zero. LQE recovered the missing data, but this indicates a problem with the tool publishing the TRS feed. Contact the support team for that tool.
	skippedCount	Number of skipped resources encountered while indexing this TRS feed Configure an alert if the skipped count is greater than zero. This indicates a possible issue with report data integrity. You only need to configure one of either this attribute or “status”.
	status	Indicates whether or not the validation was successful. An alert can be configured if status is not “success” instead of configuring alerts for “errorCount” and “skippedCount”.
	totalResources	Total number of artifacts listed in the TRS feed
	trsFetchCount	Number of TRS base log pages retrieved successfully from the tool providing this TRS feed
	trsFetchTimeAvg	Average time required to retrieve a TRS base page (in milliseconds)
	trsFetchTimeMax	Longest time required to retrieve a TRS base page (in milliseconds)
	trsFetchTimeMin	Shortest time required to retrieve a TRS base page (in milliseconds)
	trsFetchTimeStdDev	Standard deviation of the times required to fetch TRS base pages
	startTime/endTime	Time stamps

Accessing LQE index partitions

Object name: com.ibm.team.jis.lqe:type=DatasetMetrics,name="default" OR "shapes" OR "history" OR "version" (Note "version" added in 6.0.6 iFix003)

Each of these MBeans provides metrics about accessing the index partition in its object name. There are four in total, one for each of the four automatically created partitions. Use them to monitor for performance problems stemming from disk I/O. The most important partition to monitor is the default partition. Most of the data is contained in this partition, and it is accessed the most frequently.

Update interval: 1 minute

DatasetMetrics – This structure contains metrics about reading and writing data to this partition. There are two distinct types of measurements, separated into the two tables below:

Performance data

LQE tracks and graphs several of these fields over time, so each reading consists of only data from the most recent minute. After the data is collected, all these statistics are reset and collection for the next minute starts again.
An alert can be configured for any of the average values listed here. Get a baseline reading while the applications are under typical load, and configure an alert if the average time rises dramatically. A dramatic increase in any of the average values indicates a bottleneck in data processing.

Attribute	Description
graphProcessingTimeAvg	Average time required to process the data read from a TRS feed before writing it to the index (in milliseconds) Use for alerts.
graphProcessingTimeStdDev	Standard deviation of the time required to process data read from TRS feeds before writing it to the index
luceneWriteTimeAvg	Average time required to write to the Lucene text search index associated with this LQE index (in milliseconds). Use for alerts.
luceneWriteTimeStdDev	Standard deviation of the time required to write to the Lucene text search index associated with this LQE index
queryTimeAvg	Average time required to return the results of an incoming query (in milliseconds). Use for alerts.
queryTimeStdDev	Standard deviation of the time required to return the results of an incoming query
queryWaitTimeAvg	Average time between receiving an incoming query and being able to access the index to process it (in milliseconds). Use for alerts.
queryWaitTimeStdDev	Standard deviation of the wait time to process queries
readCount	Number of read transactions processed
readTimeAvg	Average time in milliseconds between the start and end of a read transaction. Use for alerts.
readTimeStdDev	Standard deviation of the duration of read transactions
readWaitTimeAvg	Average time between starting a read transaction and being able to access the index to process it (in milliseconds). Use for alerts.
readWaitTimeStdDev	Standard deviation of the wait time to process read transactions
tdbSyncTimeAvg	Average time (in milliseconds) required to commit a write transaction to the index. Use for alerts.
tdbSyncTimeStdDev	Standard deviation of the time required to commit write transactions to the index
writeCount	Number of processed write transactions
writeTimeAvg	Average time between the start and the end of a write transaction (in milliseconds). Use for alerts.
writeTimeStdDev	Standard deviation of the duration of write transactions
writeWaitTimeAvg	Average time between starting a write transaction and being able to access the index to process it (in milliseconds). Use for alerts.
writeWaitTimeStdDev	Standard deviation of the wait time to process write transactions

Non-performance data

These attributes either track totals since the LQE server was started, or contain the value for the instant they were polled.

Attribute	Description
abortedWriters	Number of write transactions aborted since the server started
activeReaders	Number of read transactions that are currently open
activeWriters	Number of write transactions that are currently open
committedWriters	Number of write transactions committed since the server started
concurrentQueries	Number of currently running queries
finishedReaders	Number of read transactions that have completed
fsSize	Size of this LQE partition on disk (in bytes)
graphCount	Number of different graphs stored in this partition
heapSuspensions	Number of times since the server started that LQE tried to suspend activity for this partition because of high JVM heap usage
location	File system location where this partition is stored
maxConcurrentQueries	Largest number of queries that have been running simultaneously since the server started
mode	Type of file I/O used by this partition, either “mapped” or “direct”, depending on whether memory mapped I/O is enabled
queryCount	Number of queries run against this partition since the server started
queryFailureCount	Number of queries against this partition that failed since the server started
queuedCommits	Number of committed write transactions currently waiting to be written to this partition on disk
stackSuspensions	Number of times since the server started that LQE tried to suspend activity for this partition because of a high number of queued commits
suspensionCompletions	Number of times since the server started that activity for this partition was successfully suspended
suspensionErrors	Number of times since the server started that an error occurred while trying to suspend activity for this partition
suspensionPending	Whether or not LQE is currently trying to suspend activity for this partition
suspensionTimeouts	Number of times since the server started that LQE failed to suspend activity for this partition because existing transactions did not complete in time
tripleCount	Number of individual pieces of data currently stored in this partition
version	Version number of the Jena TDB component used to implement this partition

LastQueryLoadSummary – This structure contains metrics summarizing all the queries and reports within the last day that read from this partition (added 7.0).

Attribute	Description
highPercent	Percentage of the day where the query load was high as a number between 0.0 and 1.0
lowPercent	Percentage of the day where the query load was low as a number between 0.0 and 1.0
medPercent	Percentage of the day where the query load was medium as a number between 0.0 and 1.0
queryAttempted	Total number of queries that were submitted that day
queryCacheHits	Total number of queries that day that returned cached data
queryExeAvg	Average response time (in milliseconds) for queries that day
queryExecuted	Total number of queries that day that were actually executed, including queries that returned cached data
queryFailed	Total number of queries that day that were submitted but not executed. These are usually caused by syntax errors in the queries
queryTimeouts	Total number of queries that day that timed out without completing
status	Overall query load status for the day. Status is “failed” if query load was ever high, “warnings” if the query load was ever medium, or “success” if the query load was always low You can configure an alert if the load level is not “success”. You do not need this alert if you have one configured for the “queryLoad” property of the QueryLoadMetrics structure instead.
startTime / endTime	Timestamps for the start and end of the time period being summarized

QueryLoadMetrics – This structure contains metrics about queries and reports that read from this partition and finished within the last minute (added 7.0).

Attribute	Description
endTime	Timestamp when queries stopped being considered for this minute’s metrics
queryAttempted	Number of queries that were submitted
queryCacheHits	Number of queries that returned cached data
queryExeAvg	Average response time (in milliseconds) for queries
queryExeMax	Longest response time for a query
queryExeMin	Shortest response time for a query
queryExeStdDev	Standard deviation of the query response time
queryExecuted	Number of queries that were actually executed, including queries that returned cached data
queryFailed	Number of queries that were submitted but not executed. These are usually caused by syntax errors in the queries
queryLoad	Overall query load during this minute. This is determined by comparing the timeout percentage to the timeout thresholds You can configure an alert if the load level is not “LOW”. This provides very specific information about the timing of the problem, but might cause a very large number of alerts. To get fewer alerts, configure an alert for the “status” property of the LastQueryLoadSummary structure instead.
queryTimeoutPercentage	Percentage of queries that timed out without completing
queryTimeouts	Number of queries that timed out without completing
timeoutThresholdHigh	Percentage of queries that must time out for the query load to be considered high
timeoutThresholdMed	Percentage of queries that must time out for the query load to be considered medium

Latest LQE maintenance activity results

Object name: com.ibm.team.jis.lqe:type=MaintenanceActivity

This MBean tracks the most recent results from several maintenance activities run in the background to either administer LQE or track the health of the LQE application. A failure from any of these tasks might indicate a problem with the deployment environment for LQE, and should be investigated.

Update interval: Instant or 15 minutes. Anything that occurs on the node being monitored will be reflected instantly. Records of backups made on other nodes update every 15 minutes.

PreviousBackupMetrics – Metrics related to the last backup run:

Attribute	Description
details	If the backup failed, the error message associated with the failure
location	Location on the file system where the backup was stored
nodeId	The ID of the node where the backup was stored
reason	High-level type of backup failure. If the backup is successful, you see “OTHER”
size	Size of the backup, as a display string
status	The overall status of the backup attempt. Configure an alert if it is not “success”.
startTime / endTime	Time stamps

PreviousCompactionMetrics – Metrics related to the last compaction run:

Attribute	Description
details	If the compaction failed, the error message associated with the failure
newSize	The total size of the indexes after the compaction, as a display string
nodeId	The ID of the node where the compaction occurred
oldSize	The total size of the indexes before the compaction, as a display string
reason	The high-level type of compaction failure. If the compaction is successful, you see “OTHER”
status	The overall status of the compaction attempt. Configure an alert if it is not “success”.
startTime /endTime	Time stamps

PreviousSystemClockMetrics – Metrics related to the last system clock verification

Attribute	Description
allowedDifference	Amount of time (in milliseconds) that the system clock can differ from the time returned by the NTP server and still pass the verification
status	Overall status of the system clock verification. Configure an alert if it is not “success”.
timeDifference	Amount of time (in milliseconds) the system clock differs from the time returned by the NTP server
timestamp	Time stamp of when the verification was run
type	Result type of the verification attempt

Load level of the system

Object name: com.ibm.team.jis.lqe:type=SystemLoad

This MBean publishes the data LQE collects about the load level the system is experiencing. Use it to monitor whether the server hardware is sufficient for LQE to run quickly and responsively.

Update interval: 1 minute

HeapMemoryUsage – Current value and thresholds for the percentage of used JVM heap memory on the JVM running the LQE application. To minimize false alerts, the current value for heap memory is the percentage of heap memory used after the most recent garbage collection before data collection, not the amount used at the moment of data collection.
DiskUsage – Current value and thresholds for the percentage of used disk space on the server

Each of these attributes has:

Attribute	Description
criticalThreshold	The critical threshold percentage configured for this load measurement. If the current value exceeds this value, query load shedding may occur.
value	The current value for this load measurement. Configure an alert if the value exceeds the warning or critical threshold (your preference). You can configure LQE to send email notifications in this case, if the mailing service is enabled.
warningThreshold	The warning threshold percentage configured for this load measurement.

These MBeans help you monitor your LQE system to avoid performance issues.

About the author

Stephen Giesbrecht is a software developer working on Jazz Reporting Service. He can be reached at Stephen.Giesbrecht@ca.ibm.com

Dashboards and work items are no longer publicly available, so some links may be invalid. We now provide similar information through other means. Learn more here.

Feedback

Was this information helpful?

1 person rated this as helpful.