Monitoring your Engineering Workflow Management cluster
This article describes how to monitor six key aspects of your clustered Engineering Workflow Management (EWM) aka Rational Team Concert environment.
Application monitoring
As described in CLM Monitoring, you should monitor these areas for each Jazz-based application:- Active services summary
- Resource usage (JDBC and RDB mediators)
- Database metrics
- WebSphere Liberty application server and Java virtual machine
- Application diagnostics
- Resource-intensive scenarios
In a clustered environment, you monitor the same areas, but for each node. In addition to those areas, you also monitor these items:
- Cluster member data metrics
- MQTT service metrics
- Distributed Cache Microservice online status
- Distributed cache metrics
- MQTT broker subscription metrics
- MQTT broker resource metrics
Engineering Workflow Management cluster overview
In a clustered environment, three applications work with Engineering Workflow Management to keep the application nodes synchronized.Load-balancing proxy: The application must be fronted by a load-balancing proxy that supports session affinity. Session affinity redirects the client to the specific node that it has authenticated with. This article uses HAProxy for this purpose, but IBM HTTP Server is also supported. The load-balancing proxy selects an application node based on the load-balancing algorithm, such as “round robin” or “least connection” (for details, see the proxy documentation) for any new client that accesses the application, and then directs all subsequent traffic from the same client to that node.
MQTT broker: The back-end clustered application uses an MQTT client to exchange messages between nodes. These messages keep all nodes synchronized. IBM IoT MessageSight is the MQTT broker that manages the messages between application nodes. This scalable MQTT broker has high levels of message throughput.
Distributed Cache Microservice: The final component in the implementation is an application called the Distributed Cache Microservice (DCM). This component is implemented as a stand-alone program (executable JAR file) that runs as a process and enables application code from multiple nodes to share data. This distributed data enables applications to minimize access to the database, which can be costly in environments with a high user load.
Engineering Workflow Management cluster topology
A clustered application such as Engineering Workflow Management must be installed on multiple servers and connected by using an MQTT broker, which enables synchronization across the nodes. A load balancer is used as a front-end URL, which accepts connections and distributes the requests to one of the back-end nodes. The host name of the load balancer is used as public URL for the application during setup. The Distributed Cache Microservice provides data-sharing capabilities for the application running on multiple cluster nodes. For details, see Setting up a Change and Configuration Management application clustered environment version 6.0.5 and later.
The following image shows an Engineering Workflow Management cluster:
What to monitor
For clustered environments, in addition to checking that the application server is running on each node, you also monitor IBM IoT MessageSight, HAProxy, and the Distributed Cache Microservice. If one of the key applications fails, you might not be able to access other components. It is important to monitor the URLs (effectively a heartbeat) for each application in the cluster.
Application | URL | Comments |
Engineering Workflow Management (Nodes) | https://specific-node-host:port/ccm/scr | Access the root services document for each node and check for success. Example using curl:
|
Cluster Node Status using HAProxy | http://haproxy.server:1936/ | Access to HAProxy console is configured in the HAProxy configuration file in the “listen statistics” section. You can see the statistics in the browser, and monitor the UP/DOWN status for the nodes. Example that checks whether the ccm2 node is up: curl -k –user username:password http://haproxy.server:1936/ | grep -o “ccm2.*UP” | wc -l Example that checks whether the ccm2 node is down: curl -k –user username:password http://haproxy.server:1936/ | grep -o “ccm2.*DOWN” | wc -l Both calls return either 0 or 1 depending on whether the requested status is found. |
HAProxy | http://haproxy.server:1936/ | If the above checks are returned successfully, HAProxy is running. If the HAProxy process is not running, the curl utility cannot connect to it and will exit with a code 7. For details, see the curl documentation. |
IBM IoT MessageSight | https://messagesight_host.com:9089/ima/v1/service/status/Server | IBM IoT MessageSight provides REST APIs that you can use to query server status. Example: curl -k -X GET https://messagesight_host:9089/ima/v1/service/status/Server : grep “Status.*Running” |
Distributed Cache Microservice | https://dcm-host.com:10001/dcm/ | Access the Distributed Cache Microservice by using the same URL that is configured in your clustered application. curl –user username:password -s -I -k -X GET https://dcm-host-name.com:10001/dcm/ | grep “200 OK” |
Diagnostics poll the heartbeat of each node and application every 70 minutes, which is a long time for your system to be down before you are alerted. Shortening the diagnostics interval is not recommended because it is a resource-intensive operation.
The Engineering Workflow Management cluster publishes some important statistics about its operation and performance, as well as a subset of data from other components (with the exception of HAProxy, which currently has no additional instrumentation).
The collection and publishing of this data is turned off by default. When this option is enabled in the advanced properties, the data is published as managed beans.
Applications in the clustered system fall into two categories: shared and individual. Shared applications can service one or more nodes or clusters (examples include HAProxy and IBM IoT MessageSight), while individual applications are unique (unshared) pieces of a specific cluster (such as the Engineering Workflow Management nodes).
HAProxy and IBM IoT MessageSight can be shared, but that is not an absolute requirement. The decision to share depends largely on the load you are trying to service. If you have a large number of users, sharing might not scale. At this time, sharing the Distributed Cache Microservice is not supported.
When you monitor shared applications, understand whether you are collecting shared or individual data. Sometimes you require both, and other times you are troubleshooting a specific node, and require node-specific data. To help identify the specific node that data is coming from, a common attribute named nodeId is available.
IBM IoT MessageSight has both individual application data, such as node-specific subscription health, and shared data, such as resource utilization.
Engineering Workflow Management also contains individual (node-specific) metrics and shared (cluster-specific) metrics. For details, see JMX MBeans for ELM Application Monitoring.
Message beans to monitor
In a clustered environment, six managed beans are important to monitor in addition to the six described in CLM Monitoring.
Note: Though the following sections name MBeans to monitor and gives some details on their object name, frequency, attributes and how to enable them, the definitive source of truth on their details is the JMX MBean reference documentation, particularly the v7.0.1 Common application MBean reference guide.
1. Cluster member data metrics
This bean monitors the state of each node.
Attribute | Description | Threshold |
nodeState | This state of the node ID. Values are INACTIVE=0, STALE=1, and ACTIVE=2 | Warn with 1, Alert with 0 |
nodeURL | The URL for the specified node ID. | |
nodeId | The application node ID when the application is clustered. |
To enable the Cluster Metrics MBean, on the serviceability page, set the Enable Cluster Metrics MBean property of CommonMetricsCollectorTask to true.
The task that collects this data runs once every 60 minutes by default. This bean will only report the state if users are actively using the node before the data collection task runs.
The node becomes stale when it does not receive any traffic (incoming requests) for more than 30 seconds. MQTT message broker messages do not cause updates to a node’s active status, only HTTP requests from clients or other nodes do.
The Cluster Nodes page in the Web Admin UI shows the table with all the application nodes. This bean provides the same information. If a node becomes inactive, you must remove its entry from the table by using the Admin UI. This removal is important because it signals the remaining nodes to reschedule background tasks that would be executed by the node that is not running.
2. MQTT service metrics
This bean provides specific information about the MQTT client that each node uses to communicate with IBM IoT MessageSight. This is not a comprehensive list of managed beans available from IBM IoT MessageSight, it is just the beans recommended for monitoring and alerting.
Attribute | Description | Threshold |
Message Sent Result – Success Ratio | Ratio of messages that were published successfully to messages that were lost. | Formula: Successful Messages / (Successful Messages + Lost Messages) |
Message Handling – Thread pool Usage Percentage | Percentage of active threads handling incoming messages versus the configured thread pool size (=%d threads). | Formula: Active Threads / thread pool size |
nodeId | Application node ID when the application is clustered. |
To enable the MQTT Service Metrics MBean, on the serviceability page, set the MQTT Service Metrics MBean property under NodeMetricsTask to true.
The MQTT service provides many different counters about its performance and states, but this information is summarized as just two values. The values provide the overall picture about sending and processing MQTT messages. In a well-tuned, well-functioning system, the success ratio should be 100%.
The number of threads should never grow uncontrollably. The default maximum is set to 200 threads, and should warn at 80% or 160 threads, and alert at 90% or 180 threads. If you continue to run out of threads, the administrator can increase the default maximum number.
The success indicator for the send rate can signal problems with the message broker or issues with network connectivity. While the MQTT service can reconnect and resend messages, it is important to understand why this metric shows a degradation.
The thread pool used by the MQTT service adjusts its size based on the volume of the incoming messages, and increases above the configured size, but only temporarily. When the thread pool size reaches the configured maximum, the service increases the thread pool to handle the increased traffic. As the volume of traffic drops, the thread pool resets its size to the default. Any thread pool value above the maximum indicates that the node is having difficulty processing or managing the volume of incoming messages. A value above the maximum indicates a problem or imbalance in the system.
3. Distributed Cache Microservice online status
This bean monitors the number of times the Distributed Cache Microservice is unable to service a request. The microservice is a critical part of the clustered ecosystem and must be online.
Facet | Attribute | Description | Threshold |
Downtime Counts | totalOverInterval | Number of times the connection to the Distributed Cache Microservice was lost over the specified interval. |
|
To enable the Distributed Cache Microservice online status MBean, on the serviceability page, set Distributed Cache Microservice online status MBean property under HighFrequencyMetricsClusterScopedTask to true.
This metric is collected and reported by the REST client that communicates with the microservice. When a connection can’t be established or fails, a flag is raised. This bean reports the value of this flag. The flag remains for the period that the microservice is inaccessible.
Clustered applications can tolerate short disruptions in connectivity with the Distributed Cache Microservice, but eventually stop retrying and produce errors. The application threads are blocked until the microservice is back online. The client/server connection can timeout during this time, so it is important to monitor this metric. The Engineering Workflow Management Active Services MBean increases when services are waiting for the Distributed Cache Microservice.
4. Distributed cache metrics
In addition to the downtime, additional data should be monitored as part of the Distributed Cache Metrics bean.
Attribute | Description | Threshold |
Percent of JVM Memory Used | Percentage of JVM memory used by the DCM process. This value is configured in the DCM startup script. | Warn > 80% Alert > 90% |
CPU | Percentage of CPU used by the DCM process. | Warn > 80% Alert > 90% |
contextRoot | This is the application root context for the CLM application |
|
domain | This is the namespace for the application under which the MBean data is published | |
host | This is the host name where the CLM application is running | |
intervalDuration | This is the collection time interval in seconds | |
mbeanCreationTimestamp | This is the time when the MBean was updated with a snapshot of the relevant data | |
nodeId | This is the application node id in case the CLM application is clustered | |
Not enough threads | Jetty server indicator that threadpool is running low on threads | |
Percent of Server Thread Pool Used | Percentage of the Jetty server thread pool used by the DCM process. | Warn > 80% Alert > 90% |
Response Size | Average size (in bytes) of all responses received from the DCM during the interval. | Watch for abnormal growth. Alert if 10% – 15% deviation above baseline. |
Request Size | Average size (in bytes) of all requests made to the DCM during the interval. | Watch for abnormal growth. Alert if 10% – 15% deviation above baseline. |
Elapsed Time | Average time (in milliseconds) it took to serve all requests received during the interval. | Watch for spikes or abnormal growth. Alert if 10% – 15% deviation above baseline. |
Transaction Rate | The number of transactions handled by the DCM during the interval. | Watch for abnormal growth. Alert if 10%-15% deviation above baseline. |
Transaction Count | The total number of transactions handled by the DCM. | Watch for abnormal growth. Alert if 10%-15% deviation above baseline. |
To enable the Distributed Cache Metrics MBean, on the serviceability page, set the Distributed Cache Metrics MBean property under ClusterMetricsTask to true.
Deviations in transaction rates might require further tuning of the Distributed Cache Microservice in terms of memory and CPU if the usage patterns are legitimate.
The data published by the Distributed Cache Metrics MBean comes from the Distributed Cache Microservice. Publishing is disabled by default: you enable it by modifying settings in the configuration file and restarting the microservice. To enable publishing, modify the Distributed Cache Microservice configuration file as shown below, save the configuration file, and restart the microservice. Clustered applications can tolerate short connectivity disruptions with Distributed Cache Microservice: you can restart the microservice without having to restart the rest of the cluster.
Example (modified configuration values in bold):
The CPU usage data might not be available on all operating systems.
The default number of threads is between 8 and 256. You can change this value by modifying the microservice properties in its configuration file.
Because the microservice is used heavily, it is normal to see high numbers in transaction rate and count. It is important for the “Elapsed Time” to stay low to ensure system responsiveness.
The request and response sizes, transaction rate, and elapsed times are generally for information purposes only, but you can also use them to detect some abnormal behavior, such as unexpected changes in usage patterns. Baselines for these numbers should be created in production and measured against to detect deviations.
5. MQTT broker subscription metrics
Note: This managed bean was introduced in version 6.0.6.
The subscriptions monitoring bean reports subscriptions where buffering occurs. The number of subscriptions to monitor is configurable. Multiple clustered applications can use a single MQTT broker, and the information contains enough details to identify which cluster (application instance) each subscription belongs to.
Attribute | Descrption | Threshold |
subscriptionName | Name of the message broker subscription. | |
clusterName | Name of the cluster in which the application is running. | |
bufferedMessages | Number of published messages that are waiting to be sent to clients. | Warn > 10% Alert > 25% |
bufferedPercent | Percentage of the maximum buffered messages that the current buffered messages represent. | Warn > 1% Alert > 50% |
rejectedMessages | Number of messages that were rejected because the maximum number of buffered messages was reached. | |
discardedMessages | Number of messages that are not delivered because they were discarded when the buffer became full. | |
intervalDuration | Collection interval, in seconds. | |
nodeId | Application node ID in case the application is clustered. |
To enable the MQTT Broker Metrics MBean (MQTTBrokerSubscriptionMetrics), on the serviceability page, set the Enable MQTT Broker Metrics MBean property under HighFrequencyMetricsNodeScopedTask to true and Enable MQTT broker statistics to true under com.ibm.team.repository.service.mqtt.statistics.MQTTStatsService on the advanced properties page.
If no subscriptions are experiencing buffering, the subscriptions monitoring MBean does not report anything. A node only reports its own subscription: it does not report those created by other nodes. As a result, if a node goes offline without unsubscribing, for example, if the server process fails, or the machine is rebooted, the messages sent to that node are buffered but not reported by this bean. The bean only reports buffering for the node that is online but that experiences some sort of slowdown or trouble.
Buffering indicates that the application in this node is incapable of processing messages quickly. This issue might require tuning the system characteristics of the node or increasing the thread pools for message handling.
6. MQTT broker resource metrics
Note: This managed bean was introduced in version 6.0.6.
This bean provides memory information about the message broker server. This data is a shared metric: it represents resource use by the message broker itself, not by a separate node or cluster.
Attribute | Description | Threshold |
MemoryUsedPercent | Percentage of the total system memory used by the broker. | Warn > 80% Alert > 90% |
To enable the MQTT Broker Metrics MBean (MQTTBrokerSubscriptionMetrics), on the serviceability page, set the Enable MQTT Broker Metrics MBean property under ClusterMetricsTask to true and Enable MQTT broker statistics to true under com.ibm.team.repository.service.mqtt.statistics.MQTTStatsService on the advanced properties page.
Diagnostics in a clustered environment
The CLM Monitoring article describes diagnostics and how to monitor them. When clustering is enabled, the diagnostics detect and report some problems that can occur in a clustered application. You only see these additional checks in a clustered application.
The diagnostics managed bean is part of the server health metrics and performs a series of diagnostic tests on the system. This is also an administrative task in the Admin UI. This MBean updates every 70 minutes. To enable the Diagnostics Metrics MBean, on the serviceability page, set the Enable Diagnostics Metrics MBean property of DiagnosticsMetricsCollectorTask to true.
MBean | Attribute | Description | Threshold |
Diagnostics | com.ibm.team.foundation.diagnostic:name=<< contextRoot>>, type=diagnosticMetrics,testId=* | Provides the results of the server diagnostics that run every hour by default. This information is useful for tracking the status of the periodic execution of server diagnostics. | ALERT: Provide an alert if any of the diagnostic results show a failure. |
The diagnostics contain several tests that vary by application. To monitor your diagnostic output (only alerting on failures) you must build a pattern-based rule. You build these rules based on each tool.
Additional diagnostic checks:
- Connectivity problems with MQTT broker. If the MQTT broker is unavailable, the entire application is unusable.
- Count and status of application cluster nodes.
- Issues detected by the MQTT service and its use of resources, thread pools, queues, and messages sent and received over time.
- In CLM 6.0.6 and later, subscriptions dropped by the broker. If the subscriptions are found, the diagnostics attempt to repair them by resubscribing (if the Repair broken and missing MQTT subscriptions option is set to true).
Troubleshooting
HAProxy
You can use the HAProxy console to verify the online status and overall health of the cluster from the load-balancer’s point of view. Access to the console is configured in the HAProxy configuration file in the “listen statistics” section.
URL: http://haproxy.server.com:1936/
Documentation: http://www.haproxy.org/#docs
IBM IoT MessageSight
IBM IoT MessageSight also has an administrative console where you can review message broker details.
URL: https://messagesight.server.com:9087/
Documentation: https://www.ibm.com/support/knowledgecenter/SSCGGQ_1.2.0/com.ibm.ism.doc/Administering/ad00199_.html
Engineering Workflow Management
Engineering Workflow Management has a Clustering Administration UI.
URL: https://<ccm-public-url>:[port]/ccm/admin
Documentation: https://jazz.net/wiki/bin/view/Deployment/DeploymentTroubleshooting
Log locations
Application logs can be an invaluable source of information about the overall health of your clustered application.
Application | Default Log Location |
Engineering Workflow Management (Nodes) | <INSTALL_LOCATION>/server/liberty/servers/clm/logs |
Jazz Team Server | <INSTALL_LOCATION>/server/liberty/servers/clm/logs |
Distributed Cache Microservice | Default: <INSTALL_LOCATION>/server/clustering/cache/logs If run from another, non-default location, the logs subfolder. |
HAProxy | HAProxy logging is configured differently in its configuration file. For details, see the HAProxy documentation. |
IBM IoT MessageSight | /var/messagesight/diag/logs folder on the host machine. For details, see Log files. |
Reference
JMX MBeans for ELM Application Monitoring
Distributed Cache Microservice for clustered applications
IBM IoT MessageSight documentation
About the authors
Alexander Bernstein is a Senior Software Engineer in the Jazz Foundation development group in Persistent Systems. He has more than 30 years of experience designing and developing software solutions in a variety of fields of applications and industries, ranging from device drivers to graphical user interfaces to enterprise-wide informational systems. He led a team of developers that made clustering of Collaborative Lifecycle Management (CLM) applications possible.
Vishy Ramaswamy is a Senior Technical Staff Member in the Jazz Foundation development group in Persistent Systems. In this capacity, he is responsible for defining and managing the overall architecture of the Jazz Application Framework services and capabilities for CLM products. Vishy’s career in software development spans 21 years, during which time Vishy has worked in a technical leadership role on products in the Application Lifecycle Management space and software services related to the telecom, wireless, health care, federal government, and defense industries.
Vaughn Rokosz is a Senior Technical Staff Member (STSM) and technical lead of the performance engineering team for Jazz products, where he looks at the performance and reliability of the Jazz products at scale. He curates the Performance pages on the Jazz.net Deployment wiki, and has published a number of articles on capacity planning. He has worked in a variety of areas, including factory automation, statistical analysis, document and knowledge management, and Collaborative Lifecycle Management, and is deeply interested in the dynamics of complex software systems.
Richard Watts is the Jazz Foundation Program Director and Product Delivery Lead for the Jazz Team Server, part of the CLM team in Persistent Systems. He has over 25 years of experience building commercial software and leading development teams. He has worked on various products in application lifecycle management, application development tooling, messaging, calendar and scheduling, and financial applications.
© Copyright IBM Corporation 2020