Engineering Lifecycle Management Wiki - Deployment

Community information and contribution guidelines

~~EditAttach~~Printable

TWiki > Deployment Web > MonitoringWhereToStart > MonitoringLoggedErrorMessages

Revision 1 - 2016-07-18 - 16:32:10 - Main.fryerk

Monitoring error messages

Authors: KathrynFryer, TimFeeney
Build basis: Rational Collaborative Lifecycle Management (CLM) and Watson Internet of Things Continuous Engineering (IoTCE) solutions 6.0.2

Page contents

As part of their monitoring strategy for the CLM or IoTCE solution, some clients opt to use third-party tools to parse and monitor messages logged in log files to identify issues or potential issues. If you choose to monitor the logs, this page describes some messages and scenarios to watch for. If the log is not specifically identified, the message could appear in any of the CLM/CE application logs.

Out-of-memory errors

Look for the text OutOfMemoryError, with the additional parameters of Native Memory and Java Heap.

Native memory is the memory reserved for the classes. When you include compressedrefs in the JVM settings, native memory is the first 4G of memory reserved. If you continue to see native out-of-memory errors, you might need more than that 4G reserve, and switch to the nocompressedrefs setting.

For out-of-memory errors in general, you can look at the verbose garbage collection logs to see when memory started to increase, and then examine the application logs or a thread dump for that time period to see what was happening. You could use a tool like IBM Memory Analyzer to drill into the heap dump to find leak suspects and classes consuming the most memory.

In addition, the error message CRJAZ2264W could be an early indicator of out-of-memory errors; if you see this message, please report it to your IBM support contact.

Java dumps

In the WebSphere Application Server (WAS) native_stderr.log, JVMDUMP032I is logged any time a Java dump is requested, including thread dump, system dump, and memory dump. JVMDUMP030W is logged if the dump cannot be written, which could be due to issues with permissions, disk space, or another reason.

Because users can trigger dumps with various tools and from the WAS console, you probably do not want to monitor all the JVMDUMP messages, which are documented in the IBM Knowledge Center: http://www.ibm.com/support/knowledgecenter/SSYKE2_7.1.0/com.ibm.java.messages/diag/appendixes/messages/dump/messages_dump.html. Over time, you might add selected messages to your watch-list based on your experience. You would need to analyze the dump to determine the root cause and take action accordingly.

Hung threads

Look in the WAS Systemout.log file for the text thread(s) in total in the server that may be hung.

This support article describes how to generate a javacore or thread dump when WebSphere Application Server detects a hung thread, so that dump can be used for further analysis. In some cases, if the alerts are for jobs that simply require more time to complete, you may need to increase the hung thread detection default.

Long-running RM SPARQL queries

In the RM log, warning message CRJZS5819W or CRJZS5820W is logged when a query runs longer than the threshold setting (rm/admin > Advanced Properties > Jena query operation logging time threshold (in ms)). Another message CRJZS5742W is logged when the query completes. The full message text appears as follows:

CRJZS5819W Query {id} is still running after {n} seconds and has now exceeded the maximum amount of time allowed by the server. This query may be jeopardizing the memory and CPU health of the server. It has {x} user contexts: {contexts}
CRJZS5820W Query {id} is still running after {n} seconds and has now exceeded the maximum amount of time allowed by the server. This query may be jeopardizing the memory and CPU health of the server. It is scoped to {scope} and has {x} user contexts: {context}
CRJZS5742W Query {id} had an execution time of {n} ms, and produced {x} results

The default query threshold is 45 seconds, so not every instance of the warning message indicates an issue. However, if you find multiple sequential instances the same query id, especially without any completion message, you could note the query id and investigate the cause. You cannot kill the query while it is still running.

SocketTimeoutException occurrences

Look for the text SocketTimeoutException; it is logged as a Java exception and typically looks something like:

<Error>
    ...
    Caused by: java.net.SocketTimeoutException

This exception indicates that something has taken longer than one of the many configurable socket timeouts set throughout the stack. The exception might indicate an abnormally long running operation, or you might need to increase a socket timeout due to the amount of data or heavy activity.

SqlTimeoutException occurrences

Look for the text sqlTimeoutException; the error usually looks something like:

<Error>
    ...
    Caused by: com.ibm.db2.jcc.am.SqlTimeoutException: DB2 SQL Error: SQLCode = -{code}, SQLSTATE={code}, SQLERRMC={error}

This error indicates that a JDBC timeout has been reached and the database did not return the result. You would need to investigate the details to determine the cause.

Corrupt indexes

If an index is corrupted and needs to be rebuilt, the message CRJZS5821E appears in the server log. If the server is online and starts a re-index, the message CRJZS5762W is logged. Users can expect to see missing data until the re-indexing has completed.

FATAL messages

Any messages logged using log4j FATAL would warrant investigation. In addition, the following messages indicate fatal errors and may warrant monitoring (TBD):

CRJAZ2189E A fatal error occurred while trying to fetch the internal jpi mapping table. Consult the log files for further details
CRJAZ2826E Some fatal problem occurred while attempting to communicate with an Authorization Server.
CRJAZ2833E Some fatal problem occurred while attempting to communicate with an Authorization Server. (Code "{0}").
CRJSS0020E A fatal error occurred trying to start cache: "{0}". Make sure the directory specified has the correct permissions and exists. All monitoring will now be disabled. The server will need to be restarted after the problem is fixed to enable server monitoring.

External links:

Additional contributors:

Copyright © by IBM and non-IBM contributing authors. All material on this collaboration platform is the property of the contributing authors.
Contributions are governed by our Terms of Use. Please read the following disclaimer.
Dashboards and work items are no longer publicly available, so some links may be invalid. We now provide similar information through other means. Learn more here.