As part of their monitoring strategy for the CLM or IoTCE solution, some clients opt to use third-party tools to parse and monitor messages logged in log files to identify issues or potential issues. If you choose to monitor the logs, this page describes some messages and scenarios to watch for. If the log is not specifically identified, the message could appear in any of the CLM/CE application logs.
Out-of-memory errors
Look for the text
OutOfMemoryError
, with the additional parameters of
Native Memory
and
Java Heap
.
Native memory is the memory reserved for the classes. When you include compressedrefs in the JVM settings, native memory is the first 4G of memory reserved. If you continue to see native out-of-memory errors, you might need more than that 4G reserve, and switch to the nocompressedrefs setting.
For out-of-memory errors in general, you can look at the verbose garbage collection logs to see when memory started to increase, and then examine the application logs or a thread dump for that time period to see what was happening. You could use a tool like IBM Memory Analyzer to drill into the heap dump to find leak suspects and classes consuming the most memory.
In addition, the error message CRJAZ2264W could be an early indicator of out-of-memory errors; if you see this message, please report it to your IBM support contact.
Java dumps
In the WebSphere Application Server (WAS) native_stderr.log, JVMDUMP032I is logged any time a Java dump is requested, including thread dump, system dump, and memory dump. JVMDUMP030W is logged if the dump cannot be written, which could be due to issues with permissions, disk space, or another reason.
Because users can trigger dumps with various tools and from the WAS console, you probably do not want to monitor all the JVMDUMP messages, which are documented in the IBM Knowledge Center:
http://www.ibm.com/support/knowledgecenter/SSYKE2_7.1.0/com.ibm.java.messages/diag/appendixes/messages/dump/messages_dump.html. Over time, you might add selected messages to your watch-list based on your experience. You would need to analyze the dump to determine the root cause and take action accordingly.
Hung threads
Look in the WAS Systemout.log file for the text
thread(s) in total in the server that may be hung
.
This support article describes how to generate a javacore or thread dump when WebSphere Application Server detects a hung thread, so that dump can be used for further analysis. In some cases, if the alerts are for jobs that simply require more time to complete, you may need to increase the hung thread detection default.
Long-running RM SPARQL queries
In the RM log, warning message CRJZS5819W or CRJZS5820W is logged when a query runs longer than the threshold setting (rm/admin > Advanced Properties > Jena query operation logging time threshold (in ms)). Another message CRJZS5742W is logged when the query completes. The full message text appears as follows:
- CRJZS5819W Query {id} is still running after {n} seconds and has now exceeded the maximum amount of time allowed by the server. This query may be jeopardizing the memory and CPU health of the server. It has {x} user contexts: {contexts}
- CRJZS5820W Query {id} is still running after {n} seconds and has now exceeded the maximum amount of time allowed by the server. This query may be jeopardizing the memory and CPU health of the server. It is scoped to {scope} and has {x} user contexts: {context}
- CRJZS5742W Query {id} had an execution time of {n} ms, and produced {x} results
The default query threshold is 45 seconds, so not every instance of the warning message indicates an issue. However, if you find multiple sequential instances the same query id, especially without any completion message, you could note the query id and investigate the cause. You cannot kill the query while it is still running.
SocketTimeoutException occurrences
Look for the text
SocketTimeoutException
; it is logged as a Java exception and typically looks something like:
<Error>
...
Caused by: java.net.SocketTimeoutException
This exception indicates that something has taken longer than one of the many configurable socket timeouts set throughout the stack. The exception might indicate an abnormally long running operation, or you might need to increase a socket timeout due to the amount of data or heavy activity.
SqlTimeoutException occurrences
Look for the text
sqlTimeoutException
; the error usually looks something like:
<Error>
...
Caused by: com.ibm.db2.jcc.am.SqlTimeoutException: DB2 SQL Error: SQLCode = -{code}, SQLSTATE={code}, SQLERRMC={error}
This error indicates that a JDBC timeout has been reached and the database did not return the result. You would need to investigate the details to determine the cause.
Corrupt indexes
If an index is corrupted and needs to be rebuilt, the message CRJZS5821E appears in the server log. If the server is online and starts a re-index, the message CRJZS5762W is logged. Users can expect to see missing data until the re-indexing has completed.
FATAL messages
Any messages logged using log4j FATAL would warrant investigation. In addition, the following messages indicate fatal errors and may warrant monitoring (TBD):
- CRJAZ2189E A fatal error occurred while trying to fetch the internal jpi mapping table. Consult the log files for further details
- CRJAZ2826E Some fatal problem occurred while attempting to communicate with an Authorization Server.
- CRJAZ2833E Some fatal problem occurred while attempting to communicate with an Authorization Server. (Code "{0}").
- CRJSS0020E A fatal error occurred trying to start cache: "{0}". Make sure the directory specified has the correct permissions and exists. All monitoring will now be disabled. The server will need to be restarted after the problem is fixed to enable server monitoring.
External links:
Additional contributors: