It's all about the answers!

Ask a question

What does do?

long TRUONG (3654120147) | asked Apr 19 '16, 10:38 p.m.
This is a different question/aspect of the same server hang issue as post 220317.
RTC/RRC 5.0.2 on Windows/Tomcat
Our server hung almost every 2 hrs for the second day (+ 1 night) today with CPU at 100%, requiring a restart.

We tried to start with the startup script instead of Tomcat service hoping for more stability. We got to 3 hrs without any issue with still very good CPU profile with immediate responses from RTC (Tomcat service would become sluggish at around 2 hrs or less with CPU at 100% flat plateau). We got our fingers crossed. 

This is the time with nobody really working, we can identify CPU response with every single operation we initiate. All of a sudden, 3 users with the same activity 
appeared in the CCM active services list, the operation seemed to last only less than 0.2 secs for each of them, but the CPU spiked to a flat plateau of 100% immediately and continued on for good, even after they disappeared from the list. Of course that led to sluggish responses and a Site Scope Alert.

Any idea what this activity would accomplish and how it could bring RTC to its knees?


Accepted answer

permanent link
long TRUONG (3654120147) | answered Apr 28 '16, 2:02 a.m.
 This is not quite the direct answer to this question, however it is the answer to the severity 1 issue which prompted this question in the first place.

After 8 consecutive days and nights of hangs every 2 hrs via Tomcat service and every 3 hrs via command line starts with the server.startup.bat script, requiring an app restart and constant monitoring/attention.
After IBM support id that API class and other error messages on licenses & others as results of maxed out CPU
After IBM pinned down the severity 1 issue as the result of indexing catch-up jobs, which may restart where they were left off or may restart from scratch 
After our hangs progressed to OOM heapdumps (x4) and javacores (x4) and also DB snap with every hang:

JVMDUMP032I JVM requested Snap dump using 'D:\IBM\rtc_502\ JazzTeamServer\server\Snap. 20160423.112844.5904.0007.trc' in response to an event

JVMDUMP010I Snap dump written to {nothing to snap}

JVMDUMP013I Processed dump event "systhrow", detail "java/lang/OutOfMemoryError".

JVMDUMP032I JVM requested Java dump using 'D:\IBM\rtc_502\ JazzTeamServer\server\ javacore.20160423.112844.5904. 0001.txt' in response to an event


JVMDUMP032I JVM requested Heap dump using 'D:\IBM\rtc_502\ JazzTeamServer\server\ heapdump.20160423.112844.5904.' in response to an event

We observed, while closely monitoring the app, that there was just the one active service, apparently performing the indexing catch up job to our huge import of WI's and new custom attributes the w/e prior to start of the issues, which kept on restarting with each app restart after a hang, most of the time by its lone old self during w/e: internal.lucene. WorkItemIndexingParticipant. updateIndex



 This active service ran out of memory every time and hung RTC.

We had an emergency temp memory upgrade to the max 32GB allowed by Window server 2008, but the allocation according to the rule of thumb 50/50 heap/OS of -Xmx16G/32GBmem still did not allow completion of this job.

We finally broke out of severity 1 by using a skew allocation of higher heap, with the job completed at 4hr 26 min

We were not aware of the indexing catch up job(s) which would be long running after our routine huge imports of WI's, though this time confirmed by project team after the fact that it was bigger than our huge norm. Always thought before that indexing would complete along with each successful import.

Ralph Schoon selected this answer as the correct answer

One other answer

permanent link
Ralph Schoon (63.2k33646) | answered Apr 20 '16, 3:09 a.m.
edited Apr 20 '16, 3:12 a.m.
The API class you are referring to here is an internal service (related to accessing the database). It is not very likely that anyone in the community has a lot of experience with it. This and the rest of your question pretty much tells me, that you should probably be talking to support instead of trying to address this is the forum.

Dependent on your topology and the applications you have actually installed candidates for high loads are the Data Collection Component, BIRT reports and potentially others. It would be a good idea to also look at the database and search for errors there as well, as if the DB starts failing the application might show a bad performance as well.   

Just the CPU load does not really help either that is why you should have set up server monitoring to see how the server performance changes over time.

It is unlikely to detect the root cause without a lot more communication and information which support is able to get and share but the forums are likely not.

long TRUONG commented Apr 20 '16, 8:49 a.m.

 Thx Ralph,

Indeed we already bumped up our existing PMR to severity 1 and all logs, infos including 2 set of heapdumps and javacores are in Escalation hands for the day before this post was put up, hoping for some direct experience with same.

Was talking to IBM fresh (not on the long running case) support and he ruled that activity out as could have done it directly and by itself (picking/saving query results). We did see a lot of same activities concurrently without thyem shooting up the CPU to the level of no return.

The startup script seem to buy us over an extra hour over Tomcat service.

Your answer

Register or to post your answer.

Dashboards and work items are no longer publicly available, so some links may be invalid. We now provide similar information through other means. Learn more here.