GC Pauses and SSL issues - could they be related ?
I've read many articles about GC along with the guides posted here. In general we follow the recommendation for configuring websphere, the deviation being in -Xms being roughly 1/2 of -Xmx. Setting all our applications to have equal size start and max heap makes the lpar thrash with swapping. This has worked weil execpt as follows:
Related prior questions: https://jazz.net/forum/questions/163157/message-re-maxdirectmemorysize-in-jazz-logs-change-needed https://jazz.net/forum/questions/163472/should-rfc1323-be-1-related-to-remote-host-closed-connection-during-handshake After we upgraded to 4.0.7 and 100% WebSphere 8.5.5.2 AIX we have had a perplexing persistant issue with users receiving SSL Handshake exceptions at random times. I had opened PMR to troubleshoot and that one got pushed up to WebSphere. On a couple of the affected WebSphere, I had enabled verbose gc and found GCVM ( https://www.ibm.com/developerworks/community/blogs/troubleshootingjava/entry/gcmv_in_eclipse?lang=en ) and looked at some of the analysis. In some cases of the SSL handshake failure the time is roughly associated with one of the many short, but dense bursts of Pause Time. Pause time on average is 40ms with a maximum of almost 8s. Total pause time is about 2h over 14 day period. GCVM summary for one affected was profile: Tuning recommendation Your application appears to be leaking memory. This is indicated by the used heap increasing at a greater rate than the application workload (measured by the amount of data freed). To investigate further see Diagnostics Guide. At one point 32275 objects were queued for finalization. Using finalizers is not recommended as it can slow garbage collection and cause wasted space in the heap. Consider reviewing your application for occurrences of the finalize() method. You can use the ISA Tool Add-on, IBM Monitoring and Diagnostic Tools for Java - Memory Analyzer to list objects that are only retained through finalizers. Your application is allocating many large objects, which affects performance.Consider increasing the size of the heap. Garbage collection is causing some large pauses. The largest pause was 7743 ms. This may affect application responsiveness. If responsiveness is a concern then a switch of policy or reduction in heap size may be helpful. Your application appears to be relying on class unloading during global collects. Consider using the Balanced GC policy for applications deployed on a 64-bit platform with a heap size greater than 4GB, which performs class unloading incrementally. 198 global garbage collects took on average 1,366% longer than the average nursery collect. If you believe this is abnormally high and unacceptable, consider using the Balanced GC policy for applications deployed on a 64-bit platform with a heap size greater than 4GB. Are those pauses a potential cause for SSL Handshake issues we've been getting ? |
Be the first one to answer this question!
Dashboards and work items are no longer publicly available, so some links may be invalid. We now provide similar information through other means. Learn more here.
Comments
What is the detailed exception? Timeout? If so, I will lean on the idea that the GC pause and SSL handshake failure are related, but the only setting that I can find in WAS is com.ibm.ws.orb.transport.SSLHandshakeTimeout, with default value 0 (no timeout). Not sure about the client application setting a SSL handshake timeout value though.
See the work item mentioned in the 2nd post, comment #19
https://jazz.net/jazz/web/projects/Rational%20Team%20Concert#action=com.ibm.team.workitem.viewWorkItem&id=195587
This IS NOT answer, only place I can fit stack examples ...
Kevin, you can always convert an "answer" to a comment after you put a long text in the answer, so it ends up in the right place.
If the problem you are facing is anything like the one mentioned in WI 195587, you have a big task ahead - data collecting will be the toughest job.
In comment 20 of WI 195587, it is mentioned that Story 201208 introduced retry mechanism to overcome such errors. And I'm wondering why it does not cover your scenarios. The code change is in BasicVersionedContentManager which also appears in the stack trace you posted so I suppose the code should be working. But apparently it's not the case.
Thanks for those tips [ forum type ]. As I have been seeing Out of Memory issues mentioning MaxDirectMemory I searched far and wide about that and found this article:
https://jazz.net/library/article/1430
I added the custom property to two of the WebSphere profiles that have been the problem children. On 10th, 11th they were restarted. To date neither have emitted the OOM and garbage collection pauses have dropped to much more tolerable intervals and thus far, no user reports of SSL Handshake exceptions.