[closed] GC Pauses and SSL issues - could they be related ?

Related prior questions:
https://jazz.net/forum/questions/163157/message-re-maxdirectmemorysize-in-jazz-logs-change-needed
https://jazz.net/forum/questions/163472/should-rfc1323-be-1-related-to-remote-host-closed-connection-during-handshake
After we upgraded to 4.0.7 and 100% WebSphere 8.5.5.2 AIX we have had a perplexing persistant issue with users receiving SSL Handshake exceptions at random times. I had opened PMR to troubleshoot and that one got pushed up to WebSphere.
On a couple of the affected WebSphere, I had enabled verbose gc and found GCVM ( https://www.ibm.com/developerworks/community/blogs/troubleshootingjava/entry/gcmv_in_eclipse?lang=en ) and looked at some of the analysis. In some cases of the SSL handshake failure the time is roughly associated with one of the many short, but dense bursts of Pause Time. Pause time on average is 40ms with a maximum of almost 8s. Total pause time is about 2h over 14 day period.
GCVM summary for one affected was profile:
Tuning recommendation
Your application appears to be leaking memory. This is indicated by the used heap
increasing at a greater rate than the application workload (measured by the amount of data
freed). To investigate further see Diagnostics Guide.
At one point 32275 objects were queued for finalization. Using finalizers is not
recommended as it can slow garbage collection and cause wasted space in the heap.
Consider reviewing your application for occurrences of the finalize() method. You can use
the ISA Tool Add-on, IBM Monitoring and Diagnostic Tools for Java - Memory Analyzer to list
objects that are only retained through finalizers.
Your application is allocating many large objects, which affects performance.Consider
increasing the size of the heap.
Garbage collection is causing some large pauses. The largest pause was 7743 ms. This
may affect application responsiveness. If responsiveness is a concern then a switch of policy
or reduction in heap size may be helpful.
Your application appears to be relying on class unloading during global collects. Consider
using the Balanced GC policy for applications deployed on a 64-bit platform with a heap size
greater than 4GB, which performs class unloading incrementally.
198 global garbage collects took on average 1,366% longer than the average nursery
collect. If you believe this is abnormally high and unacceptable, consider using the Balanced
GC policy for applications deployed on a 64-bit platform with a heap size greater than 4GB.
Are those pauses a potential cause for SSL Handshake issues we've been getting ?
The question has been closed for the following reason: "Problem is not reproducible or outdated" by rschoon May 16, 7:05 a.m.
One answer

Yes, with respect to your description and GCVM analysis, the observed long GC pause times, which is the maximum pause of nearly 8 seconds, may indeed be reason for the SSL Handshake exceptions you are experiencing.These SSL Handshakes are basically time-sensitive operations. The operation might time out or fail if the JVM is in a stop-the-world GC pause when a handshake is initiated. This will lead to sporadic errors similar to the one you've reported. Now a few other problems that complicate the problems are dense GC activity, high object finalization counts, memory leaks, and reliance on class unloading. These will compound the problem by Unveiling Inconsistency or unpredictability in both memory behavior and responsiveness.
Provided with the GCVM recommendations and your heap usage profile, it's better to:
<o:p> </o:p>
1 You can investigate the potential memory leaks and with that try to reduce finalizer usage. <o:p> </o:p>
2 Secondly, you can consider tuning or reducing the heap size slightly. This will help you avoid long global collection pauses. <o:p> </o:p>
3 Evaluate and check switching to the Balanced GC policy. Only properly balanced GC can handle the large heaps more efficiently and perform class unloading incrementally. <o:p> </o:p>
4 You can monitor object allocation patterns, particularly for large objects, that may be triggering major collections more frequently. <o:p> </o:p>
If you or anyone interested, this blog will give a better explanation of Java Garbage Collection fundamentals. This is a helpful read, especially if you're trying to understand how GC behavior can affect performance: What is Java Garbage Collection?
Comments
Donald Nong
Oct 13 '14, 1:42 a.m.What is the detailed exception? Timeout? If so, I will lean on the idea that the GC pause and SSL handshake failure are related, but the only setting that I can find in WAS is com.ibm.ws.orb.transport.SSLHandshakeTimeout, with default value 0 (no timeout). Not sure about the client application setting a SSL handshake timeout value though.
Kevin Ramer
Oct 13 '14, 10:32 a.m.See the work item mentioned in the 2nd post, comment #19
https://jazz.net/jazz/web/projects/Rational%20Team%20Concert#action=com.ibm.team.workitem.viewWorkItem&id=195587
Kevin Ramer
Oct 13 '14, 7:58 p.m.This IS NOT answer, only place I can fit stack examples ...
Donald Nong
Oct 13 '14, 8:22 p.m.Kevin, you can always convert an "answer" to a comment after you put a long text in the answer, so it ends up in the right place.
If the problem you are facing is anything like the one mentioned in WI 195587, you have a big task ahead - data collecting will be the toughest job.
In comment 20 of WI 195587, it is mentioned that Story 201208 introduced retry mechanism to overcome such errors. And I'm wondering why it does not cover your scenarios. The code change is in BasicVersionedContentManager which also appears in the stack trace you posted so I suppose the code should be working. But apparently it's not the case.
Kevin Ramer
Oct 14 '14, 10:15 a.m.Thanks for those tips [ forum type ]. As I have been seeing Out of Memory issues mentioning MaxDirectMemory I searched far and wide about that and found this article:
https://jazz.net/library/article/1430
I added the custom property to two of the WebSphere profiles that have been the problem children. On 10th, 11th they were restarted. To date neither have emitted the OOM and garbage collection pauses have dropped to much more tolerable intervals and thus far, no user reports of SSL Handshake exceptions.