GC Pauses and SSL issues - could they be related ?

Kevin Ramer (4.5k●9●186●201) | asked Oct 10 '14, 1:24 p.m.
edited Oct 10 '14, 1:30 p.m.

I've read many articles about GC along with the guides posted here. In general we follow the recommendation for configuring websphere, the deviation being in -Xms being roughly 1/2 of -Xmx. Setting all our applications to have equal size start and max heap makes the lpar thrash with swapping. This has worked weil execpt as follows:

Related prior questions:
https://jazz.net/forum/questions/163157/message-re-maxdirectmemorysize-in-jazz-logs-change-needed
https://jazz.net/forum/questions/163472/should-rfc1323-be-1-related-to-remote-host-closed-connection-during-handshake

After we upgraded to 4.0.7 and 100% WebSphere 8.5.5.2 AIX we have had a perplexing persistant issue with users receiving SSL Handshake exceptions at random times. I had opened PMR to troubleshoot and that one got pushed up to WebSphere.

On a couple of the affected WebSphere, I had enabled verbose gc and found GCVM ( https://www.ibm.com/developerworks/community/blogs/troubleshootingjava/entry/gcmv_in_eclipse?lang=en ) and looked at some of the analysis. In some cases of the SSL handshake failure the time is roughly associated with one of the many short, but dense bursts of Pause Time. Pause time on average is 40ms with a maximum of almost 8s. Total pause time is about 2h over 14 day period.

GCVM summary for one affected was profile:

Tuning recommendation
Your application appears to be leaking memory. This is indicated by the used heap
increasing at a greater rate than the application workload (measured by the amount of data
freed). To investigate further see Diagnostics Guide.

At one point 32275 objects were queued for finalization. Using finalizers is not
recommended as it can slow garbage collection and cause wasted space in the heap.
Consider reviewing your application for occurrences of the finalize() method. You can use
the ISA Tool Add-on, IBM Monitoring and Diagnostic Tools for Java - Memory Analyzer to list
objects that are only retained through finalizers.

Your application is allocating many large objects, which affects performance.Consider
increasing the size of the heap.

Garbage collection is causing some large pauses. The largest pause was 7743 ms. This
may affect application responsiveness. If responsiveness is a concern then a switch of policy
or reduction in heap size may be helpful.

Your application appears to be relying on class unloading during global collects. Consider
using the Balanced GC policy for applications deployed on a 64-bit platform with a heap size
greater than 4GB, which performs class unloading incrementally.

198 global garbage collects took on average 1,366% longer than the average nursery
collect. If you believe this is abnormally high and unacceptable, consider using the Balanced
GC policy for applications deployed on a 64-bit platform with a heap size greater than 4GB.

Are those pauses a potential cause for SSL Handshake issues we've been getting ?

Comments

Donald Nong commented Oct 13 '14, 1:42 a.m.

What is the detailed exception? Timeout? If so, I will lean on the idea that the GC pause and SSL handshake failure are related, but the only setting that I can find in WAS is com.ibm.ws.orb.transport.SSLHandshakeTimeout, with default value 0 (no timeout). Not sure about the client application setting a SSL handshake timeout value though.

Kevin Ramer commented Oct 13 '14, 10:32 a.m.

See the work item mentioned in the 2nd post, comment #19

https://jazz.net/jazz/web/projects/Rational%20Team%20Concert#action=com.ibm.team.workitem.viewWorkItem&id=195587

Kevin Ramer commented Oct 13 '14, 3:09 p.m. | edited Oct 13 '14, 7:58 p.m.

This IS NOT answer, only place I can fit stack examples ...

com.ibm.team.filesystem.client.FileSystemException: Error while writing to /Unicorn/resources/messages/messages.xml : Expected content hash of 11cttddWUCLvNyDBa0BNkmvG0Un5GoAfNgHYXCM5Glk but got bsqBaPBbMY5tQ6_WUmHuc6pRnero7f4EUamWaO8t7PM] at com.ibm.team.build.internal.scm.SourceControlUtility.updateFileCopyArea(SourceControlUtility.java:625) at com.ibm.team.build.internal.engine.JazzScmPreBuildParticipant.preBuild(JazzScmPreBuildParticipant.java:238) at com.ibm.team.build.internal.engine.BuildLoop.invokePreBuildParticipants(BuildLoop.java:844) at com.ibm.team.build.internal.engine.BuildLoop$2.run(BuildLoop.java:650) at java.lang.Thread.run(Thread.java:662) Contains : 0 Failed to download /Unicorn/resources/messages/messages.xml com.ibm.team.filesystem.client.FileSystemException: Error while writing to /Unicorn/resources/messages/messages.xml : Expected content hash of 11cttddWUCLvNyDBa0BNkmvG0Un5GoAfNgHYXCM5Glk but got bsqBaPBbMY5tQ6_WUmHuc6pRnero7f4EUamWaO8t7PM at com.ibm.team.filesystem.client.internal.load.MergeLoadMutator$6.run(MergeLoadMutator.java:1580) at com.ibm.team.filesystem.client.internal.SharingManager.doSilentChange(SharingManager.java:633) at com.ibm.team.filesystem.client.internal.load.MergeLoadMutator.storeFileContents(MergeLoadMutator.java:1545) at com.ibm.team.filesystem.client.internal.load.MergeLoadMutator.access$0(MergeLoadMutator.java:1542) at com.ibm.team.filesystem.client.internal.load.MergeLoadMutator$DownloadHandler.downloadStreamAcquired(MergeLoadMutator.java:150) at com.ibm.team.filesystem.client.FileDownloadHandler.downloadStreamAcquired(FileDownloadHandler.java:86) at com.ibm.team.filesystem.client.FileDownloadHandler.downloadStreamAcquired(FileDownloadHandler.java:1) at com.ibm.team.scm.client.content.BasicAsyncVersionedContentManagerSession.internalDoRetrieveContent(BasicAsyncVersionedContentManagerSession.java:117) at com.ibm.team.scm.client.content.BasicAsyncVersionedContentManagerSession$1.run(BasicAsyncVersionedContentManagerSession.java:190) at com.ibm.team.repository.client.util.ThreadPool$PoolJob.run(ThreadPool.java:129) at org.eclipse.core.internal.jobs.Worker.run(Worker.java:55) Caused by: java.io.IOException: Expected content hash of 11cttddWUCLvNyDBa0BNkmvG0Un5GoAfNgHYXCM5Glk but got bsqBaPBbMY5tQ6_WUmHuc6pRnero7f4EUamWaO8t7PM

CRJAZ0099I When accessing the URL "https://rtp-rtc8:9443/jazz/service/com.ibm.team.scm.common.IVersionedContentService/content/com.ibm.team.filesystem/FileItem/_fzligE5YEd-CxNVgiobsPg/_AV7Ejk7WEd-wI66fLcYyDA/9V3jEdu72MsHbrW2N69Afw6oX695Rq0kGlj1iDX1kRI" the following HTTP error occurred: "Remote host closed connection during handshake" at com.ibm.team.repository.transport.client.ClientHttpUtil.executePrimitiveRequest(ClientHttpUtil.java:1179) at com.ibm.team.repository.transport.client.ClientHttpUtil.executeHttpMethod(ClientHttpUtil.java:328) at com.ibm.team.repository.transport.client.ClientHttpUtil.executeHttpMethod(ClientHttpUtil.java:290) at com.ibm.team.repository.transport.client.ClientHttpUtil.executeHttpMethod(ClientHttpUtil.java:204) at com.ibm.team.scm.client.content.BasicVersionedContentManager.executeMethod(BasicVersionedContentManager.java:446) at com.ibm.team.scm.client.content.BasicVersionedContentManager.invokeContentGet(BasicVersionedContentManager.java:1230) at com.ibm.team.scm.client.content.BasicVersionedContentManager$6.run(BasicVersionedContentManager.java:1006) at com.ibm.team.scm.client.content.BasicVersionedContentManager$6.run(BasicVersionedContentManager.java:1) at com.ibm.team.repository.client.internal.TeamRepository$3.run(TeamRepository.java:1261) at com.ibm.team.repository.common.transport.CancelableCaller.call(CancelableCaller.java:79) at com.ibm.team.repository.client.internal.TeamRepository.callCancelableService(TeamRepository.java:1254) at com.ibm.team.scm.client.internal.ScmClientLibraryContext.callCancelableService(ScmClientLibraryContext.java:70) at com.ibm.team.scm.client.content.BasicVersionedContentManager.internalRetrieveContentStream(BasicVersionedContentManager.java:997) at com.ibm.team.filesystem.client.internal.content.FileContentManager.internalRetrieveContentStream(FileContentManager.java:90) at com.ibm.team.filesystem.client.internal.content.FileContentManager.internalRetrieveContentStream(FileContentManager.java:1) at com.ibm.team.scm.client.content.BasicVersionedContentManager.retrieveContentStream(BasicVersionedContentManager.java:978) at com.ibm.team.scm.client.content.BasicAsyncVersionedContentManagerSession.getInputStream(BasicAsyncVersionedContentManagerSession.java:88) at com.ibm.team.scm.client.content.BasicAsyncVersionedContentManagerSession.internalDoRetrieveContent(BasicAsyncVersionedContentManagerSession.java:108) at com.ibm.team.scm.client.content.BasicAsyncVersionedContentManagerSession$1.run(BasicAsyncVersionedContentManagerSession.java:190) at com.ibm.team.repository.client.util.ThreadPool$PoolJob.run(ThreadPool.java:129) at org.eclipse.core.internal.jobs.Worker.run(Worker.java:55) Caused by: javax.net.ssl.SSLHandshakeException: Remote host closed connection during handshake

Donald Nong commented Oct 13 '14, 8:22 p.m.

Kevin, you can always convert an "answer" to a comment after you put a long text in the answer, so it ends up in the right place.
If the problem you are facing is anything like the one mentioned in WI 195587, you have a big task ahead - data collecting will be the toughest job.
In comment 20 of WI 195587, it is mentioned that Story 201208 introduced retry mechanism to overcome such errors. And I'm wondering why it does not cover your scenarios. The code change is in BasicVersionedContentManager which also appears in the stack trace you posted so I suppose the code should be working. But apparently it's not the case.

Kevin Ramer commented Oct 14 '14, 10:15 a.m.

Thanks for those tips [ forum type ]. As I have been seeing Out of Memory issues mentioning MaxDirectMemory I searched far and wide about that and found this article:

https://jazz.net/library/article/1430

I added the custom property to two of the WebSphere profiles that have been the problem children. On 10th, 11th they were restarted. To date neither have emitted the OOM and garbage collection pauses have dropped to much more tolerable intervals and thus far, no user reports of SSL Handshake exceptions.

Be the first one to answer this question!

Register or log in to post your answer.

Dashboards and work items are no longer publicly available, so some links may be invalid. We now provide similar information through other means. Learn more here.

It's all about the answers!

Related questions

GC Pauses and SSL issues - could they be related ?

Be the first one to answer this question!