Friday, January 15, 2010

Possible Memory Leak in BPEL Process Manager

A while back, we had a nasty issue with all of our BPEL Instances in Production - what we saw, and eventually remedied, were the following symptoms:

Each instance would go down roughly every 14 hours, independent of load on the system.
We'd start by dropping OPMN ping requests, and, eventually, OPMN would declare the instance dead and would force a restart.

By enable remote-monitoring in OPMN.XML and binding JConsole to the instance, we determined that the PS MarkSweep never ran - consequently, everything that made into Eden space made it into Survivor, and then, of course, into OldGen. Eventually, the instance would get memory-bound and stop responding to OPMN pings. Interestingly enough, manually performing a Garbage Collection through JConsole did free up Eden and Survivor.

Oracle pointed to a potential flaw in Sun's JDK, which I believe is a good part of the answer, but, we did see this issue with the most current supported version of Sun's JDK. We ran the JRockit JDK, and did not see the issue, so, this does lend some credence to that theory. None of that really fixed the problem, though, because JRockit wasn't supported under OAS/BPEL 10.1.3.3.

The fix finally came down when we built a new instance from scratch - one of the steps we take in building a new instance is to disable unused services, like OWSM, ESB, and JavaSSO. Apparently, OWSM does something akin to making a scheduled explicit Garbage Collection, because as soon as we re-enabled the application, PS MarkSweep began running as regular as clockwork (literally).

Just something you might want to know...