July 8, 2010

com.oracle.bpel.client.delivery.ReceiveTimeOutException in 50% of instances in a BPEL Process Manager cluster

An issue with a recent customer starting occurring in some BPEL processes when we migrated off of a single-node Oracle BPEL Process Manager 10g environment to a 2-node cluster.

The Problem

1. BPEL process SubscriberActivation is instantiated (instance 4961221)

2. It synchronously calls GetDeviceState (instance 4961222)

3. GetDeviceState completes successfully within 6.106 seconds


4. The response is never received by the first process around 50% of the time and returns the following BPEL fault:
<summary>
  when invoking locally the endpoint 'http://oradev1.thisisahmed.com:7777/orabpel/default/GetDeviceState/1.0', ; nested exception is
  com.oracle.bpel.client.delivery.ReceiveTimeOutException: Waiting for response has timed out
</summary>
With debugging enabled, the domain.log shows the following:
<2010-06-25 07:55:49,226> <DEBUG> <default.collaxa.cube.engine.delivery> <DeliveryHandler::initialRequestAnyType>
com.oracle.bpel.client.delivery.ReceiveTimeOutException: Waiting for response has timed out. The conversation id is bpel://localhost/default/SubscriberActivation~1.0/5040805-BpInv1-BpSeq4.7-2. Please check the process instance for detail.
at com.collaxa.cube.engine.delivery.DeliveryHandler.initialRequestAnyType(DeliveryHandler.java:543)
at com.collaxa.cube.engine.delivery.DeliveryHandler.initialRequest(DeliveryHandler.java:457)
5. The parent process times out in 1000 seconds which is the value of our syncMaxWaitTime (as shown in the Tree Finder figure above).

This only happens when both nodes of the cluster are up and running. If only a single node is running, this issue does not occur.

Troubleshooting Efforts

The BPEL Process Manager Developer's Guide 10g (10.1.3.1.0) asks you to try the following, which is not applicable to our situation.
(a) Increasing transaction-timeout="7200" in $ORACLE_HOME/j2ee/oc4j_soa/config/transaction-manager.xml 
(b) Increasing transaction-timeout="3600" to a lower value for CubeEngineBean, DispatcherBean, CubeDeliveryBean, DeliveryBean, DomainManagerBean, and ProcessManagerBean in
$ORACLE_HOME/j2ee/oc4j_soa/application-deployments/orabpel/ejb_ob_engine/orion-ejb-jar.xml 
(c) Increasing syncMaxWaitTime to 1000 in $ORACLE_HOME/bpel/domains/default/config/domain.xml
Using TCP instead of UDP for the BPEL PM cluster (in $ORACLE_HOME/bpel/system/config/jgroups-protocol.xml) has no bearing either.

Adding the transaction participate property to the partnerlink won't help either:  <property name="transaction">participate</property>


Cause of Problem & Analysis

This problem is caused by the flawed design of the flow in that it doesn't support operating in a BPEL Process Manager cluster.

Both BPEL1 and BPEL2 are designed as synchronous processes.

Success scenario:
  1. Client makes sync request to BPEL1.
  2. BPEL1 makes sync request to BPEL2.
  3. BPEL2 makes sync request to an external service (and BPEL 2 receives the sync response back).
  4. BPEL2 has an async “receive” activity.
  5. BPEL2 responds to BPEL1 synchronously
  6. BPEL1 responds to client synchronously.
Timeout scenario:
  1. Client makes sync request to BPEL1.
  2. BPEL1 makes sync request to BPEL2.
  3. BPEL2 makes sync request to external service (and BPEL 2 receives the sync response back).
  4. BPEL2 has an async “receive” activity, but receives response on Node 2.
  5. BPEL2 tries to reply to BPEL1, but no link back to BPEL1 (BPEL2 completes successfully though).
  6. BPEL1 times out.
Even though BPEL2 is designed as a synchronous process, the onWait forces it to become asynchronous, thus dehydrating, and rehydrating when it receives the callback. The problem is that the callback could be received on the other node, which is the exact behavior we are seeing here 50% of the time.

In fact, this 2 year old blog post of mine discusses the same issue:
http://blog.thisisahmed.com/2008/10/behavior-of-bpel-processes-in-bpel.html

5 comments:

  1. Hi Ahmed,

    I had the same thing happening aswell. The reason for this is that synchronous bpel process instances 'wait' through a local JVM mutex. So even if you 'cluster' your bpel container with multiple JVM's on one machine, you can experience this problem. We discussed this aswell on Mark Kelderman's blog: http://orasoa.blogspot.com/2010/03/bpel-10g-clustering-with-jgroups-on.html
    To prevent thread starvationin the JVM, there is a timeout parameter maxsyncwait of 45 seconds. So after 45 seconds the synchronous process instance will timeout.

    regards,
    Tony van Esch

    ReplyDelete
  2. Hi Ahmed,

    Will setting this parameter requiresNew will solve this issue has mentioned in the blog http://javaoraclesoa.blogspot.com.au/2012/08/issues-and-solutions-when-testing-and.html ??

    Thanks,
    Radhakrishnan

    ReplyDelete
  3. Hi Ahmed,

    Can you please look into this url? Does changing deliveryPersistPolicy to off.immediate help?

    https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=109165549251265&id=1292463.1&displayIndex=1&_afrWindowMode=0&_adf.ctrl-state=17m2k6khep_188

    Thanks,

    Murali

    ReplyDelete