Saturday, February 27, 2016

Async Job Lessons


On my project at work, we are improving our HTTP request processing times by extracting expensive update operations and offloading them to a distributed job queuing system that can execute them asynchronously. While this has helped reduce our system's response times, we have run into a couple of issues that are a fallout of the asynchronous design.
  1. Consider an async job that is triggered while a database transaction is in progress. The transaction inserts row A. The async job will read row A as input, perform a calculation on it, and then record its result elsewhere. If the async job runs before the transaction is committed, the job will fail, since row A is not yet visible outside the transaction. One solution is to delay enqueuing the job until after the transaction is committed. However, this may be difficult to implement. The advantage of this approach is that it can decide never to invoke the async job ever if the transaction aborts. A less desirable solution is to enqueue the job immediately, but to delay the execution of the async job a fixed amount of time (a feature commonly supported by job queuing systems), but then one must decide what a reasonable delay should be. A third solution is to simply rely upon the job being retried, assuming the job queuing system supports retries. This has the possible downside of errors being reported when the job initially runs, which may be unnecessarily alarming depending upon the configuration of the ops environment.
  2. Async jobs that perform updates can suffer from race conditions. If multiple jobs of the same type are invoked in quick succession, and the ordering of async jobs is not guaranteed, then incorrect updates may be made. In particular, the ordering might be reversed if the first job suffers a transient failure (e.g. network communication error) and gets retried after the second job runs successfully. If the two jobs are recording different values, the recorded value will be incorrect after the retried job succeeds, since the value it records is a stale value. One solution is to have such jobs calculate the correct value at the time the job run. If implemented this way, the ordering will not matter, since the last job to run will always calculate the correct value using input values that up to date. As a corollary, it is a bad design to have a job record a value that is provided to it via a parameter at the time of invocation, since that value may become stale.