In our Fork-Join article, we described how parallel programming can be exploited to accomplish complex tasks using the “divide-and-conquer” paradigm. In this article, we’ll talk about a unique problem, the “Race Condition,” that can happen in parallel programming contexts that is difficult to troubleshoot but can be avoided with proper planning. You will see that an innocuous fragment of code can cause unexpected consequences.
We are building a financial services application in which customers apply for loans (e.g., a small business loan). These loans can be automatically renewed at the end of their term with potential changes to rates, terms, and conditions. On a monthly basis, we are handed a set of prefetched loan records that are ready to renew. Our requirement is to build an automated loan renewal workflow.
A critical stage in this workflow is requesting a fresh credit report for the loan applicants. The code fragment below illustrates these steps:
In the code, we construct a request per applicant and fetch the report from a managed package service (line 13) that handles communication with different credit agencies (e.g., Experian, TransUnion etc.). The service fetches the credit report asynchronously and returns a transaction id in the response that we can use to track our request status. We store this under each loan (line 19).
When the credit report is eventually received per applicant, the report is stored in the database by a different program flow (thread). We piggy-back on this write-back (via a simple after-update trigger) to link the newly fetched report with the loans for that applicant. This is shown in the code fragment below.
During our tests, we found that the new credit report is not stored consistently in our loans. Sometimes they show up; sometimes they don’t. There were no errors in any of the logs; we verified that the new credit report was fetched and stored properly. We even verified that our trigger code was executing successfully! Still, there was no new credit report being attached to these loans. How can this be?
After pulling hairs for a bit, we realized that we have been burnt by a “race condition,” a problem when the order of execution between parallel threads (the “race”) causes unpredictable outcomes. An illustration of the scenario is shown below:
Unbeknownst to us, a race had started as soon as we made a request to fetch credit reports. There are two threads of execution. The blue thread that made the request and the green thread that stores the received credit report. We might anticipate that the blue thread has far less to do and will most likely complete first. But there are factors within and outside our control (e.g., network latency, resource caching, database automations, etc.) that can decide which thread wins the race each time. If the green thread unexpectedly completes before the blue thread, we store the transaction log id on the loan after the credit report is saved.
You might ask: Why is this relevant? Aren’t they updating different fields on the loan? Well, it so happens that when the loans were handed to us for renewal, the current credit report was also fetched from the database on the loan sObject. So, when we update the loans in the blue thread (Figure 1: line 22), we were inadvertently setting the transaction log id and the credit report. The order of operation is illustrated below.
We set the new credit report properly on the loan (green thread) and promptly revert it back (blue thread with the red outline)!
A “race condition” rears its head when all three of these circumstances are in play:
Awareness is the first step to resolving race conditions. Understanding when a parallel programming context is triggered can help you focus on shared resource updates.
They are particularly difficult to troubleshoot because there are no immediate failures, only unexpected behaviors.
Once we understand what’s happening, the solution is simple. We cannot avoid the race, but we can ensure that the race avoids affecting us. We should limit our update in the blue thread to only the transaction log id, making the order of finish irrelevant. The updated code fragment is shown below (line 21 in bold).
We could also solve it by guaranteeing order using “join” techniques discussed in our Fork-Join article. That would be needlessly complex and inefficient in this case, but a different scenario might require it.
Race conditions are a classic byproduct of programming assumptions that (wrongly) anticipate and depend on a particular order of finish in parallel contexts when the order is not guaranteed. They are difficult to troubleshoot because they don’t always fail. They only get harder when we don’t know that a race is happening. For instance, in our example, the race was created by a managed package service that we did not have visibility into. Awareness of the problem can help us resolve them safely. Happy coding!