Slash processed over $4 billion in transactions in 2025.¹ We are one of the largest US-based corporate card fintechs on the Visa network by processing volume. Since we fully launched our charge card program in January 2024, we have grown our monthly transaction volume by 30x.
This incredible growth in processing volume created challenges for our most critical card transaction processing system. This is the story of scaling our transaction processing system from 0 to hundreds of millions in monthly volume and the learnings we had along the way.
Context
Before 2024, Slash was a banking platform that offered a debit card product. Transactions were funded by a checking account, and our transaction processing was relatively straightforward.
We wanted to give our users real credit card benefits (such as higher rewards) without the credit risk, so we partnered with VisaDPS and Column NA and built a secured card program. In Jan 2024, we launched our new secured card program with a much more complicated flow of funds. To understand some of the challenges we faced when scaling our transaction processing system, you first need to understand how card payments actually work under the hood.
Part 1: The Network
Have you ever wondered how money leaves your credit card when you tap it at a grocery store? There is a whole card payment ecosystem around this seemingly simple flow, with many parties involved in facilitating money movements between merchants and Cardholders, Slash being one of them.
We play an important role in enabling our users to spend with their cards. When a card is swiped, a message containing payment information, such as the merchant name, amount, and POS location, is sent from the merchant to the issuer through the card network (Visa/Mastercard). The issuer makes a decision, and the response flows back up the same path.
VisaDPS is the issuer processor we use, which allows us to communicate with the card network. The important thing for us is that once Visa DPS forwards the ISO message to us, we must respond with approval or a decline within 3 seconds. Within this time, we check the user's balance, evaluate spending controls, run fraud checks, and send back a decision. If we don't respond in time, Visa marks the authorization as a timeout, resulting in a transaction decline.
A decline may not matter much to many; if my card is declined at a grocery store, I can just retry or pull out a different card. But many of Slash's customers are high-spend businesses, and for them, it's mission-critical. Many merchants don’t allow retries for declined cards, and some will even flag the cardholder as potentially fraudulent after a decline, which can block future transactions or even freeze the merchant relationship entirely. A single timeout can cause real damage to our client's business.
Our Real-Time Authorization service (RTA) is the most latency-sensitive part of our infrastructure, handling authorization decisioning within a 3-second window. We deployed some very cool batching tricks to reduce the number of balance checks. Though not a focus in today's blog, we will write about it at some point in the future.
Part 2: Merchants Can Do (Almost) “Anything”
After the initial authorization is approved, any of the following can happen:
1. Merchants can modify the original authorized amount:
- Incremental Auth: The merchant needs more money. Tips at restaurants usually come through as an incremental auth.
- Pre-authorization: The merchant authorizes a small amount just to verify the card is valid. Gas stations do this. They pre-authorize a small amount, then send the actual charge once you're done pumping.
2. Merchants can choose to capture or reverse any amount, multiple times:
- Partial capture: The merchant captures less than the authorized amount. You authorized $200 at a restaurant, but your final bill was $180. The merchant only captures $180, and we need to release the remaining $20 hold.
- Full capture/settlement: The merchant captures the full authorized amount. This is the happy path.
- Reversal: The entire authorization is voided. Maybe the transaction was a mistake, or the customer cancelled their order before it shipped.
- Partial reversal: Part of the authorization is reversed. You authorized $200 for 4 items, but returned 1 item worth $50 before the order shipped, so the merchant reversed $50 of the original authorization.
3. Other events that can happen at any time:
- Refund: Money flows back to the cardholder after a previous capture.
- Aging/expiration: Authorization was never captured and has aged out. Visa sends us an aging advice with a zero amount, telling us to release the hold.
- Force capture: The merchant captures a transaction that was never authorized through us. Yes, merchants can do this.
- Declines: We also receive decline messages for transactions that were declined by Visa before they even reached us.
In the simplest flow, we see one authorization event followed by a capture event, but this is rarely the case. We parse the raw ISO20022 messages we receive from VisaDPS and store them as AuthorizationObject entities.
This immutable log provides a complete audit trail and allows us to reconstruct the exact state of any authorization at any point in time.
Part 3: Money Movement
So far, we've discussed authorization and holds, but as an issuer, we also need to facilitate real-money movements when card transactions settle.
Because Slash operates a secured card program, every card transaction is a lending transaction. There are four accounts involved:
- Card Collateral -- where the user deposits funds. Acts as collateral against their credit line, and is the balance we check during RTA.
- Loan Account -- tracks the user's outstanding loan principal.
- Visa Settlement -- where disbursed funds accumulate until Visa draws down at eod to pay merchants.
- Repayment -- holds funds from collateral and uses them to repay the loan at eod.
When an authorization comes in, two transfers are created in a hold state: a loan disbursement (Loan -> Visa Settlement) and a book transfer (Collateral -> Repayment). When a capture event arrives, both transfers settle and real money moves. At eod, the cycle closes: Visa draws down from the settlement account, and a loan payment (Repayment -> Loan) brings the principal back to zero.
As the issuer, it is our responsibility to ensure that by the end of the day, the exact amount of all settled spend is
- disbursed into the Visa settlement account for daily drawdown
- moved from card collateral and used to complete the daily loan payment
This means our system must guarantee 100% correctness in the transfers, since any failure at any step will cause a balance mismatch. Since we process a high volume of transactions every day, we also need to ensure that the system has a high throughput.
Initial Design
With these objectives in mind, we designed our initial solution using our flow-of-funds service to orchestrate all these money movements. Flow of funds is a declarative, rule-based orchestration system. Instead of handlers calling each other with state scattered across tables, we define an explicit flow of events triggering side effects.
We have a separate blog post that goes into the design of the flow of funds in much more detail (link to FoF blog).
In our case, the entire flow of funds is triggered by VisaDps card events, which get mapped to FlowOfFund events listed below:
The flow of funds is defined as follows:
Each AsyncSideEffect is executed asynchronously via an abstraction we call an action intent, which is essentially a database record for an action that needs to be executed immediately but doesn't require a database transaction. The objective of using these action intents is to avoid long-running transactions. Each action intent gets executed in a Temporal workflow.
In this initial design, processing was split across many asynchronous queues:
This design has a few flaws:
- Database contention when ensuring correctness: Ledger-level contention: If a user has 10 transactions coming in at the same time, balance checks will contend with each other because they all need to read and write to the same balance. (We might write another blog talking about how we optimized our ledger for this issue!)
TransferEntity-level contention: Multiple queues can be reading and writing to the same non-ledger entities (transfer_intent, authorization_object, authorization_set) at the same time, so database-level locks are required to guarantee correctness.
During a partial capture, say capture 1 comes in, immediately followed by capture 2. Since these are triggered by separate VisaDPS webhooks, two workflows in the IssuerEvent queue will both try to capture transfers. The way to guarantee ordering is to use database locks, which create database contention
- Low throughput, hard to scale: As soon as one queue falls behind, it blocks items in other queues due to the implicit dependencies between queues. This makes horizontal scaling very difficult, since scaling up one queue might flood another. There's no single knob to control the throughput.
- Bad observability: To understand the state of a single card authorization lifecycle, engineers had to look across multiple database entities, Datadog metrics, and Temporal workflows. We even had to build a dedicated health-check service to monitor card authorization states (link to Kevin's healthcheck blog).
All of these issues hit us at once during our first P0 incident after launching the charge card program. We came in one morning and found all four queues backed up with hundreds of thousands of items -- and the numbers kept climbing.
A few of our largest users had significantly increased their spend over the weekends, causing a burst of volume that our system wasn't built to handle. Workflows in different queues were fighting for locks on the same entities, causing transactions to roll back. Failed transfers piled into the retry queue, adding even more load. We provisioned more Temporal workers, thinking it would help, but it only created more concurrent database access. This quickly became a vicious cycle that degraded the entire database.
We went back to the drawing board, first-principled how transaction processing should really work, and put out a new solution we now call TPS.
Solution: Transaction Processing Service (TPS)
The core concept of TPS is that all card events within a given lifecycle are processed by a single Temporal workflow. Each subsequent card event is a signal to that workflow, driving state transitions until it reaches a terminal state.
A card authorization lifecycle is essentially a finite state machine.
For each state transition to happen, two things need to have taken place:
- A new card event webhook from DPS
- All tasks from the previous transition must be completed (e.g. banking event processing, action intent execution)
If there are still pending tasks, the state transition doesn't happen, and the lifecycle stays in a pending state. We only start processing a new DPS card event once the previous transition has fully completed.
Each task is abstracted into a temporal activity, and they are defined as follows:
A task has two parts: fetch checks whether there's any work to do in this task for this lifecycle (e.g., are there pending action intents? any unprocessed account events?). If there's nothing, the task is skipped. If there is, process executes the actual work.
Each task returns a state that tells the workflow what to do next:
continueAsNew: This task may have spawned new work (e.g., executing an action intent triggered a new side effect). Go back to the top of the task list and re-process from the beginning.nothingToProcess:fetchfound nothing to do. Skip this task and move on to the next one.processedThe task completed its work. Continue to the next task.
This is how the ordering guarantee works. If executeActionIntentActivity returns continueAsNew, the workflow restarts from the top, ensuring all downstream work is finished before moving on to the next card event.
A code snippet for the executeActionIntentTask
Conceptually, a running workflow means the lifecycle it represents is still in a pending state, and one of the following is happening
- Transfer entities are being created
- Action intents are being executed
- Transfers are being retried
- Banking events are being processed
When the lifecycle reaches a new state, one of two things can happen
- If no new signal arrives, the workflow completes, and a new workflow instance with the same workflowId will spawn on subsequent card events.
- If we receive a new card event webhook of the same lifecycle while the TPSWorkflow is processing a previous event in the same lifecycle, our signal handler will set
hasNewSignalto true, and a re-run will automatically happen.
Now, how exactly did this solve the flaws in the original design?
- Database contention is eliminated. In the original design, contention arises from checks that ensure transfers aren't in an intermediate state before starting new ones. For example, during a partial capture, we had to verify that the
transfer_intentwas in a hold state, so two different card events couldn't capture it at the same time. Since only one workflow now processes a single card authorization lifecycle at a time, there's no contention on non-ledger tables liketransfer_intent,authorization_object, orauthorization_set. Sequential task execution guarantees ordering, eliminating the need for guardrail code to check entity states. - Abstracting the state machine using temporal also gives us all the benefits of Temporal workflow and activities. Every step of transaction processing is now durable. In the event of a failed workflow, Temporal resumes exactly where it left off on the next retry and self-heals.
- Observability became much better. Every step of the execution is in a single unified view. If a workflow is stuck in a failed/retry loop, it means one of the async tasks is failing, preventing it from transitioning to a proper state in the state machine. We easily set up alerts and metrics to monitor the workflow health.
- Horizontal scaling is now easy because everything related to transaction processing lives on a single Temporal task queue; it's as simple as a Kubernetes command. This also allowed us to build auto-scalers based on the temporal task queue metrics.
Takeaways and next step
Even with all the improvements we got from the TPS workflow, TPS will by no means handle 100x our current volume. There isn't really an "end game" when it comes to scaling our system, and as our growth team continues to cook, the engineering team just needs to always assume that our system will need constant improvements to handle future volume.
There will always be new engineering challenges when it comes to issuing / banking. If you made it here, thank you for reading. If you are interested in the real-world engineering problems we are solving as a fast-growing startup, apply here. We’d love to hear from you.