Building Saga Patterns with Step Functions: Distributed Transactions and Compensations

Building Saga Patterns with Step Functions: Distributed Transactions and Compensations#

Imagine you’re building an e-commerce platform. A customer places an order, and your system needs to validate the order, reserve inventory, charge a payment, and ship the goods. In a monolithic world, this would be a single database transaction—all or nothing. But in a distributed microservices architecture, each of these steps might live in a different service, each with its own database. If payment fails after inventory is already reserved, you’ve got a problem: the customer never got charged, but inventory is gone.

This is where the saga pattern comes in. A saga is a long-running transaction broken into a series of smaller, local transactions coordinated across multiple services. When something goes wrong, instead of rolling back in the traditional sense, the saga triggers compensating transactions—explicit rollback logic that undoes previous steps. AWS Step Functions provides a powerful way to orchestrate these sagas, giving you visibility, control, and resilience when building distributed systems.

In this article, we’ll explore how to build robust saga patterns using Step Functions, dive into handling failures with compensating actions, examine the trade-offs between choreography and orchestration, and discover the patterns that make distributed transactions reliable.

Understanding the Saga Pattern#

Before we jump into Step Functions, let’s establish a mental model of what a saga actually does. In traditional ACID transactions, you have atomicity: either everything commits or everything rolls back. Sagas give up immediate atomicity in exchange for availability and resilience. Instead, they guarantee eventual consistency by executing a series of steps, each with its own local transaction, and orchestrating compensating actions when failures occur.

Think of a saga as a storyline. The happy path is the main narrative—execute step one, then step two, then step three. But the story also needs alternate endings. If step three fails, you need chapters that undo steps one and two. These are your compensating transactions.

There are two primary ways to implement sagas: choreography and orchestration. In choreography, services communicate with each other through events. Service A completes a step and publishes an event. Service B listens for that event and performs the next step. This is loosely coupled but can be harder to understand and debug because the flow is distributed across multiple services. Orchestration, by contrast, uses a central coordinator—like Step Functions—that explicitly calls services in sequence and decides what happens next based on responses and failures. This gives you a clear, centralized view of the entire transaction flow.

Why Step Functions for Saga Orchestration#

Step Functions is AWS’s managed service for coordinating multi-step workflows. It offers several advantages for saga implementation. First, you define your workflow as a JSON state machine—a declarative representation that’s easy to visualize and understand. Second, Step Functions handles retries, timeouts, and error handling natively, without boilerplate code. Third, the AWS Console provides a visual representation of your workflow execution, showing you exactly where things succeeded or failed. Finally, Step Functions integrates seamlessly with nearly every AWS service and supports HTTP endpoints, making it agnostic to your actual business logic implementation.

For sagas, this means you get a single source of truth for your distributed transaction logic. The state machine becomes your transaction definition. If something breaks, you can look at the execution history and see exactly which step failed and what state the system was in at that moment.

Building an E-Commerce Order Saga#

Let’s walk through a concrete example: a multi-step order processing saga. The happy path involves four steps: validate the order, reserve inventory, charge the payment, and ship the order. But we’ll also define compensating steps that run if something goes wrong.

{
  "Comment": "E-commerce order saga with compensation",
  "StartAt": "ValidateOrder",
  "States": {
    "ValidateOrder": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "validateOrderFunction",
        "Payload": {
          "orderId.$": "$.orderId",
          "customerId.$": "$.customerId",
          "items.$": "$.items"
        }
      },
      "ResultPath": "$.validationResult",
      "Next": "ReserveInventory",
      "Catch": [
        {
          "ErrorEquals": ["States.ALL"],
          "Next": "OrderValidationFailed"
        }
      ]
    },
    "ReserveInventory": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "reserveInventoryFunction",
        "Payload": {
          "orderId.$": "$.orderId",
          "items.$": "$.items"
        }
      },
      "ResultPath": "$.reservationResult",
      "Next": "ChargePayment",
      "Catch": [
        {
          "ErrorEquals": ["States.ALL"],
          "Next": "CompensateReservation"
        }
      ]
    },
    "ChargePayment": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "chargePaymentFunction",
        "Payload": {
          "orderId.$": "$.orderId",
          "amount.$": "$.totalAmount",
          "customerId.$": "$.customerId"
        }
      },
      "ResultPath": "$.paymentResult",
      "Next": "ShipOrder",
      "Catch": [
        {
          "ErrorEquals": ["States.ALL"],
          "Next": "CompensatePayment"
        }
      ]
    },
    "ShipOrder": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "shipOrderFunction",
        "Payload": {
          "orderId.$": "$.orderId",
          "shippingAddress.$": "$.shippingAddress"
        }
      },
      "ResultPath": "$.shipmentResult",
      "Next": "OrderSucceeded",
      "Catch": [
        {
          "ErrorEquals": ["States.ALL"],
          "Next": "CompensateShipment"
        }
      ]
    },
    "CompensatePayment": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "refundPaymentFunction",
        "Payload": {
          "orderId.$": "$.orderId",
          "paymentId.$": "$.paymentResult.Payload.paymentId"
        }
      },
      "ResultPath": "$.refundResult",
      "Next": "CompensateReservation"
    },
    "CompensateReservation": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "releaseInventoryFunction",
        "Payload": {
          "orderId.$": "$.orderId",
          "reservationId.$": "$.reservationResult.Payload.reservationId"
        }
      },
      "ResultPath": "$.releaseResult",
      "Next": "OrderFailed"
    },
    "CompensateShipment": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "cancelShipmentFunction",
        "Payload": {
          "orderId.$": "$.orderId",
          "shipmentId.$": "$.shipmentResult.Payload.shipmentId"
        }
      },
      "ResultPath": "$.cancelResult",
      "Next": "CompensatePayment"
    },
    "OrderSucceeded": {
      "Type": "Succeed"
    },
    "OrderValidationFailed": {
      "Type": "Fail",
      "Error": "ValidationError",
      "Cause": "Order validation failed"
    },
    "OrderFailed": {
      "Type": "Fail",
      "Error": "SagaFailed",
      "Cause": "Order processing failed and was rolled back"
    }
  }
}

This state machine encodes the entire saga logic. Notice how each task has a ResultPath that captures the output and merges it into the state. This is crucial for sagas because you need to preserve information about what succeeded so you can compensate it later. For example, the payment function returns a paymentId, and we need that ID later to issue a refund.

Leveraging Catch Blocks for Error Handling and Compensation#

The Catch block is your compensation trigger. When a task fails—whether it’s a timeout, an exception, or an explicit error thrown by your Lambda function—the Catch block catches it and redirects execution to a compensating state.

In our example, if ReserveInventory fails, we jump to CompensateReservation, which releases the inventory. If ChargePayment fails, we jump to CompensatePayment, which issues a refund, and then we proceed to CompensateReservation to release inventory as well. Notice the order: we compensate in reverse order. This matters because if the reservation succeeded and payment failed, we need to undo the payment before undoing the reservation.

The ErrorEquals parameter in the Catch block determines which errors trigger the compensation. Using ["States.ALL"] catches all errors. In production, you might be more selective:

"Catch": [
  {
    "ErrorEquals": ["PaymentDeclined", "InsufficientFunds"],
    "Next": "CompensatePayment"
  },
  {
    "ErrorEquals": ["ServiceUnavailable"],
    "Next": "RetryPayment"
  },
  {
    "ErrorEquals": ["States.ALL"],
    "Next": "OrderFailed"
  }
]

This allows you to handle different failure modes differently. A payment decline might be unrecoverable and should immediately trigger compensation. A service unavailable error might warrant a retry before giving up.

Tracking State with ResultPath#

ResultPath is a powerful feature that deserves special attention. It controls what data flows from the output of a task into the state. By default, a task’s output completely replaces the state. For sagas, this would be disastrous—you’d lose all your accumulated data. ResultPath solves this by letting you merge task outputs into specific paths in the state.

In our example, we use "ResultPath": "$.reservationResult", which means: take the entire output from the ReserveInventory Lambda function and store it at the path $.reservationResult in the state. The rest of the state (like $.orderId, $.customerId, etc.) is preserved. This allows each step to contribute its results to a growing accumulation of data that represents the complete saga state.

You can also use ResultPath: null to discard a task’s output entirely if you don’t need it, or "ResultPath": "$" to replace the entire state. This gives you fine-grained control over what data flows through your saga.

Handling Partial Failures#

One of the trickiest scenarios in saga patterns is partial failure: some steps succeed before one fails. In a traditional transaction, the database would automatically roll back everything. In a saga, you must explicitly decide what to roll back.

Consider this scenario: validation passes, inventory is reserved, but the payment service is temporarily down. At this point, you have successfully reserved inventory but failed to charge payment. The saga must now:

Recognize that payment failed
Compensate the inventory reservation
Leave validation alone (it doesn’t require compensation)
Transition to a failed state

Our state machine handles this by having the ChargePayment Catch block point to CompensatePayment, which then chains to CompensateReservation. Critically, we don’t compensate validation because there’s nothing to undo there.

But what if a later step fails? Consider if shipment fails after everything else succeeds. Now we need to compensate in reverse order:

Shipment fails
  → Cancel shipment
    → Refund payment
      → Release inventory
        → Fail the saga

This is exactly what our state machine does with the CompensateShipment state chaining to CompensatePayment, which chains to CompensateReservation.

The key insight is that your compensating states must form a chain that mirrors the reversal of the happy path. Design your compensation flow to undo steps in reverse order of their execution.

Ensuring Idempotency in Compensation#

Here’s a critical requirement that’s easy to overlook: compensating actions must be idempotent. Idempotency means that performing the same action multiple times produces the same result as performing it once.

Why does this matter? Because Step Functions retries failed steps by default. Imagine your refund operation succeeds, but the response is lost due to a network glitch. Step Functions retries the refund. If your refund function isn’t idempotent, it might refund the customer twice.

Here’s what an idempotent refund function might look like in Python:

import boto3

dynamodb = boto3.resource('dynamodb')
refunds_table = dynamodb.Table('ProcessedRefunds')
payments_table = boto3.resource('dynamodb').Table('Payments')

def lambda_handler(event, context):
    order_id = event['orderId']
    payment_id = event['paymentId']
    
    # Check if we've already processed this refund
    response = refunds_table.get_item(Key={'orderId': order_id})
    if 'Item' in response:
        return {
            'statusCode': 200,
            'refundId': response['Item']['refundId'],
            'message': 'Refund already processed'
        }
    
    # Get payment details to verify
    payment = payments_table.get_item(Key={'paymentId': payment_id})['Item']
    if payment['status'] != 'charged':
        return {
            'statusCode': 400,
            'message': 'Payment not in chargeable state'
        }
    
    # Issue the refund
    stripe = initialize_stripe()
    refund = stripe.Refund.create(charge=payment['stripeChargeId'])
    
    # Record that we've processed this refund
    refunds_table.put_item(Item={
        'orderId': order_id,
        'refundId': refund.id,
        'timestamp': int(time.time())
    })
    
    return {
        'statusCode': 200,
        'refundId': refund.id
    }

The function checks if it’s already processed this refund by looking it up in a refunds table. If it has, it returns the same result without processing again. This ensures that even if Step Functions retries, the customer is only refunded once.

Orchestration Versus Choreography: Trade-offs#

Now that we understand orchestration through Step Functions, let’s compare it to choreography, the event-driven alternative.

In a choreography-based saga, services communicate through events. When the order service validates an order, it publishes an OrderValidated event. The inventory service listens for this event and processes it, then publishes an InventoryReserved event. The payment service listens for that event, and so on. If payment fails, the payment service publishes an OrderCancelled event, which triggers compensations in other services.

The advantage of choreography is loose coupling. Services don’t call each other directly; they react to events. This makes it easy to add new services to the saga without modifying existing ones. It also distributes intelligence across services, which some teams prefer from an organizational perspective.

However, choreography has significant drawbacks. The saga flow is implicit and distributed across multiple services. If something goes wrong, understanding what happened requires tracing through event logs across multiple services. There’s no single state machine that represents your transaction. Debugging is harder. Testing is harder because you need multiple services running. And distributed event flows can be surprisingly tricky to reason about—circular dependencies and race conditions can sneak in.

Orchestration with Step Functions inverts these trade-offs. The entire saga is centralized in one state machine. The flow is explicit and visible. Debugging is straightforward because Step Functions provides execution history showing exactly which state failed and why. Testing is easier because you can mock service calls. However, you create tighter coupling—the orchestrator calls services directly, and services don’t need to know about the saga at all (which is actually good), but the orchestrator becomes a critical dependency.

For most teams building distributed systems, orchestration with Step Functions is the better choice. The visibility and debugging benefits usually outweigh the coupling concerns. That said, if you have hundreds of microservices and need extreme decoupling, choreography might be worth the complexity.

A hybrid approach is also viable: use Step Functions to orchestrate your main saga, but have services communicate asynchronously through events for non-critical notifications. For example, Step Functions orchestrates the core order flow, but also publishes events that trigger downstream systems like analytics or customer notifications.

Implementing Long-Running Sagas with Wait States#

Not all sagas complete quickly. Sometimes you need to wait for external events. For example, after shipping an order, you might want to wait for customer confirmation before finalizing the transaction.

Step Functions includes a Wait state that pauses execution for a specified duration:

{
  "ShipOrder": {
    "Type": "Task",
    "Resource": "arn:aws:states:::lambda:invoke",
    "Parameters": {
      "FunctionName": "shipOrderFunction",
      "Payload": {
        "orderId.$": "$.orderId"
      }
    },
    "Next": "WaitForDelivery"
  },
  "WaitForDelivery": {
    "Type": "Wait",
    "Seconds": 604800,
    "Next": "ConfirmDelivery"
  },
  "ConfirmDelivery": {
    "Type": "Task",
    "Resource": "arn:aws:states:::lambda:invoke",
    "Parameters": {
      "FunctionName": "confirmDeliveryFunction",
      "Payload": {
        "orderId.$": "$.orderId"
      }
    },
    "Next": "OrderSucceeded"
  }
}

This waits seven days (604800 seconds) before checking if the delivery was confirmed. If you need to wait for an external event rather than a fixed duration, you can use a Task state with the WaitForTaskToken parameter to pause and wait for a callback:

{
  "WaitForApproval": {
    "Type": "Task",
    "Resource": "arn:aws:states:::sqs:sendMessage.waitForTaskToken",
    "Parameters": {
      "QueueUrl": "https://sqs.region.amazonaws.com/account/approvalQueue",
      "MessageBody": {
        "taskToken.$": "$$.Task.Token",
        "orderId.$": "$.orderId"
      }
    },
    "Next": "OrderSucceeded"
  }
}

With waitForTaskToken, the saga pauses and waits. Some external system (perhaps a human or another service) processes the message and then calls the Step Functions SendTaskSuccess API with the task token, resuming execution. This is powerful for sagas that require human intervention or coordination with systems outside AWS.

Monitoring and Debugging Saga Executions#

One of Step Functions’ greatest strengths is observability. Every execution is logged, and the AWS Console provides a visual representation of execution flow.

When a saga fails, you can navigate to the Step Functions console, select the execution, and see:

A visual diagram showing which state failed and why
The complete input and output of each state
Timestamps for each transition
Error messages and stack traces

This level of visibility is invaluable when debugging distributed transactions. You can see exactly which step failed, what data it received, what it returned, and when things went wrong.

For production systems, you should integrate Step Functions with CloudWatch. Enable logging at the execution level to capture full details:

{
  "loggingConfiguration": {
    "level": "ALL",
    "includeExecutionData": true,
    "destinations": [
      {
        "cloudWatchLogsLogGroup": {
          "logGroupName": "/aws/stepfunctions/order-saga"
        }
      }
    ]
  }
}

With this enabled, all execution data flows to CloudWatch Logs, where you can search, analyze, and alert on saga patterns. You can use CloudWatch Insights to query execution data, find failed sagas, and analyze failure patterns.

Testing and Validation Strategies#

Testing sagas is more complex than testing single services because they coordinate multiple dependencies. Here are strategies that work well:

First, mock external service calls in development. When you’re testing locally or in a test environment, have your Lambda functions return canned responses. This lets you test the saga logic without needing all dependent services running. Use environment variables to switch between mock and real implementations:

import os
import boto3

def get_payment_service():
    if os.environ.get('USE_MOCK_PAYMENT') == 'true':
        return MockPaymentService()
    else:
        return RealPaymentService()

Second, test each compensation path explicitly. Create test cases for each failure scenario: validation fails, inventory can’t be reserved, payment declines, shipment fails, etc. Verify that the correct compensations run in the correct order.

Third, use Step Functions’ execution history to validate behavior. After running a saga (real or mocked), check the execution history to ensure the expected states were visited and data flowed correctly:

aws stepfunctions get-execution-history \
  --execution-arn arn:aws:states:region:account:execution:stateMachine:executionName \
  --query 'events[*].[timestamp,type,stateEnteredEventDetails.name]' \
  --output table

Fourth, run chaos engineering tests in staging. Randomly inject failures in service calls and verify that sagas compensate correctly. This surfaces unexpected failure modes before they hit production.

Best Practices for Production Sagas#

Building robust sagas requires attention to several details:

Design for partial failures. Don’t assume that either everything succeeds or everything fails. Design each compensating action to work independently, even if other compensations have failed.

Make compensations idempotent. As discussed, ensure that running compensation multiple times is safe. Use request deduplication, idempotency tokens, or lookup checks.

Set appropriate timeouts. Each task should have a timeout. If a service is hanging, you don’t want your saga stuck forever. Use the TimeoutSeconds parameter:

{
  "Type": "Task",
  "Resource": "arn:aws:lambda:invoke",
  "TimeoutSeconds": 30,
  "Next": "NextState"
}

Implement sensible retry policies. Some failures are transient and should be retried; others are permanent and should immediately compensate. Use the Retry clause:

{
  "Type": "Task",
  "Resource": "arn:aws:lambda:invoke",
  "Retry": [
    {
      "ErrorEquals": ["ServiceUnavailable"],
      "IntervalSeconds": 1,
      "MaxAttempts": 3,
      "BackoffRate": 2.0
    }
  ],
  "Catch": [
    {
      "ErrorEquals": ["States.ALL"],
      "Next": "Compensate"
    }
  ]
}

Keep sagas focused. A saga should represent a single business transaction. If you’re tempted to orchestrate dozens of steps, consider breaking it into multiple sagas that call each other.

Monitor saga duration. Long-running sagas consume execution history and can become expensive at scale. If a saga should complete in minutes, set a hard timeout to catch infinite waits.

Conclusion#

The saga pattern is essential for building reliable distributed systems. AWS Step Functions provides an excellent platform for orchestrating sagas, giving you centralized visibility, built-in error handling, and straightforward compensation logic.

The key to success is thinking carefully about your happy path, explicitly designing compensating actions for each step, and ensuring those actions are idempotent. Use ResultPath to accumulate state as your saga progresses, and use Catch blocks to route failures to the appropriate compensation logic. Remember that compensation must happen in reverse order, and that partial failures—where some steps succeed before failure—are the norm, not the exception.

Whether you’re building e-commerce platforms, financial systems, or any distributed workflow, the saga pattern transforms complex coordination problems into explicit, testable, and observable flows. Step Functions brings that pattern to life with minimal boilerplate, so you can focus on your business logic rather than transaction management.