Managing State in Temporal - Chris Gavin's Dev Blog

TL;DR: Every time you write a Temporal workflow, you're writing to a durable, automatically replayed, crash-proof state machine. Most of the database tables developers reflexively create — status columns, retry counters, pending job queues, audit logs — are already sitting inside your workflow history. This post is about recognising that, and pushing the boundary of what you can avoid persisting externally.

The reflex

You're building an order processing system. An order comes in. You need to:

Track whether it's been paid
Wait for a fulfillment confirmation that might arrive hours later
Retry failed payment calls
Time out if the customer never confirms their address
Record every state transition for support queries

So you reach for a database. You create an orders table. You add a status column. You build a jobs table for retries, a scheduled task for timeouts, a status_history table for the audit log. You wire it all together with a mix of cron jobs and polling loops. You ship it.

Six months later it's a mess. Your status column has twelve possible values and nobody's quite sure which ones are reachable from which. The polling loop has a subtle race condition. The cron job sometimes fires twice.

The thing is — if you were already using Temporal, you built most of that infrastructure for free and you may not have noticed.

What Temporal actually persists

Temporal's persistence model is built on event sourcing. Every meaningful thing that happens in a workflow is appended to an immutable event log called the workflow history. This includes:

Workflow started (including all input arguments)
Activity scheduled, started, completed, failed, or timed out
Timer created and fired
Signal received
Child workflow started and completed
Local side effects recorded

When a Temporal worker crashes mid-execution — or is simply restarted for a deploy — the next available worker picks up the task and replays the history from scratch. Your workflow code runs again, but this time every await-on-activity call or sleep is short-circuited by the recorded result. By the time replay catches up to where execution stopped, every local variable is back to exactly the state it was in. No manual state restoration, no checkpoint files, no database reads.

Workflow starts
  │
  ├─ [Event] WorkflowExecutionStarted { input: {...} }
  ├─ [Event] ActivityTaskScheduled { activity: "ChargePayment" }
  ├─ [Event] ActivityTaskStarted
  ├─ [Event] ActivityTaskCompleted { result: { txId: "abc123" } }
  ├─ [Event] TimerStarted { fireAt: +24h }
  │
  -- worker crashes here --
  │
  ├─ [Replay] WorkflowExecutionStarted  → restore input
  ├─ [Replay] ActivityTaskCompleted     → short-circuit, return "abc123"
  ├─ [Replay] TimerStarted             → short-circuit, return future timer handle
  │
  ├─ [Event] TimerFired
  ├─ [Event] ActivityTaskScheduled { activity: "SendConfirmationEmail" }
  ...

The history is the state. Your local variables are just an in-memory materialisation of it.

Your workflow variables are already persisted

This is the bit that most people don't fully internalise until they've been burned by overengineering something.

Consider a workflow that tracks an order through several stages:

func OrderWorkflow(ctx workflow.Context, order Order) error {
    status := "pending"

    // Charge the customer
    var paymentResult PaymentResult
    if err := workflow.ExecuteActivity(ctx, ChargePayment, order).Get(ctx, &paymentResult); err != nil {
        return err
    }
    status = "paid"

    // Wait up to 48 hours for warehouse confirmation
    var confirmation WarehouseConfirmation
    confirmationCh := workflow.GetSignalChannel(ctx, "warehouse-confirmed")

    workflow.NewTimer(ctx, 48*time.Hour)
    selector := workflow.NewSelector(ctx)
    selector.AddReceive(confirmationCh, func(c workflow.ReceiveChannel, more bool) {
        c.Receive(ctx, &confirmation)
        status = "confirmed"
    })
    selector.Select(ctx)

    if status != "confirmed" {
        return workflow.ExecuteActivity(ctx, CancelOrder, order).Get(ctx, nil)
    }

    status = "fulfilled"
    return workflow.ExecuteActivity(ctx, ShipOrder, order, confirmation).Get(ctx, nil)
}

status is a plain Go string. There is no database write anywhere in this code. But if the worker running this workflow crashes between the payment and the warehouse confirmation, status will be "paid" when it resumes — because the activity completion event for ChargePayment is in the history, and replay reconstructs it.

The status column you were about to add to your orders table? It's already here.

Signals replace your inbound queues

The most common external-DB pattern I see developers reaching for is a pending events table: something that external systems write to, and your workflow polls. A warehouse system fires an HTTP call that writes a row, your worker picks it up, marks it handled.

Temporal has a first-class primitive for this: signals.

A signal is a named, typed, asynchronous message sent to a running workflow. It's stored in the workflow history. It's never lost even if the workflow is mid-replay when it arrives. And you can send it from anywhere — an HTTP endpoint, another workflow, a Lambda, a Pub/Sub consumer.

// Receiving end — inside the workflow
signalCh := workflow.GetSignalChannel(ctx, "warehouse-confirmed")
var confirmation WarehouseConfirmation
signalCh.Receive(ctx, &confirmation)

// Sending end — from your HTTP handler
client.SignalWorkflow(ctx, workflowID, "", "warehouse-confirmed", WarehouseConfirmation{
    WarehouseID: "LHR-02",
    EstimatedShip: time.Now().Add(3 * 24 * time.Hour),
})

No pending_events table. No polling loop. No risk of double-processing without idempotency guards. Temporal handles delivery and ordering.

Queries replace your status read API

The other side of the coin: if signals are the write path, queries are the read path.

A query is a synchronous, read-only RPC into a running (or recently completed) workflow. The worker reconstructs state from history and calls your query handler without advancing the history at all. From the caller's perspective it's just an API call that returns the current state.

// Register a query handler inside the workflow
workflow.SetQueryHandler(ctx, "get-status", func() (OrderStatus, error) {
    return OrderStatus{
        Status:      status,
        PaymentTxID: paymentResult.TxID,
        Confirmation: confirmation,
    }, nil
})

// Call it from anywhere
resp, err := client.QueryWorkflow(ctx, workflowID, "", "get-status")
var orderStatus OrderStatus
resp.Get(&orderStatus)

Your "what's the current state of order X?" endpoint doesn't need a database read. It needs a query.

Timers replace your scheduler

workflow.Sleep(ctx, 48*time.Hour) is not a thread sleep. It's a durable timer, persisted in the workflow history. The worker that was running your workflow can restart, redeploy, or disappear entirely — the timer still fires at the right time. A different worker picks it up, replays to the sleep point, and resumes.

This eliminates a whole class of infrastructure:

What you might have built	Temporal equivalent
Cron job that polls for expired orders	`workflow.Sleep` + cancellation logic
Scheduled retry table with next_attempt_at	Activity retry policy with backoff
Delayed email queue	`workflow.Sleep` → send email activity
SLA breach monitor	Timer + escalation signal handler

For the recurring/cron case, Temporal has Schedules — a built-in scheduler that creates new workflow runs on a cron expression. No external scheduler, no missed fires, full history of past runs.

Activity retries replace your retry infrastructure

Activities have a configurable retry policy — max attempts, initial interval, backoff coefficient, max interval, non-retryable error types. All of this is encoded in the history. You don't need to track retry counts in a column or implement backoff logic yourself.

activityOptions := workflow.ActivityOptions{
    StartToCloseTimeout: 30 * time.Second,
    RetryPolicy: &temporal.RetryPolicy{
        InitialInterval:    time.Second,
        BackoffCoefficient: 2.0,
        MaximumInterval:    time.Minute,
        MaximumAttempts:    5,
        NonRetryableErrorTypes: []string{"PaymentDeclinedError"},
    },
}
ctx = workflow.WithActivityOptions(ctx, activityOptions)

Each attempt, each failure, each eventual success is an event in the history. That history IS your audit log for the payment flow.

A worked example: the order lifecycle

Here's what a realistic order workflow looks like in practice, with no external database managing workflow state:

func OrderWorkflow(ctx workflow.Context, order Order) error {
    logger := workflow.GetLogger(ctx)

    // State lives here
    var (
        paymentResult    PaymentResult
        warehouseConfirm WarehouseConfirmation
        shipmentID       string
    )

    // --- Expose read API via query ---
    workflow.SetQueryHandler(ctx, "order-status", func() (map[string]any, error) {
        return map[string]any{
            "orderID":     order.ID,
            "payment":     paymentResult,
            "warehouse":   warehouseConfirm,
            "shipmentID":  shipmentID,
        }, nil
    })

    // --- Payment (auto-retried, result in history) ---
    payCtx := workflow.WithActivityOptions(ctx, workflow.ActivityOptions{
        StartToCloseTimeout: time.Minute,
        RetryPolicy: &temporal.RetryPolicy{MaximumAttempts: 3},
    })
    if err := workflow.ExecuteActivity(payCtx, ChargePayment, order).Get(ctx, &paymentResult); err != nil {
        return fmt.Errorf("payment failed: %w", err)
    }

    // --- Wait for warehouse (signal), with 48h timeout ---
    confirmCh := workflow.GetSignalChannel(ctx, "warehouse-confirmed")
    timeoutTimer := workflow.NewTimer(ctx, 48*time.Hour)

    sel := workflow.NewSelector(ctx)
    confirmed := false
    sel.AddReceive(confirmCh, func(c workflow.ReceiveChannel, _ bool) {
        c.Receive(ctx, &warehouseConfirm)
        confirmed = true
    })
    sel.AddFuture(timeoutTimer, func(f workflow.Future) {
        logger.Warn("warehouse confirmation timed out")
    })
    sel.Select(ctx)

    if !confirmed {
        return workflow.ExecuteActivity(ctx, CancelOrder, order, paymentResult).Get(ctx, nil)
    }

    // --- Ship ---
    shipCtx := workflow.WithActivityOptions(ctx, workflow.ActivityOptions{
        StartToCloseTimeout: 2 * time.Minute,
    })
    return workflow.ExecuteActivity(shipCtx, ShipOrder, order, warehouseConfirm).Get(ctx, &shipmentID)
}

What this workflow replaces:

orders.status column → workflow completion/query state
order_events audit table → workflow history
retry_jobs table → activity retry policy
pending_confirmations inbox → signal channel
order_timeout_jobs cron table → durable timer
The polling loop → selector on signal + timer

Where you still need a database

I'm not going to pretend Temporal replaces everything. There are genuine cases where external storage is the right call.

Large payloads. Temporal payloads have a default size limit (around 2MB per event, configurable). If you're moving around large files, binary blobs, or bulk records, put them in object storage and pass references through the workflow.

Cross-workflow aggregate queries. "Show me all orders in 'awaiting_warehouse' state for the last 30 days" — this is an analytics/reporting query that spans many workflow instances. Temporal's visibility layer (via Elasticsearch) can help with basic filtering by custom search attributes, but for complex reporting you'll want a dedicated store. The pattern here is an activity that writes a compact summary to your reporting DB on each state transition — a narrow, intentional write, not a full data model.

Data that outlives your workflow history. Temporal has a configurable retention period (often 7–30 days after workflow close). If you need to look up the details of a completed order years later, either extend retention, archive the history on completion, or write to a system of record at close time.

Non-Temporal consumers. If your mobile app, your support tool, or a third-party integration needs to read order state — and they can't talk to Temporal — you'll need to project state somewhere they can reach it. An on-completion write or an event-driven projection is the cleanest pattern.

The key question to ask is: does this state change exist to drive the workflow, or does it need to exist independently of it? If it's the former, it probably belongs in the workflow. If it's the latter, write it externally — but do it from an activity, not from the workflow directly.

The mental model shift

The core shift is this: stop thinking of Temporal as an orchestration layer on top of your data model, and start thinking of it as the data model itself — for the duration of the process it's running.

Your instinct will be to write status to a database so that "things are safe". But Temporal's event history is safer than most databases you'll deploy. It's replicated, append-only, crash-consistent, and automatically reconstructed on failure. The status in your orders table is a stale projection of reality; the status in your workflow is authoritative.

Push your state into the workflow first. Pull it out to an external store only when there's a concrete reason to — a query pattern it can't serve, a retention requirement it can't meet, or a consumer it can't reach.

You'll end up writing far less infrastructure, and what you do write will be a lot harder to get wrong.

Chris