Skip to content

Persisting Stateless Workflows

If you want to build production-grade applications using Microsoft’s Agent Framework Workflows, they must be stateless.

Table of Contents

In the Microsoft Agent Framework (MAF), workflows provide the backbone for building reasoning and decision-driven systems. They enable agents to execute complex tasks, coordinate multiple components, and persist progress over time. The framework’s documentation and samples show how to use the checkpoint store, configure a JSON-based backing store, and create a checkpoint manager to persist workflow progress.

What’s missing, though, is how to bring all of those parts together into a complete, stateless, production-ready design. In real-world systems, workflows need to pause, persist, and later resume.

This post focuses on combining the existing Microsoft examples into a cohesive architecture that uses an ASP.NET API and a persistent store to create a fully stateless workflow.

We’ll explore how to coordinate checkpoints, persist workflow context externally, and safely resume execution without losing continuity. Rather than restating existing documentation, the goal is to show how to turn those building blocks into a seamless workflow that can be halted and resumed repeatedly, giving you a clear pattern for robust stateless agents.

You can get the code here

Understanding MAF Checkpoints

The Microsoft Agent Framework relies on two core components to manage workflow state:

  1. Checkpoint Store — Responsible for persisting workflow snapshots to an external storage medium, which by default is in-memory. In many examples, this might be implemented using a JSON-based store, which allows the workflow state to be serialized and stored in a structured but human-readable format. The checkpoint store acts as the persistence layer — a durable backing where workflow data lives outside the runtime process.
  2. Checkpoint Manager — Coordinates the creation, updating, and retrieval of these persisted states. Each time a workflow reaches a logical completion point — often called a Super Step — the checkpoint manager creates a checkpoint. This checkpoint captures the workflow’s current state, including variables, messages, and any necessary context for resumption.

When a workflow resumes, the checkpoint manager retrieves the latest checkpoint from the store and reconstructs the workflow state in memory. This enables execution to continue from the exact point it left off, even if the application has restarted or the process has moved to another machine. By persisting state externally and reconstructing it on demand, MAF effectively decouples workflow execution from process lifetime — a foundational principle of stateless orchestration.

Azure Storage Check Point Store

The implementation of a stateless checkpoint store in MAF doesn’t require an elaborate rewrite of the framework — it simply involves extending the existing JsonCheckpointStore and providing custom persistence logic. The code below shows how this can be achieved with a lightweight wrapper around a repository interface:

public class CheckpointStore(ICheckpointRepository checkpointRepository) : JsonCheckpointStore
{
    public override async ValueTask<IEnumerable<CheckpointInfo>> RetrieveIndexAsync(string runId, CheckpointInfo? withParent = null)
    {
        var stateByRunId = await checkpointRepository.GetAsync(runId);
        return stateByRunId.Select(x => x.CheckpointInfo);
    }

    public override async ValueTask<CheckpointInfo> CreateCheckpointAsync(string runId, JsonElement value, CheckpointInfo? parent = null)
    {
        var checkpointInfo = new CheckpointInfo(runId, Guid.NewGuid().ToString());
        await checkpointRepository.SaveAsync(new StoreStateDto(checkpointInfo, value));
        return checkpointInfo;
    }

    public override async ValueTask<JsonElement> RetrieveCheckpointAsync(string runId, CheckpointInfo key)
    {
        var stateDto = await checkpointRepository.LoadAsync(key.CheckpointId, runId);
        return stateDto.JsonElement;
    }
}

Behind this implementation is an ICheckpointRepository abstraction, which in this case writes to Azure Blob Storage. This approach allows you to persist checkpoints externally and retrieve them later, enabling stateless workflows that can safely pause, stop, and restart without holding any memory-resident state.

By leveraging the existing MAF infrastructure and plugging in a persistence layer, this code demonstrates how to achieve a complete separation between workflow logic and runtime state.

Once the CheckpointStore is in place, initializing the Checkpoint Manager becomes a single, straightforward step:

var checkpointManager = CheckpointManager.CreateJson(new      CheckpointStore(repository));

This line creates a CheckpointManager instance that uses the CheckpointStore you’ve defined. The manager now handles all checkpoint creation, retrieval, and coordination automatically. You then pass this checkpointManager to your workflow runner or orchestrator, enabling the workflow to persist and resume seamlessly as it executes. This small addition connects the persistence layer to the orchestration layer, completing the stateless workflow pipeline. It’s a small amount of custom code with a large architectural impact — transforming the workflow from stateful execution to a truly stateless, resumable process.

Until Next Time

Stateless workflows represent the next evolution in how agents manage long-running or multi-step reasoning tasks. By decoupling execution state from the runtime process, developers gain the ability to scale horizontally, recover from interruptions, and distribute workloads without sacrificing continuity.

The Microsoft Agent Framework already provides the primitives needed for this: checkpoints, serialization, and workflow management. With a small extension — like the CheckpointStore implementation shown here — you can achieve full durability and resumability across distributed environments.

In production systems, this approach means that agents can reliably handle complex workflows that span hours or days, survive restarts, and even move seamlessly between hosts. It’s the foundation for building robust, fault-tolerant agents capable of real-world reasoning at scale.

In the next post, we’ll extend this architecture by wiring the checkpoint store into an ASP.NET API and expanding the workflow to create a stateless, human-in-the-loop ReAct, Reason, and Act workflow.

Latest