Implementing State Machines in Go for Cleaner Call Chains
Background
Many projects require sequential operations based on runtime state. Common scenarios include:
- Parsing configuration formats or programming languages
- Executing operations on systems, routers, or clusters
- ETL pipelines for data extraction, transformation, and loading
Rob Pike's classic talk on lexical scanning in Go introduced a powerful state machine pattern that has influenced many implementations. The key insight is leveraging Go's ability to treat functions as first-class values—each state returns the next state to execute, breaking the traditional if/else chain of function calls.
This approach decouples call chains into testable units, making it significantly easier to verify individual components without mocking entire hierarchies.
The Problem with Traditional Call Chains
Conventional approaches to sequential operations typically look like this:
func Execute(args Config) {
stepOne(args)
stepTwo(args)
}
Or with distributed responsibilities:
func Execute(args Config) {
stepOne(args)
}
func stepOne(args Config) {
stepTwo(args)
}
func stepTwo(args Config) {
// completion
}
Both patterns work, but they create challenges when testing. If Execute makes remote calls to external systems, you must mock or stub every downstream function to write isolated tests. With 50 call levels, this becomes unmanageable—testing a single function requires mocking everything underneath it.
Pike's pattern solves this by making each function return the next function to execute.
State Machine Pattern
Define a state as a function type:
type State[T any] func(ctx context.Context, input T) (T, State[T], error)
Each state receives input of type T and returns the modified input, the next state to execute, or an error. Returning nil for the next state terminates the machine. Returning an error also terminates the machine.
This design differs from Pike's original by including generics and returning the input parameter, enabling pure functional state machines where each state can transform data and pass it to the next.
The state driver is straightforward:
func Run[T any](ctx context.Context, input T, initial State[T]) (T, error) {
current := initial
for {
select {
case <-ctx.Done():
return input, ctx.Err()
default:
}
input, current, err := current(ctx, input)
if err != nil {
return input, err
}
if current == nil {
return input, nil
}
}
}
This minimal implementation handles context cancellation, error propagation, and state transitions.
Practical Example: Cluster Service Cleanup
Consider a scenario where you need to remove a service from a cluster while handling associated storage:
package cleanup
// storageProvider defines operations for managing backup storage.
type storageProvider interface {
DeleteOldBackups(ctx context.Context, serviceID string, retainCount int) error
PurgeContainer(ctx context.Context, serviceID string) error
}
// clusterOps defines operations for cluster service management.
type clusterOps interface {
Drain(ctx context.Context, serviceID string) error
Unregister(ctx context.Context, serviceID string) error
Enumerate(ctx context.Context) ([]string, error)
HasAttachedStorage(ctx context.Context, serviceID string) (bool, error)
}
The private interfaces define the exact operations needed, allowing clients to implement only what the state machine requires.
Configuration for the operation:
// Request holds parameters for service cleanup.
type Request struct {
ServiceID string
Storage storageProvider
Cluster clusterOps
}
func (r Request) validate() error {
if r.ServiceID == "" {
return errors.New("service ID is required")
}
if r.Storage == nil {
return errors.New("storage provider is required")
}
if r.Cluster == nil {
return errors.New("cluster operations are required")
}
return nil
}
Note that Request is a value type, not a pointer. Since we modify and pass Request through each state, avoiding pointer indirection saves garbage collector overhead—negligible for simple operations but meaningful in high-throughput ETL pipelines.
The public API wraps the state machine:
// Cleanup removes a service from the cluster and cleans up storage.
// The three most recent backups are preserved.
func Cleanup(ctx context.Context, req Request) error {
if err := req.validate(); err != nil {
return err
}
_, err := Run(ctx, req, drainPhase)
if err != nil {
return fmt.Errorf("failed to cleanup service %q: %w", req.ServiceID, err)
}
return nil
}
Users interact with Cleanup, not the underlying state machine. The function validates input, selects the initial state (drainPhase), and executes the machine.
The first state drains the service:
func drainPhase(ctx context.Context, req Request) (Request, State[Request], error) {
services, err := req.Cluster.Enumerate(ctx)
if err != nil {
return req, nil, err
}
found := false
for _, id := range services {
if id == req.ServiceID {
found = true
break
}
}
if !found {
return req, nil, fmt.Errorf("service %q not found in cluster", req.ServiceID)
}
if err := req.Cluster.Drain(ctx, req.ServiceID); err != nil {
return req, nil, fmt.Errorf("failed to drain service: %w", err)
}
return req, unregisterPhase, nil
}
This state verifies the service exists, drains it, and returns unregisterPhase as the next state.
The second state unregisters the service:
func unregisterPhase(ctx context.Context, req Request) (Request, State[Request], error) {
if err := req.Cluster.Unregister(ctx, req.ServiceID); err != nil {
return req, nil, fmt.Errorf("failed to unregister service: %w", err)
}
hasStorage, err := req.Cluster.HasAttachedStorage(ctx, req.ServiceID)
if err != nil {
return req, nil, fmt.Errorf("storage check failed: %w", err)
}
if hasStorage {
return req, cleanupPhase, nil
}
return req, nil, nil
}
This state removes the service from the cluster, checks for attached storage, and either proceeds to cleanup or terminates the machine by returning nil, nil.
The conditional branching demonstrates how state machines handle different execution paths based on runtime conditions.
Testing Benefits
This pattern naturally encourages small, testable units. Modules split easily when they grow too large—just create new states to isolate functionality.
The primary advantage emerges when testing. Traditional approaches require either sequential calls or functional chains, both leading to end-to-end tests that shouldn't exist:
// Sequential approach - testing Execute requires mocking everything
func Execute(ctx context.Context, req Request) error {
if err := drainPhase(ctx, req); err != nil {
return err
}
if err := unregisterPhase(ctx, req); err != nil {
return err
}
// ... more phases
return nil
}
With a state machine, each state is independently testable. Test drainPhase in isolation by verifying it returns the correct next state and input modifications. The top-level Cleanup function doesn't need testing—it simply validates input and delegates to Run, which is also testable separately.
A table-driven test for drainPhase:
func TestDrainPhase(t *testing.T) {
t.Parallel()
cases := []struct {
desc string
request Request
expectErr bool
expectNext State[Request]
}{
{
desc: "Enumerate returns error",
request: Request{
Cluster: &mockCluster{
enumerateErr: errors.New("network failure"),
},
},
expectErr: true,
},
{
desc: "Service not in cluster",
request: Request{
ServiceID: "api-gateway",
Cluster: &mockCluster{
services: []string{"auth-service", "db-primary"},
},
},
expectErr: true,
},
{
desc: "Drain operation fails",
request: Request{
ServiceID: "api-gateway",
Cluster: &mockCluster{
services: []string{"api-gateway", "auth-service"},
drainErr: errors.New("timeout"),
},
},
expectErr: true,
},
{
desc: "Successful drain",
request: Request{
ServiceID: "api-gateway",
Cluster: &mockCluster{
services: []string{"api-gateway", "auth-service"},
drainErr: nil,
},
},
expectNext: unregisterPhase,
},
}
for _, tc := range cases {
t.Run(tc.desc, func(t *testing.T) {
_, next, err := drainPhase(context.Background(), tc.request)
if tc.expectErr && err == nil {
t.Errorf("expected error, got nil")
return
}
if !tc.expectErr && err != nil {
t.Errorf("expected no error, got %v", err)
return
}
if err != nil {
return
}
if tc.expectNext != nil && reflect.ValueOf(next).Pointer() != reflect.ValueOf(tc.expectNext).Pointer() {
t.Errorf("state mismatch: got %v, want %v", funcName(next), funcName(tc.expectNext))
}
})
}
}
Each test verifies that a state produces the correct output and transitions to the expected next state. No mocking of entire call chains required.
Alternative Implementations
A variant pattern stores the next state in the input parameter itself, useful for dynamic state selection:
type State[T any] func(ctx context.Context, input T) (T, State[T], error)
type Transition[T any] struct {
Data T
Next State[T]
}
func Run[T any](ctx context.Context, trans Transition[T], initial State[T]) (T, error) {
current := initial
for {
if ctx.Err() != nil {
return trans.Data, ctx.Err()
}
trans.Data, current, err := current(ctx, trans.Data)
if err != nil {
return trans.Data, err
}
current = trans.Next
trans.Next = nil
if current == nil {
return trans.Data, nil
}
}
}
This approach simplifies integration with distributed tracing systems or logging frameworks, as state transitions are explicit in the input structure.
For high-throughput scenarios requiring concurrency, the stagedpipe library provides advanced features including parallel processing and backpressure management.
Conclusion
State machines in Go provide a clean mechanism for managing sequential operations with testable components. By representing each step as a function that returns the next step, call chains become explicit, configurable, and straightforward to verify. This pattern scales from simple parsers to complex multi-stage workflows, making it a valuable addition to any Go developer's toolkit.