ML API Design Principles

Overview

A fundamental principle in ML API design is the strict architectural decoupling of Inference (Prediction) from Training (Learning). These two workloads have diametrically opposed system requirements.

Feature	Inference API	Training API
Latency	Critical (<100ms usually required).	Flexible (Hours to days).
Compute	CPU/Small GPU (Bursty, high concurrency).	Heavy GPU/TPU (Sustained, batch processing).
State	Stateless (Ideally).	Stateful (Checkpoints, logs).
Protocol	gRPC / HTTP/2 (Minimize overhead).	Async HTTP / Webhooks (Long-running).

The “Command-Query” Separation for ML

Treat ML system using a variation of CQRS (Command Query Responsibility Segregation):

Queries (Inference): Read-only operations that return predictions. Optimized for Throughput ( $λ$ ) and Latency ( $W$ ).
Commands (Training/Fine-tuning): Write operations that update model state. Optimized for reliability and resource utilization.

Protocol Selection & Payload Optimization

“JSON over REST” is often insufficient for heavy tensor payloads. Protocols evaluations should be based on serialization overhead and transport efficiency.

The Protocol Decision Matrix

REST (JSON): Use for low-frequency management APIs (e.g., list_models, update_config) or public-facing APIs where developer experience (DX) > raw performance.
gRPC (Protobuf): The standard for internal service-to-service inference.
- Why: HTTP/2 multiplexing and binary serialization.
- Performance: Benchmarks typically show a 7-10x reduction in latency compared to REST for payload-heavy requests.

Binary Serialization Deep Dive

When designing the schema for your PredictRequest, the serialization format dictates the “tax” you pay on every call.

Protocol Buffers:
- Pros: Strongly typed, backward compatible, excellent tooling (gRPC).
- Cons: Requires a deserialization step (parsing).
FlatBuffers (https://en.wikipedia.org/wiki/FlatBuffers):
- Mechanism: Accesses serialized data without parsing/unpacking. It uses offset tables to read data directly from the buffer.
- Use Case: Mobile/Edge ML deployment where CPU cycles for parsing are expensive.
- Trade-off: slightly larger payload size on wire compared to Protobuf, but effectively zero-latency parsing.

Queueing Theory for Inference

To rigorously design for SLA (Service Level Agreement), apply queuing theory. An inference server can be modeled as an $M / M / c$ queue (Markovian arrival, Markovian service times, $c$ servers).

Little’s Law

The fundamental theorem governing your API’s concurrency: $L = λW$ Where:

$L$ = Average number of requests in the system (Concurrency).
$λ$ = Average arrival rate (Requests per second - RPS).
$W$ = Average time a request spends in the system (Latency).

Design Implication: If your model takes $W = 0.2 s$ (200ms) to infer, and you need to handle $λ = 1000$ RPS: $L = 1000 \times 0.2 = 200$ You need system capacity (concurrency) to handle 200 active requests simultaneously. This dictates your GPU memory sizing and worker count.

The Batching Cost Function

Batching improves throughput but harms latency. We can define a cost function $C (b)$ to find the optimal batch size $b$ : $C (b) = α \cdot Latency (b) + \frac{β}{Throughput ( b )}$ Where:

$Latency (b) \approx T_{o v er h e a d} + b \cdot T_{co m p u t e}$ (simplified linear approx).
$Throughput (b) \approx \frac{b}{T _{o v er h e a d} + b \cdot T _{co m p u t e}}$ .
$α, β$ are weights based on business priority (e.g., Real-time user vs. Offline job).

Actionable Insight: Expose a max_batch_size and batch_timeout parameter in your API configuration (or dynamic batching sidecar like Triton) to tune this curve.

Asynchronous Patterns for Long-Running Operations (LROs)

For Generative AI (e.g., image generation) or Batch Processing, a synchronous 200 OK is an anti-pattern. Use the Polled Async Request-Reply pattern.

Client POSTs request: POST /v1/jobs/generate
Server accepts immediately: Returns 202 Accepted with a Location header pointing to a status endpoint.

HTTP/1.1 202 Accepted
Location: /v1/jobs/12345/status
Retry-After: 5

Client Polls: GET /v1/jobs/12345/status returns {"status": "processing"}.
Completion: Eventually returns 303 See Other (redirect to result) or 200 OK with the payload.

Advanced Variation: Use Webhooks for the completion signal to avoid “chatty” polling if the job duration is highly variable (minutes to hours).

LLM Specifics: Evaluation & Feedback Loops

For LLMs, the API must support the Data Flywheel(using user data to create continuous improvement cycle). You are not just serving predictions; you are harvesting data for future fine-tuning (RLHF).

A. The “Feedback” Endpoint

Every generation endpoint should return a request_id or trace_id. The API must have a companion endpoint to capture human feedback on that specific trace.

Resources

Back to: ML & AI Index

Aayush's ML & AI Notes

Explorer

ML API Design Principles

ML API Design Principles

Overview

The “Command-Query” Separation for ML

Protocol Selection & Payload Optimization

The Protocol Decision Matrix

Binary Serialization Deep Dive

Queueing Theory for Inference

Little’s Law

The Batching Cost Function

Asynchronous Patterns for Long-Running Operations (LROs)

LLM Specifics: Evaluation & Feedback Loops

A. The “Feedback” Endpoint

Resources

Graph View

Table of Contents