Think twice before deploying your AI Agent PoC to Prod

What changes when input size changes from 10 to 10K? Here's a story of the pains I encountered at scale, and how you could avoid them.

Aug 04, 2025

Over the past eighteen months almost every data-science or platform team has built at least one “single-file” agent demo: drop in a PDF, get a summary, maybe run a retrieval-augmented chat. These experiments validated raw capability but hid three things that only show up in production: sustained throughput, failure handling and cost control.

Adoption numbers
- 51 % of the 1 300 professionals surveyed by LangChain already run at least one agent in production, and 78 % have concrete deployment plans. Mid-sized companies (100-2 000 employees) are the most aggressive, with 63 % already live. langchain.com
- IBM’s June 2025 enterprise study found 76 % of executives have an agentic proof-of-concept running but are now asking “how do we scale and govern this?” IBM

What breaks when you scale

Moving from one document to tens of millions, or from a single tool invocation to hundreds per minute, exposes three recurrent pain points:

Latency balloons from seconds to minutes

Root cause: Serial tool calls and synchronous I/O
Why the PoC missed it: Demo only processed one request at a time

Costs explode unpredictably

Root cause: Exponential token growth when agents recurse or retry
Why the PoC missed it: PoC ran on free-tier limits

Silent accuracy regressions

Root cause: No tracing or automated evals (Ref vol 1)
Why the PoC missed it: Manual eyeballing felt “good enough”

Lets take an example.

Picture a steady stream of 20 000 PDF documents flowing into your pipeline, each with more than ten pages and an average of one image per page. Every page must pass through OCR to harvest text, and each image needs separate extraction. After pulling this multimodal content, the workflow stitches the text and images back together, runs a custom summarisation step, and finally stores the results in your datastore, all while you juggle a very modest pool of CPU and GPU resources.

To understand how such a workload scales, we will estimate the total number of discrete processing nodes, project the compute time and power these nodes consume under realistic parallelism, and factor in retry overhead for inevitable failures. From there, we can outline the minimal yet resilient infrastructure, think task orchestrator, worker pools, message broker, observability stack, that keeps this many moving parts humming in production.

How many workflow-nodes will be spin up?

A. OCR + layout on each page

Granularity: page-level
Workload: 20 000 PDFs × 10 pages ≈ 200 000 pages
Nodes needed: 200 000

B. Image extraction

Granularity: page-level (≈1 image per page)
Items: 200 000 pages
Nodes needed: 200 000

C. Concatenate text + images

Granularity: per PDF
Items: 20 000 PDFs
Nodes needed: 20 000

D. Summarise PDF

Granularity: per PDF
Items: 20 000 PDFs
Nodes needed: 20 000

E. Push to datastore

Granularity: per PDF
Items: 20 000 PDFs
Nodes needed: 20 000

Subtotal: 460 000 nodes

Retry buffer (≈ 2 % for network/OCR hiccups): ≈ 9 200 nodes

Grand total: ≈ 469 000 nodes!!

Computational Cost Assessment

OCR processing

Compute: ~0.79 s per page (CPU) → 158 000 s raw runtime
Parallelization: 8 vCPUs handle ≈10 pages / s
Wall-clock time: ≈ 5 ½ hours

Image extraction

Compute: ~60 ms per image → 12 000 s total
Runs concurrently with OCR, so the extra wall-clock time is negligible

Concatenation

Compute: <10 ms per document
Kicked off asynchronously; effectively hidden behind other work

Summarization

Compute: ~25.5 s per document on a single GPU → 142 h raw runtime
- CPU fallback: ~73 h but at higher token cost
With one GPU, summarization dominates the schedule: ≈ 6 days!

Data insertion

Compute: ~50 ms per document
Buffered and streamed asynchronously; latency is masked

Optimization Strategies

Implementing GPU-accelerated OCR reduces CPU demands significantly.
Utilizing quantized or smaller-scale language models substantially decreases summarization durations.
Leveraging speculative decoding and batch processing enhances GPU throughput.
Adopting incremental summarization strategies reduces memory load and promotes parallel execution.
Employing spot instances mitigates expenses for stateless, computation-heavy processes.

Execution Strategy Examples

Baseline (32 CPUs + 1 GPU): approximately 148 hours (~6 days)
Optimized (32 CPUs + 2 GPUs or GPU OCR): approximately 48 hours
Budget-focused (8 CPUs only): approximately 95 hours (~4 days)

To summarise:

Expect ~4.7 × 10⁵ discrete workflow nodes!!
Summarisation dictates total elapsed time.
One modest GPU often beats many CPU cores for LLM work, but only if you can batch or run speculative decoding; otherwise adding CPU workers is still competitive.
Keep the pipeline idempotent and checkpoint after every fan-in so partial failures don’t force a restart of earlier stages.

So how to solve this simply?

Effectively processing gigabyte-scale or high-document-count workloads means treating your agent pipeline like a miniature data-platform rather than a single script. The following patterns have emerged as the most

reliable way to keep throughput high without astronomical costs.

1. Batch Processing APIs

Implement batch endpoints providing asynchronous job identifiers.
Utilize asynchronous callbacks or polling for result retrieval.
Incorporate adaptive throttling for back-pressure management.

2. Intelligent Task Batching

Cluster tasks semantically to maximize embedding cache effectiveness.
Dynamically adjust batch sizes to resource constraints (e.g., via Ray Data, Torch DataLoader).
Employ predictive token estimation (e.g., TikToken) for optimized model usage.

3. Advanced Scheduling Techniques

Utilize explicit Directed Acyclic Graph (DAG) structures for precise failure recovery (Airflow, Exosphere, Prefect, Argo).
Enable dynamic task generation to ensure accurate task tracking.
Leverage cost-effective scheduling through the intelligent allocation of spot instances.

4. Effective Parallelization and Autoscaling

Implement distributed queuing mechanisms for automated scaling (Kubernetes HPA, AWS Batch, Azure Container Apps).
Clearly delineate GPU- and CPU-intensive tasks.
Employ concurrency controls to mitigate overload scenarios.

5. Comprehensive Observability and Fault Tolerance

Generate structured logs and metrics (via OpenTelemetry, Prometheus).
Establish isolation protocols for recurrent task failures.
Introduce budget monitoring checkpoints for proactive cost management.

Tools in a snapshot by Capability

Batch APIs

Existing Solutions: AWS SageMaker, Google AI Platform, vLLM KV cache
Comprehensive Integration with Exosphere: one place to run batch jobs across different models and API formats

Scheduling

Existing Solutions: Airflow, Prefect, Argo
Comprehensive Integration with Exosphere: built-in DAG plus dynamic task orchestration that fits agent-style workflows

Autoscaling

Existing Solutions: Kubernetes HPA, AWS Batch
Comprehensive Integration with Exosphere: resource-aware scaling tuned separately for GPUs and CPUs

Observability

Existing Solutions: Prometheus, Grafana, OpenTelemetry
Comprehensive Integration with Exosphere: agent-specific metrics pushed straight into the same dashboards you already use

We have finally crossed a threshold where AI agents are no longer weekend experiments but full-fledged production workloads moving terabytes of text, images, and embeddings every day. The patterns we covered: batch-first APIs, size-aware micro-batches, dynamic fan-out schedulers, and spot-friendly autoscaling turn what used to be fragile, one-off scripts into resilient data factories. They also surface a new set of questions that teams rarely faced at smaller scale:

How do you budget for transient GPU spikes when token usage can quadruple overnight?
Which retry policy balances cost against accuracy when your upstream model provider silently throttles half your requests?
Where do you draw the line between “smart batching” and needlessly complex micro-batch orchestration?

These questions will shape the next wave of infrastructure tooling just as early CI/CD tools reshaped software delivery.

I would love to hear your war stories. What tactics brought your throughput from hundreds to millions of tokens per minute? Which tools saved the day, and which surprised you by becoming bottlenecks? Have you abandoned certain libraries entirely, or patched together bespoke glue code that still feels irreplaceable?

Discussion about this post

Ready for more?