Nikita Agarwal

Skills vs Tools. Is it just packaging?

Nikita Agarwal — Mon, 29 Dec 2025 07:09:27 GMT

Every few months, there’s a new keyword that is buzzing AI town. Started with llms, then it was prompts, evals, agents, the MCP, A2A and most recently the new entrant: SKILLS.

First heard this floating around October when I was deep into the pit of legal stuff for my business, without bandwidth for the leisure of curiosity and I dismissed it as another hot word for tools: - an external tool calling interface to extend model capabilities.

The slowness of the New Year gifted me the privilege of reading and I found myself digging into announcements, repos and projects using Skills, was pleasantly surprised to find it is more than a rebrand of the MCP tool.

Yes, tools have been inherently flawed. Bloated contexts, models getting confused, a wholly separate scaffolding to make them WORK. From auth, to orchestration. The efforts we have put to make them work, only to have at max 4-5 tools that can be exposed at a time to a model, lest the possibility of model confusion!

We are yet to see the true protocol of communication for agentic systems. It mostly won’t be English even.

But Skills do take a good bite into the flaws of the simple tooling design introduced by MCP.

When a new member joins any organisation, we have DOCUMENTATION prepped for their onboarding. More the merrier. Less the more hallucinatory. Similar is the case with artificially intelligent team members.

But we have hands, they do not. Along with a brain, you need execution. And these two need to share brainspace. The brain, tuned for performing actions, not independent of each other.

Thus the introduction of skills.

Along the brain power of a model, you have now extended them with the power of true execution with instruction.

Provide instructions and now the model layer between the user and model can figure what to do, and do it with some deterministic execution runtime in a sandboxed environment.

With tools, models would ask the user to execute things for them and wait on responses of results. Now with skills,the model layer comes as a package with execution built in for tools. For the model layer to effectively execute, it is a prerequisite to be able to discover the possibilities of its execution instead of bombarding the model with an arsenal of tools that it could use, without fully understanding their relations or boundaries.

Thus, the skills folder and tree-like organisation with exploratory prompt formation couples with a skill execution runtime.

Is this just packaging or functionality? I would say the latter.

It enables me to be lazier. Create onboarding docs/scripts and start asking for things to be done instead of plumbing out the runtime and thinking thrice before exposing capabilities.

With Codex recently also pushing Skills into the hat of understood ‘languages’ by OpenAI ecosystem, I see a possible new pattern of achieving the meta goal of non-deterministically reliable systems: where known working patterns are baked in and the reasoning, execution is completely in the purview of a model that understands the task end to end, instead of just a single question/answer.

Bake in memory and self learning, and you are way closer to a system that mimics how humans attempt work.

What do you think?

What would be a good design to take this to production from local?

What about opensource models? Without such platform benefits, are they missing out?

Are there any differences at all between skills and tools?

Leave a comment

Subscribe now

Where Is the Next.js for AI Workflows?

Nikita Agarwal — Sun, 05 Oct 2025 05:58:32 GMT

Every technology wave eventually produces a framework that crystallizes best practices, hides away the ugly parts, and lets developers move faster than they thought possible. For the web, we saw WordPress turn the act of publishing into something anyone could do. We saw HTML standardize the primitives of the browser. And more recently, we saw Next.js reshape frontend development into something opinionated, integrated, and production-ready.

We are now watching a similar story unfold in artificial intelligence. The applications are here: copilots, recommendation engines, AI assistants, multi-agent workflows that coordinate tools on behalf of users. The infrastructure is here too: GPUs, TPUs, vector databases, orchestration layers, workflow engines. And yet, the developer experience of putting these pieces together feels awkward, messy, and inconsistent. It is as if we are still hand-rolling PHP blogs before WordPress, or manually wiring React components before Next.js arrived.

The truth is that AI still lacks its Next.js moment. We have plenty of powerful libraries and promising frameworks, but none of them yet gives developers the seamless, opinionated defaults that make building distributed AI workflows as natural as building a modern web app. The absence is striking because the need is obvious. Building a serious AI system today means stitching together multiple tools: an orchestration engine like Ray or Temporal, a chaining framework like LangChain or LlamaIndex, a message queue like Kafka or Redis, an inference runtime like vLLM or TensorRT, and various bits of glue code for retries, scaling, and observability. Each component is strong on its own, but together they feel more like a collection of raw materials than a framework.

This is where the analogy to earlier waves of software is useful. Think about what WordPress did: it democratized publishing by abstracting away complexity, at the cost of some flexibility. Think about HTML: it defined universal primitives so that browsers and developers could speak the same language. Think about Next.js: it did not try to solve everything, but it standardized the 80 percent of patterns developers needed such as routing, server-side rendering, APIs, and deployment. It gave developers a coherent, opinionated way to work.

That is exactly what is missing for AI workflows. We have tools like n8n that play the role of WordPress, letting people drag and drop pipelines visually. We have base languages and libraries like PyTorch and JAX that play the role of HTML, giving us primitives for expressing computation. But what we don’t have is the Next.js layer: a framework that takes the complexity of distributed AI workflows and wraps it into a set of defaults and conventions that just work.

Why does this matter? Because without it, developers are spending too much time on infrastructure problems instead of application problems. When you want to build an AI agent that scrapes documents, summarizes them, and posts results to a channel, you should not need to be an expert in distributed consensus or checkpointing. You should not have to wire up retries manually or tune autoscaling policies by hand. You should be able to declare what the workflow is, and let the framework take care of the infrastructure details. Just as Next.js made deployment, routing, and rendering invisible, the missing AI framework should make distributed execution, autoscaling, retries, and observability invisible.

This is more than convenience, it is about resilience. The difference between a toy demo and a production system is not just accuracy, it is durability. AI workflows fail constantly. APIs time out. GPUs become unavailable. Inputs arrive in bursts. If you want to run a serious multi-agent architecture, you need autoscaling as a default. You need retries as a default. You need checkpointing as a default. These should not be optional extras. They should be embedded assumptions, invisible to the developer unless they choose to override them. Without these defaults, every team ends up reinventing the same reliability mechanisms, often imperfectly. With them, the ecosystem can move faster and focus on the unique parts of their applications rather than the generic challenges of infrastructure.

There are promising directions. LangGraph is trying to make agent workflows declarative, representing them as graphs of nodes that can recover from interruptions. Ray is giving us the substrate for distributed execution, letting workloads stretch elastically across heterogeneous clusters. Temporal is providing durable workflows with retries and resumability. Exosphere is building AI-first orchestration with distributed pipelines treated as native constructs rather than special cases. What makes Exosphere interesting is that it is not just borrowing patterns from traditional workflow engines but rethinking orchestration for the AI era. It assumes that LLM workloads are bursty, that they need parallelism at scale, that checkpoints and recovery are non-negotiable, and that developers want a declarative way to define their workflow graphs without wading through Kubernetes manifests. It is early, but it points toward the type of opinionated defaults that could evolve into a Next.js-style framework for AI systems.

Leave a comment

Still, none of these efforts yet provides the integrated developer experience layer that turns powerful primitives into a coherent whole. They are closer to React before Next.js: flexible, expressive, but leaving too much wiring in the hands of developers. What is missing is the unifying layer that integrates these ideas into a single workflow-first experience.

If you look closely, you’ll see why the analogy to web frameworks is so apt. Early web development required manually managing state, routing, and server logic. Today’s AI development requires manually managing state persistence, task queues, and distributed scaling. Early web frameworks removed boilerplate by introducing conventions such as folder structures, default rendering strategies, and automatic bundling. The missing AI framework must do the same: define conventions for agent graphs, standardize error handling, bake in monitoring, and make distributed execution an assumption. Developers should not wonder whether their summarization node will autoscale, they should expect that it will. They should not write retry logic for API calls, they should assume that retries with backoff are built in.

It is worth noting that low-code tools like n8n or Zapier for AI are not the answer here. They play an important role, just as WordPress still does for websites, but they are not enough for developers building production-grade systems. Professional engineers want composability, performance, and reliability. They want to live close to code, but not so close that they drown in Kubernetes manifests or custom orchestration. They want what Next.js gave to the web: sensible defaults, escape hatches when necessary, and an integrated path from development to deployment.

The reason this framework does not exist yet is partly cultural. AI is still in its experimental phase, where new architectures, models, and workflows are being invented weekly. Frameworks harden conventions, and conventions require stability. But the demand is growing. As more applications move from proof-of-concept to production, the pain of hand-rolled infra is becoming acute. Just as developers grew tired of configuring webpack, they are now growing tired of wiring together distributed systems for every AI workflow. At some point, someone will ship a framework that says: here is the obvious way to do it. And once that happens, the ecosystem will rally.

The framework does not need to cover every corner case. It does not need to replace every orchestration engine or inference runtime. It just needs to capture the common patterns such as data ingestion, retrieval, summarization, classification, and multi-step reasoning, and make them trivial. It needs to wrap infrastructure concerns into conventions, so developers can think about the flow of information and logic rather than the plumbing beneath. It needs to define not just APIs but assumptions, the same way Next.js assumed you wanted file-based routing and server-side rendering until you told it otherwise.

The strategic stakes are high. The team that creates this framework will not just ship another library. They will define the developer experience of AI for the next decade. They will be the ones who decide what default means for distributed agent workflows. They will influence not just how engineers code, but how infrastructure evolves. Cloud providers will optimize for the framework’s conventions. Monitoring tools will adapt to its abstractions. Developers will standardize on its patterns. That is what frameworks do when they hit at the right moment: they stop being optional and start being the air everyone breathes.

In hindsight, we will look back at this era of AI systems, the glue code, the manual retries, the hand-tuned autoscaling policies, the fragmented orchestration, the scattered observability, and we will see it as necessary but temporary. Just as the web moved from static HTML to frameworks that embedded production reality into developer defaults, AI will move from stitching together disparate components to building on rails. The frameworks are coming. The only question is who will ship the Next.js of AI workflows first, and how quickly the ecosystem will converge once it does.

Until then, building AI systems will remain partly an act of infrastructure engineering, partly an act of application design. But the future is clear.

Just as Next.js gave developers rails for the web, the missing AI framework will give us rails for distributed agents and workflows. And when it arrives, it will not just make developers faster.

It will change what kinds of AI applications are possible, because the invisible tax of infrastructure will finally be lifted.

Happy building!

-Nikita Ag

Subscribe now

A swarm of SLMs vs an LLM

Nikita Agarwal — Mon, 08 Sep 2025 17:41:33 GMT

Small things often move the biggest mountains. Ants build cities, raindrops carve canyons, tiny services power the apps we use all day.

Small language models fit that same pattern. They are fast, frugal, and easy to place close to users or data. In the right roles they punch far above their size, and ignoring them now would be like ignoring microservices when monoliths felt inevitable.

Large language models still shine at broad reasoning and open-ended synthesis, but a swarm of small models can collaborate, specialize, and route work with tight control over cost, latency, and privacy. This essay looks at how SLMs and LLMs complement each other, why a many-small approach changes what you can build, and how to design systems where the small parts add up to something surprisingly powerful.

The past year has been fascinating to watch. Everyone's been obsessing over the latest frontier models GPT-4, Claude, Gemini while quietly, a different conversation has been brewing in production environments. Teams are discovering that smaller language models aren't just budget alternatives; they're often the better choice.

I've been working with teams deploying both small and large models at scale, and the pattern is clear: the decision isn't about settling for less capability. It's about matching the right tool to the job. And increasingly, that tool is a small language model.

Let's talk about when, why, and how to make this choice.

Defining SLMs vs LLMs

The lines aren't as clear as you might think. It's not just about parameter count it's about deployment philosophy, operational constraints, and what you're trying to achieve.

Small Language Models (SLMs):

0.1B to 7B parameters
Run on edge devices, mobile, single GPUs
Sub-50ms latency typical
Examples: Phi-3 Mini, Llama 3.2, Gemma 2B

Large Language Models (LLMs):

13B+ parameters
Require cloud infrastructure, GPU clusters
100ms to 5s+ latency
Examples: GPT-4, Claude 3.5, Llama 70B

The interesting boundary is around 7B parameters. That's where deployment constraints kick in hard memory limits, quantization effectiveness, mobile viability. It's not arbitrary; it's where physics meets practicality.

The Easy Route: Just Use LLMs

Let's be honest starting with LLMs is the path of least resistance. Call OpenAI's API, get great results, ship fast. For most teams getting started, this makes perfect sense.

LLMs give you:

Broad capability: Handle almost any task reasonably well
Zero infrastructure: Someone else's problem
Rapid prototyping: From idea to demo in hours
Continuous improvement: Models get better without you doing anything

The math is simple: $0.01-0.10 per 1K tokens, no infrastructure headaches, predictable scaling. For many applications, this is the right choice and you should stop here.

But if you're processing millions of requests, need sub-100ms latency, have strict privacy requirements, or want to deploy to edge devices, the economics change quickly.

Why SLMs? The Compelling Scenarios

SLMs are not just cheaper LLMs. They invite different system designs. Benchmark scores alone hide how they reshape latency budgets, privacy posture, and total cost of ownership. The architectural shift is the story: place a capable small model close to the data and the user, and entire classes of experiences stop feeling laggy, leaky, or overpriced.

Latency-Critical Applications

Latency is the first unlock. Some interactions cannot wait half a second. Real-time voice agents feel human only when responses arrive inside a breath. Coding assistants preserve flow when completions appear faster than a keystroke pause. Simultaneous translation should track a speaker, not a transcript. Even humble edge workloads like sensor triage need decisions inside a control cycle. In these loops an SLM is not a compromise. It is often the only path to sub-50 ms without exotic caching.

Some applications simply can't wait 500ms for a response:

Real-time conversation: Voice assistants, gaming NPCs
Interactive coding: Code completion that doesn't break flow
Live translation: Simultaneous interpretation
IoT processing: Sensor data analysis at the edge

Privacy-First Deployments

Privacy is the next one. When data stays on a device or inside a facility, on-device SLMs keep the entire interaction local. A bedside clinical tool that parses notes, an internal finance screener that classifies transactions, a legal reviewer that summarizes privileged documents, a truly personal assistant that never calls home. Zero transmission by default. Fewer hard vendor dependencies because the boundary is physical rather than contractual.

When data can't leave the device or premises:

Healthcare: Patient records processing
Finance: Transaction analysis
Legal: Document review
Personal assistants: Truly private AI

Cost at Scale

Then comes cost at scale. The math is dull and decisive. If an SLM runs around 0.0001 to 0.001 dollars per 1K tokens and an LLM sits near 0.01 to 0.10, the gap is two to three orders of magnitude. At ten million tokens a month the spend is roughly 1 to 10 dollars for an SLM versus about 100 to 1000 dollars for an LLM. That delta buys observability, canaries, evals, and a thick layer of QA. At large volumes, economics become architecture.

The economics flip when you're processing high volumes:

SLM: $0.0001-0.001 per 1K tokens
LLM: $0.01-0.10 per 1K tokens

At 10M tokens/month, you're looking at $100-1K vs $100K-1M. The difference pays for a lot of infrastructure.

Edge Computing

Edge computing ties it together. Put the model where data is born and mobile apps work without a network, manufacturing lines run quality checks in real time, autonomous systems make split-second calls, and remote deployments keep operating through poor connectivity. This is not only about saving bandwidth. It reduces failure modes, removes round trips, and turns AI from a request into a capability.

Running AI where the data lives:

Mobile apps: No network dependency
Manufacturing: Real-time quality control
Autonomous systems: Split-second decisions
Remote locations: Limited connectivity

So how to choose?

Choosing between SLMs and LLMs comes down to four things. First, task complexity. SLMs excel at classification, routing, structured extraction, templated responses, and tightly scoped domain tasks, especially with light fine-tuning. LLMs win when multi-step reasoning, broad knowledge synthesis, open-ended creativity, or robust few-shot generalization across messy inputs is required.

Second, latency. Under 50 ms tends to be SLM territory. Between 100 ms and one second depends on other constraints. Beyond a second, LLMs are usually acceptable.

Third, privacy and compliance. On-device or on-prem SLMs provide zero data transmission by default, which simplifies regulatory conversations and builds trust while avoiding dependency on a single provider’s data path.

Fourth, the economic model. SLMs favor high-volume simple tasks, predictable costs, and long-term steady deployments. LLMs favor bursty or rare workloads, complex infrequent tasks, and situations where value per request dwarfs cost.

Leave a comment

How-To: Building with SLMs

Building with SLMs benefits from a different playbook than working with large models. The goal is simple: pick the smallest model that clears the quality bar for the job, shape the memory and inference path so it stays fast and cheap, then add routing so bigger models are used only when there is real uncertainty or complexity.

1. Model Selection

For general tasks:

Phi-3 Mini (3.8B): Best balance of capability and efficiency
Llama 3.2 (3B): Strong performance, good ecosystem
Gemma 2B: Lightweight, good for mobile

For specific domains:

Code: CodeLlama 7B, StarCoder
Math/Reasoning: DeepSeek-Math, Llama specialized variants
Multilingual: mGLM, multilingual Llama variants

2. Optimization Techniques

Memory optimization:

Gradient checkpointing
Mixed precision training
Parameter-efficient fine-tuning (LoRA, AdaLoRA)

Inference optimization:

Dynamic batching
KV cache optimization
Speculative decoding

Smart Routing: The Best of Both Worlds

The most sophisticated systems don't choose between SLMs and LLMs they use both intelligently.

Cascade Routing

Start with SLM, escalate to LLM when needed:

A cascade pattern handles the majority case with an SLM and escalates only when needed. The request hits the SLM first, the system checks a confidence score line around 0.8, and if the bar is met the response returns in something like 45 milliseconds. If not, the same request moves to an LLM that may take closer to 800 milliseconds but delivers the depth required. Latency stays low for the common path and quality stays high for the hard cases.

Parallel Processing

Run both, take the first good result:

A racing pattern trades a little extra compute for predictability. An SLM and an LLM start in parallel. If the SLM produces a candidate inside 100 milliseconds, a quick quality gate evaluates it. Passing the gate cancels the LLM and returns the fast result. Failing the gate lets the LLM finish and return its answer. This keeps p95 and p99 latencies tight while preserving quality on the edge cases that truly need more capacity.

Context-Aware Routing

Route based on query characteristics:

Context-aware routing lets the stack make smarter choices before any tokens flow. Privacy constraints route strictly to on-device SLMs. Tight latency budgets favor the small model for speed. Open privacy and relaxed latency push the decision to a complexity score derived from the query and recent outcomes. High complexity routes to an LLM for deeper reasoning, while everything else defaults to the SLM for efficiency.

Best SLMs Right Now

The landscape is moving fast, but here are the current standouts:

General Purpose

Phi-3 Mini (3.8B): Microsoft's efficiency champion
Llama 3.2 (3B): Meta's balanced approach
Gemma 2B: Google's lightweight option

Specialized Models

Code: CodeLlama 7B, StarCoder 3B
Math: DeepSeek-Math 7B
Multilingual: mGLM 6B
Conversational: Vicuna 7B, Alpaca variants

Emerging Contenders

Qwen 2.5: Strong reasoning capabilities
StableLM: Stability AI's latest
TinyLlama: Ultra-lightweight at 1.1B parameters

Realistically, most teams should start with LLMs and only move to SLMs when they hit specific constraints:

Latency: Can't wait 200ms+ for responses
Privacy: Data can't leave premises
Cost: Processing millions of requests monthly
Connectivity: Need offline capability

If none of these apply, stick with LLMs. The operational complexity of SLMs model selection, fine-tuning, deployment, monitoring isn't worth it unless you have a compelling reason.

But when you do have that reason, SLMs can be transformational. They enable applications that simply aren't possible with cloud-based LLMs: truly private AI, real-time interaction, offline operation, and cost-effective scale.

The future isn't SLMs vs LLMs it's intelligent orchestration of both. The teams building that capability now will have a significant advantage as AI becomes infrastructure.

What's your experience been with SLMs in production? Are you seeing the same patterns, or different constraints driving your decisions?

Thanks for reading,

Nikita Agarwal, AI Infra Weekly

Subscribe now

More GPUs Won't Save You

Nikita Agarwal — Tue, 02 Sep 2025 03:47:30 GMT

The bottleneck has shifted. While everyone's been obsessing over compute scaling more GPUs, bigger clusters, faster interconnects a quieter crisis has been building in the shadows. We're hitting the memory wall in AI inference, and it's fundamentally changing how we need to think about serving large language models at scale.

Having worked on inference infrastructure at scale, I've watched this transition happen in real-time. The symptoms are everywhere: GPU utilization dropping despite queues backing up, latency spikes that don't correlate with compute load, and throughput plateaus that no amount of additional hardware seems to break through.

The math is unforgiving, and it's time we talked about it.

The Fundamental Problem

Modern LLMs are memory-bound, not compute-bound. This isn't immediately obvious because training workloads which dominate the narrative are the opposite. Training saturates compute with dense matrix multiplications across massive batches. Inference, particularly for generation tasks, tells a different story entirely.

Consider the memory requirements for serving a 70B parameter model:

Model weights: 70B parameters × 16 bits = 140 GB (FP16)
KV cache per sequence: ~2 GB for 4K context (depending on model architecture)
Activation memory: Variable, but typically 1-4 GB during forward pass

For a single H100 with 80GB HBM, you're already at the limit with just the model weights. Add concurrent sequences with their KV caches, and you quickly understand why memory bandwidth not FLOPS determines your serving capacity.

Let's formalize this. For autoregressive generation, each token requires:

Where:

W = Model weights read (constant per token)
KVread = Key-value cache read (grows with sequence length)
KVwrite = Key-value cache write (new key-value pairs)
A = Activation memory (varies by layer)

The key insight: W dominates early in generation, but as sequences grow longer, KV the bottleneck. This creates a non-linear relationship between sequence length and memory pressure that most serving systems handle poorly.

Memory access components during autoregressive generation. Note how KV cache reads dominate as sequence length increases.

The Arithmetic of Memory Bandwidth

Let's work through a concrete example. An H100 has roughly 3 TB/s of memory bandwidth. For a 70B parameter model at FP16:

Weight access per token: 140 GB
Theoretical max tokens/sec: 3,000 GB/s ÷ 140 GB = ~21 tokens/sec

This is the absolute ceiling for a single sequence, assuming perfect memory access patterns and zero overhead. In practice, you get maybe 60-70% of this due to:

Memory access inefficiencies
KV cache overhead growing with sequence length
Attention computation patterns
System overhead

So realistic peak throughput is closer to 12-15 tokens/sec for long sequences on a single H100. Now consider batching multiple sequences:

Where N is batch size and α represents the KV cache access pattern efficiency. As batch size increases, the weight reads are amortized across sequences, but KV cache pressure grows quadratically with both batch size and sequence length.

This creates an optimization surface that most serving frameworks haven't properly mapped. The sweet spot isn't just about maximizing batch size it's about finding the optimal balance between batch size, sequence length, and memory access patterns.

Why Current Solutions Fall Short

Most inference serving solutions treat memory as an afterthought. They focus on request routing, model loading, and compute scheduling, but ignore the fundamental memory access patterns that determine actual performance.

Static Batching assumes uniform sequence lengths and fails catastrophically when real traffic arrives with mixed lengths. A batch with one 8K sequence and seven 1K sequences performs worse than eight 1K sequences due to memory access patterns.

Naive KV Caching stores key-value pairs contiguously in memory, creating fragmentation and inefficient access patterns as sequences grow and shrink dynamically.

Compute-First Scheduling allocates GPUs based on model size and expected compute load, ignoring memory bandwidth utilization. You end up with GPUs sitting idle while memory controllers are saturated.

Typical production scenario: GPU compute utilization remains low while memory bandwidth saturates, creating a hidden bottleneck.

The result? Serving systems that look good on paper but collapse under real production loads.

Memory-Aware Serving Patterns

The path forward requires treating memory bandwidth as a first-class resource, just like compute. Here are the patterns that actually work at scale:

1. Dynamic Memory-Aware Batching

Instead of fixed batch sizes, implement dynamic batching that considers memory access patterns:

# Exosphere Node: Memory-Aware Dynamic Batching
class MemoryAwareBatchingNode(BaseNode):
    class Inputs(BaseModel):
        requests: str
        memory_budget_gb: str
        
    class Outputs(BaseModel):
        optimized_batch: str
        estimated_memory_usage: str
    
    async def execute(self) -> Outputs:
        # Sort requests by sequence length for memory locality
        sorted_requests = sort_by_sequence_length(self.inputs.requests)
        
        # Greedy batch composition within memory budget
        batch = []
        memory_used = 0.0
        
        for request in sorted_requests:
            memory_cost = estimate_memory_cost(request, len(batch))
            if memory_used + memory_cost <= self.inputs.memory_budget_gb:
                batch.append(request)
                memory_used += memory_cost
            else:
                break  # Budget exceeded
        
        return self.Outputs(
            optimized_batch=batch,
            estimated_memory_usage=memory_used
        )

The key insight: sequence length homogeneity within batches dramatically improves memory efficiency.

2. Streaming KV Cache Management

Rather than allocating fixed KV cache blocks, implement streaming cache management that adapts to actual usage patterns:

Chunked allocation: Allocate KV cache in fixed-size chunks rather than per-sequence
Memory pooling: Reuse deallocated chunks across sequences
Predictive prefetching: Prefetch likely-to-be-accessed cache lines based on attention patterns

3. Memory Bandwidth Scheduling

Schedule requests based on memory bandwidth utilization, not just compute availability:

# Exosphere Node: Memory Bandwidth Scheduler
class MemoryBandwidthSchedulerNode(BaseNode):
    class Inputs(BaseModel):
        request: InferenceRequest
        peak_bandwidth_gb_per_sec: str
        current_utilization: str
        
    class Outputs(BaseModel):
        can_schedule: bool
        estimated_bandwidth_usage: str
        utilization_after_scheduling: str
    
    async def execute(self) -> Outputs:
        # Calculate bandwidth requirements for this request
        bandwidth_needed = calculate_bandwidth_requirements(
            model_weights=MODEL_SIZE_GB,
            kv_cache_size=self.inputs.request.sequence_length,
            expected_tokens=self.inputs.request.expected_tokens
        )
        
        # Apply safety margin (80% max utilization)
        max_allowed = self.inputs.peak_bandwidth_gb_per_sec * 0.8
        
        # Check if we can schedule without exceeding bandwidth limits
        can_schedule = (
            self.inputs.current_utilization + bandwidth_needed <= max_allowed
        )
        
        new_utilization = (
            self.inputs.current_utilization + bandwidth_needed 
            if can_schedule else self.inputs.current_utilization
        )
        
        return self.Outputs(
            can_schedule=can_schedule,
            estimated_bandwidth_usage=bandwidth_needed,
            utilization_after_scheduling=new_utilization
        )

4. Speculative Memory Allocation

For multi-turn conversations and long-form generation, implement speculative memory allocation that reserves bandwidth for likely future requests:

# Exosphere Node: Speculative Memory Allocation
class SpeculativeMemoryNode(BaseNode):
    class Inputs(BaseModel):
        conversation_history: str
        current_request: str
        available_memory_gb: str
        
    class Outputs(BaseModel):
        memory_reservation: str
        continuation_probability: str
        predicted_output_length: str
    
    async def execute(self) -> Outputs:
        # Predict likelihood of conversation continuation
        continuation_prob = estimate_continuation_probability(
            self.inputs.conversation_history
        )
        
        # Predict output length based on prompt characteristics
        predicted_length = predict_output_length(
            self.inputs.current_request.prompt
        )
        
        # Calculate memory reservation strategy
        base_memory = calculate_base_memory_requirement(
            self.inputs.current_request
        )
        
        speculative_memory = (
            continuation_prob * 
            predicted_length * 
            MEMORY_PER_TOKEN_GB
        )
        
        # Cap speculative allocation at 30% of available memory
        total_reservation = min(
            base_memory + speculative_memory,
            self.inputs.available_memory_gb * 0.3
        )
        
        return self.Outputs(
            memory_reservation=total_reservation,
            continuation_probability=continuation_prob,
            predicted_output_length=predicted_length
        )

Orchestrating Memory-Aware Nodes

These individual nodes can be composed into a complete memory-aware inference pipeline using Exosphere's runtime:

# Exosphere Runtime: Memory-Aware Inference Pipeline
Runtime(
    namespace="MemoryAwareInference",
    name="memory-optimization-pipeline",
    nodes=[
        MemoryAwareBatchingNode,      # Dynamic batching optimization
        MemoryBandwidthSchedulerNode, # Bandwidth-aware scheduling  
        SpeculativeMemoryNode         # Predictive memory allocation
    ]
).start()

# Alternative: Separate runtimes for different concerns
Runtime(
    namespace="MemoryAwareInference",
    name="batch-optimizer",
    nodes=[MemoryAwareBatchingNode]
).start()

Runtime(
    namespace="MemoryAwareInference", 
    name="bandwidth-scheduler",
    nodes=[MemoryBandwidthSchedulerNode]
).start()

This modular approach allows you to scale different components independently based on your specific bottlenecks.

The Infrastructure Implications

Memory-aware serving changes infrastructure requirements fundamentally:

Hardware Selection: Memory bandwidth becomes the primary metric, not FLOPS. A GPU with 2x memory bandwidth but 0.8x compute is often better for inference workloads.

Cluster Architecture: Network topology matters less; memory hierarchy matters more. NUMA-aware scheduling and memory-local processing become critical.

Monitoring and Observability: Traditional metrics like GPU utilization become misleading. Memory bandwidth utilization, cache hit rates, and access pattern efficiency become the key indicators.

Cost Optimization: The cost model shifts from compute-hours to memory-bandwidth-hours. Suddenly, techniques like model quantization and sparse attention aren't just about model size they're about memory access efficiency.

Traditional compute-centric cost model vs memory-bandwidth-aware cost model. The optimization targets shift dramatically.

What This Means for the Next Wave

The memory wall in AI inference is forcing a fundamental rethink of serving infrastructure. Just as the transition from CPU to GPU changed everything about training, the transition from compute-bound to memory-bound inference is reshaping the serving landscape.

Model Architecture Evolution: Future models will be designed with memory access patterns in mind, not just parameter count. Techniques like mixture-of-experts and dynamic routing become essential for memory efficiency.

Serving Framework Consolidation: Frameworks that understand memory bandwidth as a first-class resource will dominate. Those that don't will be relegated to toy demos and benchmarks.

Hardware-Software Co-design: The next generation of inference accelerators will be designed around memory bandwidth, with novel memory hierarchies and access patterns optimized for transformer workloads.

The Questions We Need to Answer

As we navigate this transition, several critical questions emerge:

How do we build serving systems that gracefully degrade under memory pressure rather than failing catastrophically?
What's the right abstraction for memory bandwidth scheduling across heterogeneous hardware?
How do we balance the tension between batch efficiency and tail latency when memory access patterns are non-uniform?
Can we develop memory access pattern prediction that's accurate enough to drive real-time scheduling decisions?

The memory wall isn't just a technical challenge it's an opportunity to build fundamentally better inference infrastructure. The teams that solve this first will have a massive advantage in the next phase of AI deployment.

What patterns have you seen in your inference workloads? Are you hitting memory bandwidth limits, or are there other bottlenecks I'm missing? The infrastructure that emerges from this transition will shape how we deploy AI for the next decade.

Thanks for reading,

Nikita Ag, AI Infra Weekly

How are intelligent systems redefining what's possible with large data?

Nikita Agarwal — Tue, 26 Aug 2025 03:45:21 GMT

The quiet revolution happening in AI infrastructure isn't just about faster models or cheaper compute. It's about scale transcendence, the moment when AI agents stop being tools that extend human capability and become systems that operate entirely beyond human comprehension and capacity.

This week, I want to walk you through a fascinating case study that perfectly illustrates this transition: WhatPeopleWant, an AI agent that processes Hacker News discussions at internet scale to uncover entrepreneurial opportunities. But more importantly, I want to show you how platforms like Exosphere are making this kind of superhuman data processing accessible to any developer.

The Scale Problem That Humans Can't Solve

Let's start with a thought experiment. Imagine you wanted to manually analyze every Hacker News comment posted in the last 24 hours to identify unmet market needs. Here's what you'd be up against:

~500 new posts per day across all categories
~15,000 comments generated daily across active threads
Nested conversation trees that can go 10+ levels deep
Real-time updates every few minutes as discussions evolve
Pattern recognition across thousands of simultaneous conversations

Evolution of Internet-Scale Data Processing: From Human-Scale to AI Agent-Scale Operations

A human analyst, working at superhuman speed, might process 100 comments per hour with decent comprehension. At that rate, analyzing a single day's worth of Hacker News data would take 150 hours—nearly a month of full-time work. By the time you finish, you'd have 29 days of new data waiting.

This is where internet-scale data processing fundamentally breaks human-centric workflows. The traditional approach of "humans + tools" hits a hard ceiling when data velocity exceeds human cognitive throughput by orders of magnitude.

The WhatPeopleWant project demonstrates a fundamentally different approach. Instead of augmenting human analysis, it replaces it entirely with an autonomous agent pipeline that operates at machine speed.

Let's dissect the architecture:

The Data Ingestion Layer

class GetMaxItemNode(BaseNode):
    async def execute(self) -> Outputs:
        async with ClientSession() as session:
            async with session.get(MAX_ITEM_ENDPOINT) as response:
                max_item = await response.json()
        return self.Outputs(max_item=str(max_item))

The agent starts by hitting the Hacker News Firebase API to get the current maximum item ID, essentially the "water mark" for new content. This simple operation reveals something profound: the agent doesn't process data in human-friendly batches. It processes everything, continuously, as a streaming workflow.

Parallel Data Processing at Scale

WhatPeopleWant Agent: Data Processing Pipeline Volume Analysis

The real magic happens in the GenerateItemsNode, which creates processing tasks for every single item ID in a range:

async def execute(self) -> list[Outputs]:
    outputs = []
    for item_id in range(int(self.inputs.start_id), int(self.inputs.end_id) + 1):
        outputs.append(self.Outputs(item_id=str(item_id)))
    return outputs

This isn't just batch processing it's dynamic parallelization. The agent spawns individual processing tasks for each piece of content, allowing Exosphere's runtime to distribute work across multiple containers automatically. A human would process items sequentially; the agent processes them as a massive parallel operation.

Graph-Based Relationship Mapping

The agent doesn't just read comments, it reconstructs entire conversation graphs, identifies discussion clusters, and spots trending topics based on engagement patterns.

Flowchart showing a LangGraph multi-agent system workflow orchestrating city-related data retrieval and analysis using various agents and APIs.

This type of multi-dimensional analysis would require a team of human analysts working with specialized tools for weeks.

The Exosphere Advantage: Infrastructure That Thinks

What makes this architecture possible isn't just clever code but the underlying platform. Exosphere provides three critical capabilities that transform this from an interesting prototype to a production-scale system:

1. Automatic Orchestration

With a simple deployment:

runner-1:
  build: .
  container_name: whatpeoplewant-runner-1
  environment:
    - EXOSPHERE_STATE_MANAGER_URI=http://exosphere-state-manager:8000
    - EXOSPHERE_API_KEY=exosphere@123

# ... repeated for runner-2, runner-3, runner-4

The system automatically distributes work across four runner instances. When processing load increases, Exosphere can spawn additional runners dynamically. The developer doesn't manage containers, queues, or load balancing—the platform handles orchestration automatically.

2. Stateful Workflow Management

The register.py file defines a complex directed acyclic graph (DAG) of processing nodes:

graph_nodes=[
    {
        "node_name": GetMaxItemNode.__name__,
        "identifier": "GetMaxItem",
        "inputs": {},
        "next_nodes": ["AddDatabasePointer"]
    },
    {
        "node_name": AddDatabasePointerNode.__name__,
        "identifier": "AddDatabasePointer", 
        "inputs": {"item_id": "${{GetMaxItem.outputs.max_item}}"},
        "next_nodes": ["GenerateItems"]
    }
    # ... continues for 8+ nodes
]

Each node can fail, restart, or scale independently while maintaining state continuity. Traditional processing systems require complex error handling and checkpointing logic. Exosphere provides this as a platform primitive.

The MongoDB aggregation pipelines create temporal patterns:

Which types of problems get discussed repeatedly?
How does sentiment around specific technologies evolve?
Which conversation patterns predict successful product launches?

A human analyst might spot individual opportunities. The agent spots meta-patterns across thousands of conversations over months of operation. It develops institutional memory that no individual human could maintain

The Technical Architecture Deep Dive

Let's examine how Exosphere enables this kind of sophisticated workflow with remarkably simple code:

Node-Based Processing

Each processing step is implemented as a BaseNode with standardized inputs and outputs:

class AddItemToDatabaseNode(BaseNode):
    class Inputs(BaseModel):
        item_id: str
    
    class Outputs(BaseModel):
        object_id: str
    
    async def execute(self) -> Outputs:
        item_id = int(self.inputs.item_id)
        item_data = await get_item_from_hacker_news(item_id)
        object_id = await add_item_to_database(item_id, item_data)
        return self.Outputs(object_id=object_id)

This abstraction allows complex workflows to be composed from simple, testable components. Each node can be developed, tested, and deployed independently while maintaining type safety through Pydantic models.

Distributed State Management

The platform maintains a centralized state manager that tracks workflow execution across all nodes:

StateManager(namespace="WhatPeopleWant").trigger(
    graph_name="ScrapeYC",
    state=TriggerState(
        identifier="GetMaxItem",
        inputs={}
    )
)

This enables sophisticated workflow patterns like conditional branching, parallel execution, and automatic retry logic without requiring developers to implement distributed systems primitives.

Resource Optimization

The scheduler automatically batches work based on resource availability and deadline requirements. During high-load periods, it may delay non-critical processing. During low-load periods, it accelerates processing to stay ahead of schedule.

This dynamic resource allocation is invisible to the developer but critical for cost-effective internet-scale operation.

The Economics

Here's where the story gets really interesting. The WhatPeopleWant agent processes more data in an hour than a team of human analysts could handle in weeks. But it doesn't just process more, it processes differently

Traditional market research might involve:

Survey design and distribution
Response collection and validation
Statistical analysis and reporting
Weeks of human effort across multiple specialists

The agent approach involves:

Continuous data ingestion from public discussions
Real-time sentiment and pattern analysis
Automated insight generation and distribution
Zero ongoing human effort after initial setup

The unit economics are transformative. Instead of paying $10,000+ for a market research report that's outdated by publication, you get continuous intelligence for the cost of cloud compute—typically under $50/month for this type of workload.

What This Means for AI Infrastructure

The WhatPeopleWant project is a neat preview of how AI infrastructure will evolve. We're moving from platforms that help humans work faster to platforms that enable entirely non-human workflows.

Key architectural patterns emerging:

Agent-Native Design: Instead of building human-centric interfaces with API access, we're building agent-native platforms where the primary users are other AI systems.
Workflow Declarativity: Developers describe what they want accomplished, not how to accomplish it. The platform handles optimization, scaling, and reliability automatically.
Intelligence Integration: LLM capabilities are built into the infrastructure layer, not added as external services. This enables more sophisticated processing without complex integration overhead.
Scale Economics: The marginal cost of additional data processing approaches zero, enabling workflows that were previously economically impossible.
Share

Platforms like Exosphere represent the beginning of a broader infrastructure shift. Today, we're still in the "human + AI tools" era for most applications. But projects like WhatPeopleWant show us what the "AI-native workflow" era looks like.

The implications are staggering:

Research and Analysis: Continuous monitoring and analysis of any topic across all available data sources
Business Intelligence: Real-time market sensing and competitive analysis that updates faster than human decision-making cycles
Content Creation: Automated generation of insights, reports, and creative content based on real-time data synthesis
System Optimization: Self-managing infrastructure that optimizes performance, cost, and reliability without human intervention

We're not just building better tools for humans. We're building the foundation for a new class of applications that operate entirely beyond human scale and comprehension.

The next time someone asks you about AI infrastructure, don't just think about faster GPUs or cheaper inference. Think about platforms that enable entirely superhuman workflows. Because that's where the real transformation is happening.

Want to explore building your own internet-scale AI workflows? Check out the WhatPeopleWant repository and Exosphere's platform to get started. The future is autonomous, and it's arriving faster than most people realize.

Leave a comment

Subscribe now

Does inference have an answer to your hallucination problem?

Nikita Agarwal — Mon, 18 Aug 2025 16:15:29 GMT

The last three years were dominated by the race to scale model training. Bigger datasets on larger, more GPUs drove the leap from GPT-3 to GPT-4, from early LLaMAs to Gemini, from “just text” to multimodal reasoning. That era gave us the raw capability.

But once a model is trained, where is its value truly realized?

Not in the datacenter that produced it, but in the billions of inferences it serves daily. This is where the real frontier lies: inference compute.

Being on the team handling inference compute at Azure OpenAI, I had the privilege of being on the frontline handling the sheer volume that inference drives, which is growing at rocket speed, not by months but by days. We are only crawling into the inference landscape in my mind.

Inference has a fundamentally different characteristic than training: it is elastic. Training is a one-shot event where you scale up until the model converges. Inference, by contrast, happens query by query, workflow by workflow. Each request can be cheap or expensive depending on how much “thinking time” you allow the model.

This is what we mean by test-time compute scaling: deciding at runtime how much computation to spend per query.

That elasticity creates a central tension every infra team now feels: latency vs reliability vs cost. Push for low latency and you risk brittle, hallucinated outputs.

Push for reliability and you must tolerate higher token counts, retries, or parallel search.

Push for cost efficiency and you need clever policies that dynamically adapt compute to the hardness of the query.

Balancing these forces is quickly becoming as important as training breakthroughs themselves.

Estimated global token usage by LLMs: 2023-2025 (in trillions of tokens)

What is Test-Time Compute Scaling?

At its core, test-time compute scaling is about spending more computational effort during inference to get a better answer. Instead of firing off a single forward pass and taking the first decoded response, we allow the model to “think longer” by exploring multiple reasoning paths, searching, verifying, and then selecting the most reliable outcome. Crucially, none of this requires retraining the model, the weights remain frozen. What changes is the strategy we apply at inference?

It is almost a spectrum. On one end, you have the cheapest possible decode: one forward pass, greedy decoding, instant output. On the other end, you have heavyweight inference: dozens of parallel samples, beam search over reasoning chains, verifier models cross-checking results, even iterative self-repair loops.

The space in between is where most real-world applications will land, allocating compute dynamically based on query difficulty, user expectations, or system policy.

This makes test-time compute scaling a knob we can dial per request, contrary to training, where scale is fixed once you launch a run. This knob is becoming an essential infrastructure piece for anyone orchestrating production grade AI workflows.

AI workflows have evolved to have a large number of steps: retrieval, reasoning, code execution, summarization, routing between agents, and handoffs to humans or external APIs. In these workflows, errors are multiplicative(ref #1). A single weak step, a misparsed field, a hallucinated fact, a failed function call, can derail the entire DAG. The deeper the pipeline, the more fragile it becomes.

This is where test-time compute scaling becomes essential. By allocating more inference compute at the leaves of the workflow the points where correctness matters most one can catch failures before they cascade.

One of the underappreciated aspects of test-time compute scaling is that it maps perfectly onto parallel execution patterns we already know from distributed systems. Multi-sample voting, best-of-N reranking, tree-of-thoughts search, these aren’t sequential processes. They are inherently parallelizable.

That parallelism changes the character of test-time scaling. Instead of thinking of it as “retry until you get a good answer,” you can treat it as fanout followed by aggregation. Fire off ten reasoning paths in parallel, score them as they arrive, and stop early if a quorum emerges. Or run multiple candidates asynchronously, let verifiers score them on the side, and promote a winner as soon as confidence is high enough. This is the same pattern we’ve been using for years in web services (hedged requests, quorum reads in distributed databases, speculative execution in MapReduce), only now applied at the inference layer.

The unlock is that these techniques let us trade off latency against reliability without always paying the full sequential cost. From an infrastructure perspective, this means test-time scaling is not a modeling hack, it’s a workflow execution problem.

How you shard inference requests across GPUs, how you pipeline verifiers, how you detect stragglers, these become first-class concerns. Just as batch processing frameworks like Spark made parallel data transformations tractable, inference orchestrators will need to make parallel reasoning at test time a built-in capability.

So how do you go about implementing this, let’s get some intuition to get you started.

Multi-Sample + Majority Vote

The simplest baseline: ask the model the same question multiple times and see what answer comes up most often. This works because LLMs are stochastic when the same output emerges across independent samples, it’s usually more reliable.

answers = [model(query, temperature=0.7) for _ in range(K)]
final = most_common(answers)

Like asking five smart people the same question and going with the consensus.

Best-of-N with a Verifier

Here, instead of blindly voting, we score candidates against some verifier. The verifier might be another model, a set of heuristics, or even ground-truth tests (in coding tasks).

candidates = [model(query) for _ in range(N)]
scored = [(cand, verifier(cand)) for cand in candidates]
final = argmax(scored)   # candidate with highest verifier score

Think of it as generating multiple drafts, then letting an editor pick the best.

Search over Thoughts (Tree/Beam Search)

Rather than producing whole answers at once, generate reasoning step by step. Branch on partial outputs, prune weak paths, and expand promising ones.

frontier = [""]
for step in range(max_steps):
    expanded = []
    for partial in frontier:
        continuations = model_step(query, partial, beam_width)
        expanded.extend(continuations)
    frontier = prune(expanded, top_k=beam_width)
final = best(frontier)

Instead of one linear train of thought, imagine exploring multiple branches of reasoning and following the ones that “look right.”

Self-Critique and Repair Loops

Ask the model to check its own output and revise. This loop often catches simple arithmetic or logical slips.

draft = model(query)
for _ in range(max_iters):
    critique = model(f"Critique this answer: {draft}")
    if "no issues" in critique: break
    draft = model(f"Revise based on critique: {draft}\\n{critique}")
final = draft

Write, proofread, revise repeat.

Tool-Augmented Inference

Bring in external checks: calculators for math, retrievers for grounding, unit tests for code.

draft = model(query)
if has_equations(draft):
    result = external_calc(draft)
    draft = inject_result(draft, result)
if not passes_tests(draft):
    draft = repair(draft)

Give the model instruments instead of trusting it to play by ear.

Each of these techniques scales with parallelism: more samples, more

branches, more verifier calls. And each one benefits from smart stop rules: don’t always burn the full budget, quit early when confidence is high.

The shift from training scale to inference scale is an answer to the determinism problem. Now we are no longer dependant on only bigger models for reliability. We can derive it from smarter orchestration at runtime, where parallelism, verifiers, and stop rules combine into systems that steer models toward trustworthy results.

Test-time compute scaling doesn’t make models smarter it makes the system around them smarter. And in the new era of AI, that’s where the real breakthroughs will come from.

What do you think?

Leave a comment

Subscribe now

Artificially Intelligent Orchestration of Agents

Nikita Agarwal — Mon, 11 Aug 2025 12:54:42 GMT

Intelligence.

‘The ability to understand, learn and think’. My mind is plagued by the implications of ‘intelligence’ being available perpetually, autonomously: what changes? How should we build to cater to such change?

Last week, it was interesting to work out how simple, innocent-looking graphs could explode into 6 figure node beasts with even mid-scale production data. It was mind blowing to imagine intelligence being made available to each of those nodes, independently, reliably, at scale.

But is intelligence to be limited only to node-level functionality? Why should we limit ourselves to such boundaries?

Humans have probably been one of the most efficient species to have been able to organise and coordinate large groups of people into sub groups and orchestrate accomplishing a large goal. Some call it religion, others call it countries(jk!) That has been our mark of intelligence.

Now with Artificial Intelligence being made so much more mature, it is exciting to imagine AI coordinating large groups of AI units into subgroups to orchestrate accomplishing a large goal. At scale. Autonomously. Now that’s exciting.

Today, we talk about such intelligent orchestration.

Intelligent Orchestration

We have accomplished many a complex tasks with many sophistacated workflows, familarly called ‘DAGs’ They are almost the basis of any mature system, Kafka queues, Redis streams, all buzzing to keep your system guzzling. But here’s the interesting thing: with intelligence at our disposal, we now are free from the shackles of a ‘DAG’. We can now create sequences of actions to accomplish tasks at runtime - popularly called ‘agents’. However, only so much can be achieved by a single agent, we need ‘multi-agentic’ solutions going forward to accomplish complex tasks for us.

Agentic systems rarely follow a single, fixed path. Inputs vary, models disagree, and tools fail in surprising ways. Static DAGs force you to anticipate every branch in advance. Dynamic graphs invert that constraint. Edges are chosen at runtime by policies that read live state, which allows fanouts, model A or B trials, mid-run rewires, and graceful failure handling without redeploys.

Each step produces signals that should change where you go next: confidence scores, schema checks, cost budgets, latency SLOs, abuse filters. Hard-coding branches in code or YAML does not survive the real world where:

Quality or cost policies shift daily
Vendors or models need rapid A or B switches
Failures are non-uniform and require specific fallbacks
Partial progress must be reused to avoid recomputation

We can now treat routing as data that is evaluated at runtime against the current run state. Nodes are long-lived capabilities. Edges are conditional, late-bound, and modifiable while the run is in flight.

But how do you go about doing this, heck, should you even do this for your use case?

What scenarios would realistically even need such a ‘dynamic’ agent across defined agents?

Let us look at some concrete scenarios:

Highly variable inputs: You do not know which path is correct until you inspect the data.
Open-ended goals or exploratory tasks: You discover subgoals as you go.
Continuous optimisation: You want online A/Bs, bandits, or context-aware model selection without redeploys
Long-running or event-driven work: State evolves over hours or days.
Combinatorial explosion of possible flows: The number of potential edges is huge, for example: knowledge graph construction. As new entity or relation types are discovered, the manager dynamically fans out to relation-specific extractors, runs validators, and rewires the next steps based on confidence and coverage gaps.

When is this a bad idea?

Strictly regulated, safety-critical flows that require fully deterministic, auditable paths per version.
Very short, stable workflows where a static DAG is simpler to reason about and cheaper to operate.
Teams without strong observability or replay. Debugging emergent routing without traces, metrics, and replays is painful.

Now we have a new tool in our box, the power to recruit a team of agents and set up a ‘manager’ that would eternally assign tasks to these agents. What would building infra to achieve this usecase of AI look like?

The vanilla way of setting this up

Artist: Sora, Chatgpt (please spare the spelling it is only 3 years old xD)

Router API owns run state and chooses the next hop. Keep the router small, stateless per request, and lock per run_id to avoid double routing.
Durable state in a database. Tables for runs, steps, agents, and policies. Store input or output as JSONB, plus metrics like cost and latency for policy checks.
Queues per capability. Use Redis Streams, Kafka, or RabbitMQ. Each agent worker consumes its queue, heartbeats, writes outputs, and emits step_done events.
Policy engine in data. Start with rule rows that match on features like confidence, elapsed time, tenant, budget remaining. Action is route-to, params, or fanout list.
Controls. Retries with backoff, budget guards, latency SLO guards, circuit breakers, and idempotency on step_id. Instrument with OpenTelemetry.

Tools to get you started

OpenAI Agents SDK and MCP
What you get: Agent runtimes with tool calling, MCP connectors for tools and data, server side execution primitives.
You implement: Central router and run level state, policy engine and budgets or SLOs, queues and worker processes, long run persistence and full observability.
Temporal
What you get: Durable workflows and activities with retries, timers, Signals or Updates, deterministic execution, task queues, visibility and versioning.
You implement: Data driven policy engine, agent registry and schemas, budget or SLO guards, experimentation, routing UI, plus the activity workers that call models or tools.
Exosphere
What you get: Central manager pattern for dynamic graphs, durable runs and steps, policy based routing, queues per capability, background agent execution, budgets or SLO gates, experiments, observability.
You implement: Define your concrete agents, schemas and guardrails, integrate specific tools or models.
Hatchet
What you get: Background task platform with spawning of child tasks, dynamic fanout, retries, concurrency limits, schedules.
You implement: Router level policy and state aggregation across tasks, budget or SLO gates, experiments or bandits, agent registry and guardrails.
Inngest
What you get: Event driven functions with fan out, waits, retries, scheduling, replay or DLQ managed for you.
You implement: Cross function run state, routing policies, budget or SLO enforcement, agent registry and worker code for AI calls.
AWS Step Functions
What you get: Choice and Map or Distributed Map for conditional or parallel routing, retries or catches, deep AWS service integrations, CloudWatch metrics.
You implement: Express runtime policies as Choice JSON or Lambdas, external agent workers, cost or latency budgeting, experiment control, richer tracing beyond CloudWatch.
Dagster
What you get: Dynamic ops or graphs, assets, sensors, type checks, IO managers, orchestration and scheduling.
You implement: Central router abstractions, runtime rewiring via API, budget or SLO gates, multi tenant policy tables, queue backed agent workers.
Prefect
What you get: Python flows with dynamic branching, retries, caching, deployments, result persistence, UI.
You implement: Cross flow router and policy in data, per capability queues, budgets or SLOs, experiments, registry of agents or tools.
Camunda 8 with DMN
What you get: BPMN orchestrations with Zeebe workers, DMN decision tables so routing rules live in data, message correlation, operations UI.
You implement: AI agent workers and schema validation, mapping model metrics to DMN inputs, budgeting logic, autoscaling by queue depth, observability integration for AI signals.
Ray Serve
What you get: Programmable request routers, dynamic backend selection, autoscaling, model composition, metrics.
You implement: Multi step run state and history, policy or experiment DB, budgets or SLOs across steps, tool integrations beyond model inference.
LangGraph
What you get: Node graphs with conditional edges, checkpoints, streaming, interrupt or resume, tool calling.
You implement: If avoiding explicit graphs you still need a central manager, plus policies in data, budgets, experiments, queue scaled workers and global run state.
Argo Workflows
What you get: Kubernetes native DAGs with loops, artifacts, retries, templates, parallelization, K8s autoscaling.
You implement: Externalized policy engine for runtime rewiring, agent registry and schemas, budgets or SLOs, cross run observability and replay.

With n independent components in a graph, possibilities of graph executions grow exponentially in order of n, to optimally know what should be done is a hard task, so is managing such variable execution.

There is no easy way today to get this going, and honestly the models are still maturing to handle such complex context and intelligently manage traffic realtime. I talk to founders each week building such solutions from ground up to support their use case, sharing the same story: a duct-taped framework which is ‘running’ and they are choosing to look the other way as capacity constrains.

I also think of the possibility that we might see models meant for such decision making. Or could we have AGI that could itself execute a task end to end without needing an external ‘manager’? Tools and mcps are unlocking such capabilities with higher order models as we go, what’s your bet?

-thanks for reading,

Nikita Ag

Leave a comment

Subscribe now

Think twice before deploying your AI Agent PoC to Prod

Nikita Agarwal — Mon, 04 Aug 2025 17:45:50 GMT

Over the past eighteen months almost every data-science or platform team has built at least one “single-file” agent demo: drop in a PDF, get a summary, maybe run a retrieval-augmented chat. These experiments validated raw capability but hid three things that only show up in production: sustained throughput, failure handling and cost control.

Adoption numbers
- 51 % of the 1 300 professionals surveyed by LangChain already run at least one agent in production, and 78 % have concrete deployment plans. Mid-sized companies (100-2 000 employees) are the most aggressive, with 63 % already live. langchain.com
- IBM’s June 2025 enterprise study found 76 % of executives have an agentic proof-of-concept running but are now asking “how do we scale and govern this?” IBM

What breaks when you scale

Moving from one document to tens of millions, or from a single tool invocation to hundreds per minute, exposes three recurrent pain points:

Latency balloons from seconds to minutes

Root cause: Serial tool calls and synchronous I/O
Why the PoC missed it: Demo only processed one request at a time

Costs explode unpredictably

Root cause: Exponential token growth when agents recurse or retry
Why the PoC missed it: PoC ran on free-tier limits

Silent accuracy regressions

Root cause: No tracing or automated evals (Ref vol 1)
Why the PoC missed it: Manual eyeballing felt “good enough”

Lets take an example.

Picture a steady stream of 20 000 PDF documents flowing into your pipeline, each with more than ten pages and an average of one image per page. Every page must pass through OCR to harvest text, and each image needs separate extraction. After pulling this multimodal content, the workflow stitches the text and images back together, runs a custom summarisation step, and finally stores the results in your datastore, all while you juggle a very modest pool of CPU and GPU resources.

To understand how such a workload scales, we will estimate the total number of discrete processing nodes, project the compute time and power these nodes consume under realistic parallelism, and factor in retry overhead for inevitable failures. From there, we can outline the minimal yet resilient infrastructure, think task orchestrator, worker pools, message broker, observability stack, that keeps this many moving parts humming in production.

How many workflow-nodes will be spin up?

A. OCR + layout on each page

Granularity: page-level
Workload: 20 000 PDFs × 10 pages ≈ 200 000 pages
Nodes needed: 200 000

B. Image extraction

Granularity: page-level (≈1 image per page)
Items: 200 000 pages
Nodes needed: 200 000

C. Concatenate text + images

Granularity: per PDF
Items: 20 000 PDFs
Nodes needed: 20 000

D. Summarise PDF

Granularity: per PDF
Items: 20 000 PDFs
Nodes needed: 20 000

E. Push to datastore

Granularity: per PDF
Items: 20 000 PDFs
Nodes needed: 20 000

Subtotal: 460 000 nodes

Retry buffer (≈ 2 % for network/OCR hiccups): ≈ 9 200 nodes

Grand total: ≈ 469 000 nodes!!

Computational Cost Assessment

OCR processing

Compute: ~0.79 s per page (CPU) → 158 000 s raw runtime
Parallelization: 8 vCPUs handle ≈10 pages / s
Wall-clock time: ≈ 5 ½ hours

Image extraction

Compute: ~60 ms per image → 12 000 s total
Runs concurrently with OCR, so the extra wall-clock time is negligible

Concatenation

Compute: <10 ms per document
Kicked off asynchronously; effectively hidden behind other work

Summarization

Compute: ~25.5 s per document on a single GPU → 142 h raw runtime
- CPU fallback: ~73 h but at higher token cost
With one GPU, summarization dominates the schedule: ≈ 6 days!

Data insertion

Compute: ~50 ms per document
Buffered and streamed asynchronously; latency is masked

Optimization Strategies

Implementing GPU-accelerated OCR reduces CPU demands significantly.
Utilizing quantized or smaller-scale language models substantially decreases summarization durations.
Leveraging speculative decoding and batch processing enhances GPU throughput.
Adopting incremental summarization strategies reduces memory load and promotes parallel execution.
Employing spot instances mitigates expenses for stateless, computation-heavy processes.

Execution Strategy Examples

Baseline (32 CPUs + 1 GPU): approximately 148 hours (~6 days)
Optimized (32 CPUs + 2 GPUs or GPU OCR): approximately 48 hours
Budget-focused (8 CPUs only): approximately 95 hours (~4 days)

To summarise:

Expect ~4.7 × 10⁵ discrete workflow nodes!!
Summarisation dictates total elapsed time.
One modest GPU often beats many CPU cores for LLM work, but only if you can batch or run speculative decoding; otherwise adding CPU workers is still competitive.
Keep the pipeline idempotent and checkpoint after every fan-in so partial failures don’t force a restart of earlier stages.

So how to solve this simply?

Effectively processing gigabyte-scale or high-document-count workloads means treating your agent pipeline like a miniature data-platform rather than a single script. The following patterns have emerged as the most

reliable way to keep throughput high without astronomical costs.

1. Batch Processing APIs

Implement batch endpoints providing asynchronous job identifiers.
Utilize asynchronous callbacks or polling for result retrieval.
Incorporate adaptive throttling for back-pressure management.

2. Intelligent Task Batching

Cluster tasks semantically to maximize embedding cache effectiveness.
Dynamically adjust batch sizes to resource constraints (e.g., via Ray Data, Torch DataLoader).
Employ predictive token estimation (e.g., TikToken) for optimized model usage.

3. Advanced Scheduling Techniques

Utilize explicit Directed Acyclic Graph (DAG) structures for precise failure recovery (Airflow, Exosphere, Prefect, Argo).
Enable dynamic task generation to ensure accurate task tracking.
Leverage cost-effective scheduling through the intelligent allocation of spot instances.

4. Effective Parallelization and Autoscaling

Implement distributed queuing mechanisms for automated scaling (Kubernetes HPA, AWS Batch, Azure Container Apps).
Clearly delineate GPU- and CPU-intensive tasks.
Employ concurrency controls to mitigate overload scenarios.

5. Comprehensive Observability and Fault Tolerance

Generate structured logs and metrics (via OpenTelemetry, Prometheus).
Establish isolation protocols for recurrent task failures.
Introduce budget monitoring checkpoints for proactive cost management.

Tools in a snapshot by Capability

Batch APIs

Existing Solutions: AWS SageMaker, Google AI Platform, vLLM KV cache
Comprehensive Integration with Exosphere: one place to run batch jobs across different models and API formats

Scheduling

Existing Solutions: Airflow, Prefect, Argo
Comprehensive Integration with Exosphere: built-in DAG plus dynamic task orchestration that fits agent-style workflows

Autoscaling

Existing Solutions: Kubernetes HPA, AWS Batch
Comprehensive Integration with Exosphere: resource-aware scaling tuned separately for GPUs and CPUs

Observability

Existing Solutions: Prometheus, Grafana, OpenTelemetry
Comprehensive Integration with Exosphere: agent-specific metrics pushed straight into the same dashboards you already use

We have finally crossed a threshold where AI agents are no longer weekend experiments but full-fledged production workloads moving terabytes of text, images, and embeddings every day. The patterns we covered: batch-first APIs, size-aware micro-batches, dynamic fan-out schedulers, and spot-friendly autoscaling turn what used to be fragile, one-off scripts into resilient data factories. They also surface a new set of questions that teams rarely faced at smaller scale:

How do you budget for transient GPU spikes when token usage can quadruple overnight?
Which retry policy balances cost against accuracy when your upstream model provider silently throttles half your requests?
Where do you draw the line between “smart batching” and needlessly complex micro-batch orchestration?

These questions will shape the next wave of infrastructure tooling just as early CI/CD tools reshaped software delivery.

I would love to hear your war stories. What tactics brought your throughput from hundreds to millions of tokens per minute? Which tools saved the day, and which surprised you by becoming bottlenecks? Have you abandoned certain libraries entirely, or patched together bespoke glue code that still feels irreplaceable?

Leave a comment

Subscribe now

Make your AI agent fail fast to succeed

Nikita Agarwal — Mon, 28 Jul 2025 03:35:22 GMT

I am betting on the increasing autonomy of computer systems going forward, driven by rapid advances in artificial intelligence. This implies that many interacting agents and solutions will collaboratively work towards common goals, potentially running for extended periods, hours or even days in the background.

While we haven't fully reached this stage yet, we are currently experiencing its precursor: long-running AI workflows. These workflows are well-defined processes with fixed inputs, sequential steps, and expected outputs. They're widely used in various applications today, such as:

Database queries
Deep research across multiple domains
Coding agents and assistants

As we push the limits of multi-model workflows, it is crucial to pause and critically assess their current performance and reliability. Given the substantial cost associated with these workflows, including heavy CPU/GPU compute, extensive data handling, and extended runtimes, identifying failures only at the final stages can lead to exponentially costly repercussions.

Defining Key Terminology

To understand this better, let's define some working terminology:

Node: An atomic unit within a workflow, such as a model call, a data formatting operation, or an aggregation step.
Workflow: An end-to-end sequence of nodes with defined entry and exit points.
Metrics: Hyperparameters influencing workflow performance:
- Probability that a node successfully completes a task (defining success and failure itself is an intricate topic addressed later):

Maximum number of retries configured for each node:

1","id":"MDVGRQYAJP"}" data-component-name="LatexBlockToDOM">

Probability that the workflow ends in a successful state, given n nodes:

Probability and Workflow Reliability

Considering a base scenario, we calculate the probability of node failure after retries:

Hence, the probability of node success becomes:

P(node)_i = 1-(1-P_i)^{r_i}","id":"IUOZUJIQRH"}" data-component-name="LatexBlockToDOM">

Thus, the workflow's overall success probability is:

Observations from Probability Analysis

Analyzing relationships between metrics reveals critical insights (x axis denotes number of steps n):

Scenario 1 (Single Attempt, Moderate Probability):

A workflow with even two sequential steps has less than a 50% chance of success if each step has an 80% success rate.

Scenario 2 (Single Attempt, High Probability):

In this case 2-step workflow reaches around 80% reliability, but reliability drops significantly beyond six steps

Impact of Retries

What if we are able to accurately identify failures and trigger immediate retries, how do the numbers change?

Including retries dramatically improves reliability:

Moderate Probability (80%) with Retries:

With just a single retry on failure, at accuracy as low as 80% we are able to run workflows with much larger sequences! Comparing to the 2-step workflow failing with no checks, this is a significant improvement.

And if we consider higher probablity of per node success, numbers shine even brighter:

Higher Probability (90%) with Retries:

Only now do we start seeing longer (>20 step workflows) to have a chance at running reliably.

Furthermore, with higher intelligence and increased retries (e.g., r=4), even extensive workflows (e.g., 100-step workflows) achieve remarkably high reliability, approaching 99.9%.

Plotting P(W) against number of retries at number of steps = 100. We see P(W) stabilises at r=4 considering P=0.9

What has been your experience working with AI agents with successive non-deterministic steps?

Leave a comment

Why Node-level Failure Detection Matters

Effective AI workflows rely heavily on accurately detecting and addressing failures at each individual step or node. Implementing robust node-level checks rather than solely depending on end-to-end workflow validations provides substantial benefits:

Reduced Resource Wastage: Quickly identifying and resolving node failures prevents repeated, costly retries of the entire workflow.
Improved Reliability: Early detection enables granular retry logic, improving workflow resiliency and uptime.
Enhanced Debugging Capabilities: Pinpointing failures at the node level simplifies debugging, offering clearer visibility into which component failed and why.

In large-scale AI workflows, single-node failures if left unchecked can propagate silently and magnify resource usage exponentially, leading to cascading failures and degraded performance.

How to?

Identifying Failures in Non-deterministic Nodes

Non-deterministic nodes such as LLM-generated content, probabilistic algorithms, and stochastic processes introduce unique challenges in identifying failures accurately. Unlike deterministic tasks, outputs from these nodes vary naturally, complicating the differentiation between acceptable variance and genuine failures.

Several effective strategies to tackle these challenges include:

Hybrid Evaluation Techniques

Hybrid evaluation blends deterministic checks with statistical or heuristic indicators. This method improves robustness by assessing multiple indicators rather than relying on binary pass/fail outcomes. Techniques include:

Confidence Score Thresholding: Using model confidence metrics to evaluate if node outputs are within acceptable thresholds.
Statistical Heuristics: Establishing expected distributions or statistical bounds (e.g., KL-divergence, Jensen-Shannon divergence) to identify anomalies or outliers.
Drift Detection Algorithms: Implementing methods like Adaptive Windowing (ADWIN) or Kolmogorov–Smirnov tests to dynamically assess node performance against expected behavior.

LLM-based Judging Systems

Utilizing advanced language models as evaluative judges provides a powerful mechanism for validating outputs of non-deterministic nodes. In practice, this involves:

Prompt-based Validation: Generating structured prompts to solicit detailed evaluations of task outcomes from an LLM (e.g., GPT-4, GPT-4o).
Self-consistency Checks: Using multiple samples or chain-of-thought reasoning to improve the reliability of the evaluation.
Meta-evaluation Layers: Applying hierarchical evaluations, where one LLM judges node outputs and another assesses the reliability of the first evaluator, thus improving overall accuracy.

While highly effective, these methods may introduce additional latency and computational costs. Balancing these trade-offs requires careful calibration to workflow demands.

Intelligent Failure Handling

Beyond detection, intelligently handling failures is essential for robust workflows. Strategies include:

Dynamic Retry Logic: Implementing adaptive retry mechanisms, such as exponential backoff combined with smart checkpointing, to reduce resource usage without compromising reliability.
Task-Specific Error Handling: Developing tailored responses to node-specific failures, for example, fallback methods, alternative models, or parameter adjustments to minimize workflow interruptions.
Resource-aware Retries: Integrating resource estimation models (e.g., predictive cost and compute time) into retry decision-making, ensuring retries occur only when justified by resource efficiency and probability of success.

Choosing the Right Evaluation Method

Selecting suitable evaluation metrics or criteria involves detailed trade-off analysis:

Accuracy vs. Overhead: Choosing metrics that offer the best compromise between evaluation accuracy and computational or latency overhead. High-stakes nodes may justify resource-intensive evaluations, whereas lower-impact nodes require lightweight heuristics.
Node-specific Metrics: Employing specialized metrics aligned closely with node functionality. Examples include BLEU scores for language generation tasks, Structural Similarity Index (SSIM) for visual outputs, or F1 scores for classification tasks.
Operational Benchmarks: Integrating operational metrics such as memory usage, inference latency, or CPU/GPU utilization to proactively identify performance degradation that precedes outright failures.

Parallelization: Maximizing Reliability Without Latency

Parallelization significantly enhances workflow robustness by addressing multiple potential failure points simultaneously. Effective parallelization involves:

Concurrent Execution of Critical Nodes: Ensuring that critical or failure-prone tasks run in parallel with redundancy (replicated nodes), preventing single points of failure.
Preemptive Checks and Parallel Validation: Performing concurrent validation tasks to promptly isolate failures, allowing recovery measures before the primary workflow is disrupted.
Fine-Grained Node Splitting: Breaking down complex nodes into smaller, concurrent tasks that can be executed and validated independently, facilitating quicker failure detection and reducing rollback overhead.

Advanced orchestration tools (e.g., Apache Airflow, Kubernetes Jobs, Exosphere, AWS Step Functions) are often leveraged to efficiently manage parallel execution, providing built-in fault tolerance, automatic retry mechanisms, and granular monitoring.

Make your AI agent fail fast to succeed

As AI workflows scale in complexity and criticality, implementing sophisticated node-level failure detection, intelligent retry strategies, and well-orchestrated parallelization becomes imperative. These technical measures ensure robust, efficient operations, reduce computational waste, and mitigate costly downtime.

Embracing these best practices brings us closer to achieving truly autonomous, reliable, and resource-efficient AI systems, capable of handling increasingly complex, long-duration tasks at scale.

How are you measuring success and achieving it?

The Bangalore Startup Experience

Nikita Agarwal — Wed, 09 Oct 2024 14:12:55 GMT

A 9 to 5 isn’t enough for me.
Nah, I will follow my *passion*
Idea chahiye bas, fir toh aish hi aish
Startup is a trend today. It has evolved into a culture. Everyone wants to be their own boss, everyone looks at it with lusty eyes, but can you do it? Do you have it in you to actually startup? Is there anything special even, separating CEOs from normal janta?
I, like any other engineer, am a part of the startup-hype clan. Even I had these exact questions, looking for answers for which, I was fortunate enough to get an internship at a startup in Bangalore, the Silicon Valley of India.

Touchdown

Two months back, I touched down in this city, not knowing what to imagine or expect in the coming months. This was a city of complete strangers, without a single point of fully known contact at the time of landing, other than my Dad, who had come to set things up for me. I was excited, hopeful, nervous and cautioned with scepticism. I was sure of only one thing: to get to know startups as well as it may be possible. I wanted to meet and get to know the people who dare to actually embark and continue on this fogged journey, albeit of self-satisfaction. What vision do they have, what belief systems they function on, how do they stay sane!
In the very first week, it was evident how Bangalore is different from any other city. It is a city of youths. Almost everyone here has a rented space somewhere in the city; many wear glasses and carry backpacks while travelling on autos with plugged in ear-pods, blasting music. Most swiggy their meals every day, making each restaurant serve their own “Combo for 1”. Almost all everyday services like laundry, point-to-point delivery, and transportation are decentralised and optimised for the city’s young town boy. Just like that, there are countless PGs; each lane has “To-let” boards inviting guests over and also co-working spaces.

The Office

Multiple offices functioning from a single facility, benefitting from the common well-equipped Conference Rooms and Chai Point dispensers, is the ideal space for startups to flourish(after moving out from your own house-office). I remember being stunned by the look of my own office space; it was modern, well-lit, had vending machines and great coffee. To top it all, my particular office had the cutest husky to grasp all attention and send me on a guilt trip every day for eating fruits alone without succumbing to his accosting demands(only for your good, Dexter!).
Just like that, I got my first desk at a physical office. After initially fumbling to work with multiple screens, I made myself comfortable and got into the grind of morning meetings and code-ridden evenings. I learnt how offices function and what goes into envisioning a product. How the work is assigned, coordinated and accomplished. Every morning my team would meet up to discuss the progress made in the previous day and the plan for the current day. I loved the independence given to each developer regarding their own work. After being assigned to a developer, a task was theirs, and it was up to them to achieve it most efficiently. This was the trust that the seniors had in each developer, to give them the space to work however suits them best. Also, guidance and teamwork were in practice very efficiently. I, being an intern, was given a seat right next to the team lead to make it easy for me to bug him for help whenever I got stuck, which he always did, without question. My best learning happened during morning meetings when you learn about projects other than your own. It was during this time when I would intently listen to the discussion that I started building my understanding of the product we wish to deliver and methods to act on deliverance.
Further, I saw how value is consistently added with the detailed work being done here. Doing something which has not been done before comes with its own challenges. It makes you rethink solutions and look at available technology more minutely, to find ways to squeeze out your own path to achieve your goals and expectations. Low-level development, shredding the “benefit” of high level frameworks, gives you this power to achieve the unachieved. This particularly requires development of great expertise in the domain, to know and implement on the what’s what of the system.
Apart from the routine work that goes around from 10 to 7, was the office banter, which taught me it’s own share of things. Morning traffic woes, new tech releases, cricket analyses(which I was so out of), Hindutva and ottomans made the day whizz past in what looked like an hour on most days. During this time, I also got the chance to squeeze some time out with our CEO, going off-topic discussing one thing and another. Being a founder of a company requires the right mix of mind-bending philosophy and lightheadedness. Knowing that doing nothing is the most ideal way of life, and making what you do, count, go hand in hand. Fully believing that there is no other outcome possible than success and accepting any other outcome also go hand in hand. Being an open book while being mysterious is just the same.
Making it work does not depend on an idea. It depends on how much you think that the idea will work. Establishing a direct dependency on the people working with you, how much do they believe in it? It is the team and the value they are adding, that is, the technology that is created. Belief is not plucked off a bush, it is this added value, which is the asset on which you can stand tall without flinching an eye, making you believe, for real. This asset is backed by the team you put your trust in. If you believe, you find yourself working towards it, building stuff and making it happen, which is when you have started up.
I had heard people say: “The idea is just 1% of the startup,” I think I understand this statement to some extent now, if not entirely. So while starting up: focus on one thing, what value are you adding with this name? The answer to this one question would then be enough to make you wield your stick on the foggy road and rocky terrain.

The Trek

Incidentally, I happened to go on a rocky trek with my team during my time here, and it was a challenging trek, both up and down. Going up was difficult because of the uncertainty. I did not know how steep it would get, how slippery the rocks ahead of me were and how much distance was left to cover. This became incredibly easier when I believed that I could reach the top, exactly when uncertainty turned into conviction. However weary, sweated out, or dehydrated, I’d undoubtedly reach. Coming down was easy just because of the surety of me knowing my way down, but was difficult because I was already tired and shaky, with my grip risking failure at some points. This became much easier when I started focusing, one hundred per cent on the way ahead of me. Finding the best way down, optimising the way based on my recent experience of handling such unwelcome rocks.
On the drive back home, only one thought looped in my head: that trek was exactly the story I had come to experience in Bangalore. The story of a startup, from zero to the top, to further ahead. The three-hour trek was such a succinct description of every startup from start to finish.
Both your laughter and cry count during this journey are greatly affected by the hands, holding you from slipping down and stealing your food when you’re looking away. It depends on the team. How to choose this team? I think this is the one question left to answer, which cannot have an objective method mapped out. I imagine it to be a gut instinct that would instantly say yes or no to a person in front of you, an acquired skill, in my opinion.
A starting point, a vision(even if blurred in spots) and some hands to hold, that could be listed as the potion to a startup, to hopefully someday, have you seated in a Bugatti, whizzing on the streets of Bangalore to your home, where, when you lay in bed, you can tell yourself, “Yes, I add some value to technology, and life, in general.”

Published originally on https://hillstotech.wordpress.com/

October 25, 2021

Where's your gaze?

Nikita Agarwal — Wed, 09 Oct 2024 14:01:55 GMT

Subscribe now

We all want to achieve a list of things, this, that and then the next. We even probably jot down a well thought out plan and envision ourselves on track to reaching the next goal post. But often, we see ourselves falling behind on those plans we chalked out for ourselves. There’s the gap, between plans and reality and while this could happen due to a concoction of things, I find a big reason behind me falling behind, to be where my eyes are locked in. Let me explain.

In the last few months, I have come up with a list of lists of steps to achieve a list of things, of which, only a meagre percentage has been actualized. That’s embarrassing, in the least. Why?

My gaze. I realized that while following my said plan my eyes are looking outwards on something, someone or some place outside of myself to take a call, follow through on accepted task splits or another task. And am waiting, for this external force to take action, while my plans go stale, and I stagnate in the same pool of quicksand, once again crying for help than walking myself out.

This is not a self-journal, but an observation on how we tend to abandon ourselves because we are not self-aware. We often choose goals that are not our own and of some other agent and then go on to wait on those external forces to push us towards action. That, when I talk about it, is downright foolish, and puts you in a weak and vulnerable position.

Photo by Sumit Mangela on Unsplash

This reminds me of a rather famous anecdote from Mahabharata, where Arjun was poised with an impossible challenge of shooting a moving fish in the eye by just the reflection in the water bowl, with one of the heaviest bows of the time. While most princes at the competition were unable to even flinch the bow, let alone complete the challenge, Arjun took the challenge and with his eye focused on the fish’s, he landed the arrow at his target in the very first shot. This oversimplified version of the story tells just one thing, do you have your eye fixated on the goal, or on the onlookers? Is your focus undivided on the target ahead of you to have you unbeatable or are you deterred by the task and your eyes are now scanning the arena looking for a helping hand?

This brings to my mind, a similar anecdote from the same epic, that of Draupadi and the infamous “vastra-haran”. Her plight continued on as long as her fearful eyes kept scanning the courtyard. The moment she reflected on herself, and remembered “Krishna,” the story goes on to tell a miracle saved her. But I find it intriguing how Draupadi was also called “Krishnaa” fondly by Krishna himself. Is this a way to hint on the same principle, of self-reflection, self-resiliency and the truth that true power only comes from yourself. Only you, can save yourself. Yudishthir, let his eyes wander to people in the courtyard and took decisions to satisfy their vision of him, and not to help himself and his own. Focus, gaze, should always be inward. What could I do better? What is in my power? What can I change? What do I want to achieve?

I notice, dependencies, partnerships and them falling out, often leave us looking outward for action, and instead of us taking up the sword and fighting out the situation, well, we fall behind.

This makes me question, was my picking this goal the right call? If I cannot take a fight for something to happen, did I really want it in the first place?

So how do we unknowingly land up with such convoluted goals? Again, those wandering eyes are to blame, looking for shiny things that catch their attention. What does the world want, what do “they” think and, thus, what achievement would have me shining before the world? Gaze.

Photo by Lareised Leneseur on Unsplash

Where is my gaze at?

Is at at myself, at my interests, passions, time, and desires or am I gazing at the guy next door and implicating what I should be doing instead? Where is my gaze, and where’s yours?

Where is the focus? On the world, or at yourself?

That’s something I would be acting upon, until next time :)