Blog | Naveen Addanki

Mar 21, 2026

The AI Engineering Stack in Production: What Actually Breaks at 10 Million Requests a Day

Development environments hide the failure modes that production reveals. GPU memory pressure spikes, embedding model drift, context budget exhaustion, and tool call storms are invisible at 100 requests per day and routine at 10 million. Here is the failure taxonomy and the runbook structure that prevents incidents.

productioninfrastructurereliabilityllm-opsscaling

Mar 16, 2026

Hallucination Is a Distribution Problem, Not a Bug to Patch

Hallucination is not a model defect waiting to be fixed in the next release. It is a predictable consequence of how language models work. Understanding the three distinct failure modes tells you which mitigation applies and what reduction you can realistically expect.

hallucinationproductionreliabilityragalignment

Mar 11, 2026

Prompt Engineering Has a Ceiling: Here Is Where It Is

Zero-shot, few-shot, chain-of-thought, self-consistency: each rung of the prompting stack costs more and returns less than the one before. Knowing where the ceiling is tells you when to stop prompting and start building a different solution.

promptingfine-tuningragllm-opsproduction

Mar 6, 2026

Why Your Agent Loops Forever (And How to Design One That Doesn't)

An agent loop that lacks explicit termination conditions, typed state, and bounded tool schemas will loop until it hits a token limit or runs up a bill. The failure modes are predictable. So are the fixes.

agentstool-useinferencearchitectureproduction

Mar 1, 2026

Chunking Is an Engineering Decision, Not a Preprocessing Step

Splitting at 512 characters is not a chunking strategy. It is a bet that your documents have no structure worth preserving. Chunk boundaries destroy subject-predicate relationships, split tables down the middle, and separate code from its context. Here is how to chunk for recall instead of convenience.

ragchunkingretrievalembeddingsnlp

Feb 24, 2026

Embedding Models Are Not Interchangeable: Choosing One That Won't Sink Your RAG Pipeline

MTEB aggregate score is the wrong metric for embedding model selection. A model that ranks first on the leaderboard may recall fewer than 60% of relevant documents in your domain. Here is how to select, evaluate, and extend an embedding model for production RAG.

embeddingsragretrievalvector-searchnlp

Feb 19, 2026

Quantization Without Regret: Picking the Right Precision for Your Model

Dropping from fp16 to int4 cuts memory by 4x and increases throughput by 2-3x. The accuracy cost ranges from imperceptible to catastrophic depending on model size, task, and quantization method. Here is the framework for making this decision correctly.

inferencequantizationprecisionservinghardware

Feb 14, 2026

KV Cache Is the Bottleneck You're Not Measuring

At batch size 16 and 4096-token context, LLaMA-2-70B needs 40 GB of KV cache on top of 140 GB of weights. Most inference bottlenecks trace back to memory pressure from the cache, not model compute. Here is how to measure it, size it, and reduce it.

inferencekv-cacheservingthroughputtransformers

Feb 9, 2026

Preference Optimization Without the Pain: DPO vs. RLHF in Production

RLHF requires three model copies in memory and a full PPO training loop. DPO reduces alignment to a supervised classification objective. Understanding the math behind both explains when DPO fails and when the complexity of RLHF is actually worth it.

fine-tuningrlhfdpoalignmenttraining

Feb 4, 2026

The Eval Trap: Why Your Fine-Tuned Model Scores 94% and Still Fails in Production

A 94% benchmark score is not a production readiness signal. It is often a contamination signal. Building an evaluation pipeline that actually predicts production behavior requires understanding why standard benchmarks fail and what to measure instead.

fine-tuningevalstestingproductionllm-ops

Jan 30, 2026

LoRA From First Principles: Why Low-Rank Adaptation Works and When It Breaks

A LoRA adapter for a 4096x4096 weight matrix at rank 8 reduces 16.7 million trainable parameters to 65,536. The math behind why this works reveals exactly when it will fail and how to configure it without guessing.

fine-tuningloraadaptationtrainingtransformers

Jan 25, 2026

What the FFN Layers Are Actually Storing (The Transformer as a Key-Value Memory)

More than 40 percent of a transformer's parameters live in its FFN layers. Research published in 2021 showed these layers function as key-value memories storing factual associations. Understanding this explains why hallucination happens and why retrieval beats scale for knowledge-intensive tasks.

transformersarchitectureffnhallucinationrag

Jan 20, 2026

Tokenization Is Not Preprocessing: It Is a Hard Constraint on What Your Model Can Reason About

Before the model sees a single word, the tokenizer has already decided what it can and cannot reason about. Arithmetic failures, non-English degradation, and structured data hallucinations all trace back to the same mechanism.

transformerstokenizationnlparchitectureinference

Jan 15, 2026

Why Your 1M-Token Context Window Is Mostly Wasted

A million-token context window sounds like a superpower. In practice, most of what you put in the middle gets ignored. Here is the research behind why, and how to engineer around it.

transformerscontext-windowraginferencepositional-encoding

Jan 10, 2026

How Attention Actually Works (And Why It Costs You O(n²) Every Time)

Before you pick a context size or a model, you need to understand why attention's memory bill grows with the square of sequence length, and what Flash Attention and GQA actually do about it.

transformersattentioninferencearchitecture