Chunking Is an Engineering Decision, Not a Preprocessing Step

· 15 min read
ragchunkingretrievalembeddingsnlp

In most RAG implementations encountered in production, chunking is treated as a one-time preprocessing decision made at project start and never revisited. A fixed size of 256, 512, or 1024 characters is chosen, a sliding window overlap of 10 to 20 percent is added, and the pipeline is shipped. The implicit assumption is that chunking is a technical detail with minor impact on retrieval quality. The assumption is wrong. Chunking strategy is the single most impactful configurable parameter in a RAG pipeline, because it determines what unit of meaning the embedding model encodes. A chunk that splits a sentence in the middle encodes an incomplete thought, and no amount of reranking or prompting can recover the meaning that was cut away.

Why chunk boundaries matter

When an embedding model encodes a chunk, it produces a vector representing the semantic content of that chunk as a unit. The embedding captures the meaning of the full input, weighted by the attention mechanism across all tokens in the chunk. A chunk that contains a complete thought, whether a full sentence, a coherent paragraph, or a function with its docstring, produces an embedding that accurately represents that thought. A chunk split mid-sentence produces an embedding representing a partial thought, which will match queries about that partial thought less reliably.

The practical consequence is retrieval failure at the boundary. Consider a passage: "The treatment was not effective in patients with stage 3 disease. However, when combined with radiotherapy, outcomes improved by 34%." If this is split between "stage 3 disease." and "However, when combined," the second chunk embeds the improvement statistic without its critical qualification. A query about "radiotherapy outcome improvement" correctly retrieves the second chunk but misses the context that this applies only when combined with another treatment. The answer the model generates is technically grounded in retrieved content but misleadingly incomplete.

This is the insidious failure mode of bad chunking: it does not surface as an obvious retrieval miss. The retrieved chunk contains words that match the query. The LLM produces a confident, coherent answer. The answer is wrong. Without a task-specific evaluation set that tests for completeness and qualification, the failure is invisible until a user catches it. The fix is not downstream, in the prompt or the reranker. The fix is at the boundary.

Fixed-size vs. sentence-aware vs. semantic chunking

Fixed-size chunking splits text every N characters or tokens, optionally with a stride for overlap. It is fast, predictable, and wrong for nearly every document type. The split point is determined by character count, not by linguistic structure, so it routinely splits sentences, severs compound predicates from their subjects, and divides enumeration items from their headers. The overlap window partially mitigates this by duplicating text across adjacent chunks, but it does not restore the semantic unit: the duplicated text appears in a different linguistic context in each chunk, and neither context is the correct one.

Sentence-aware chunking splits at sentence boundaries detected by a sentence tokenizer, using spaCy, NLTK's Punkt model, or regex-based heuristics, then groups sentences into chunks of a target size, never splitting within a sentence. This is substantially better than fixed-size for prose documents. The cost is that sentence tokenization is imperfect on real-world text: abbreviations, decimal numbers, and domain-specific punctuation cause false sentence boundaries. "Dr. Smith found that pH 7.4 was optimal. vs. Fig. 3 shows the result." contains three false sentence boundaries before the actual period. Production sentence tokenizers handle common abbreviations from a built-in dictionary, but domain-specific text, medical records, legal documents, scientific papers, will always have a long tail of edge cases. For most use cases the imperfection is acceptable: a false sentence boundary creates a slightly suboptimal chunk boundary, not a catastrophic split. The improvement over fixed-size is large enough that the imperfection is worth tolerating.

Semantic chunking clusters adjacent sentences by embedding similarity, splitting only when similarity drops below a threshold. Consecutive sentences that discuss the same topic are kept together; a topic shift triggers a chunk boundary. Greg Kamradt's implementation, using cosine similarity between consecutive sentence embeddings with a threshold at the 95th percentile of the similarity distribution, produces chunks that correspond roughly to coherent semantic segments. The cost is higher preprocessing compute, because each sentence must be embedded to determine boundaries, and variable chunk size, because some topics span many sentences and others span few.

In recall experiments at equivalent embedding model and index configuration, semantic chunking typically outperforms sentence-aware chunking by 5 to 10 percentage points on recall@5 for question answering tasks, and sentence-aware chunking outperforms fixed-size by 8 to 15 percentage points. The improvement from fixed-size to sentence-aware is larger than the improvement from sentence-aware to semantic, which means sentence-aware chunking is the practical minimum for production RAG systems. If preprocessing compute is the binding constraint, sentence-aware chunking gets you most of the benefit at a fraction of the cost.

Chunking strategy comparison: fixed-size vs sentence-aware vs semantic, with recall scores

The chunk size vs. recall tradeoff

Recall and precision trade off with chunk size in opposite directions. Small chunks, in the 64 to 128 token range, have high precision when retrieved: a small, dense chunk is likely to be entirely relevant if it matches the query. But small chunks have lower recall because a small chunk often omits necessary context. The context needed to fully answer a question about "radiotherapy outcomes" may span 300 tokens, but if your chunks are 128 tokens, the necessary content is split across three separate chunks that may be retrieved separately, ranked below irrelevant chunks, or not retrieved at all if the top-k budget is tight.

Large chunks, in the 512 to 1024 token range, have higher recall: the necessary context is more likely to be within a single chunk. But large chunks have lower precision. A chunk containing 800 tokens may contain only 100 tokens of relevant content, and the remaining 700 tokens of irrelevant text dilute the embedding. The query vector must match not just the relevant 100 tokens but the weighted average of all 800 tokens worth of semantics, which reduces the similarity score of the chunk against specific queries. Beyond embedding dilution, large chunks waste LLM context budget: the retrieved chunks are placed in the model's input, and including 700 tokens of irrelevant text for every 100 tokens of relevant content burns context window space that could fit additional relevant chunks.

The empirical sweet spot for most question answering workloads is 256 to 512 tokens with sentence-aware boundaries. Below 128 tokens, precision is high but recall drops significantly because answer spans exceed chunk size. Above 1024 tokens, the embedding starts to represent an average over too much content to be precisely matchable to specific queries. The right choice within this range depends on the average length of the relevant unit in your document type. For long-form prose where a relevant answer typically spans a full paragraph, 512 tokens is appropriate. For FAQ-style documents where answers are self-contained in 2 to 3 sentences, 256 tokens suffices.

Chunk size vs. recall and precision: the sweet spot curve for question answering workloads

One calibration technique that is underused in practice: measure the distribution of answer span lengths in your document corpus before choosing a chunk size. If your documents are a legal FAQ where answers average 150 tokens, chunking at 512 tokens creates chunks that contain three to four answers, diluting the embedding for each. If your documents are research papers where the relevant methodology section averages 800 tokens, chunking at 256 tokens guarantees that every retrieved chunk contains a fragment, not a complete unit of meaning. The right chunk size is a property of your document type, not a universal constant.

Hierarchical indexing: paragraph retrieval, sentence extraction

Hierarchical indexing decouples the embedding unit from the extraction unit. Documents are embedded at two granularities: large parent chunks, typically a full paragraph or section, for embedding and retrieval, and small child chunks, typically individual sentences, for extracting the precise answer span once a parent chunk has been retrieved.

The retrieval query matches against parent chunk embeddings, which are long enough to represent coherent context. Once a parent chunk is retrieved, the relevant sentence within it is identified either by a secondary embedding lookup within the parent or by asking the LLM to identify and quote the relevant portion. This preserves the context richness of large embeddings while enabling precise answer extraction from small chunks. The tradeoff at each level is handled optimally: the parent chunk is large enough to avoid the recall failure of small chunks, and the sentence extraction is precise enough to avoid wasting LLM context on irrelevant content.

Microsoft's GraphRAG and LlamaIndex's hierarchical node parser implement variants of this approach. The GraphRAG paper reports consistent improvements on multi-hop question answering tasks where a single retrieved chunk rarely contains all the information needed, and the hierarchical structure allows evidence to be assembled from multiple parent chunks whose relevant sentences complement each other. In production RAG systems with long-form documents, particularly legal briefs, research papers, and technical documentation, hierarchical indexing consistently outperforms flat chunking at any fixed chunk size.

The implementation cost is real: you need to build and maintain two indices, parent and child, with a mapping between them. The retrieval logic requires two steps instead of one. Frameworks like LlamaIndex expose this as a first-class retrieval strategy with built-in support, which reduces the implementation overhead to a configuration choice. For document corpora where answer spans are consistently much shorter than the context needed to interpret them, the accuracy improvement justifies the architectural complexity.

Document-type-specific strategies

PDF documents introduce column layout, headers and footers, and footnotes that break natural text flow. Naive text extraction from PDFs concatenates across columns, merges headers and footers into the body text, and drops the structural relationship between a table's column headers and its data rows. A two-column paper extracted naively will have the left column's sentences interleaved with the right column's sentences at each line boundary. Chunking extracted PDF text without first cleaning the extraction produces chunks full of pagination artifacts, merged column fragments, and footer text mixed into body paragraphs. The minimum requirement for PDF-sourced RAG is to use a structured PDF parser, such as pdfplumber, Adobe PDF Extract API, or a vision-language model for scanned PDFs, that preserves reading order and separates page metadata from content. Tables specifically should be extracted and embedded as structured text descriptions rather than as raw cell contents, because the spatial relationship between headers and values is lost in raw extraction and only recovers if the table is reconstructed as "column name: value" pairs.

HTML content from web pages typically contains navigation menus, advertisements, sidebars, and boilerplate that should be excluded before chunking. A chunk containing "2024 Company Name | Privacy Policy | Terms of Use | Cookie Settings" will match queries about policy-related content even though it contains no useful information. More dangerously, navigation menus often contain anchor text summarizing every section of a site, which produces chunks that appear to match almost any query with moderate confidence. HTML preprocessing should extract the main content region using the main, article, or content-specific div elements identified by class or ID, and strip nav, footer, aside, and header elements before chunking. The heading hierarchy should be preserved: H1, H2, and H3 text should be included as a prefix in the chunks below them to provide section context that would otherwise be lost when the heading and its content land in separate chunks.

Document-type specific chunking strategies: PDF, HTML, and code require different approaches

Code requires AST-aware chunking, not text-based chunking. Splitting a Python function in the middle of its body creates a chunk that does not execute and cannot be meaningfully embedded as a complete unit of behavior. A chunk containing the second half of a function body with no function signature and no imports is not interpretable in isolation. Code should be chunked at function and class boundaries, detected by parsing the abstract syntax tree with a library like tree-sitter, which supports over 40 languages with a consistent API. The docstring and signature must always be included in the same chunk as the function body: a docstring in isolation is ambiguous, and a function body without its signature loses argument names and type annotations that are often the most query-relevant content. For large classes, split by method but include the class name, base classes, and class-level docstring as a prefix in every method chunk, so that each chunk is self-contained even when retrieved without its siblings. Module-level imports should similarly be included as a prefix in each function chunk to preserve the information needed to understand the function's dependencies.

Metadata-aware chunking and filtered retrieval

Every chunk should carry metadata: source document identifier, section heading, page number for PDFs, creation date, author, and document type. This metadata enables filtered retrieval, where a query can be constrained to chunks from specific documents, date ranges, or document types before similarity search is applied.

Filtered retrieval dramatically improves precision for use cases where the relevant content is known to be in a specific subset of the corpus. A financial analyst asking about "Q3 2024 revenue" should retrieve only from documents tagged as financial reports from Q3 2024, not from all documents in the corpus that mention revenue. Without metadata filtering, the similarity search returns chunks from various documents that mention revenue, requiring the model to sort out which ones are relevant to the specific quarter. The model may succeed at this, but it is more likely to mix results from different quarters or include analyst commentary about revenue rather than the primary report figures.

The embedding index must be built with metadata fields enabled and indexed. Pinecone, Weaviate, Qdrant, and Chroma all support metadata filtering at query time with efficient indexed filters. The correct query pattern is filter first, then similarity search: reduce the search space to the relevant document subset using metadata predicates, then run similarity search within that filtered set. This is both more accurate and faster than searching the full corpus when the relevant subset is small, because approximate nearest neighbor search on a smaller corpus has fewer candidates to rank and a lower probability of false positives from documents outside the relevant subset.

The metadata should be populated at indexing time from the document's own structure where possible. A PDF with a title page, section headings, and author list provides this metadata directly. A web page provides it in its <head> metadata and URL structure. Source code provides module path, function name, and class context from the AST. For document types without embedded metadata, external metadata must be maintained in a document registry and joined with chunk records at indexing time. The operational overhead is real but the precision improvement makes it worthwhile for any corpus with natural substructure that queries exploit.

One often-overlooked dimension of chunk metadata is versioning. When documents are updated, the chunks derived from them must be updated too, and the old chunks must be removed from the index. A naive append-only indexing strategy accumulates stale chunks that remain retrievable long after the source document has changed. For RAG systems over living document corpora, the indexing pipeline must support document-level deletion and re-indexing, keyed by document identifier. This is an operational concern rather than a chunking concern, but it is inseparable from the decision to use rich metadata: the document identifier that enables deletion is the same identifier that enables metadata filtering.

Putting it together

The decision sequence for chunking in a new RAG project is concrete and in order. First, classify your document types and choose a type-appropriate parsing strategy before any chunking decision: PDFs need structured extraction, HTML needs boilerplate removal, code needs AST parsing. Second, set sentence-aware chunking as the baseline regardless of document type, since it is always better than fixed-size and its cost is negligible. Third, measure answer span length distribution in your corpus to calibrate the chunk size target: aim for chunks that are 1.5 to 2x the average answer span. Fourth, evaluate whether the compute cost of semantic chunking is justified by the recall improvement, which is typically 5 to 10 percentage points on recall@5. Fifth, if your documents are long-form with short answer spans, implement hierarchical indexing rather than accepting the recall-precision tradeoff of any single flat chunk size. Sixth, add metadata to every chunk at indexing time and build your query interface to accept metadata filters.

None of these steps are complicated in isolation. The system-wide consequence of skipping them is a RAG pipeline that appears to work in demos on simple queries and fails on the long tail of real queries where the answer spans a sentence boundary, requires qualification from an adjacent chunk, or is buried in a document type that naive extraction corrupts. Chunking is not preprocessing. It is the architecture of your retrieval index, and it deserves the same design attention as any other architectural decision.


References

  1. Lewis, P., Perez, E., Piktus, A., et al. "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS, 2020. https://arxiv.org/abs/2005.11401

  2. Edge, D., Trinh, H., Cheng, N., et al. "From Local to Global: A Graph RAG Approach to Query-Focused Summarization." arXiv, 2024. https://arxiv.org/abs/2404.16130

  3. Gao, Y., Xiong, Y., Gao, X., et al. "Retrieval-Augmented Generation for Large Language Models: A Survey." arXiv, 2023. https://arxiv.org/abs/2312.10997

  4. Reimers, N. and Gurevych, I. "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." EMNLP, 2019. https://arxiv.org/abs/1908.10084

  5. Kamradt, G. "Semantic Chunking." Notebook / Tutorial, 2023. https://github.com/FullStackRetrieval-com/RetrievalTutorials

  6. Borgeaud, S., Mensch, A., Hoffmann, J., et al. "Improving Language Models by Retrieving from Trillions of Tokens." ICML, 2022. https://arxiv.org/abs/2112.04426