Tokenization Is Not Preprocessing: It Is a Hard Constraint on What Your Model Can Reason About

· 16 min read
transformerstokenizationnlparchitectureinference

Every large language model pipeline starts the same way: raw text goes in, tokens come out, and then the model begins. Most engineers treat this first step as plumbing. The tokenizer runs, a list of integer IDs is produced, and the interesting work begins downstream with attention layers and generation. The tokenizer is thought of as a translation layer, a necessary conversion step before the real computation starts.

This framing is wrong, and the wrongness has concrete consequences. The tokenizer does not neutrally convert text into a representation the model can process. It makes decisions, baked in at training time and unchangeable at inference time, about which sequences of characters constitute meaningful units of language. Those decisions determine whether your model can add two numbers reliably, how many tokens a page of Thai text consumes versus a page of English, whether indented Python uses 30% of its token budget on whitespace, and whether switching from a 32k to a 128k vocabulary makes non-English tasks meaningfully cheaper to run. None of these are quirks or edge cases. They are direct, predictable consequences of how modern tokenizers work, and understanding the mechanism is the prerequisite for reasoning about any of them.

BPE from scratch: how a vocabulary is built from character frequency

The dominant tokenization method used in GPT-2, GPT-3, GPT-4, LLaMA, Mistral, and most other major models is Byte-Pair Encoding (BPE), originally proposed for neural machine translation by Sennrich et al. in 2016. The algorithm is simple enough to trace through by hand, which is exactly what reveals why the resulting vocabulary has the properties it does.

BPE starts with the most granular possible representation: every word in the training corpus is decomposed into individual characters, with a special end-of-word symbol appended to track word boundaries. For a toy corpus where "lower" appears 40 times, "newer" 30 times, "wider" 25 times, and "higher" 20 times, the initial character representation of "lower" is l o w e r, each character a separate token.

The algorithm then counts every adjacent pair of tokens across the entire corpus. In this example, the pair e r appears in all four words and accumulates 40 + 30 + 25 + 20 = 115 counts. It is the most frequent pair, so it gets merged first: everywhere e r appears, it is replaced by the single token er, and er is added to the vocabulary. The corpus is updated, and the process repeats.

On the next iteration, the pair l o might be the most frequent, so l and o merge into lo. Then lo and w merge into low. After three merges, "lower" is represented not as five separate characters but as two tokens: [low][er]. After 32,000 such merges on a web-scale corpus, "lower" would be a single token [lower], because its frequency was high enough that it got absorbed into a single vocabulary entry. Less common words, technical terms, and text in languages underrepresented in the training corpus remain fragmented into many pieces.

The critical insight is that frequency in the training corpus determines which sequences become tokens. Not linguistic meaning. Not conceptual boundaries. Frequency. The token [the] exists because the three-character sequence appeared billions of times. The token [tokenization] may or may not exist depending on how often it appeared; in GPT-4's tokenizer, it splits into [token] and [ization] because neither sequence is frequent enough on its own to merit a two-part split further. The word [deserialization] fragments further because it is a technical term appearing in a relatively small slice of the training corpus.

BPE merge process: how character sequences merge into vocabulary tokens based on corpus frequency

This frequency-driven construction has an immediate practical consequence: the tokenizer has been optimized for the distribution of text it was trained on. For a tokenizer trained predominantly on English web text, English is cheap. For everything else, you are paying the cost of a vocabulary that was not designed with your use case in mind.

Why arithmetic fails: opaque token IDs carry no place-value structure

The most commonly cited symptom of tokenization-as-constraint is arithmetic failure, and it is worth understanding precisely why it happens rather than dismissing it as a model capability issue.

Consider the addition problem 197 + 842 = 1039. In GPT-4's cl100k_base tokenizer, this particular problem is "lucky": 197, 842, and 1039 each happen to tokenize as single tokens with IDs 11997, 19274, and 28866 respectively. The model sees four token IDs: [11997][+][19274][=] and is asked to predict that [28866] follows. If it has seen enough similar examples in training, it might get this right through pattern matching. But this is not arithmetic. The model has no representation that tells it token 11997 contains a hundreds digit of 1, a tens digit of 9, and a units digit of 7.

Now consider 19876 + 24531. GPT-4's tokenizer splits 19876 as [1987][6] and 24531 as [2453][1]. The token for 1987 has some ID, say 3753. The token for 1988 has a completely different ID, 3754. These two IDs are adjacent integers, but the model's embedding table assigns them independently learned vectors with no geometric relationship. The model cannot infer from the representation of 1987 what 1988 means, because the tokenizer never encoded that 1987 and 1988 share all but one digit.

The problem is compounded by inconsistency. The digit 6 that appears at the end of 19876 tokenizes as [6] with one ID. The same digit 6 appearing in a different context (say, 6.0 or 16) may tokenize differently because the surrounding characters influence which merge rules apply. The digit is not a stable atomic unit; it is a byproduct of whichever merges happened to apply to the surrounding context. Column arithmetic is impossible when the same symbol has different representations depending on what is next to it.

The model sees opaque token IDs, not digit structure. Token ID 3753 and ID 3754 differ by 1, but their embedded representations may be as far apart as any two arbitrary tokens.

Chain-of-thought prompting (Wei et al., NeurIPS 2022) is frequently described as the fix for arithmetic failures. What it actually does is work around the tokenization problem by forcing the model to externalize the intermediate reasoning steps in text. When the model writes "units digit: 7 + 6 = 13, write 3, carry 1" in its output, it is now operating on single-digit tokens that do exist as stable units in the vocabulary, and it can apply the addition rules it learned from worked examples in its training data. The model is not doing arithmetic in the sense a calculator does. It is pattern-matching on a sequence of digit-level tokens that it has been trained to produce correctly. This works much better than asking for a direct answer, but it is a patch, not a solution. For very large numbers, for multiplication, or for floating-point operations, the model will still fail because the chain-of-thought steps themselves eventually require operations that exceed what the model can reliably pattern-match on token sequences.

How large numbers fragment into inconsistent tokens, preventing systematic column arithmetic

The practical upshot for production systems: never trust a model to perform arithmetic directly on token sequences. Route numeric computation to a code interpreter, a calculator tool, or any external system that operates on actual numbers. The model's role is to understand what computation needs to happen, not to perform it.

Language disparity: fertility and the hidden context budget tax

Fertility is the term researchers use for the number of tokens required to represent a word. An English word requires on average 1.3 tokens with a 32k BPE vocabulary trained on web data. A French word requires about 1.4, German about 1.5 (compounding adds tokens for inflections), and Mandarin Chinese about 1.8 (characters are often single tokens, but the word-level fertility depends on how you count words). Arabic requires roughly 3.2 tokens per word because its morphologically rich script, with vowel diacritics and complex prefixing and suffixing, is underrepresented in typical training corpora. Thai requires 5 to 6 tokens per word with a 32k vocabulary, partly because Thai has no word-boundary spaces (the tokenizer has to segment from a continuous stream of characters) and partly because Thai script is severely underrepresented in English-dominated training data.

These fertility numbers have immediate cost and context implications. A 4,096-token context window holds approximately 3,150 English words. The same context window holds approximately 700 Thai words. If you are building a customer service system for a Thai-speaking market and you size your context budget based on English estimates, you will be wrong by a factor of 4. The prompt that fits comfortably in a 2,000-token window in English will overflow a 4,096-token window in Thai. The API cost you estimated per conversation will be 4 times higher. The documents you can fit into a retrieval-augmented generation context will be one quarter as long.

Tokens per word by language: fertility comparison showing context budget disparity

Petrov et al. (NeurIPS 2024) documented this disparity systematically across a large set of languages and showed that it does not just affect cost: it affects quality. Models perform worse on tasks in high-fertility languages not only because the context window runs out faster, but because the fragmented sub-word representations carry less semantic information per token. A language model trained on a corpus where 95% of tokens are English has seen far fewer examples of Thai morphological patterns than English morphological patterns. The subword pieces it creates for Thai are lower-quality semantic units, and the representations of Thai words tend to be noisier. Rust et al. (ACL 2021) showed this directly: monolingual models trained in a specific language consistently outperform multilingual models on tasks in that language, and the performance gap correlates with the multilingual model's fertility for that language. High fertility is both a symptom of poor vocabulary coverage and a cause of poor downstream task performance.

The practical implication for multilingual applications is that model selection cannot be made on English benchmarks alone. A model that performs well on English arithmetic, reading comprehension, and summarization may perform significantly worse on the same tasks in Arabic or Thai because its tokenizer was not designed for those languages. LLaMA 3's expansion from 32k to 128k vocabulary was motivated in part by this exact problem: the 128k tokenizer allocates more vocabulary entries to non-Latin scripts, reducing Thai fertility to approximately 3 to 4 tokens per word and Arabic fertility to approximately 2 to 3 tokens per word. This brings non-English languages closer to the efficiency English users take for granted, at the cost of a larger embedding table.

Code tokenization: the indentation tax and why delimiter choice matters

Code is a domain where tokenization patterns have large, measurable practical effects that differ significantly from natural language. Three effects are worth knowing precisely.

The first is the Python indentation tax. Python's indentation-based block structure means that deeply nested code contains long runs of leading whitespace before any semantically meaningful token. In a function nested three levels deep inside a class method, each line begins with twelve spaces. GPT-4's tokenizer compresses runs of spaces into multi-space tokens (it has tokens for two spaces, four spaces, eight spaces, and other common indentations), but deeply nested code with irregular indentation can still spend 25 to 35% of its token budget on whitespace. A 200-line function at four levels of nesting may use 50 to 60 tokens on indentation alone. This is not catastrophic, but it matters when you are trying to fit a complete Python file into a context window for code review or refactoring.

The second effect concerns data formats. YAML and JSON encode the same structured data with very different token costs. YAML uses significant whitespace, newlines for structure, and implicit typing, all of which fragment into more tokens than JSON's explicit delimiters. A configuration file in YAML that uses 400 tokens may encode identical information in JSON at 280 tokens. For applications that feed large configuration files or API schemas into a model's context (which is common in tool-use and function-calling pipelines), serializing to JSON rather than YAML meaningfully extends what you can fit in the window.

The third effect is between programming language families. Languages that use explicit delimiters for block structure (JavaScript, Go, Rust, Java with braces) are more token-efficient per line of logic than languages that rely on whitespace (Python, CoffeeScript, YAML, Makefile). This is not a strong reason to write JavaScript over Python, but it is relevant when you are building a code analysis system and choosing which languages to support at scale. At 10 million API calls per month processing code files, the difference between 1.2 tokens per character and 1.5 tokens per character is a substantial line item.

Vocabulary size: LLaMA 2, LLaMA 3, and the embedding memory trade-off

The vocabulary size is a hyperparameter chosen at tokenizer training time, and it has two competing effects that pull in opposite directions. A larger vocabulary allows common multi-character sequences (including non-English words and technical terms) to be assigned their own token, reducing fertility and improving efficiency. A smaller vocabulary forces more fragmentation but keeps the embedding table small, which matters for memory-constrained deployments.

LLaMA 2 used a 32,768-token vocabulary, consistent with the SentencePiece BPE tokenizers used by earlier models. This produces good efficiency for English but poor efficiency for non-English scripts and technical domains with specialized terminology. LLaMA 3 expanded to 128,256 tokens specifically to address language coverage and code quality. Qwen-series models from Alibaba use vocabularies around 151,643 tokens, even larger, with extensive coverage of Chinese characters and character combinations.

The embedding memory cost is concrete. Each vocabulary entry requires one embedding vector of dimension equal to the model's hidden size. For a 70B-parameter model with hidden dimension 8,192, each vocabulary entry consumes 8,192 x 2 bytes (in bfloat16) = 16,384 bytes. LLaMA 2's 32k vocabulary requires approximately 512 MB for the embedding table. LLaMA 3's 128k vocabulary requires approximately 2 GB. Qwen's 151k vocabulary requires approximately 2.35 GB. For a 70B model that already occupies 140 GB in bfloat16, this is a small but non-trivial addition, and on memory-constrained edge deployments (running a 7B model at 4-bit quantization on 8 GB), the embedding table can represent a meaningful fraction of total memory.

The fertility gains from larger vocabularies are real but not uniform. For English-heavy workloads, moving from 32k to 128k vocabulary provides minimal improvement because English was already well-covered at 32k. The gains concentrate in languages that were underrepresented in the original vocabulary and in technical domains like code, math, and scientific notation. For a team running English-only customer service, the memory cost of a 128k vocabulary buys almost nothing. For a team running multilingual document processing across European and Asian languages, it can meaningfully reduce both cost and quality degradation.

Practical consequences for production systems

The tokenization constraint surfaces in three areas that directly affect production system design.

Cost estimation is the most immediately practical. API pricing is per token, not per word or per character. If you estimate costs using English fertility numbers (1.3 tokens per word) and your actual workload is Arabic, you will underestimate API costs by roughly 2.5x. If your workload is Thai, by 4x or more. Before committing to a model provider or a pricing tier, measure actual token counts on a representative sample of your actual data in your actual languages. The tiktoken library for OpenAI models and the tokenizer classes in Hugging Face Transformers allow you to count tokens for any text before sending it to the API. This is a five-minute exercise that can prevent significant budget surprises.

Prompt design for structured data should account for tokenization efficiency. When injecting JSON schemas, database schemas, configuration files, or API documentation into a context window, the serialization format affects how much fits. Prefer compact JSON over verbose YAML for injected data. Prefer shorter key names in JSON objects that will be injected frequently. A key named "user_authentication_status" uses more tokens than "auth_status" and in a prompt that includes dozens of such keys, the savings compound. When injecting code for analysis, consider whether the full file is necessary or whether extracting the relevant function signatures and docstrings produces a more token-efficient representation of the same semantic content.

Model selection for non-English tasks should begin with tokenizer inspection, not benchmark comparison. Before evaluating model quality on Thai or Arabic tasks, measure the fertility of each candidate model's tokenizer on your actual data. A model with lower fertility will cost less per query, fit more context, and typically perform better on that language because higher vocabulary coverage correlates with better training data coverage. The Hugging Face tokenizer page for any model will typically report its vocabulary size, and you can measure fertility directly by running tokenization on a sample and dividing token count by word count. Models that explicitly advertise multilingual support (Qwen, BLOOM, mBART) generally have better fertility profiles for non-English languages than models optimized primarily for English.

The broader principle is that the tokenizer is not a detail to be delegated to a library call. It is an architectural decision made at model training time that shapes everything downstream: what representations are available to the model, how efficiently different languages and domains use the context window, what computations the model can and cannot perform reliably on its native representations, and what the true cost of running a workload will be. Understanding it at the level of the BPE algorithm is the minimum required to make informed choices about model selection, prompt design, and system architecture.


References

  1. Sennrich, R., Haddow, B., and Birch, A. "Neural Machine Translation of Rare Words with Subword Units." ACL, 2016. https://arxiv.org/abs/1508.07909

  2. Petrov, A., La Malfa, E., Torr, P., and Bibi, A. "Language Model Tokenizers Introduce Unfairness Between Languages." NeurIPS, 2024. https://arxiv.org/abs/2305.15425

  3. Rust, P., Pfeiffer, J., Vulić, I., Ruder, S., and Gurevych, I. "How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models." ACL, 2021. https://arxiv.org/abs/2012.15613

  4. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G. "LLaMA: Open and Efficient Foundation Language Models." arXiv, 2023. https://arxiv.org/abs/2302.13971

  5. Meta AI. "Meta LLaMA 3." Meta AI Blog, 2024. https://ai.meta.com/blog/meta-llama-3/

  6. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., and Zhou, D. "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." NeurIPS, 2022. https://arxiv.org/abs/2201.11903