Tokenization
The process of breaking text into smaller units (tokens) that language models can process.
Tokenization converts raw text into sequences of tokens — the fundamental units that language models operate on. A token might be a word, a subword, or even a single character, depending on the tokenization algorithm. Common approaches include Byte-Pair Encoding (BPE), WordPiece, and SentencePiece.
Tokenization directly affects model performance, cost, and multilingual capability. The number of tokens in a prompt determines API costs and context window usage. Different tokenizers handle languages, code, and special characters differently, which can impact model behavior.
Understanding tokenization is important for anyone working with LLMs — from engineers optimizing API costs to researchers studying model behavior. Token-level analysis helps debug unexpected model outputs and design more effective prompts.
Related AI Job Categories
Related Terms
Large Language Model (LLM)
A neural network trained on massive text datasets that can understand and generate human language.
Transformer
The neural network architecture behind modern LLMs, using self-attention mechanisms to process sequences in parallel.
Natural Language Processing (NLP)
The branch of AI focused on enabling computers to understand, interpret, and generate human language.
Prompt Engineering
The practice of designing and refining inputs to AI models to produce desired outputs.