Back to Glossary

Tokenization

The process of breaking text into smaller units (tokens) that language models can process.

Tokenization converts raw text into sequences of tokens — the fundamental units that language models operate on. A token might be a word, a subword, or even a single character, depending on the tokenization algorithm. Common approaches include Byte-Pair Encoding (BPE), WordPiece, and SentencePiece.

Tokenization directly affects model performance, cost, and multilingual capability. The number of tokens in a prompt determines API costs and context window usage. Different tokenizers handle languages, code, and special characters differently, which can impact model behavior.

Understanding tokenization is important for anyone working with LLMs — from engineers optimizing API costs to researchers studying model behavior. Token-level analysis helps debug unexpected model outputs and design more effective prompts.

Related AI Job Categories

    Tokenization — AI Careers Glossary | We Love AI Jobs