Back to Glossary

Inference

The process of running a trained AI model to generate predictions or outputs from new input data.

Inference is the production phase of the ML lifecycle — using a trained model to process new inputs and generate outputs. While training happens once (or periodically), inference happens continuously as users interact with the model.

Inference optimization is critical for production AI systems. Techniques include model quantization (reducing numerical precision), pruning (removing unnecessary parameters), distillation (training smaller models to mimic larger ones), batching, caching, and hardware acceleration (GPUs, TPUs, specialized chips).

Inference infrastructure is a major cost center for AI companies. Engineers who can reduce latency, increase throughput, and lower serving costs — while maintaining model quality — are highly valued. This work spans MLOps, systems engineering, and model optimization.

    Inference — AI Careers Glossary | We Love AI Jobs