What is it?
Inference is what an AI model does the moment you use it. After the training phase, in which the model has learned, comes the inference phase: the model receives new data and generates output based on what it has learned.
Every time you ask a question to ChatGPT, have a document summarised, or let an AI agent execute a task, inference takes place. Inference is the active use of the model, as opposed to training, which happens once or periodically.
Why it matters for SMEs
For SMEs, inference is the phase that is directly visible in cost and speed. The more efficiently a model runs inference, the faster and cheaper your AI applications operate.
- Every API call to a language model is an inference call: the cost per use, the latency, and the scalability of your AI solution are directly tied to how inference is structured.
- The choice between models is partly about inference cost: a smaller model that infers quickly and cheaply can offer better economics for routine tasks than a large model.
- At high volume, such as processing thousands of documents, inference speed determines whether a process is practically feasible or not.
Understanding what inference is helps when comparing AI services on price and speed, and when building scalable workflows.
How it works
During inference, the model processes the input through its learned parameters and generates output step by step. For language models, this means predicting the most likely text token by token. This process runs on the provider's servers or, for smaller models, locally.
- Receive input: the prompt, document, or data is passed to the model.
- Processing via parameters: the model processes the input through its layers of learned weights.
- Token predictions: for language models, the model generates the answer token by token.
- Return output: the result is sent back to the calling application.
- Cost and latency: the size of the model and the number of tokens determine how fast and expensive the inference is.
Inference is essentially stateless: each request is handled independently. Memory and context for longer conversations are managed externally, not inside the model itself.
Example in practice
Picture a staffing agency processing hundreds of CVs each day through an AI system that automatically highlights relevant experience and skills. Each time the system processes a CV, the model runs inference: it reads the text, applies its learned knowledge, and generates a structured summary. At one hundred CVs per day, that is one hundred inference calls; at one thousand, it is ten times the cost and ten times the processing time, unless the system is built to handle that volume.
Comparison and misconceptions
Training is the learning process in which the model sets its parameters based on data: it happens once or periodically and requires significant compute. Inference is the use of the trained model on new data: it happens with every call and is considerably cheaper and faster than training.

