Concept

Inference and inference hosting.

Training builds the model once. Inference is every time you actually use it, and hosting is the infrastructure that keeps it fast.

The short version

A model's life has two phases. Training is the one-time, expensive process of building it. Inference is everything after: actually using the trained model to get outputs. When you send a prompt and read the reply, that is inference.

Inference happens millions of times where training happened once, so making it fast and affordable is its own engineering problem. That is what inference hosting solves.

Inference is the "using" phase

Once training is done, the weights are frozen. Inference feeds your prompt through those fixed weights to generate a response. The model is not learning during inference; it is applying what it already learned. Each request is independent, which is why a base model does not remember your last conversation unless that history is fed back in.

Why inference needs specialized hardware

Generating each token means multiplying through billions of parameters. Doing that fast enough to feel instant takes GPUs or similar accelerators with lots of fast memory. A large model can require multiple high-end GPUs just to hold its weights, before serving a single user.

What a hosting layer actually does

Inference hosting is the infrastructure that runs models for many users at once. It loads the weights onto GPUs, batches requests together for efficiency, streams tokens back as they are produced, and scales capacity up and down with demand. Good hosting is the difference between a snappy assistant and one that stalls.

Why it costs what it costs

GPU time is expensive, and a bigger model or a longer conversation means more computation per reply. That is why providers price by tokens and why very long contexts cost more: every token in the prompt and the response has to be processed through the whole network.

An analogy

Training is writing and printing a textbook: slow, costly, done once. Inference is a student opening that book to answer a question, which happens constantly. Inference hosting is the well-run library that keeps thousands of students reading at the same time without a queue.

Where Berges AI fits

When you chat with Berges AI, each message is an inference request routed to a hosted model. Part of what Berges AI does is pick and serve the right model for the job so responses stay fast without you thinking about any of the infrastructure.

Try Berges AI
Keep going

Related concepts

Questions

Things people ask.

What is the difference between training and inference?

Training builds the model by adjusting its parameters from data, once and at great cost. Inference uses the finished model to answer, over and over. Training teaches; inference applies.

Why do longer prompts cost more?

Every token in your prompt and the reply is processed through the entire network. More tokens means more computation, so longer contexts and longer answers take more time and money.

Does the model learn from my messages during inference?

No. During inference the weights are frozen. It can use your conversation as context within a session, but it is not updating itself from what you say.