Logits, Temperature, and Top-P in LLM Token Selection

In this article, you will learn how logits, temperature, and top-p sampling work together to control next-token prediction in large language models.

Topics covered include:

What logits are and how they are produced by a transformer’s final linear layer.
How temperature and top-p (nucleus sampling) shape the probability distribution used for token selection.
How these three components fit into a sequential pipeline that governs LLM output generation.

The Statistics of Token Selection: Logits, Temperature, and Top-P Walkthrough

Introduction

When large language models (LLMs) produce outputs, several criteria are at stake, including overall response relevance, coherence, and creativity. Since models operate by building their response token by token, capturing these desirable properties is a matter of mathematically adjusting the output probability distributions that govern the next-token prediction process.

This article introduces the mechanics behind LLM decoding strategies from a statistical vantage point. In particular, we will explore how raw model scores, known as logits, interact with two other model settings — temperature and top-p — which are three key parameters used to control the token selection process.

While we will focus on the very final stages of the transformer architecture, this article provides a concise overview of the whole process and the journey tokens make from beginning to end.

Token selection process in LLMs

What Are Logits?

In neural networks, the raw, unnormalized scores produced at final linear layers before converting them into probabilities are known as logits. While logits have been used since the era of classical machine learning models like softmax regression, the same principle applies to the final linear layer of transformer models. This final layer processes hidden states — which contain gradually accumulated linguistic knowledge about the input text — and outputs a vector of logits, one for each token in the model’s vocabulary.

For example, if an LLM trained for English-to-Spanish translation is predicting the next word after “me gusta mucho,” it might output a raw logit score of 12.5 for “viajar” (travel), 8.2 for “jugar” (play), and -3.1 for “dormir” (sleep). These raw values are unbounded and difficult to interpret directly, so a softmax function is applied on top of the final linear layer to transform them into a standard probability distribution over vocabulary tokens, such that all values sum to 1.

What Are Temperature and Top-P?

Once we have a probability distribution over the target vocabulary, LLMs do not simply choose the highest-probability token every time. The next token is sampled from the distribution, and how this sampling works depends on several decoding parameters — two of the most important being temperature and top-p.

Temperature is a scaling factor applied to the logits before the softmax step. A high temperature (e.g., above 1) flattens the resulting probabilities, making them more uniform, which increases uncertainty and unpredictability and causes the model to behave more creatively. A low temperature (e.g., well below 1) sharpens the differences between high- and low-probability tokens, increasing certainty and strongly favoring the most likely tokens. More detail on temperature can be found in this related article.
Top-p, also called nucleus sampling, controls randomness by limiting the pool of candidate tokens rather than scaling probabilities. While similar strategies like top-k consider only the k highest-probability tokens, top-p identifies the smallest set of tokens whose cumulative probability meets or exceeds a threshold p, making it more adaptive and flexible. For example, setting p=0.9 causes the model to sort tokens by probability and keep adding them to the candidate pool until their cumulative probability reaches 0.9.

The Full Walkthrough: How These Concepts Fit Together

Logit-to-probability calculation, temperature, and top-p combine into a sequential multi-step pipeline for producing next-token predictions.

First, the model generates raw logits for all possible tokens. Temperature then scales these raw logits — note that this happens before the softmax function converts them into probabilities. Depending on the temperature value, the resulting distribution will appear more uniform (high temperature) or sharper (low temperature).

Token selection walkthrough based on logits, temperature, and top-p

Once the scaled logits are converted into probabilities, top-p is applied to filter the resulting distribution, calculating cumulative probabilities to retain only a core nucleus pool of the most likely tokens. Finally, the model samples randomly from within that pool to select the next token.

Closing Remarks

Understanding the statistical process behind token selection helps inform practical decisions about how to configure these parameters. For factual, high-stakes scenarios like coding or legal analysis, a low temperature and a stricter top-p are advisable — for example, t=0.1 and p=0.5 — which yields highly deterministic model responses. For creative domains like poetry generation or brainstorming, a higher temperature and top-p, such as t=0.8 and p=0.95, allow for a richer variety of candidate tokens in the selection pool. Choosing the right balance between predictability and creativity for your use case is the key practical takeaway from understanding how these three mechanisms work together.