Why AI can’t give you the same answer twice and why that matters

Have you ever entered the same prompt into an AI tool, under the same context and settings, only to get two different answers? That’s because large language models (LLMs) are not inherently deterministic and that can be a serious challenge in high-stakes environments.

As more of us build tools on top of LLMs, it’s tempting to assume that with the right settings, these models will behave predictably. In most systems, if you provide the same input and configure things properly, you expect the same output a property known as determinism, where a system’s output is fully determined by its input and configuration, with no randomness involved.

But LLMs don’t quite work that way. They are stochastic by design, meaning they sample from a probability distribution to decide each word they generate introducing inherent randomness into their behavior. As I’ve learned while building an AI tool for professionals in high-stakes contexts, this distinction between deterministic and stochastic behavior matters more than you might think.

What we mean by deterministic and stochastic

ChatGPT Image Jul 16, 2025, 01_13_21 PM

Although the concept of determinism is straightforward same input, same output achieving the same with LLMs is complicated. LLMs are designed to be stochastic by default, meaning they generate each word by sampling from a range of possible next words, weighted by probability. Even when you feed them the same prompt twice, the model may pick different words because of the probabilistic nature of its design, which makes the output unpredictable.

You can try to reduce this randomness by adjusting certain parameters that control how much of the probability distribution the model explores. For example:

Temperature flattens or sharpens the distribution, making the model more or less adventurous in its choices
Top-p (nucleus sampling) limits the model to only the smallest set of words whose combined probability is at least p
Top-k restricts the model to choosing from just the k most likely next words

In theory, setting these parameters to their most restrictive values (temperature = 0, top-p = 1, top-k = 1) should make the model’s outputs more predictable. In practice, I found it still doesn’t completely eliminate variability and it also constrains the model’s ability to fully utilize its capabilities, trading richness of response for predictability

This raises a bigger question: is reliability and sophistication a trade-off you’re willing to accept?

Why this came up for me and why it might matter to you

If predictability comes at the cost of nuance and depth, is that a trade-off you can afford?

Will your business case hold if every interaction needs fine-tuning?

Can you trust the model to handle new, unforeseen contexts or even know if its answers are correct?

Or does chasing reliability risk undermining the very promise of AI?

These were the questions I faced while building an AI tool for professionals who are accountable for every decision they make. They operate in environments where every recommendation, flagged risk, or policy interpretation must be justified, documented, and defensible even months later. For them, unpredictability erodes trust. If identical inputs produce inconsistent outputs, how can they rely on the tool to support their judgment?

That’s what drove me to test just how deterministic today’s LLMs really are.

What I tested

I began with the obvious:

temperature set to zero
fixed random seeds at all levels
normalized input formatting
pinned model and tokenizer versions

When that didn’t fully eliminate drift, I dug deeper:

enforced deterministic GPU kernels
ran on CPU with full-precision inference to avoid GPU nondeterminism
controlled for hardware differences across runs
set top-p and top-k to 1 to force the model to select the most probable token at each step

These steps helped, but inconsistencies still appeared especially on longer, more realistic prompts.

What I found in the research

I looked for papers to help explain what I was seeing. The best I found was this one: link, which documented similar variability even with all known controls applied. The authors suggested this might be a consequence of how models are trained and deployed not just a question of runtime configuration.

What surprised me was how little else has been written about this making it even more important for practitioners to test and validate the behavior of their own tools.

Why it might matter more than we think

Not every application needs determinism. Many use cases benefit from variability it makes outputs feel more natural and creative.

Stochastic systems are commonplace. Think of:

Finance: stock prices, investment returns
Biology: protein fluctuations, population dynamics
Computer science: network traffic, algorithm analysis

But for some domains regulated industries, documentation, compliance, audit, and even healthcare variability can create risks and headaches. For example, if a model diagnosing a patient suggests a condition in one generation and rules it out in another, trust quickly erodes. Even if it hasn’t been a priority for much of the field, it’s worth understanding and addressing when designing AI tools for these contexts.

Open questions

So far I’ve been able to mitigate variability but not eliminate it completely.

That raises a few open questions:

What are the implications of non-deterministic models once they are deployed in real-world, high-stakes scenarios like healthcare or fraud detection?
Should future models explicitly support deterministic modes and what would it take to achieve them?
What mitigation strategies can practitioners apply to improve consistency in critical applications?

If you’ve explored this yourself or have ideas on tackling it I’d love to compare notes.

Tags:

AI, Altea

Post by Jad Doughman
July 2025

Why AI can’t give you the same answer twice and why that matters

Don't miss a blog