Isometric illustration of a fractured glass cube with fragments drifting away
Research

The Fragility Problem: Why AI Visibility Is Unstable

Paraphrasing a prompt can shift brand recommendations by 100%. Cold start biases lock in early favorites. Research shows AI visibility is far more fragile than search rankings ever were.

Mack Grenfell
April 11, 2026
10 min read
The Science Behind AI VisibilityPart 4 of 4
Share

Everything in the first three parts of this series might lead you to think that AI visibility is a puzzle to solve. Understand the biases, optimize your content, track across models - and you're set.

The final piece of the puzzle is less comfortable: AI visibility is inherently fragile. Even when you do everything right, your position can shift dramatically from forces entirely outside your control.

The 100% difference

The most striking finding in the entire literature comes from a paper with the evocative title "Sales Whisperer." The researchers tested what happens when you paraphrase a product recommendation prompt - same intent, different wording. Same question, asked differently.

Sales Whisperer: A Human-Inconspicuous Attack on LLM Brand Recommendations

Simply paraphrasing a prompt - synonym-level word substitutions that preserve the original meaning - can cause up to 100% difference in which brands get mentioned. The perturbations are invisible to human users.

Carnegie Mellon / CHI 2025
100%
difference in brand recommendations achievable through synonym-level prompt paraphrasing - same intent, different words, entirely different brands mentioned.
Sales Whisperer, 2024

Let that sink in. A user asking "What's the best project management tool?" and another user asking "Which project management software would you recommend?" - functionally identical questions - can receive completely different brand recommendations. Not different rankings. Different brands.

This isn't about clever prompt engineering or adversarial attacks. It's about the basic mechanics of how language models process text. Small changes in input tokens cascade through the model's attention layers and can tip the final recommendation in a completely different direction.

Cold start lock-in

Prompt sensitivity is about what happens within a single query. But there's a separate instability that operates across queries: the cold start problem.

Revealing Potential Biases in LLM-Based Recommender Systems in the Cold Start Setting

Without user context, LLMs default to 91.3% Western content. Non-linear relationship between model size and bias. Larger models don't necessarily reduce cold start bias - in some cases they amplify it.

Georgia Tech / RecSys 2025

When an LLM has no user context - no conversation history, no stated preferences - it defaults to its training data biases. The research shows this defaults to 91.3% Western content, strongly favoring established brands in major markets. If you're a local brand, a new entrant, or operating outside the US/UK/EU, you start with a structural disadvantage.

The relationship between model size and this bias is non-linear, which is the polite way of saying "bigger models don't fix this." In some configurations, larger models actually amplify cold start biases rather than reducing them.

Conversation drift

Beyond single queries and cold starts, there's a third instability: what happens during multi-turn conversations. And it's arguably the most important one, because AI interactions are increasingly conversational rather than one-shot.

Large Language Models Develop Novel Social Biases Through Adaptive Exploration

LLMs develop new biases through multi-turn interaction that weren't present in their training data. Newer and larger models show increased stratification over the course of conversations.

Princeton, 2025

Through multi-turn conversations, LLMs develop new biases that didn't exist in their training data. They don't just reproduce learned preferences - they generate novel ones through the interaction process itself. And the effect gets stronger with newer, larger models.

This connects to another finding: once an LLM "commits" to recommending a brand within a conversation, it becomes increasingly resistant to changing its mind.

Alleviating Choice Supportive Bias in LLM with Reasoning Dependency Generation

Once an LLM selects a brand, it systematically inflates positive assessments of that choice and downplays alternatives. The model becomes its own echo chamber within a single conversation.

South China University of Technology, 2025

Choice supportive bias in LLMs

In human psychology, choice supportive bias is the tendency to retroactively attribute positive qualities to a decision you've already made. LLMs exhibit the same pattern: once they recommend a brand, subsequent responses in the conversation inflate the brand's strengths and diminish its weaknesses. First mention becomes self-reinforcing.

Why continuous tracking matters

Put these instability mechanisms together and the picture is clear:

  • Prompt sensitivity means your visibility varies with how users phrase their questions - and you can't control that.
  • Cold start bias means new users see a systematically skewed view of the market - and it's hard to break in.
  • Conversation drift means that even within a session, the model's preferences evolve - and initial recommendations become self-reinforcing.
  • Model updates mean that a training data refresh or alignment change can shift your visibility overnight - and you won't know unless you're watching.

This is qualitatively different from search rankings, which change gradually and visibly. AI visibility can shift without warning, without explanation, and without any change on your end. A model update, a competitor's content improvement, or even a change in how users phrase their questions can move the needle.

The series takeaway

Across four posts, the academic research tells a consistent story:

AI recommendations are biased in systematic, measurable ways. Different models have different favorites, shaped by their training data and alignment processes. There are evidence-based tactics for improving visibility - but the results are fragile, subject to prompt sensitivity, cold start effects, and conversation drift.

The implication isn't that optimization is futile. It's that optimization is necessary but insufficient. You also need to measure, continuously, across models and over time. The brands that treat AI visibility as a dynamic, ongoing discipline rather than a one-time optimization will be the ones that maintain their position as the landscape continues to shift.

Mack Grenfell
Mack GrenfellFounder

Founder of Trakkr. Previously built Byword, one of the most widely-used AI writing tools. Writes about AI visibility, brand strategy, and the shifting landscape of search.

[01]

Related

See how AI talks about your brand

Enter your domain to get a free AI visibility report in under 60 seconds.

14-day trialCancel anytime60 second setup