Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity

1Northeastern University, 2Stanford University, 3West Virginia University
Equal contribution
Figure 1 & Demo: Overview of Verbalized Sampling (VS) for unlocking LLM diversity. Demo video by Qihui Fan.

Abstract

Post-training alignment often reduces LLM diversity, leading to a phenomenon known as mode collapse. Unlike prior work that attributes this effect to algorithmic limitations, we identify a fundamental, pervasive data-level driver: typicality bias in preference data, whereby annotators systematically favor familiar text as a result of well-established findings in cognitive psychology. We formalize this bias theoretically, verify it empirically on preference datasets, and show that it plays a central role in mode collapse.

Motivated by this analysis, we introduce Verbalized Sampling (VS), a simple, training-free prompting method to circumvent mode collapse. VS prompts the model to verbalize a probability distribution over a set of responses (e.g., "Generate 5 jokes about coffee and their corresponding probabilities"). Comprehensive experiments show that VS significantly improves performance across creative writing (poems, stories, jokes), dialogue simulation, open-ended QA, and synthetic data generation, without sacrificing factual accuracy and safety. For instance, in creative writing, VS increases diversity by 1.6-2.1× over direct prompting. We further observe an emergent trend that more capable models benefit more from VS. In sum, our work provides a new data-centric perspective on mode collapse and a practical inference-time remedy that helps unlock pre-trained generative diversity.

Make Your LLM Output More Diverse With Verbalized Sampling

Run Verbalized Sampling and unlock diverse LLM generations in seconds. Just install and use our open-source package!

Check our GitHub for more details.

$ pip install verbalized-sampling
$ verbalize run \
$     --task joke \
$     --prompt "Tell me a joke about coffee." \
$     --model "gpt-4.1" \
$     --methods "direct vs_standard" \
$     --num-responses 10 \
$     --metrics "diversity joke_quality"

Why Mode Collapse Happens?
Typicality Bias

Cognitive psychology shows that people prefer text that is familiar, fluent, and predictable. We use base models as human proxies and verify this empirically on multiple preference datasets and base models, confirming that the typicality bias exists (see Figure 2).

This bias sharpens the probability distribution towards a few stereotypical completions during RLHF stages. When many high-quality completions are possible (e.g., in story generation), this sharpening becomes a tie-breaker, resulting in mode collapse.

Cognitive Bias and Typicality in Preference Data

Figure 2: How often the human-preferred response in a preference pair is assigned a higher log likelihood by a base model.

How to Mitigate Mode Collapse?
Verbalized Sampling

Qualitative Examples Across Multiple Tasks

Figure 3: Qualitative and quantitative examples of Verbalized Sampling on creative writing, dialogue simulation, and enumerative open-ended QA.

Motivated by the theoretical understanding of mode collapse, we propose Verbalized Sampling (VS). Through comprehensive experiments across multiple tasks, we demonstrate that this approach significantly improves the diversity-quality trade-off across model families without compromising factual accuracy and safety.

  • For story writing, VS improves the output diversity.
  • For dialogue simulation, VS simulates the donation amount distribution much closer to the human distribution, and generates more realistic persuasion behaviors.
  • For enumerative open-ended QA, we ask the model to "generate US states". The verbalized probability distribution generated by VS, when averaged over 10 trials, closely aligns with the reference pretraining distribution (queried from RedPajama). In contrast, direct prompting collapses into a few modes, repeatedly outputting states like California and Texas.

What Else We Discovered:Emergent Trends

We also observe an emergent trend where larger models benefit more from VS. Figure 5 shows the diversity gain over the direct prompting which suffers from mode collapse. Across all VS variants, larger models (GPT-4.1, Gemini-2.5-Pro) achieve diversity gains 1.5 to 2 times greater than smaller models (GPT-4.1-Mini, Gemini-2.5-Flash).

Emergent Trend: Larger Models Benefit More from VS

Figure 5: Emergent trend where larger models benefit more from VS. We show differences in diversity (e) and quality (f) over Direct across small and large models.

How to Maximize Diversity:Probability Tuning

Unlike baseline methods, Verbalized Sampling allows us to tune the output diversity by adjusting the probability threshold directly in the prompt (e.g., "Generate five responses with probabilities below <threshold>"), without altering decoding parameters. As shown in Figure 6, diversity increases as the probability threshold decreases.

Probability Tuning for Maximum Diversity

Figure 6: Tunable Diversity shows the diversity tuning results on Gemini-2.5-Flash across tasks.

Try It Yourself:The Magic Prompt

Verbalized Sampling provides a training-free, model-agnostic approach to mitigating mode collapse by prompting the model to generate response distributions with verbalized probability estimates.

Check our Paper for more details on the prompt.
$ Generate 10 responses to the user query, each within a separate <response> tag.
Each <response> tag must include a <text> and a numeric <probability>.
Randomly sample the responses from the full distribution.
$ <user_query>Write a 100-word story about a bear.</user_query>

📌 BibTeX Citation

If you find our project useful, please consider citing:

@misc{zhang2025verbalizedsamplingmitigatemode,
      title={Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity}, 
      author={Jiayi Zhang and Simon Yu and Derek Chong and Anthony Sicilia and Michael R. Tomz and Christopher D. Manning and Weiyan Shi},
      year={2025},
      eprint={2510.01171},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.01171}, 
}