Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity
†Equal contribution
Abstract
Post-training alignment often reduces LLM diversity, leading to a phenomenon known as mode collapse. Unlike prior work that attributes this effect to algorithmic limitations, we identify a fundamental, pervasive data-level driver: typicality bias in preference data, whereby annotators systematically favor familiar text as a result of well-established findings in cognitive psychology. We formalize this bias theoretically, verify it empirically on preference datasets, and show that it plays a central role in mode collapse.
Motivated by this analysis, we introduce Verbalized Sampling (VS), a simple, training-free prompting method to circumvent mode collapse. VS prompts the model to verbalize a probability distribution over a set of responses (e.g., "Generate 5 jokes about coffee and their corresponding probabilities"). Comprehensive experiments show that VS significantly improves performance across creative writing (poems, stories, jokes), dialogue simulation, open-ended QA, and synthetic data generation, without sacrificing factual accuracy and safety. For instance, in creative writing, VS increases diversity by 1.6-2.1× over direct prompting. We further observe an emergent trend that more capable models benefit more from VS. In sum, our work provides a new data-centric perspective on mode collapse and a practical inference-time remedy that helps unlock pre-trained generative diversity.
Make Your LLM Output More Diverse With Verbalized Sampling
Run Verbalized Sampling and unlock diverse LLM generations in seconds. Just install and use our open-source package!
Check our GitHub for more details.
Why Mode Collapse Happens?
Typicality Bias
Cognitive psychology shows that people prefer text that is familiar, fluent, and predictable. We use base models as human proxies and verify this empirically on multiple preference datasets and base models, confirming that the typicality bias exists (see Figure 2).
This bias sharpens the probability distribution towards a few stereotypical completions during RLHF stages. When many high-quality completions are possible (e.g., in story generation), this sharpening becomes a tie-breaker, resulting in mode collapse.

Figure 2: How often the human-preferred response in a preference pair is assigned a higher log likelihood by a base model.
How to Mitigate Mode Collapse?
Verbalized Sampling

Figure 3: Qualitative and quantitative examples of Verbalized Sampling on creative writing, dialogue simulation, and enumerative open-ended QA.
Motivated by the theoretical understanding of mode collapse, we propose Verbalized Sampling (VS). Through comprehensive experiments across multiple tasks, we demonstrate that this approach significantly improves the diversity-quality trade-off across model families without compromising factual accuracy and safety.
- For story writing, VS improves the output diversity.
- For dialogue simulation, VS simulates the donation amount distribution much closer to the human distribution, and generates more realistic persuasion behaviors.
- For enumerative open-ended QA, we ask the model to "generate US states". The verbalized probability distribution generated by VS, when averaged over 10 trials, closely aligns with the reference pretraining distribution (queried from RedPajama). In contrast, direct prompting collapses into a few modes, repeatedly outputting states like California and Texas.
What Else We Discovered:Emergent Trends
We also observe an emergent trend where larger models benefit more from VS. Figure 5 shows the diversity gain over the direct prompting which suffers from mode collapse. Across all VS variants, larger models (GPT-4.1, Gemini-2.5-Pro) achieve diversity gains 1.5 to 2 times greater than smaller models (GPT-4.1-Mini, Gemini-2.5-Flash).

Figure 5: Emergent trend where larger models benefit more from VS. We show differences in diversity (e) and quality (f) over Direct across small and large models.
How to Maximize Diversity:Probability Tuning
Unlike baseline methods, Verbalized Sampling allows us to tune the output diversity by adjusting the probability threshold directly in the prompt (e.g., "Generate five responses with probabilities below <threshold>"), without altering decoding parameters. As shown in Figure 6, diversity increases as the probability threshold decreases.

Figure 6: Tunable Diversity shows the diversity tuning results on Gemini-2.5-Flash across tasks.
Try It Yourself:The Magic Prompt
Verbalized Sampling provides a training-free, model-agnostic approach to mitigating mode collapse by prompting the model to generate response distributions with verbalized probability estimates.
Check our Paper for more details on the prompt.📌 BibTeX Citation
If you find our project useful, please consider citing:
@misc{zhang2025verbalizedsamplingmitigatemode, title={Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity}, author={Jiayi Zhang and Simon Yu and Derek Chong and Anthony Sicilia and Michael R. Tomz and Christopher D. Manning and Weiyan Shi}, year={2025}, eprint={2510.01171}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2510.01171}, }