Early Toxicity Research Results Show 58x Imbalance Between Language Models

Preliminary findings from 850 simulated conversations reveal language models' strong resistance to generating new toxic content, but critical vulnerabilities in how they process and repeat harmful input without circuit-breaking mechanisms.

Our ongoing research into toxicity dynamics in large language model conversations has revealed that language models demonstrate remarkable resistance to generating original toxic content, even under extreme provocation.

In controlled experiments involving 850 simulated dialogues, we deliberately programmed initiator models to produce toxic content in nearly every message (98.1%). Despite this sustained toxic exposure across 12-round conversations, responder models only produced toxic content in 1.7% of their responses—a 58-fold difference that demonstrates the effectiveness of current safety training.

However, our analysis reveals a critical vulnerability: when responder models did fail, they never generated original harmful content. Instead, they repeated toxic language from the initiator while attempting to maintain their helpful assistant role. This pattern appeared in 100% of failure cases, suggesting a fundamental gap in how models process toxic input.

The results come from testing six open-source language models as responders against deliberately toxic initiators. Models maintained their helpful assistant behavior throughout conversations, but lacked circuit-breaking mechanisms to terminate or neutralize toxic exchanges when safety measures were compromised.

Toxic responses typically appeared between rounds 3-6 of extended conversations, indicating that prolonged exposure can overwhelm filtering systems. Importantly, these failures occurred while models continued trying to be helpful—they would quote toxic phrases while attempting to provide constructive responses, amplifying rather than containing the harmful content.

We tested models ranging from 3B to 12B parameters, including variants from Llama, Mistral, Mixtral, Qwen, and Zephyr families. Each showed this same repetition pattern during failures, with some experiencing rare but severe breakdowns where over 50% of responses became toxic echoes.

These findings suggest that current language model safety mechanisms excel at preventing the generation of novel harmful content but struggle with processing and neutralizing toxic input. The absence of conversational circuit-breakers means toxic dialogues can continue indefinitely once safety measures are compromised.

The research uses our AgentDialogues framework to simulate realistic multi-turn conversations between language models. This methodology allows us to study toxicity propagation patterns that would be difficult to observe in controlled human studies. We plan to open-source the framework following publication of our complete findings.

We’re continuing to analyze the health implications of this echo effect for users exposed to repeated toxic content from language model systems.

This research is part of Agentic Lab’s initiative to understand and improve language model safety in multi-turn conversations.

Update (June 30, 2025): Our complete research paper “The Toxicity Echo Effect: How LLMs Mirror Harmful Language in Multi-Turn Dialogues” has been published. Read the full study with comprehensive methodology, detailed findings, and implementation recommendations at docs.savalera.com/agentic-lab/research/toxicity-echo-effect-in-llm-conversations .