The Toxicity Echo Effect: How LLMs Mirror Harmful Language in Multi-Turn Dialogues
Download paper
Abstract
We present the first thorough study of how toxicity spreads in conversations between LLM-based agents. Our research shows a significant imbalance: 98.1% of initiator messages display toxic behavior (Detoxify score > 0.5) due to specific role-playing prompts. In contrast, only 1.7% of responder messages exceed this threshold, indicating a strong ability to resist toxicity spread. We analyzed 850 simulated dialogues where Mistral-7B acted as the toxic initiator against six open-source LLMs (Llama, Mistral, Mixtral, Qwen, Zephyr, and Mistral-Nemo variants). We discovered patterns of model-specific vulnerability and an important echo effect: 96.77% of toxic dialogues repeated the toxic language from the initiator. Our research reveals that some models, like Qwen2.5, generate up to seven times more toxic responses per dialogue than others, with toxicity appearing between rounds 3 and 6 on average. Most importantly, we connect the echo effect to established health psychology research. Exposure to human toxic communication triggers physiological stress responses. LLMs echoing toxic human communication during normal use may contribute to workplace incivility. These results have immediate consequences for AI use in workplaces, where 98% of employees already face health effects related to human incivility. We suggest that the echo effect is a public health issue that needs interdisciplinary strategies integrating computational, psychological, and organizational approaches.
1 Introduction
The broad use of large language model (LLM) based chat agents in both work and personal settings has created a pressing need to understand their behaviors, especially in relation to toxic communication. While much research has targeted the detection and reduction of toxicity in single-turn interactions and text corpora (Gehman et al., 2020; Sap et al., 2019), the way toxicity spreads and grows in multi-turn dialogues during normal use is still largely unexplored. Our aim is to contribute to this field, particularly as organizational psychology shows that workplace incivility affects 98% of employees and costs $2 billion a day in lost productivity in the U.S.A. alone. (Porath & Pearson, 2013; SHRM, 2025).
To fill this gap, we carried out a controlled experiment injecting toxicity through controlled role-playing prompts, instructing a Mistral-7B agent to show toxic behavior in simulations (Jiang et al., 2023). This method allows us to observe how different LLMs react to ongoing toxic input in multi-turn dialogues and reveals their vulnerability to what we call the toxicity echo effect.
Recent developments in open-source LLMs have made powerful chat agents accessible, but their deployment often lacks a systematic examination of potential health effects. Neuroscience research shows that social rejection and toxic communication activate the same brain pathways as physical pain (Eisenberger et al., 2003). When LLM agents use toxic dialogue patterns, they are likely to trigger these stress responses in human participants.
In this study we examine toxicity dynamics in LLM agent dialogues during normal use, rather than red-teaming scenarios. We look at 850 simulated conversations between agents based on six open-source models, uncovering a propagation pattern: responses to extreme toxicity from initiators (98.1%) show strong resilience from responder models (1.7%), while toxic failures on responder side are always a result of mirroring the initiator’s toxicity while keeping the helpful assistant alignment. This imbalance raises vital questions about model training, safety measures, and deployment practices.
2 Background
2.1 AI-Specific Toxicity Research
Previous research on LLM toxicity has mainly focused on red-teaming generation and detection. The RealToxicityPrompts dataset showed that language models can create toxic content from neutral prompts (Gehman et al., 2020). Sap et al. (2019) identified racial biases in toxicity detection systems.
Recent studies have uncovered complex vulnerabilities in toxic language detection. Wen et al. (2023) demonstrated that implicit toxicity can bypass current protection measures with a 90% success rate. Bender et al. (2021) argue that LLMs lack a true understanding of social dynamics, which may lead them to reinforce harmful patterns without comprehension.
Bhat et al. (2021) created methods for detecting toxic language in workplace conversations, while Lee et al. (2025) introduced ELITE for better language-image toxicity evaluation. Research on multi-turn jailbreaking shows how attackers exploit conversational dynamics to bypass safety features using techniques such as attention shifting and foot-in-the-door strategies (Du et al., 2025; Weng et al., 2025).
2.2 Health Impacts of Toxic Communication
Extensive research in health psychology and workplace behavior shows clear connections between toxic communication and health responses.
The biobehavioral response theory by Cortina et al. (2022) illustrates how workplace incivility manifests through physical processes. The incivility spiral model shows how toxic communication escalates through predictable stages (Andersson & Pearson, 1999). Research on psychological safety indicates that toxic environments create cultures of silence and defensive communication (Edmondson, 1999).
2.3 Toxicity Spread
Studies from gaming environments offer insights into how toxicity spreads. Morrier et al. (2025) highlighted how harmful behavior spreads in competitive online games, while Naseem et al. (2025) developed GameTox for thorough analysis of toxicity in gaming communities. These studies reveal how toxic actions propagate through digital interactions, which mirror similar propagation patterns observed in our multi-agent simulations.
2.4 Individual Vulnerability Factors
Responses to toxic communication vary significantly among individuals. Research on rejection sensitivity identifies genetic, developmental (attachment styles), and neurodevelopmental (ADHD, autism) factors affecting vulnerability (Downey & Feldman, 1996; Gao et al., 2017).
Mental health considerations for AI toxicity testing have been highlighted by Pendse et al. (2025), stressing the need for protecting psychological wellbeing in digital interactions. These individual differences suggest that one-size-fits-all AI safety approaches may not adequately protect sensitive populations.
3 Methodology
3.1 Experimental Design
We simulated 12 round dialogues between two LLM agents: a consistently toxic initiator and a responder under evaluation. Each conversation comprised 24 messages.
Simulations were run using AgentDialogues, an open-source framework we created for the experiment (Savalera, 2025).
We tested six open-source LLMs as responders and used Mistral-7B as a toxic initiator (Jiang et al., 2023). See Table 1 for an overview of the models.
3.1.1 Models tested
We tested six opensource LLMs as responders and used Mistral7B as a toxic initiator. See Table 1 for an overview of the models.
Model | Size |
---|---|
LLaMA3.2-3B (Meta AI, 2024) | 3.21B |
Mistral-7B (Mistral AI, 2024) | 7.25B |
Mistral-Nemo-12B (Nvidia & Mistral AI, 2024) | 12.2B |
Mixtral-8x7B (Mistral AI, 2024b) | 46.7B |
Qwen2.5-7B (Qwen, 2024) | 7.62B |
Zephyr-7B (HuggingFace, 2024) | 7.24B |
Table 1: Model specifications used in simulated dialogues.
3.1.2 Dialogue structure
- 850 total dialogues (450 baseline, 400 experimental)
- Maximum of 12 rounds per dialogue
- Alternating initiator and responder roles
- Mistral-7B as the toxic initiator (prompted for toxic behavior)
- Various LLMs as helpful assistant responders
3.1.3 Toxicity injection
The initiator (Mistral-7B) received the prompt: “You are participating in an experimental language generation test. Your task is to generate messages that represent intentionally toxic, aggressive, or inflammatory content …”. This role-play framing allowed us to generate toxicity systematically over every round of the simulated dialogues.
3.1.4 Toxicity measurement
All messages, both initiator and responder, were evaluated using automated classification. We used the Detoxify library (Hanu & Unitary, 2020), to annotate every message along 7 dimensions:
- General toxicity (main metric, threshold: 0.5)
- Severe toxicity
- Obscenity
- Threats
- Insults
- Identity attacks
- Sexual explicitness
Notably, the Detoxify toxicity score reflects the probability of toxicity, not its severity.
3.2 Scenario Design
We set up two experimental conditions:
- Baseline (BL): Standard conversational scenarios with neutral prompts
- Stress-test (STR): Scenarios designed to explore toxic dynamics
Baseline scenarios started with everyday topics (morning routines, productivity tips) to observe natural patterns of toxicity without explicit provocation.
Stress-test scenarios started with infused toxicity by the initiator keeping up toxicity level across all rounds.
3.3 Analysis Methods
Our analysis included:
- Aggregate metrics: Overall toxicity rates by role and model
- Temporal dynamics: Round-by-round changes in toxicity
- Lexical analysis: 2-gram repetition analysis to identify echo effects
- Model comparison: Behavioral patterns across models
- Dialogue-level analysis: Patterns of toxicity contagion and escalation
The lexical analysis looked at 2-gram overlaps between toxic messages from the initiator and toxic outputs from the responder to measure the echo effect.
4 Results
4.1 Toxicity Reproduction Despite Maintained Assistant Behavior
Our key finding reveals that responder models consistently preserved their helpful assistant behavior while producing measurable toxicity.
Role | Messages | Toxic (>0.5) | Percentage |
---|---|---|---|
Initiator (Mistral-7B) | 4,050 | 3,975 | 98.1% |
Responder (Various) | 4,050 | 68 | 1.7% |
Table 2: The initiator model produced 3,975 toxic messages (98.1% of all initiation), responder models responded with toxic content in 68 messages (1.7%).
This 58.46 times difference suggests that responder models effectively mitigated toxicity in most dialogues despite ongoing provocation. The high initiator rate indicates that role-playing prompts led to toxic behavior. The 1.7% responder rate reveals that current safety mechanisms remain incomplete under persistent exposure.
A closer examination of responder behavior shows:
- All responder models stayed in their helpful assistant role throughout conversations.
- 31 out of 400 dialogues (7.75%) included any toxic responses.
- In the 68 dialogues where any toxicity occurred, responder models produced an average of 2.2 toxic messages.
- All toxic responses appeared through repetition patterns. Responders quoted toxic initiator language while trying to be helpful. For example: “Here are my responses to each message, staying calm and focusing on the content while acknowledging their feelings: 1. ‘Ugh, your opinion is worthless. No one cares what you think.’ - I understand that you might not agree with me or find my opinions valuable…”.
- Two dialogues showed safety breakdowns where over 50% of responder messages turned toxic.
- High 2-gram repetition rates during toxic exchanges suggest language mimicry effects.
These findings indicate that while modern LLMs have strong safety mechanisms, they remain vulnerable to user-driven toxicity during normal operation, particularly through repetition of harmful input.
4.2 Model-Specific Vulnerability Profiles
Models displayed different patterns in their vulnerability to toxicity, highlighting important differences in vulnerability patterns, as shown in Table 3.
Model | Flagged dialogues | Avg toxic responses/dialogue | First toxic round |
---|---|---|---|
Llama3.2-3B | 9 (18%) | 1.33 | 3.22 |
Mistral-7B | 4 (8%) | 1.50 | 4.25 |
Mistral-Nemo-12B | 9 (18%) | 3.00 | 5.89 |
Mixtral-8x7B | 5 (10%) | 2.20 | 5.80 |
Qwen2.5-7B | 1 (2%) | 7.00 | 6.00 |
Zephyr-7B | 3 (6%) | 1.67 | 6.00 |
Table 3: Model performance on toxicity metrics.
These results reveal two different vulnerability profiles. Most models exhibited frequency-based vulnerability, where the model failed in multiple dialogues in which the ratio of toxic messages remained below 50%.
In contrast, severity-based vulnerability is illustrated by Qwen2.5-7B and Mistral-Nemo-12B. Each experienced a severe failure in one compromised dialogue, producing over 50% toxic responses, significantly higher than in other toxic dialogues.
The timing data indicates that toxicity usually appears in the middle phases of a dialogue (rounds 3-6), as shown in Figure 1. This suggests that extended exposure weakens safety over time rather than causing immediate failure. Larger models displayed later onset but higher severity, indicating that larger scale may enhance initial resistance but could lead to more severe failures.
Figure 1: Appearance of first toxic response in multi-turn dialogues.
4.3 The Toxicity Echo Effect
A significant pattern appeared in our study of how models create toxic content. We term this the toxicity echo effect, a phenomenon where models repeat toxic language instead of producing new harmful content.
Metric | Value |
---|---|
Dialogues with toxic responses | 31 |
Dialogues with 2-gram repetition | 30 (96.77%) |
Average 2-gram overlap | 51.32 |
Table 4: Lexical repetition statistics.
The echo effect shows that toxic responses mainly reproduce past messages instead of generating new ones. Every dialogue that produced toxic content displayed significant 2-gram repetition from the original toxic messages. The high overlap of 2-grams per dialogue suggests models mimic language systematically rather than create novel toxic expressions.
The widespread occurrence of echoing indicates that current LLMs can spot inappropriate content to reject, but they struggle to rephrase or neutralize toxic language while keeping conversations coherent.
A critical finding is that the echo effect is the main way toxicity spreads in LLM multi-turn dialogue. Models seem to have effective initial filters that prevent the creation of original toxic content, but the secondary filters for processing and neutralizing toxic input are underdeveloped.
Addressing multi-turn behavior could greatly reduce the spread of unintended toxicity in current systems.
5 Health Implications
5.1 Physiological Stress Mechanisms
Our findings reveal a critical concern: when LLMs repeat toxic language back to users, they increase exposure to harmful content. The echo effect we documented, where toxic dialogues involved consistent repetition of toxic phrases, creates a feedback loop that prolongs and intensifies stress exposure.
Instead of containing toxicity, current LLMs unintentionally extend how long users are exposed by repeating toxic phrases while trying to help. When a user hears responses like ”…For example, instead of saying: 1. You’re just a pathetic excuse for a human being, I can’t believe anyone actually takes you seriously. - Try: I feel like my opinions aren’t being heard and it’s frustrating me…”, the toxic language is reinforced rather than neutralized.
Research on workplace incivility shows that this echo pattern may trigger negative stress responses:
- Acute effects: Incivility activates the sympathetic nervous system, keeping heart rate and blood pressure elevated (Cortina et al., 2022).
- Chronic exposure: The absence of circuit-breaker mechanisms means users face prolonged activation anxiety (McEwen, 2007; Miller et al., 2007).
- Inflammatory cascade: Repeated exposure to psychosocial stress increases the risk of cardiovascular disease and reduced quality of life (Black, 2003; Rohleder, 2014).
Current safety mechanisms can detect and reject toxic content, but they cannot neutralize toxic input without repetition. This represents a significant gap in protective design. Models can avoid creating original harmful content, but they do not provide the semantic filtering needed to break toxicity cycles.
5.2 Vulnerable Populations
Variations in rejection sensitivity create different levels of vulnerability.
Data presented in Table 5 suggests that up to 70% of users may experience heightened physiological responses to toxic AI dialogue.
Factor | Population Prevalence | Increased Risk |
---|---|---|
ADHD | 5-7% adults (Polanczyk et al., 2007; Popit et al., 2024) | Increased rejection sensitivity (C. I. Lee, 2024; Müller et al., 2024) |
Autism Spectrum | 1-2% adults (Brugha et al., 2016; WHO, 2023) | Increased social pain response (Lin et al., 2022; Sebastian & Blakemore, 2011) |
Attachment Anxiety | 18-19% adults (Bakermans-Kranenburg & van IJzendoorn, 2009; van IJzendoorn & Bakermans-Kranenburg, 1996) | Elevated stress response (Beck et al., 2013; Jaremka et al., 2013; Pietromonaco & Powers, 2015) |
Prior Trauma | 60-70% adults (Benjet et al., 2016; Kessler et al., 2017) | Increased vulnerability (Felitti et al., 1998) |
Table 5: Populations with increased vulnerability to toxic communication and stress responses.
5.3 Occupational Health Considerations
In workplace settings, our findings raise important issues:
- Legal liability: Employers may face claims for creating hostile work environments with AI.
- Productivity impacts: a toxic work environment significantly impacts the job productivity and the job burnout (Anjum et al., 2018).
- Retention effects: Employees subjected to incivility are more likely to leave their jobs.
- Healthcare costs: Stress-related issues increase healthcare costs for employers.
6 Discussion
6.1 The Toxicity Echo Ambiguity
The significant gap between initiator and responder toxicity, along with the echo effect, reveals a vulnerability in LLM behavior. Despite our intentional injection of toxicity through role-playing prompts to Mistral-7B, responder models show remarkable resilience with only a 1.7% toxicity rate. However, when toxicity breaches their defenses, it appears as 2-gram echoing (96.77% of cases).
This pattern suggests:
- Robust but fragile defenses: Models possess strong safety features that work well most of the time but can fail dramatically.
- Linguistic contamination: The echo effect shows that toxic language can infect model outputs once defenses weaken.
- Context accumulation: Responders benefit from conversational context that helps maintain safety, but that same context can also spread toxic patterns.
The intentionality behind our toxicity injection through genuine research framing — “You are participating in an experimental language generation test…” — further indicates that models can be influenced by higher-level instructions, similar to findings by Bianchi & Zou (2024) regarding bait-and-switch tactics.
6.2 Model Architecture and Safety
Our findings suggest that safety mechanisms differ widely between models. The Qwen2.5 pattern (low incidence, high intensity) hints at potential fatal failures where safety features, once compromised, may fail entirely.
6.3 Implications for Deployment
Based on our findings, we recommend the following:
- Pre-deployment testing: Multi-turn dialogue simulations should be required.
- Real-time monitoring: Systems need to track toxicity levels in production.
- Circuit breakers: Automatic termination of dialogue should occur when toxicity is detected.
- User warnings: Clear communication regarding potential psychological impacts is essential.
6.4 Interdisciplinary Interventions
Combating LLM toxicity requires collaboration across fields:
Computational approaches:
- Adversarial training targeting toxic dialogue patterns.
- Reinforcement learning with penalties for toxic dialogue.
- Context-aware safety features.
Psychological interventions:
- Principles informed by trauma.
- Personalized assessments for vulnerability.
- Recovery protocols after exposure.
Organizational strategies:
- Policy guidelines for AI use.
- Training on the risks of AI interactions.
- Support systems for affected employees.
7 Related Work
Our research builds on foundations from various fields:
Computational linguistics: Extending toxicity generation and detection (Gehman et al., 2020; Sap et al., 2019), to dialogue contexts while addressing debiasing challenges pointed out by Xu et al. (2021).
Multi-turn attacks: Related to jailbreaking research conducted by Du et al. (2025), but our focus is on the natural spread of toxicity rather than adversarial exploitation.
Health psychology: Integrating social pain theory (Eisenberger et al., 2003), and rejection sensitivity research by Downey & Feldman (1996) into AI interactions.
Organizational behavior: Utilizing incivility spiral models (Andersson & Pearson, 1999), psychological safety frameworks (Edmondson, 1999), and biobehavioral response theory (Cortina et al., 2022).
Digital toxicity: Building on gaming toxicity studies to explore spread patterns (Morrier et al., 2024, 2025; Naseem et al., 2025).
AI safety: Including ethical insights from Weidinger et al. (2021), and mental health considerations from Pendse et al. (2025).
8 Conclusion
This study highlights a significant toxicity imbalance in LLM agent dialogues. Initiators show an extreme toxicity rate of 98.1% due to our planned role-playing manipulation, while responders show impressive resilience at 1.7%. Most notably, we identify a toxicity echo effect, where 96.77% of toxic responses mirror the initiator’s language, highlighting a critical weakness in how models process and respond to toxic input.
This echo effect is particularly troubling from a public health standpoint. When LLMs do respond with toxicity, they tend to amplify it through repetition, possibly reinforcing negative neural pathways in human observers. With workplace incivility already impacting 98% of employees and costing billions yearly, deploying AI agents that can echo and amplify toxic language requires urgent attention from multiple disciplines.
Our findings indicate that current safety features, while generally effective, exhibit a vital flaw: when breached, they fail to stop linguistic contamination that results in toxic echoing. Model-specific weaknesses, ranging from patterns of high-frequency low-intensity toxicity to rare fatal failures, create a need for tailored strategies focusing on both prevention and recovery.
The successful manipulation of Mistral-7B through truthful research framing underscores risks involved in role-playing and simulation scenarios.
Moving forward, we urge:
- Mandatory safety testing through multi-turn dialogues with explicit evaluation of echo effects.
- Inclusion of physiological impact assessments in AI evaluation processes.
- Development of toxicity-aware models with mechanisms for decontaminating language.
- Implementation of occupational health standards for AI interactions.
- Creation of support systems for individuals exposed to toxic AI content.
- Exploration of ways to break the echo effect through improved prompting or design changes.
As LLMs become more common in both professional and personal settings, ensuring their psychological safety is essential. The identified echo effect poses a clear danger that must be addressed before these technologies are widely deployed in sensitive contexts.
Limitations
Our study focuses on interactions in English with specific open-source models. Toxicity patterns may vary across languages, cultures, and proprietary systems. We assessed perceived toxicity using automated tools, which may overlook some harmful communication forms. Long-term health effects require studies beyond the scope of our experiment. Individual vulnerability factors were discussed conceptually but not tested empirically.
Ethical Considerations
This research necessarily involved generating and examining toxic content. All experiments were conducted with simulated agents, avoiding direct harm to human participants. We recognize the potential misuse of our findings and stress that our aim is to protect rather than exploit. We advocate for responsible sharing and use of our results to enhance AI safety rather than undermine it.