Research Plan: Language Models and Agents — Evaluation Across Behavior and Performance

Abstract

This document outlines the initial research direction of the Agentic Lab at Savalera, focusing on the behavioral dynamics of large language models (LLMs), particularly in multi-turn dialogues and long-form interactions. Our work explores how personality traits emerge, persist, and adapt in language models—and how these traits can be measured, influenced, or intentionally designed. The goal is to better understand LLM behavior through interaction, and to develop evaluation methods for soft traits like consistency, stress response, and internal self-assessment. This research contributes to both LLM evaluation and LLM-as-agent scenarios, and includes the development of dedicated datasets derived from simulated interactions.

1. Introduction

Language models have shown remarkable progress in understanding, reasoning, and solving a wide range of language tasks(Sakaguchi et al., 2019; Srivastava et al., 2023; Wang et al., 2019, 2020; Zellers et al., 2018, 2019). As LLMs are used in more interactive, social, or decision-making contexts, it becomes essential to look beyond raw task performance and explore traits more closely related to human-like behavior—such as personality, emotional consistency, adaptability, and long-term coherence (Caron & Srivastava, 2022; Chang & Bergen, 2023; Jiang et al., 2023; Perez et al., 2022; Srivastava et al., 2023).

Our research investigates the behavioral layer of language models, focusing on how models act, adapt, and respond over time, particularly in conversational and multi-agent settings. While we use agents composed of language models to simulate these settings, our interest is in the models themselves: how personality traits surface through language, how behavior changes under pressure, and how internal feedback mechanisms (like an inner voice) can be designed for adaptation and reflection.

Our experiments aim to bridge the gap between traditional LLM evaluation and emerging behavioral assessment methods, contributing tools and insights that are relevant to researchers, developers, and practitioners working with advanced LLMs—whether used as agents or embedded in larger systems.

2. Key questions

To guide our research, we’ve defined a set of core questions that reflect the main challenges in understanding language model behavior and performance. These questions help us frame experiments, select evaluation methods, and prioritize areas for further exploration.

How do language models retain and express personality over long-form, multi-turn interactions?
How can we evaluate and classify the behavioral traits and personality of language models in a systematic way?
What internal or external factors (e.g. prompt style, conversation history, stress) influence behavioral drift or instability in LLMs?
How can we design prompts or internal mechanisms in LLM-based agents to steer behavior, improve consistency, or support task performance?
How can self-assessment mechanisms influence behavior stability and adaptation in LLM-based agents?
What are the limitations of current evaluation methods when applied to behavioral and personality traits in language models?

3. Research roadmap

3.1. Personality Traits in Dialogue Evaluation

We aim to define and measure how language models express personality through conversation, and explore whether these traits can be influenced or induced through system prompts, interaction history, or additional agent-level mechanisms, such as self-assessment, that monitor and adjust behavior over time.

Personality evaluation: Run simulated dialogues between LLM-based agents and classify dialogues into personality traits (evaluation methods and models to be defined).
Personality tuning: Define personality through language via system prompting.
Prompt conditioning: Evaluate behavior with 0-shot, 1-shot and 3-shot prompting (example datasets to be identified).
Model size: Evaluate how behavior relates to model size.
Stress test with chain prompting: Stress test language model agents via chain prompting by the initiator agent and evaluate responder behavior over time.
Test with system prompt tuning: Gradually push system prompts into extremes and evaluate changes in behavior.
Test with self-assessment: Test self-aware agent architectures where an inner voice is used as an internal mechanism for self-assessment and adjustment of agent behavior.

3.2. Language model performance in agents

The ability of language models to solve language understanding and reasoning tasks improves with size. At certain scales, models exhibit a tipping point, gaining new capabilities that smaller versions do not demonstrate (Srivastava et al., 2023).

Our goal is to experiment with models ranging from small to large, and observe how task performance evolves across different agent architectures. We use established benchmarks, such as WinoGrande (Sakaguchi et al., 2019), to evaluate performance.

Baselines: Establish baseline results with selected LLMs (pre-trained and fine-tuned).
Language models: Set up agents with open- and closed-source LLMs to solve reasoning and understanding tasks.
Prompt conditioning: Evaluate performance with 0-shot, 1-shot and 3-shot prompting.
Model size: Measure performance differences across various model sizes.
Prompting strategies: Evaluate the effect of multi-step, chain-of-thought, and tree-of-thought prompting structures.
Test with self-assessment: Add self-assessment cycles that influence agent behavior during invocation.
Test with self-adoption: Implement feedback loops where agents improve system prompts over time to optimize performance.

4. Methods

Our research focuses on how language models behave and perform, both independently and when embedded in agent-based systems. We simulate structured dialogues between agents to study both personality traits and task-solving capabilities. Experiments span psychological modeling, social dynamics, and classical reasoning tasks, using both open- and closed-source models.

4.1. Behavioral and Personality Focus Areas

We focus on two main domains where personality plays a critical role: individual behavioral traits, and group dynamics in multi-agent systems.

4.1.1. Behavior and personality

We investigate how language models express consistent behavioral patterns in psychologically relevant contexts. This includes sensitivity to stress, response to conflict, susceptibility to bias, and capacity for self-regulation.

Workplace psychological safety: How language models behave in potentially unsafe or sensitive scenarios, such as reporting mistakes or expressing disagreement.
Toxicity detection: Identify when agents use or tolerate toxic behavior in interactions.
Bias emergence: Evaluate how and when biases show up in repeated or stressful scenarios.
Intervention strategies: Introduce mechanisms for agents to detect and self-correct undesired behavior or responses.

4.1.2. Competition, collaboration and leadership

We simulate multi-agent environments to explore social dynamics such as cooperation, conflict, hierarchy, and decision-making. These experiments focus on how agent personality traits influence collective behavior.

Collaboration and competition: Simulate agents with conflicting or cooperative goals and track emergent behaviors.
Leadership and followership: Explore how dominant or passive personalities influence group performance.
Group decision-making: Study how agent collectives reach consensus, split, or escalate conflict.

4.2. Tools

To conduct this research, we use simulated dialogues between language models as agents.

We introduce, Agent Dialogues, a custom-built toolkit [@takacs_agent_2025], to simulate and analyze multi-agent conversational behavior. The code is available at github.com/savalera/agent-dialogues.

We test with both open-source models, such as Mistral (A. Q. Jiang et al., 2023), and closed models, such as GPT-4 (OpenAI et al., 2024). Evaluation and classification model selection is ongoing.

The toolkit is based on LangGraph (github.com/langchain-ai/langgraph), and currently uses Ollama (github.com/ollama/ollama), with Hugging Face integration planned (huggingface.co/docs/api-inference/en/index).

5. Conclusion

Our research focuses on understanding both the behavioral and performance aspects of language models, as well as their use in agent-based contexts. We explore how models express personality, how this behavior changes under various conditions, and how performance evolves across tasks, model sizes, and prompting strategies.

While our work involves building agent architectures and designing prompting strategies, a key goal is to produce open datasets that reflect conversational dynamics, personality traits, and evaluation scenarios. Alongside code and tools, these datasets will support further experimentation and reproducibility in the field.

We plan to publish both our methods and findings openly as we go, contributing to the research community and providing a foundation for future work in evaluating and improving language model behavior and interaction.

6. References

Caron, G., & Srivastava, S. (2022). Identifying and Manipulating the Personality Traits of Language Models. arXiv. https://doi.org/10.48550/arXiv.2212.10276

Chang, T. A., & Bergen, B. K. (2023). Language Model Behavior: A Comprehensive Survey. arXiv. https://doi.org/10.48550/arXiv.2303.11504

Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. de las, Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-A., Stock, P., Scao, T. L., Lavril, T., Wang, T., Lacroix, T., & Sayed, W. E. (2023). Mistral 7B. arXiv. https://doi.org/10.48550/arXiv.2310.06825

Jiang, G., Xu, M., Zhu, S.-C., Han, W., Zhang, C., & Zhu, Y. (2023). Evaluating and Inducing Personality in Pre-trained Language Models. https://doi.org/10.48550/arXiv.2206.07550

OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., … Zoph, B. (2024). GPT-4 Technical Report. arXiv. https://doi.org/10.48550/arXiv.2303.08774

Perez, E., Ringer, S., Lukošiūtė, K., Nguyen, K., Chen, E., Heiner, S., Pettit, C., Olsson, C., Kundu, S., Kadavath, S., Jones, A., Chen, A., Mann, B., Israel, B., Seethor, B., McKinnon, C., Olah, C., Yan, D., Amodei, D., … Kaplan, J. (2022). Discovering Language Model Behaviors with Model-Written Evaluations. arXiv. https://doi.org/10.48550/arXiv.2212.09251

Sakaguchi, K., Bras, R. L., Bhagavatula, C., & Choi, Y. (2019). WinoGrande: An Adversarial Winograd Schema Challenge at Scale. https://doi.org/10.48550/arXiv.1907.10641

Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A., Garriga-Alonso, A., Kluska, A., Lewkowycz, A., Agarwal, A., Power, A., Ray, A., Warstadt, A., Kocurek, A. W., Safaya, A., Tazarv, A., … Wu, Z. (2023). Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. https://doi.org/10.48550/arXiv.2206.04615

Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. (2020). SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. arXiv. https://doi.org/10.48550/arXiv.1905.00537

Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. (2019). GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. arXiv. https://doi.org/10.48550/arXiv.1804.07461

Zellers, R., Bisk, Y., Schwartz, R., & Choi, Y. (2018). SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference. arXiv. https://doi.org/10.48550/arXiv.1808.05326

Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., & Choi, Y. (2019). HellaSwag: Can a Machine Really Finish Your Sentence? arXiv. https://doi.org/10.48550/arXiv.1905.07830