Project Details
Projekt Print View

Evaluating, Explaining, and Enabling Ethical Multi-Agent Systems of Large Language Models (E4-MALM)

Subject Area Image and Language Processing, Computer Graphics and Visualisation, Human Computer Interaction, Ubiquitous and Wearable Computing
Term since 2026
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 579327477
 
Large language models (LLMs) have recently transformed the state-of-the-art in natural language processing. They have been shown to be powerful in isolation, but even more so, when deployed in so-called multi-agent setups, where individual LLM-based agents cooperate. However, recent work shows that such multi-agent interactions can lead to unpredictable and undesired emergent behaviors potentially resulting in harmful outcomes (e.g., unsafe system decisions).Our project addresses this problem by investigating the ethical reliability and safety of multi-agent systems built from large language models (MALMs). We will (i) develop a robust evaluation framework that measures and stress-tests the ethical behavior of MALMs at three levels—individual agents, their interactions, and overall system convergence—with a focus on socially salient failure modes such as “toxic agreement” in social-simulation scenarios; (ii) produce causal, mechanistic explanations that connect macro-level interaction patterns to micro-level features, neurons, and attention heads, yielding actionable “mechanism cards”; and (iii) design novel parameter-efficient alignment interventions (e.g., activation steering, LoRA/QLoRA, rank-one edits) that improve the safety of MALMs while preserving core capabilities. Our work contributes to the aims of LaSTing (SPP 2556) by delivering robust assessment methods, actionable insights into internal system mechanisms, and novel methods that enhance the safe applicability of MALMs.
DFG Programme Priority Programmes
 
 

Additional Information

Textvergrößerung und Kontrastanpassung