Project Details
Trustworthy Reinforcement Learning from Human Feedback
Applicant
Professor Dr. Eyke Hüllermeier
Subject Area
Methods in Artificial Intelligence and Machine Learning
Term
since 2026
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 566029805
Artificial intelligence (AI) has made impressive progress in recent years. In the field of reinforcement learning (RL), this has been achieved primarily by integrating humans into the learning process: Reinforcement Learning from Human Feedback (RLHF) allows human experts to specify domain knowledge and feedback in the form of preferences and to control the learning process with comparatively little cognitive effort. The potential of this approach has recently been impressively demonstrated by the revolution in the training of large language models such as GPT. However, the RLHF learning process still involves certain risks, which have been recognized, for example, in the course of the spread of powerful chatbots, and which need to be reduced as far as possible. Accordingly, the aim of this project is to make important contributions to the trustworthiness of RLHF. By developing advanced methods for uncertainty quantification, we increase the reliability and robustness of RLHF, which is an important requirement for safety-critical applications. By integrating explainable AI (XAI) methods, we also improve the acceptance of RL and the communication between the human expert and the AI. Last but not least, more expressive feedback models and methods for dealing with time-dependent preferences will be developed in order to improve the applicability of RLHF and to extend the range of applications.
DFG Programme
Research Grants
