Project Details
Application of Large Language Models for Archiving File Systems
Applicants
Professor Dr. Sven Groppe; Dr. Andreas Marquet
Subject Area
Methods in Artificial Intelligence and Machine Learning
Term
since 2026
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 570892866
Despite numerous specialist applications and document management systems, file systems are a widespread technical environment worth to be archived. The challenge from an archival perspective lies in the largely flexible handling of file systems, which is often reflected in weakly structured file systems, inconsistent file naming, redundancies, etc. The document format of the transmission (file-process-document) is not guaranteed, which often leads to a loss of the context of its origin. In contrast to files, file systems do not contain any fundamental limitation of the documents to be stored in terms of quality, which would correspond to the characteristic of file relevance. Due to the hardly limited amount of data and the fundamentally different media, the archival methods of appraisal and archival description in their current form can only be used to a limited extent or require a methodological and technical addition. The proposed project will develop AI-supported methods that address this problem and can be used universally in all archive sectors. Due to their impressive ability to generate human-like relevant answers for various tasks such as translation, summarization, question answering, poetry or code while taking context into account, large language models are now used in many areas of daily life. Large language models can perform various activities by providing zero or few examples. This means, among other things, that they can be used flexibly when archiving heterogeneous documents, and archivists can, if necessary, easily intervene in the process with minimal effort to improve the results and reduce subsequent effort. In particular, in the proposed project we will test and evaluate the use of large language models for the classification of the file systems to be archived as well as the masking of sensitive data, the extraction of data and the analysis of archived file systems. For the special requirements of archiving in terms of quality of results, traceability and control, new methods will be developed that use frameworks for multi-agents consisting of Large Langue Model (LLM) agents, human feedback options and code to create automatic feedback loops such as iterative self-refinement, reinforcement via self-reflection and correction through criticism of external tools to improve the results.
DFG Programme
Research data and software (Scientific Library Services and Information Systems)
