Project Details
Projekt Print View

MIDAS: Generation of Large and Heterogeneous Test Data for Duplicate Detection and Elimination

Applicant Dr. Fabian Panse
Subject Area Data Management, Data-Intensive Systems, Computer Science Methods in Business Informatics
Term since 2022
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 495170629
 
Detecting and eliminating duplicate data records are important tasks in data management. As the requirements for such management change due to the increasing volume, volatility, and diversity of data, the requirements for duplicate detection and elimination algorithms change accordingly. While research is already intensively addressing the adaptation of such algorithms to these changing conditions, existing test data generators are still designed for small – mostly relational – datasets, so that they no longer meet today's requirements. However, since the evaluation of such algorithms is an important part of research and practice, new methods for test data generation are indispensable. In this project, we will develop and implement a new approach to test data generation that allows the creation of large test datasets with complex data schemas using different data models and with realistic error patterns, as they result, for example, from copying processes and outdated values. Moreover, we will develop and implement a concept for automatic preconfiguration that supports users in adjusting the parameter settings of the resulting generation system to their particular use case, thus enabling an efficient and effective use even for inexperienced users. The main research challenges of this project are: (i) the profiling of non-relational and temporal data, (ii) the efficient generation of realistic data histories to simulate copying processes and outdated values, (iii) the automatic and customizable calculation of parameter settings (including a target-driven transformation of data schemas), and (iv) the scalable injection of realistic data errors and error patterns into existing datasets.
DFG Programme Research Grants
International Connection Australia
Cooperation Partner Professor Dr. Peter Christen
 
 

Additional Information

Textvergrößerung und Kontrastanpassung