Project Details
TDC 2- Creating Tractable Data Curation Workflows
Applicant
Professor Dr. Ziawasch Abedjan
Subject Area
Data Management, Data-Intensive Systems, Computer Science Methods in Business Informatics
Term
since 2017
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 387872445
Data cleansing is a crucial step in data integration for making datasets consumable for an application or a certain analytics task. There are several types of data cleansing algorithms. Each data cleaning algorithm usually covers a subset (or a category) of all data quality problems in a dataset. Thus, multiple iterations of various cleaning algorithms have to be applied on a dataset until the a dataset is sufficiently clean. Choosing which algorithms to pick in which order on a new dataset is challenging task that usually requires exhaustive data profiling. In the initial phase of this project we tackled the problem of suggesting data cleaning workflows for a cleaning task at hand by by comparing the cleaning requirements of a new dataset with previously cleaned datasets. With that, we identified a learning model to describe prevalent types of errors in a dataset and identify dataset profiles that harbor information about the dirtiness of a dataset. Ultimately, we built holistic error detection and correction systems Raha and Baran that combine multiple base detectors/correctors and benefit from prior cleaning. We would like to continue to this research and explore how this approach can be leveraged to improve data cleaning beyond the one-dataset-at-a-time scenario. In particular, we want to expand our approach to be able to clean entire databases or prepare the required cleaning routines to reduce the necessary cleaning efforts at query time.
DFG Programme
Research Grants