Project Details
Creating and optimizing data preparation pipelines for cluster analyses based on complex data characteristics
Applicant
Professor Dr.-Ing. Bernhard Mitschang
Subject Area
Data Management, Data-Intensive Systems, Computer Science Methods in Business Informatics
Term
since 2025
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 556739690
In real-word use cases, data preparation accounts for by far the greatest effort in the implementation of data analysis processes. In addition, the data available in real-world use cases often have complex data characteristics, which further complicates data preparation. If application-specific complex data characteristics are not addressed during data preparation, this often leads to inaccurate analysis results from which incorrect conclusions are drawn. Many related approaches, e.g., to AutoML or meta-learning, focus primarily on model building, i.e., on the selection of suitable analysis algorithms. They therefore provide insufficient support for data preparation and do not solve the fundamental problem that a huge amount of effort is required in order to ensure that the data is accurately prepared in line with the application-specific objective of the data analysis. For each use case, the data preparation operations (DVOs) suitable for the respective data must be selected from a large number of possible DVOs, e.g., for sampling or feature engineering. Moreover, these DVOs must be configured precisely and applied in the correct order in a data preparation pipeline (DVP). In this project, this problem and possible solution approaches are going to be investigated in detail. The initial focus of the project is on data preparation for cluster analyses. As complex data characteristics, we especially consider complex shapes and distributions of clusters in data, which make it difficult for clustering algorithms to detect the clusters correctly. For example, overlapping clusters in feature space or an uneven distribution of clusters pose typical problems for cluster analyses. Relevant DVOs which need to be properly applied in a DVP to address such complex data characteristics include sampling of data instances, outlier detection, and various feature engineering techniques. The primary goal of this project is to develop and evaluate methods that, in contrast to existing approaches, do not only support model building, but also explicitly support data preparation. These novel methods are going to recommend DVPs that are tailor-made for new data to be analyzed in order to address their complex data characteristics. Our methodological approach is to determine and evaluate the effects of certain DVOs and DVPs on specific data characteristics. Based on this, we will develop a case base, i.e., a repository of data sets and DVPs and of descriptive metadata, that compiles the insights and knowledge regarding the effects of DVOs and DVPs. In addition, we will devise a novel approach that uses this structured knowledge from the case base to recommend DVPs that are suitable for new data to be analyzed, i.e., that are able to address their specific data characteristics. This will significantly reduce the effort required to design DVPs and lead to better analysis results.
DFG Programme
Research Grants
