Project Details
Preserving Logical and Functional Dependencies in Synthetic Clinical Datasets
Applicant
Professor Dr. Olaf Wolkenhauer
Subject Area
Medical Informatics and Medical Bioinformatics
Term
since 2026
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 576429337
Synthetic data generation has gained importance across various domains, including medical research. While many advanced generative models can produce high-quality synthetic data, there has been limited exploration into their effectiveness in preserving the dependencies between various attributes within the data. Dependencies among attributes are common in tabular data. For example, attributes like gender and pregnancy are logically dependent, as a male cannot be pregnant. State-of-the-art generative models do not adequately maintain these relationships, assuming feature independence. Furthermore, functional dependencies, essential for database normalization and overall data quality, are frequently overlooked in synthetic data generation. In a feasibility study, we introduced a Bayesian logic-based function to extract logical dependencies from a specified set of attributes. We also conducted a comparative analysis of seven generative models using five publicly available datasets. This analysis revealed that while some models can preserve logical dependencies, none successfully maintain the functional dependencies commonly found in real datasets. To address these shortcomings, our project aims to develop methodologies that effectively preserve logical and functional dependencies in synthetic tabular data. The first goal of our research is to model inter-attribute dependencies. We plan to create an algorithm that identifies the top k logical dependencies from a tabular dataset, structuring it similarly to Bayesian networks. This approach will quantify the relationships between attributes by selecting the most relevant dependencies and maximizing a score based on logical associations. The second goal centers on establishing a hierarchical feature generation methodology that maintains inter-attribute logical and functional dependencies in synthetic data. This will be a two-stage process: first, generate independent features and then map dependent features based on relationships derived from real data. We will validate the effectiveness of these algorithms using real-world clinical patient data provided by our clinical partners, alongside simulated datasets, to evaluate their robustness across different data. Our project seeks to create more reliable and usable synthetic data by emphasizing the preservation of logical and functional dependencies. This focus is especially crucial in the clinical domain, where accurate representation of dependencies can significantly influence the effectiveness of predictive models and the quality of healthcare decisions.
DFG Programme
Research Grants
International Connection
India
Cooperation Partner
Professor Dr.-Ing. Saptarshi Bej
