Project Details
Projekt Print View

Utility- and privacy-preserving synthetic health data generation through ensemble models.

Applicant Dr. Lisa Pilgram
Subject Area Medical Informatics and Medical Bioinformatics
Term since 2023
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 530282197
 
Access to and sharing of health data provides a huge potential for translational medicine in terms of secondary purpose research, innovation, reproducibility and transparency. Privacy considerations and privacy-enhancing technologies (PETs) are becoming more important in the context of the on-going growth of data availability and linkage, as well as multiple examples of disclosure attacks in real world. Modern PETs include synthetic data generation (SDG) through artificial intelligence where data is generated from real data maintaining the original statistical properties but not containing any actual patient’s information. On-going research indicates that synthetic data can serve as a proxy for real data but performance in terms of privacy and utility depends largely on the dataset and the chosen model even within one class of SDG techniques. This supports the idea that there is not one superior model but rather application-specific superiority. A solution to this dilemma could be ensembles. Having the advantages of each model combined, this approach potentially overcomes the privacy-utility trade-off of individual models in particular when working with complex data from electronic health records or from different sources. Based on these considerations, I am planning to work with the Electronic Health Information Laboratory directed by Dr. K. El Emam (Ottawa, Canada) to demonstrate the benefits of SDG through ensembles in terms of utility and privacy. We will fully synthesize datasets using ensembles of Generative Adversarial Networks, sequential synthesis and Bayesian neuronal networks. To capture the various dimensions of utility, broad (Hellinger distance) and narrow (reproducibility of results) metrics will be combined. Also, privacy in the overall context of SDG will be evaluated in a broad framework. Privacy metrics (disclosure risks) developed in the host lab will be used for this purpose. Also, we will include the ethics review boards’ perspective and their bioethical understanding of synthetic data as developed during the project. This will be achieved by conducting a mixed-methods study among the Canadian Association of Research Ethics Boards at their annual meeting. Through a questionnaire and series of focus groups, a precise bioethical understanding will be developed. Our holistic approach will ensure a technical optimization through ensembles in line with actual scientific utility and bioethical understanding of SDG. More precisely, this project will lead to a more reliable technique presenting consistent results independent of the dataset and the model applied. This, in turn, enables non-experts to include it as a PET into standard data management procedures and is substantial to pave the way as widely used efficient and safe PET.
DFG Programme WBP Fellowship
International Connection Canada
 
 

Additional Information

Textvergrößerung und Kontrastanpassung