Enhancing the development of a cloud-based diagnostic platform through the use of GAN generated synthetic health data
Dr Marinel Cavelaars, The HYVE, Netherlands
Jeremy obtained a bachelor’s of bioinformatics from Université Laval during which he did an exchange at Université de Strasbourg in the Master’s of Integrative structural biology and bioinformatics and completed an internship at The Hyve developing a user interface prototype for TranSMART. Following this, he completed a Master’s in computer science at McGill University by developing a machine learning algorithm based on phylogenetic networks that has significant advantages over commonly used genetic assignment methods. Jeremy joined The Hyve in 2018 to pursue a PhD within the context of the AiPBAND project and, more generally, his passion for multidisciplinary science.
The aim of this ESR project is to develop a big-data powered diagnostic infrastructure for brain cancer prediction. The platform will be deployed and run on cloud services . The platform will include a data warehouse for integrating clinical and omics data. This includes data generated by innovative biosensors, developed by WP2. The components of the platform will be designed as flexible microservices. Separation of concerns will facilitate high performance, security and scalability. One of these services will be the prediction algorithms, developed in close collaboration with ESR-11 at IC. Patients, clinician and potential industrial client will gain access via any device to their results on an intuitive user interface.
To achieve these goals, large quantities of health data is required at every step. For both simpy software testing or improving machine learning accuracy, quality data remains hard to get. The scientific opportunities locked-in the massive amounts of health data collected are inestimable. The growing concerns towards data privacy will only only complicate the issue.
Anonymization is generally employed to prevent misuse of private data. Anonymization trades data utility for privacy, but does not fully prevent reidentification. This means accessing data requires eminent academic credentials and resources. If real data is lacking for developments, delays and errors can have consequences on critical health informatics. The alternative is to produce synthetic data, thereby avoiding the privacy issues. Rigid and error prone statistical models have long been the only generative algorithms for this purpose.
Recently, generative adversarial networks have demonstrated their ability to produce indistinguishable synthetic data. Synthetic data has been gaining traction rapidly in medical imaging applications. However, due to certain complexities of observational health, the idea is only starting to gain interest. This project will explore GAN methods to produce synthetic health data that address these complexities. In addition to evaluating they gains provided to the development of the diagnostic platform.