National Health Data Science Sandbox

To support the widespread implementation of precision medicine and other data-driven advancements in the healthcare sector, we need to expand training and research in health data science. However, access to sensitive data such as patient health records or human genomic data is understandably restricted. This ethical and regulatory barrier complicates early stage practice, experimentation and tool prototyping in fields ranging from genome sequence analysis to predictive modeling of health outcomes. 

To bolster modern data science training of students and analytics innovations by researchers, we can provide computing environments and resources that resemble the real-world, secure platforms used in health data projects without the additional complications of finding, accessing, cleaning and protecting person-sensitive data according to ethical guidelines/GDPR regulations. This ‘sandbox’ environment for exploring health data science techniques will allow low-stakes guided learning and development followed by a smooth transition to a secure environment where users’ knowledge and tools can be applied to sensitive data. 

With support from the Novo Nordisk Foundation’s Data Science Research Infrastructure initiative, we are building a data science sandbox for students and researchers at Danish universities. TThe sandbox will contain non-sensitive datasets used previously in published research that have been: 1) published in anonymized form in alignment with relevant GDPR/HIPAA/ethics guidelines, 2) published and made publicly available with patient consent, and 3) simulated/synthetized via privacy-preserving techniques that leverage descriptive statistics and correlations mapped in sensitive datasets, such that no real patient data is included in the synthetic dataset.

These curated datasets will span key health data domains – electronic health records, omics data such as genomics and transcriptomics, images, and wearable device data. Each dataset will be paired with recommended analysis tools, pipelines, and learning materials/tutorials in a portable, containerized format. We are partnering with Elixir-DK given our shared goals in resource curation and computational life sciences training.

This is a national project coordinated by the Center for Health Data Science at University of Copenhagen with stakeholders, advisors and project scientists located at five Danish universities. Our health data science sandbox is under active development and hosted at Computerome (the Danish National Life Science Supercomputing Center). Our initial aim is to support university courses and programs in health data science and personal medicine, with broader access for researchers and university students planned in the future. 

If you are a professor/lecturer with an idea for a sandbox-hosted course or a researcher with a pilot project concept, please contact us at