Like our fingerprint, our DNA sequence is unique. Only 0.1% of genetic information differs from one person to another. This percentage makes us who we are. Our DNA carries information about disease likelihood, sex, ethnicity, or other sensitive information. Hence, special care must be taken when handling human omics data. Data collected for research purposes are typically pseudonymised, however, individual gene sequences could potentially still lead to re-identification.
GHGA takes a multilayered approach to data security. We build advanced infrastructure to allow the data to be archived and shared safely. In addition, we develop a framework for GDPR-compliant data processing and help data producers to inform patients and navigate consent. Enabling controlled, yet FAIR, data access is the last layer to ensure data is protected and at the same time fulfils its potential advancing research.
The prevention of data misuse is a primary objective in GHGA's mission of building a national omics data infrastructure.
The castle and the moat
In explaining GHGA's approach to data security, it is helpful to first talk about the traditional security model, its assumptions, and its implications.
We predominantly see a security model of four walls around a data centre, where traffic goes over the front door and firewalls and application filters safeguard the network. Everything outside this castle is potentially malicious and untrusted; everything inside is good and receives high trust. Unfortunately, it has taken many data breaches for the industry to recognise that this model is fundamentally flawed and that there are good reasons to move away from it.
First of all, the castle-and-moat model assumes an unrealistic degree of perfection: It assumes that there will never be a mistake. There will never be a misconfiguration in firewalls or application filters, any out-of-date vulnerability, etc. Secondly, the traditional model does not consider insider threats. It does not matter how tall your walls are if your adversary has already sneaked behind.
Zero-trust networking
Most data breaches result from stolen credentials, application vulnerabilities, malware, social engineering, insider threats, physical theft, or human errors. Of course, reducing the probability of any of these attacks remains highly important. However, we cannot assume that the previously explained perimeter-based security model is all-encompassing. Our perimeter might cover 95%, maybe 99.9%, but we can never guarantee 100%. We must therefore consider that our adversary is inside the network already. By default, we can trust no one - internal or external parties. This is what leads us to zero-trust networking.
At GHGA, we build on a private cloud infrastructure. All our applications, data processing steps, databases and file storages are physically located, controlled, and operated at our data hub locations. But the fact that these environments are private does not automatically make them highly trusted. We design the GHGA applications, services, data storage and network components according to the zero-trust networking model. No one will be trusted by default. Both humans and machines require strict verification of their identity before access is granted.
Because the German Human Genome-Phenome Archive is a joint effort between multiple omics data processors, the system design poses several challenges. As a nationwide initiative, we aim to build an open system that enables universities, research institutions and other interested parties to join and submit data or to become a subprocessor node, which we call a data hub. This demand for adaptability encourages us to build on open standards for data encryption, authentication and authorisation and last but not least in data privacy regulations. By implementing open standards and transparent systems, we also discourage obsolete paradigms such as "security through obscurity". Good system security should not depend on the secrecy of the implementation or its components.
Managing Information Security
The secure storage of sensitive omics data is more than a matter of technical measures. It further includes organisational, personnel and physical controls. Therefore, GHGA implements an Information Security Management System oriented on industry standards such as ISO27001. This ISMS supports GHGA in meeting high-security standards for the whole organisation and will be the basis for regular security audits. Moreover, the ISMS sets guidelines and controls at which the federated data hubs operate.
GDPR-compliant data sharing
Sensitive personal data, such as human omics data, is protected under the European General Data Protection Regulation (GDPR). The interpretation of the GDPR is dependent on the country the data is handled in. As a German initiative, GHGA addresses the legal basis for data processing and consent in the national context.
One step further, data shareable across national borders empowers international collaboration by which German research becomes more visible while also improving the quality of science and ensuring a better return for the people it serves. GHGA therefore works on the legal interoperability for data processing within the EU and in international data spaces. GHGA is part of EU-initiatives like FAIR DataSpaces, GDI and the global initiative GA4GH.
Navigate consent - a guide for researchers and clinicians
Typically, informed consent from patients and research participants is required in order to share omics and related health data for research. To provide guidance to clinicians, researchers and institutions wanting to share data via GHGA, we have developed modules that can be integrated into consent forms. These updates inform patients and research participants about the possibility of sharing their omics data with genome archives such as GHGA.
Assisting researchers and clinicians with the evaluation of so-called legacy consent (consent obtained in the past), GHGA has developed an App to help assess the legal validity of sharing data with any given past consent form. Additionally, GHGA experts work on risk assessments, de-identification and anonymisation methods, and a possible code of conduct for data sharers.
Omics data is sensitive, and patients know that. Yet patients are willing to donate their data to science - hoping to help future patients with new developments and treatment options. A study involving cancer patients found that 97 percent are generally willing to make clinical data available for biomedical research purposes. The major condition for their consent? High data security. A goal GHGA strives towards.
The patient view on data sharing
In addition to complying with the data protection laws, GHGA considers ethical and social implications of human omics data sharing. Thereby, patients and other data donors are at the heart of the efforts. GHGA is actively exploring strategies to involve patients in the conception and governance of GHGA to achieve broad and sustained societal support for the project.
The open dialogue and collaboration was initiated via deliberative forums, in which the establishment of a patient advisory board and close input on outreach measures was agreed upon.
Not everyone can access data stored at GHGA. Only non-personal metadata is publicly available within the GHGA data portal. Researchers intending to use any of the archived data or view personal metadata must apply for access to the responsible legal person - typically the person or institution submitting the data. Often a data access committee (DAC) will be used, to review the legitimacy of the request before granting access and to ensure it is aligned with the original consent the patient from whom the data was generated has given. This step ensures that only researchers with a valid research purpose gain access to sensitive data - adding another layer of protection.
Researchers or institutions submitting data to GHGA remain the controllers of the data, and it is their decision who is granted access to the data. GHGA serves as a mediator in this process. Dedicated data stewards at the GHGA data hubs, trained in technical and ethico-legal aspects of managing omics data, will assist users in submitting data, offer guidance on how to manage data access requests, and enable secure access via encrypted downloads (see Cybersecurity above).