It’s the stuff of science fiction:  adversaries extract DNA information from a cup of coffee or postage stamp and use it infer one’s most private traits.  However, a recently released study entitled, “Data Sanitization to Reduce Private Information Leakage from Functional Genomics” discusses how this can be achieved, along with privacy measures that the life sciences and research community can use to help limit the risks to identifiable health information.

DNA information extracted from coffee cups and other environmental samples is “noisy” — for example, due to potential contamination by multiple individuals.  However, in the recently published study, researchers using involved statistical techniques report that they were able to reliably link information about known individuals from these “noisy” environmental samples with whole-genome sequencing reads and even partial genomic assays.  According to this study, this allowed for inferences about the individuals’ sensitive phenotypic information, such as information about mental health.

The study proposes techniques that can be used to help anonymize or protect the privacy of genomic information by removing certain observable variants from genomic datasets.  According to the study authors, there are parts of genomic datasets that tend to contain large amounts of variant information that their tool targets to help protect against the risk of re-identification.

What are the key takeaways for the privacy professional?

  • The researchers expressly recognize the scientific and public health value of genomic research and concerns that data anonymization processes should be balanced against the decreased utilization of more limited datasets.  Thus, the study contemplates that not all variant information would be removed for all genomic datasets.  For example, the study authors contemplate that some participants may simply mask variants that leak information about their susceptibility to stigmatizing phenotypes.
  • In addition, the study contemplates that information about the variants would be retained — not deleted altogether — but subject to more limited access controls based on research need.
  • The techniques used by the researchers involved sophisticated forensic, statistical, and sequencing techniques.  Indeed, the authors were not able to recreate certain results using lower cost and more portable genotypic methods.  This is relevant because most privacy frameworks consider the reasonableness of linking information to an identifiable individual.