Genotype Extraction and False Relative Attacks: Security Risks to Third-Party Genetic Genealogy Services Beyond Identity Inference

Paul G. Allen School of Computer Science & Engineering, University of Washington

Peter Ney, Luis Ceze, Tadayoshi Kohno


Frequently Asked Questions

This FAQ references an academic paper written by Peter Ney, Luis Ceze, and Tadayoshi Kohno available here.

Who are you and what is this research about?

We are a group of researchers in the Paul G. Allen School of Computer Science & Engineering at the University of Washington who specialize in computer security and molecular information processing.

Direct-to-consumer genetic testing with companies like 23andMe and AncestryDNA has become common in the last few years. Users often take the genetic data generated from these genetic tests and send it to third-party companies (like GEDmatch) to learn more about their ancestry and health. In this study, we focus on particular type of genetic genealogy analysis done by third-party companies that is used to identify close genetic relatives. This is called relative matching. These algorithms are sensitive enough to reliably find 3rd or more distant relatives in large genetic databases.

Given the prominence of genetic genealogy analysis by third-party companies and the sensitivity of the data that is analyzed, including sensitive genetic markers, we believe that it is essential that these services remain secure and maintain the privacy of users. The goal of our research was to study potential security and privacy risks associated with third-party genetic genealogy companies. This type of security research in the scientific and academic computer security field is meant to bring awareness of potential security issues and spur the development of more secure systems. In this work, we focused our efforts on the largest, pure third-party genetic genealogy website, called GEDmatch, that also plays a significant role in criminal investigations.

What's already known about genetic genealogy and computer security?

In April 2018, it was revealed that relative matching on the GEDmatch genetic genealogy website played a crucial role in identifying an unknown DNA sample that was used to solve the Golden State Killer case. Since then, DNA samples in dozens of cold cases have been identified via relative matching.

In these cases, an unknown individual’s DNA sample was identified by finding relatives of that individual via relative matching on GEDmatch. Then, working backwards through genealogy records, investigators narrowed down the list of possible identities to a small number that could be tested using traditional DNA forensic methods.

This technique is not limited to just law enforcement; researchers have already demonstrated that participants in public, anonymous research genetic datasets can be identified using these same approaches. This result raises significant privacy concerns because malicious actors could use these same identification techniques to violate the privacy of anonymous genetic data. More generally, these finding raise the question of whether genetic data itself is inherently identifiable. The effectiveness of genetic genealogy identification depends on a number of factors like the family tree of the individual and the number of relatives in the genetic database. However, as third-party genetic genealogy databases are expected to grow, we expect that this type of identification will only become easier.

Another recent paper considered whether genetic privacy could be breached using the matching segments that are returned by relative matching algorithms. For more details on this work, see this question below.

Can you summarize your findings? What is new about your results?

We suspected that genetic genealogy services like GEDmatch might be at risk to security and privacy threats. We focused in particular on two issues: genetic marker theft and falsified genetic data.

Genetic Marker Extraction

Third-party genetic genealogy services maintain databases that hold customer’s genetic data. This data is used by relative matching algorithms to find genetic relatives that are in the genetic database. In services like GEDmatch, the identity of a match — a genetic relative in the database — is revealed to a user along with some details about where the two individuals share DNA; however, the raw genetic information of other users is considered private and not revealed. We hypothesized that relative matching queries, which work by comparing raw genetic data between two users, could be leveraged by an adversary to steal the private genetic markers of other users. Through an experiment using the GEDmatch service, we demonstrated that an adversary could upload artificial genetic data to extract the large majority of private genetic markers of other users on the service with standard DNA comparisons used for basic relative matching. The visualizations and other results leak enough information to an adversary to be able to infer over 90% the underlying genetic markers, known as single-nucleotide polymorphisms (SNPs), of any user whose genetic data the adversary can compare against the artificial genetic data. Therefore, this information leakage potentially poses a significant privacy violation for the affected individuals.

Falsified Genetic Relations

Now that GEDmatch and other genetic genealogy services provide a powerful method to identify genetic data, we explored what risks might exist if artificial genetic data was used to falsify relationships. We developed methods to construct falsified relatives for any individual when the genotype of the individual is known. We evaluated and demonstrated this technique on GEDmatch. By combining this approach with the genetic marker theft, described above, an adversary can create false relations between arbitrary users on GEDmatch. False genetic relatives pose a number of risks and could be used by an adversary to defraud victims or possibly make a person's anonymous genetic data more difficult to identify.

How were the experiments designed? What was done to ensure that this research was done ethically?

We take ethics very seriously and took care to ensure that our experiments respected the privacy of GEDmatch users, minimized the impact to GEDmatch services, and respected the user terms-of-service. All experiments used artificial genetic data derived from publicly available datasets, and we only extracted the genotype of files that we uploaded to accounts we controlled (not other GEDmatch users). All uploaded kits were set to the 'Research' privacy setting so they would not affect public matching results. Furthermore, we did not view any public matches or attempt to identify the source of any public genetic data we uploaded. The research was disclosed ethically and responsibly, described below.

Is there reason for immediate concern? Is there anything the users of GEDmatch should do?

By the time this work is public, the vulnerabilities we uncovered during our carefully controlled experiments have been disclosed to GEDmatch. It is common for the academic research community to publicly disseminate security analyses to facilitate an open discussion about privacy and security issues affecting a technology. We believe that a public dissemination of this work will have this same effect and make the broader genetic genealogy community more aware of potential security concerns in these services, with the eventual goal of leading to the design and implementation of secure genetic genealogy services. Our research group has found this approach to be effective with other technologies we have previously studied, including automobiles and implanted medical devices.

Security is a difficult problem for Internet companies in every industry, and genetic genealogy is no different. The choice to share data is a personal decision, and anytime users share data there is always a potential risk of data security issues. We have had extensive communication with GEDmatch for months to help them resolve these issues, and they have been aware that these results would be publicly released. However, since we are not affiliated with GEDmatch and do not have access to their systems, we cannot comment on whether sufficient security mitigations have been put into place to resolve the security problems we uncovered. If users have concerns about the privacy of their genetic data then they always have the option to delete it from the site. We want to stress that the particular security issues we studied were specific to the implementation of GEDmatch and so our experimental results do not directly apply to other services.

Why are you releasing these results?

This is a common question with computer security research. As a general rule, we believe that the benefits of early research and disclosure of security problems outweigh the risks. Historically, motivated malicious actors have demonstrated significant skill compromising computer systems and services, and there is no guarantee that attackers won’t soon develop this knowledge on their own — if they haven’t already. As a general rule, the security community has found that early analysis and disclosure of security problems leads to more secure systems in the long run. This process gives engineers and system designers a chance to design and implement defenses before security attacks manifest. We have taken care to give the GEDmatch advance notice of our findings to allow them to mitigate any potential security problems before the results of this work became public. Releasing this paper now is also especially timely and valuable because a separate research effort, recently made public, has suggested that third-party genetic uploads may have privacy implications for many genetic genealogy services because of the way matching algorithms work. Therefore, it is even more imperative to further inform a public debate. This form of disclosure — first to the relevant stakeholders and then to the public — is common in the computer security research field. These results have been peer-reviewed and will appear in the 2020 Network and Distributed System Security Symposium (NDSS).

In light of your research, how can we strengthen the security and integrity of genetic genealogy services?

We have a number of recommendations for genetic genealogy services. These suggestions are not meant to be comprehensive, necessary, or sufficient for security. Rather, these recommendations provide a starting point for thinking about secure system design. We encourage more future research into the design of secure genetic genealogy services.

Support Only Authenticated Genetic Data Files

The threats that we experimentally demonstrated were made possible because there are no technical restrictions on uploading artificial genetic data to GEDmatch. This capability lets an adversary upload highly irregular or falsified data, which was a necessary requirement to extract genetic markers or false relations. As others have suggested, we encourage the genetic genealogy industry to develop methods to ensure the authenticity of the data that is uploaded to third-party websites. In other words, genetic genealogy websites should confirm that the genetic data used in relative matching originated from reputable direct-to-consumer testing companies and has not been subsequently manipulated. Cryptographic techniques, like digital signatures, can be used to enforce this restriction in addition to providing other traceability benefits — for example, tracing a digital genetic data file to a particular company, genotyping instrument, and date. Authenticity restrictions could be overruled In special circumstances, like authorized law enforcement queries.

Ensure Data Integrity

Like any service that accepts data uploads from users, genetic genealogy services should ensure that data is not corrupted and looks “reasonable.” Some of the security risks we demonstrated required an adversary to upload highly anomalous genetic data files. Genetic genealogy services should consider doing some form of anomaly detection on newly uploaded data to prohibit the inclusion of highly irregular data in the genetic database.

Restrict Comparisons

It is risky to let an adversary run queries against arbitrary users in the database. Instead, we suggest that services only allow users to run DNA comparisons and view details for high-quality matches. This will minimize the risk of potential security issues because adversaries can only target users that are closely related matches instead of large portions of the entire genetic data database, which can contain the data of hundreds of thousands of users.

Minimize Information Leakage

Adversaries have historically shown the ability to leverage small leaks of information to steal data or compromise systems. In the case of DNA comparisons, third-party genetic genealogy services should be careful about showing fine-grained results, including visualizations like high-resolution chromosome paintings and precise matching segment coordinates. In our study, we were able to use these fine-grained results to extract individual SNPs from arbitrary users in the database.

Please see the academic paper here for more details on these recommendations.

I heard about another recent paper that looked at the security of genetic genealogy uploads. What was that?

On October 22, 2019, a paper was released by Michael Edge and Graham Coop that considered whether identical by state (IBS) segments could be leveraged to extract private SNPs from other files in the genetic database. This work was done independently of ours and was made public after our work was accepted for publication. The authors developed three new attacks (IBS tiling, IBS probing, and IBS baiting) that take advantage of different aspects of matching algorithms to extract raw SNPs from other files. These attacks were not evaluated on live services but with large public datasets. We believe that these results are important and complementary to ours because they give another example of how 3rd-party uploads can pose privacy risks to genetic genealogy services if they are not handled carefully. Like Edge and Coop, we propose a number of similar security suggestions like digital signatures, anomalous genetic data file detection, and a restriction of arbitrary direct comparisons. For more details on the paper see their FAQ.


Paper

For more details on our findings and recommendations see our peer-reviewed technical paper that will be published at the 2020 Network and Distributed System Security Symposium (NDSS).

This research was supported in part by the University of Washington Tech Policy Lab, which receives support from: the William and Flora Hewlett Foundation, the John D. and Catherine T. MacArthur Foundation, Microsoft, and the Pierre and Pamela Omidyar Fund at the Silicon Valley Community Foundation. It was also supported by a grant from the DARPA Molecular Informatics Program.