Computer Security and Privacy in DNA Sequencing

Paul G. Allen School of Computer Science & Engineering, University of Washington

There has been rapid improvement in the cost and time necessary to sequence and analyze DNA. In the past decade, the cost to sequence a human genome has decreased 100,000 fold or more. This rapid improvement was made possible by faster, massively parallel processing. Modern sequencing techniques can sequence hundreds of millions of DNA strands simultaneously, resulting in a proliferation of new applications in domains ranging from personalized medicine, ancestry, and even the study of the microorganisms that live in your gut.

Computers are needed to process, analyze, and store the billions of DNA bases that can be sequenced from a single DNA sample. Even the sequencing machines themselves run on computers. New and unexpected interactions may be possible at this boundary between electronic and biological systems. As a multi-disciplinary group of researchers who study both computer security and DNA manipulation, we wanted to understand what new computer security risks are possible in the interaction between biomolecular information and the computer systems that analyze it.

Here we highlight two key examples of our research below: (1) the failure of DNA sequencers to follow best practices in computer security and (2) the possibility to encode malware in DNA sequences. See our paper for more detailed information on our findings. This paper will appear at the peer-reviewed USENIX Security Symposium in August 2017.

Computer Security Analysis of DNA Sequencing Programs

After DNA is sequenced, it is usually processed and analyzed by a number of computer programs through what is called the DNA data processing pipeline. We analyzed the computer security practices of commonly used, open-source programs in this pipeline and found that they did not follow computer security best practices. Many were written in programming languages known to routinely contain security problems, and we found early indicators of security problems and vulnerable code. This basic security analysis implies that the security of the sequencing data processing pipeline is not sufficient if or when attackers target the pipeline.

DNA Encoded Malware

DNA stores standard nucleotides—the basic structural units of DNA—as letters such as A, C, G, and T. After sequencing, this DNA data is processed and analyzed using many computer programs. It is well known in computer security that any data used as input into a program may contain code designed to compromise a computer. This lead us to question whether it is possible to produce DNA strands containing malicious computer code that, if sequenced and analyzed, could compromise a computer.

To assess whether this is theoretically possible, we included a known security vulnerability in a DNA processing program that is similar to what we found in our earlier security analysis. We then designed and created a synthetic DNA strand that contained malicious computer code encoded in the bases of the DNA strand. When this physical strand was sequenced and processed by the vulnerable program it gave remote control of the computer doing the processing. That is, we were able to remotely exploit and gain full control over a computer using adversarial synthetic DNA.

No Reason for Concern

Note that there is not present cause for alarm about present-day threats. We have no evidence to believe that the security of DNA sequencing or DNA data in general is currently under attack. Instead, we view these results as a first step toward thinking about computer security in the DNA sequencing ecosystem. One theme from computer security research is that it is better to consider security threats early in emerging technologies, before the technology matures, since security issues are much easier to fix before real attacks manifest.

We again stress that there is no cause for people to be alarmed today, but we also encourage the DNA sequencing community to proactively address computer security risks before any adversaries manifest. That said, it is time to improve the state of DNA security.

We encourage the DNA sequencing community to follow secure software best practices when coding bioinformatics software, especially if it is used for commercial or sensitive purposes. Also, it is important to consider threats from all sources, including the DNA strands being sequenced, as a vector for computer attacks. See our research paper for a more detailed discussion of threats to the DNA sequencing pipeline and potential defenses.

FAQ

Is it possible to exploit a computer program with synthesized DNA?

The results from our study show that it is theoretically possible to produce synthetic DNA that is capable of compromising a computer system. For now, these attacks are difficult in practice because it is challenging to synthesize malicious DNA strands and to find relevant vulnerabilities in DNA processing programs. Thus, while scientifically interesting, we stress that people today should not necessarily be alarmed, as we discuss both above and below.

What are your findings, regarding leading open-source computational biology software packages?

We analyzed open-source bioinformatics tools that are commonly used by researchers to analyze DNA data. Many of these are written in languages like C and C++ that are known to contain security vulnerabilities unless programs are carefully written. In this case the programs did not follow computer security best practices. For example, most had little input sanitization and used insecure functions. Others had static buffers that could overflow. The lack of input sanitization, the use of insecure functions, and the use of overflowable buffers can make a program vulnerable to attackers; modern computer security best practices are to avoid or cautiously use these programmatic constructs whenever possible.

Is there any reason for immediate concern?

No. We have no reason to believe that there have been any attacks against DNA sequencing or analysis programs. A primary goal of this study was to better understand the feasibility of DNA-based code injection attacks. Our DNA-based exploit is hypothetical, compromising a program that we intentionally modified to include a vulnerability. We also know of no efforts by adversaries to compromise computational biology programs.

However, since DNA sequencing technologies are maturing and becoming more ubiquitous, we do believe that these types of issues could pose a growing problem into the future, if unaddressed. We therefore believe that now is the right time to begin hardening the computational biology ecosystem to cyber attacks.

Are there any risks to people with DNA-based exploits? Will this infect my genome?

The answers to both questions are no. Your genome is untouched. Our exploit shows that specifically designed DNA can be used to affect computer programs, not living organisms themselves. Said another way, our exploit is designed to compromise a computer program involved in the DNA sequencing pipeline (and a program intentionally modified to include a vulnerability). The DNA sequence we designed for this paper does not have any biological significance. We further stress that researchers often synthesize DNA with non-biological functions, e.g., when using DNA for digital data storage.

Are you helping the bad guys?

As computer security researchers, we are interested in understanding the security risks of emerging technologies, with the goal of helping improve the security of future versions of those technologies.

The security research community has found that evaluating the security risks of a new technology while it is being developed makes it much easier to confront and address security problems before adversarial pressure manifests. One example has been the modern automobile and another the modern wireless implantable medical device. In both cases, the government and industry responded to security research uncovering potential risks, and as a result both the modern automotive industry and the medical device industry have significantly increased their computer security protections. We encourage the computational biology community to do the same.

What is the DNA data processing pipeline?

DNA sequencing is a complicated process that begins with physical DNA samples that are prepared in a laboratory. These prepared samples are then run through a machine that produces raw DNA sequence output. To make this data useful, it is manipulated and analyzed through a number of different programs that process the data in stages. These programs constitute the DNA data processing pipeline.

Do you have any advice for governments?

The government is currently involved in regulating the production of synthetic DNA products that may be used to generate dangerous compounds (e.g., infectious diseases, toxins, etc.) and federal law requires adequate security in connection to some types of health information. At this point, we are not in a position to propose any specific additional regulations. However, we intend to analyze the law and policy ramifications of this work in partnership with the UW Tech Policy Lab and encourage regulators to consider this area moving into the future.

Do you have any advice for biology researchers and the computational biology community?

The DNA sequencing community, and especially the programmers of bioinformatics tools, should consider computer security when developing software. In particular, we encourage the wide adoption of security best practices like the use of memory safe languages or bounds checking at buffers, input sanitization, and regular security audits.

Another issue to consider is how to best maintain and patch bioinformatics software. Much of it is written and maintained by many entities, which makes it difficult to patch and has led to a high prevalence of out-of-date software.

Please see the research paper for a detailed threat analysis and additional security recommendations.

Do you have recommendations for the computer security community?

DNA synthesis and sequencing are very important tools in molecular and synthetic biology, and over time, we expect that they will increase in prevalence, especially as they move into new commercial domains. This study is just a first attempt to consider the security risks of this field. Given the importance of these technologies and their close connection to computers it is important that the security community consider the broad threats to this ecosystem.

Should I avoid genetic testing because of these findings?

No, not at all. Genetic sequencing and testing has many important benefits, and the risks we describe in this study are far from practice.

Paper

For more details on our findings and recommendations for the DNA sequencing community see our peer-reviewed technical paper published at the 2017 USENIX Security Symposium.

This research was supported in part by the University of Washington Tech Policy Lab, the Short-Dooley Professorship, and the Torode Family Professorship.

Our Team

We are researchers at the University of Washington's Paul G. Allen School of Computer Science & Engineering.

Peter Ney (right)
Doctoral Student in Computer Security and Privacy Research Lab

Karl Koscher (middle)
Research Scientist in Computer Security and Privacy Research Lab

Lee Organick (left)
Doctoral Student in Molecular Information Systems Lab

Tadayoshi Kohno
Faculty in Computer Science & Engineering

Luis Ceze
Faculty in Computer Science & Engineering