Synthetic nucleic acid sequences are a staple in modern biology these days. Whether it is cell and gene therapies, biofuels, or foods like the Impossible Burger, products using synthetic DNA are everywhere. Over the last two decades, synthesizing DNA has become faster and easier, but researchers worry that this will make it easier for people to access potentially dangerous products. While many experts call for more federal guidance and regulation over the production of synthetic nucleic acid sequences, others have drawn focus to biosecurity concerns that are a little closer to home: in research labs. Jean Peccoud, a synthetic biologist at Colorado State University, and Casey-Tyler Berezin, a molecular biologist on Peccoud’s team, discussed the biggest biosecurity issue facing research, approaches for encrypting messages into DNA sequences, and the importance of sequencing technologies for mitigating biosecurity risks.
How do you define biosecurity in the context of DNA?
Peccoud: An internet search for biosecurity will return stories about security measures aimed at preventing pests or infectious agents from entering the country. However, what gene synthesis has done in terms of transforming biosecurity is that now you can have an infectious agent coming through the computer network. What I mean by that is that countries can build walls across their borders, fill airports with dogs, and introduce more screening of passengers, but an infectious agent can come from a database and then get sequenced. However, the topic has evolved significantly; rather than solely focusing on whether someone can use a benchtop DNA printer to create Ebola in the lab, it’s much more interesting to broaden the scope of risks that we’re exposed to. Some of the problems that people are starting to pay attention to with respect to the biosecurity of gene synthesis are way too narrow. The life sciences enterprises need to develop more awareness of what they’re actually doing: How do we know that what we have in our labs is what we think it is? The community would really benefit from more security awareness.
For example, every research sample, such as a tube with a DNA plasmid, has two facets: a computer record that contains information about the sequence or provides a plasmid map and then there’s the content of the tube. When the two don’t match, there are all sorts of potential problems that arise. This may not be a biosecurity problem in the regular sense because you’re not dealing with infectious agents, but people are spending millions of dollars on research that they cannot reproduce because they don’t know what they have in their flasks. It’s a security problem that comes from the fact that what you’re working with is not what you think it is. That’s something that is happening in every lab, every day, and we have very few tools to figure out what’s going on in our own lab.
Jean Peccoud, a synthetic biologist at Colorado State University, is interested in the security implications of synthetic biology.
Tim Gillies
How did you become interested in studying biosecurity in the lab?
Berezin: When it comes to biosecurity with respect to synthetic DNA sequences, the onus is really on the gene synthesis companies that are making the sequences, but these security efforts are limited in scope and can’t protect against other events that can come up later in the life of the product, whether it’s the intentional manipulation of a sequence or plasmid mutations that naturally occur over time. Biosecurity is not something that was part of my PhD training, nor something that ever came up in any of the labs that I worked in. That is really a missed opportunity to educate people on these potential security issues. I became interested in the topic when I joined Peccoud’s synthetic biology team. I realized that a lot of the methods that we’re using, such as polymerase chain reaction (PCR) and bacterial transformations, are methods I had used before but never wondered where the DNA sequences came from or how I would know if something had changed in the sequences. This is the status quo—we work with DNA and take for granted that it’s going to be what we think it is. Once you are aware of the biosecurity issues, it’s something you can’t turn your back on. Now, I see those issues everywhere.
How can DNA sequencing prevent potential biosecurity events?
Berezin: Initial screenings of DNA products may not be enough to prevent problems from arising years down the line. DNA is going to mutate. That’s what it likes to do. It likes to replicate and sometimes that doesn’t go perfectly. So even if you might have something safe in a tube in your lab, after you propagate it 100 times or 1,000 times, you might not have what you think you do. Whether that’s dangerous or not really depends on the specific scenario, but that uncertainty of not knowing what you have, is very prevalent across academic research labs. It takes a lot of work on the part of the user to ensure that they’re tracking all the sequences that they have and that they are sequencing their plasmids as they go on.
Even outside of just mutations, anytime you’re making something new, there’s a chance that it’s not exactly what you’re expecting. For example, when someone creates a new plasmid, they often take a cut-and-paste approach where they insert their gene of interest into an existing plasmid and then run PCR on the inserted section to check that the product is the expected size. However, people rarely sequence the entire plasmid because it’s expensive or it might not be accessible or even seem worth it. We’re really trying to push for people to be sequencing everything they have as often as they can to catch those scenarios.
Casey-Tyler Berezin, a molecular biologist at Colorado State University, is developing digital signatures techniques that would allow scientists to trace and authenticate synthetic nucleic acid products.
Colorado State University
What is DNA cryptography?
Berezin: Cryptography involves encoding a secret message in something that looks like regular text and sending it to someone who has the information needed to decode the hidden message. This can extend to DNA where letters are translated into nucleotides—for example, ‘E’ becomes ‘GGT’.1 On the face of it, the sequence would look like any other DNA sequence, but someone who has a specific set of primers can isolate a segment of the sequence and use a cipher table to decode the nucleotides. This is a standard DNA barcoding scheme that scientists have used for decades to show that DNA can encode information. However, more recently, that system has become easier to break. For example, E is the most common letter in the English language so if someone determines the sequence for E, that might make it easier to figure out what the rest of the message is supposed to say.
We’re interested in encoding encrypted messages that provide the user with information about the authenticity of the materials they’re working with. For this, our group has been developing a digital signature approach called DNA Identification Number (DIN), which is a more complex cryptography approach that makes it even more difficult for the receiver to open unless they know what they’re looking for.
How does the DNA Identification Number system work?
Peccoud: In any mature industry, you have some kind of product identification system in place that makes it possible to authenticate an individual product. We don’t have that in biotech. The DIN system is a digital certificate technology for verifying synthetic sequences.2 The system borrows from the Vehicle Identification Number (VIN), a unique code used by the automotive industry to identify individual vehicles and access information about the car’s manufacturer, make, and model. The DIN approach creates a digital signature cassette that is embedded into a DNA plasmid.3 Once identified using primers that recognize a start and stop sequence, the user can extract information such as the developer’s ORCID ID, a plasmid ID, and a digital signature from the creator.
Berezin: Currently, the DIN cassette is 512 base pairs in length, irrespective of the length of the incoming message—whether the encrypted information uses 5,000 or 10,000 base pairs, a hash function compresses the data into the same 512-nucleotide-long sequence. The hash function is a random, computer-generated string of nucleotides that is harder to break. To confirm that the hash function sequence isn’t inadvertently interfering with function or introducing an unwanted mutation, we can compare it to a database of sequences and introduce it to bacteria to see how it affects gene expression and behavior. This is something that we are continuing to test, along with creating an even smaller signature cassette.
What kind of infrastructure is needed in laboratories to support the wider adoption of these tools?
Berezin: The continued verification of DNA sequences relies on us having good sequencing technology. To get as accurate of a sequencing read as possible, we use a combination of sequencing technologies—short read sequencing, long read sequencing, and Sanger sequencing. This also requires users to have programming and sequence analysis skills, which could introduce accessibility issues. At the moment, it’s difficult to analyze sequencing reads if you’ve never worked with bioinformatics. Something we are working on is a user-friendly website where users can upload their raw data and our bioinformatics pipeline returns information about their sequence.
Disclosure of Conflicts of Interest: Jean Peccoud is the chief operating officer of GenoFAB, Inc, a biofoundry that provides plasmid management services and an inventor on a patent related to the research discussed in this story.