Now, imagine another story in DNA: in the universe, the “Anthropocene” also died out. Another intelligent creature appeared, and they went to explore the ancient “human civilization”. What will carry the memory of human civilization? The temperature changes, and the huge data centers on the earth leave remnants.
And there’s a piece of DNA in the permafrost, which is light, only 1kg, and looks like some white powder encapsulated in capsules. After reading it, it records a huge amount of information that once existed on the earth. Videos, texts, and codes show countless inventions and literary works in the course of human history. So the traces of that distant civilization unfolded again in the universe.
This is another sci-fi setting. The technology behind it is a cutting-edge direction that is currently being paid attention to: DNA storage of information. In nature, DNA is responsible for storing genetic information. The average diameter of a single human cell is 5 to 200 microns, and the DNA in it can contain a person’s entire genetic information: 3 billion pairs of bases.
So why can’t bases be used to store other information?This sci-fi idea is going out of the lab and is being used as a future solution for information storage.
01 There is too much genomic data, what should I do?
Originally, biologists wanted to solve the problem of biological development.
Eleven years ago, a group of bioinformaticians discussed “the problem of data storage” in a hotel in Germany. Nick Goldman was among them, in his second year as a senior scientist at the European Bioinformatics Institute (EBI).
Large-scale genome sequencing is underway, and the resulting data is rapidly growing in size. Storing and compressing this data is a hassle, and the existing technical solutions don’t seem to work. It is estimated that the human genome requires up to 2-40EB of storage capacity. That could be more than the cloud storage of a world-class tech company – the worldappleThe total amount of data a user stores on Google Cloud is about 8 EB. For this 8EB of data, the monthly storage fee is $218 million. (1EB= 102^3GB)
Biologists fell into depression.
Nick Goldman holds the DNA that stores all of Shakespeare’s sonnets, a photo and snippets of his “I have a dream” speech | Source: EBI
Someone had an epiphany: What’s stopping us from using DNA to store data?
It seemed like a joke, but the biologists realized it wasn’t just a joke. They took a napkin and carefully calculated the feasibility with a ballpoint pen.
The principle of DNA storage of genetic information is not complicated. It consists of four nucleotides A, T, G, and C, which correspond to each other in pairs to form a double helix structure. The sequence of nucleotides, which records genetic information.
In the digital world, all information is essentially a string of 0s and 1s. If you want DNA to store digital information, a simple understanding is to convert the coding sequence of 0 and 1 into a sequence of nucleotides.The advantage of DNA storage is its high density. About the size of a comma in front of your eyes, 1 cubic millimeter of DNA can hold 9TB (1TB=1024GB) of information.
Using DNA to store data is not a completely new idea. Scientists have tried it before. But it belongs to the pioneering cross-border experiment of science and art.
In 1988, artist Joe Davis and Harvard researchers stored a pattern called “Micro Venus” in short strands of DNA.
Image of microvenus stored in DNA Source: Related Paper
This pattern coding is simple, the white part is marked as 0, the black line part is marked as 1, the file size is only 35bits, and it uses a DNA chain of 28 nucleotides to store.
Two years after that hotel discussion, in 2013, Goldman’s team published research. This time, they stored files in 5 different formats, totaling 0.75MB. In order to ensure that the information can be read without error, when scientists store it, each piece of information is stored with four times the amount of redundancy.
The five files are:
• 154 Shakespeare’s 14-line poems (ASCII-encoded format)
• Paper presenting the DNA double helix structure (PDF version)
• One photo (JPEG format)
• 26-second clip of Martin Luther King Jr.’s “I Have a Dream” speech (in MP3 format)
• A string of Huffman ciphers
In recent years, the online DNA storage capacity has been continuously broken through.In 2019, Catalog, a US startup, stored 16GB of Wikipedia in its DNA.The company says it is building the world’s first DNA-based platform for large-scale digital data storage and computing.
02 Encoding and decoding, there are many things to deal with
In the opinion of some biologists, using DNA for storage is a very “smooth” thing. “Nature’s coding language is very similar to the binary language we use in computing. On hard drives we use 0s and 1s to represent data, whereas in DNA we have 4 forms of nucleotides, A, C, T and G”. says biologist Robert Grass at the Swiss Federal Institute of Technology.
One of the keys to DNA storage is to map the numbers 0 and 1 with four nucleotides.The scheme can be very simple. For example: A corresponds to 00, C corresponds to 01, G corresponds to 10, and T corresponds to 11. Then according to the required nucleotide sequence, the nucleotides are strung together like a string of beads. (This is DNA synthesis) When the information needs to be read, gene sequencing technology is used to read out this string of nucleotide sequences, and then translate it into a string of 0s and 1s.This process is encoding-DNA synthesis-sequencing-decoding.
This sounds like “put an elephant inrefrigerator” process, there are still many issues that need to be considered in operation. Otherwise, scientists do not have to study new coding schemes all the time.
In DNA that exists in nature, A and T, C and G are paired in pairs, and in a DNA, the proportion of CG and AT is basically uniform, about 50%. If the content of C and G is too high, it may cause some complex physical structures in the DNA strand. This complicates DNA sequencing (decoding).
Steps to DNA Storage | Source: DNA Data Storage Alliance
And in the process of “beading” (that is, synthesizing DNA strands), the error rate is inevitable. There is currently an error about every 100 bases synthesized. This is the bottleneck brought about by the current chemical synthesis technology. Each base synthesized has a correct rate of more than 99.9%. But when the base string gets longer, the 0.01% probability is multiplied, and errors are unavoidable. At present, the length of a single strand of synthetic DNA is generally not more than 100 bases, and the limit is about 300 bases. In nature, DNA often has several thousand base pairs.
That is, while DNA has great storage capacity, they have to exist in many short chains. If the amount of information stored is relatively large, these short DNA strands are like a loose book. It can store a lot of information, but it exists in the form of sheets of paper with page numbers. Of course, short DNA strands can be spliced into long strands. This means an additional process. In the process of sequencing, it is necessary to break long chains into short chains. This is because current technology cannot read long chains in one go.
In the process of sequencing, there is also an error rate. Although the current error rate is as low as 10^-3 orders of magnitude, it is still at least 9 orders of magnitude different from the read and write error rates of commercial hard drives.
The accuracy rate is affected by the two technologies of synthesis and sequencing, and scientists have thought of designing an encoding scheme to avoid it: adding error correction mechanisms to the encoding.In this way, even if there are errors in base synthesis and sequencing, it is still possible to ensure that the content stored in the DNA can be read out correctly.
03 Go out of the laboratory, but also consider speed and cost
DNA storage is also trying to move out of the lab.
In October 2020, Microsoft,Western DigitalIt has established a DNA data storage alliance with gene sequencing giant Illumina and DNA synthesis startup Twist Bioscience.
This is the world’s first academic and industrial chain alliance in this field.The consortium hopes to develop technical and format standards that will eventually lead to a common business system.
Microsoft Research launched the DNA storage project in 2015 and hired Karin Strauss, an associate professor in the School of Computer Science and Engineering at the University of Washington, as Senior Principal Research Manager.
In 2013, she and her colleagues visited EBI in the United Kingdom and learned about Goldman and her colleagues’ research on DNA storage, and became very interested in this direction. “We’re excited about the density, stability and maturity of DNA,” Strauss said.
In their research, they wanted to develop another feature: random reads.In common DNA sequencing technologies, all base strings must be read at one time to obtain information. Either don’t read it or read it all. If you only want a small fragment of the data, it will be very troublesome.
In 2016, they published a study that could search for a given image in the information already stored in DNA, locate it, use enzymes to replicate the desired DNA segment, and then simply read that small segment.
Karin Strauss (right) and two research collaborators | Source: csenews
Synthesis speed and cost also need to be addressed to bring DNA storage one step closer to commercial use. Now the synthesis speed is stored in kilobytes (KB) per second, and mature cloud storage solutions already have more than gigabytes (GB) per second.
This means that the speed of writing DNA needs to be improved by 6 orders of magnitude. How to increase the amount of data processing?Just as parallel computing can improve the speed of data processing, scientists hope that DNA can be synthesized in parallel and processed at the same time.
In 2021, Microsoft will develop the first nanoscale DNA memory that can simultaneously synthesize 25X106 (2650) base sequences on each square centimeter area. This new technology has raised the original number of simultaneously synthesized base sequences from the single digit to the thousand digit. This throughput turns DNA synthesis into megabytes per second (MB).
New method greatly increases the number of arrays for DNA synthesis | Source: Microsoft Research
Greater throughput means lower cost. The cost of DNA storage today is $800 million per terabyte (TB). And tape storage costs have fallen below $16 per terabyte. It seems uncompetitive in comparison. However, the maintenance cost of large-scale data centers in real life is extremely high, and the hardware needs to be updated regularly; the advantages of DNA storage density, small size, and long-term non-deterioration become a dimensionality reduction blow.
Therefore, “cold data” with large amount and low reading frequency is considered to be the latest application scenario of DNA storage.Twist Bioscience highlighted in a recent market report that the technology could help tech companies deploy more efficiently at “large scale, low power.”
Other optimistic scientists believe more in technological progress.
Since the completion of the Human Genome Project in 2003, the cost of sequencing has decreased by a factor of 2 million. In 2016, when faced with kilobytes per second, Goldman said, “[Read and write speeds] 6 orders of magnitude is no big deal for genomics. You just have to wait a little longer.”
So how long is this “moment”? This field seems to be on the verge of breaking through.