DNA Digital Data Storage
   HOME

TheInfoList



OR:

DNA digital data storage is the process of encoding and decoding binary data to and from synthesized strands of DNA. While DNA as a storage medium has enormous potential because of its high storage density, its practical use is currently severely limited because of its high cost and very slow read and write times. In June 2019, scientists reported that all 16 GB of text from Wikipedia's English-language version had been encoded into synthetic DNA. In 2021, scientists reported that a custom DNA data writer had been developed that was capable of writing data into DNA at 18 Mbps.


Encoding methods

Countless methods for encoding data in DNA are possible. The most optimal methods are those that make economical use of DNA and protect against errors. If the message DNA is intended to be stored for a long period of time, for example, 1,000 years, it is also helpful if the sequence is obviously artificial and the reading frame is easy to identify.


Encoding text

Several simple methods for encoding text have been proposed. Most of these involve translating each letter into a corresponding "codon", consisting of a unique small sequence of nucleotides in a
lookup table In computer science, a lookup table (LUT) is an array that replaces runtime computation with a simpler array indexing operation. The process is termed as "direct addressing" and LUTs differ from hash tables in a way that, to retrieve a value v wi ...
. Some examples of these encoding schemes include
Huffman codes In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code proceeds by means of Huffman coding, an algor ...
, comma codes, and alternating codes.


Encoding arbitrary data

To encode arbitrary data in DNA, the data is typically first converted into
ternary Ternary (from Latin ''ternarius'') or trinary is an adjective meaning "composed of three items". It can refer to: Mathematics and logic * Ternary numeral system, a base-3 counting system ** Balanced ternary, a positional numeral system, useful ...
(base 3) data rather than
binary Binary may refer to: Science and technology Mathematics * Binary number, a representation of numbers using only two digits (0 and 1) * Binary function, a function that takes two arguments * Binary operation, a mathematical operation that t ...
(base 2) data. Each digit (or "trit") is then converted to a nucleotide using a lookup table. To prevent homopolymers (repeating nucleotides), which can cause problems with accurate sequencing, the result of the lookup also depends on the preceding nucleotide. Using the example lookup table below, if the previous nucleotide in the sequence is T (thymine), and the trit is 2, the next nucleotide will be G (guanine). Various systems may be incorporated to partition and address the data, as well as to protect it from errors. One approach to error correction is to regularly intersperse synchronization nucleotides between the information-encoding nucleotides. These synchronization nucleotides can act as scaffolds when reconstructing the sequence from multiple overlapping strands.


In vivo

The genetic code within living organisms can potentially be co-opted to store information. Furthermore
synthetic biology Synthetic biology (SynBio) is a multidisciplinary area of research that seeks to create new biological parts, devices, and systems, or to redesign systems that are already found in nature. It is a branch of science that encompasses a broad ran ...
can be used to engineer cells with "molecular recorders" to allow the storage and retrieval of information stored in the cell's genetic material.
CRISPR gene editing CRISPR gene editing (pronounced "crisper") is a genetic engineering technique in molecular biology by which the genomes of living organisms may be modified. It is based on a simplified version of the bacterial CRISPR-Cas9 antiviral defense sys ...
can also be used to insert artificial DNA sequences into the genome of the cell. For encoding developmental lineage data (molecular flight recorder), roughly 30 trillion cell nuclei per mouse * 60 recording sites per nucleus * 7-15 bits per site yields about 2 TeraBytes per mouse written (but only very selectively read).


History

The idea of DNA digital data storage dates back to 1959, when the physicist
Richard P. Feynman Richard Phillips Feynman (; May 11, 1918 – February 15, 1988) was an American theoretical physicist, known for his work in the path integral formulation of quantum mechanics, the theory of quantum electrodynamics, the physics of the superflu ...
, in "There's Plenty of Room at the Bottom: An Invitation to Enter a New Field of Physics" outlined the general prospects for the creation of artificial objects similar to objects of the microcosm (including biological) and having similar or even more extensive capabilities. In 1964–65,
Mikhail Samoilovich Neiman Mikhail Samoilovich Neiman (March 7, 1905 in Sevastopol, Russian Empire – June 25, 1975 in Moscow, USSR) was a Soviet physicist, Doctor of Technical Sciences and professor. The main directions of his research were the study of the microwave el ...
, the Soviet physicist, published 3 articles about microminiaturization in electronics at the molecular-atomic level, which independently presented general considerations and some calculations regarding the possibility of recording, storage, and retrieval of information on synthesized DNA and RNA molecules. After the publication of the first M.S. Neiman's paper and after receiving by Editor the manuscript of his second paper (January, the 8th, 1964, as indicated in that paper) the interview with cybernetician
Norbert Wiener Norbert Wiener (November 26, 1894 – March 18, 1964) was an American mathematician and philosopher. He was a professor of mathematics at the Massachusetts Institute of Technology (MIT). A child prodigy, Wiener later became an early researcher i ...
as published. N. Wiener expressed ideas about miniaturization of computer memory, close to the ideas, proposed by M. S. Neiman independently. These Wiener's ideas M. S. Neiman mentioned in the third of his papers. This story is described in details. One of the earliest uses of DNA storage occurred in a 1988 collaboration between artist
Joe Davis Joseph Davis (15 April 190110 July 1978) was an English professional snooker and English billiards player. He was the dominant figure in snooker from the 1920s to the 1950s, and has been credited with inventing aspects of the way the game i ...
and researchers from Harvard. The image, stored in a DNA sequence in ''E.coli'', was organized in a 5 x 7 matrix that, once decoded, formed a picture of an ancient Germanic rune representing life and the female Earth. In the matrix, ones corresponded to dark pixels while zeros corresponded to light pixels. In 2007 a device was created at the University of Arizona using addressing molecules to encode mismatch sites within a DNA strand. These mismatches were then able to be read out by performing a restriction digest, thereby recovering the data. In 2011, George Church, Sri Kosuri, and Yuan Gao carried out an experiment that would encode a 659-kb book that was co-authored by Church. To do this, the research team did a two-to-one correspondence where a binary zero was represented by either an adenine or cytosine and a binary one was represented by a guanine or thymine. After examination, 22 errors were found in the DNA. In 2012, George Church and colleagues at
Harvard University Harvard University is a private Ivy League research university in Cambridge, Massachusetts. Founded in 1636 as Harvard College and named for its first benefactor, the Puritan clergyman John Harvard, it is the oldest institution of higher le ...
published an article in which DNA was encoded with digital information that included an HTML draft of a 53,400 word book written by the lead researcher, eleven JPG images and one JavaScript program. Multiple copies for redundancy were added and 5.5
petabit The bit is the most basic unit of information in computing and digital communications. The name is a portmanteau of binary digit. The bit represents a logical state with one of two possible values. These values are most commonly represented a ...
s can be stored in each cubic millimeter of DNA. The researchers used a simple code where bits were mapped one-to-one with bases, which had the shortcoming that it led to long runs of the same base, the sequencing of which is error-prone. This result showed that besides its other functions, DNA can also be another type of storage medium such as hard drives and magnetic tapes. In 2013, an article led by researchers from the
European Bioinformatics Institute The European Bioinformatics Institute (EMBL-EBI) is an Intergovernmental Organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Well ...
(EBI) and submitted at around the same time as the paper of
Church Church may refer to: Religion * Church (building), a building for Christian religious activities * Church (congregation), a local congregation of a Christian denomination * Church service, a formalized period of Christian communal worship * Chris ...
and colleagues detailed the storage, retrieval, and reproduction of over five million bits of data. All the DNA files reproduced the information between 99.99% and 100% accuracy. The main innovations in this research were the use of an error-correcting encoding scheme to ensure the extremely low data-loss rate, as well as the idea of encoding the data in a series of overlapping short
oligonucleotide Oligonucleotides are short DNA or RNA molecules, oligomers, that have a wide range of applications in genetic testing, research, and forensics. Commonly made in the laboratory by solid-phase chemical synthesis, these small bits of nucleic acids c ...
s identifiable through a sequence-based indexing scheme. Also, the sequences of the individual strands of DNA overlapped in such a way that each region of data was repeated four times to avoid errors. Two of these four strands were constructed backwards, also with the goal of eliminating errors. The costs per megabyte were estimated at $12,400 to encode data and $220 for retrieval. However, it was noted that the exponential decrease in DNA synthesis and sequencing costs, if it continues into the future, should make the technology cost-effective for long-term data storage by 2023. In 2013, a software called DNACloud was developed by Manish K. Gupta and co-workers to encode computer files to their DNA representation. It implements a memory efficiency version of the algorithm proposed by Goldman et al. to encode (and decode) data to DNA (.dnac files). The long-term stability of data encoded in DNA was reported in February 2015, in an article by researchers from
ETH Zurich (colloquially) , former_name = eidgenössische polytechnische Schule , image = ETHZ.JPG , image_size = , established = , type = Public , budget = CHF 1.896 billion (2021) , rector = Günther Dissertori , president = Joël Mesot , ac ...
. The team added redundancy via
Reed–Solomon error correction Reed–Solomon codes are a group of error-correcting codes that were introduced by Irving S. Reed and Gustave Solomon in 1960. They have many applications, the most prominent of which include consumer technologies such as MiniDiscs, CDs, DVDs, B ...
coding and by encapsulating the DNA within silica glass spheres via Sol-gel chemistry. In 2016 research by Church and Technicolor Research and Innovation was published in which, 22 MB of a MPEG compressed movie sequence were stored and recovered from DNA. The recovery of the sequence was found to have zero errors. In March 2017,
Yaniv Erlich Yaniv Erlich is an Israeli-American scientist. He formerly served as an Associate Professor of Computer Science at Columbia University and was the Chief Science Officer of MyHeritage. Erlich's work combines computer science and genomics. Biog ...
and
Dina Zielinski Dina ( ar, دينا, he, דִּינָה, also spelled Dinah, Dena, Deena) is a female given name. Women * Dina bint Abdul-Hamid (1929–2019), Queen consort of Jordan, first wife of King Hussein * Princess Dina Mired of Jordan (born 1965), Princ ...
of
Columbia University Columbia University (also known as Columbia, and officially as Columbia University in the City of New York) is a private research university in New York City. Established in 1754 as King's College on the grounds of Trinity Church in Manhatt ...
and the
New York Genome Center The New York Genome Center (NYGC) is an independent 501(c)(3) nonprofit academic research institution in New York, New York. It serves as a multi-institutional collaborative hub focused on the advancement of genomic science and its application ...
published a method known as DNA Fountain that stored data at a density of 215 petabytes per gram of DNA. The technique approaches the Shannon capacity of DNA storage, achieving 85% of the theoretical limit. The method was not ready for large-scale use, as it costs $7000 to synthesize 2 megabytes of data and another $2000 to read it. In March 2018,
University of Washington The University of Washington (UW, simply Washington, or informally U-Dub) is a public research university in Seattle, Washington. Founded in 1861, Washington is one of the oldest universities on the West Coast; it was established in Seattle a ...
and
Microsoft Microsoft Corporation is an American multinational technology corporation producing computer software, consumer electronics, personal computers, and related services headquartered at the Microsoft Redmond campus located in Redmond, Washing ...
published results demonstrating storage and retrieval of approximately 200MB of data. The research also proposed and evaluated a method for
random access Random access (more precisely and more generally called direct access) is the ability to access an arbitrary element of a sequence in equal time or any datum from a population of addressable elements roughly as easily and efficiently as any othe ...
of data items stored in DNA. In March 2019, the same team announced they have demonstrated a fully automated system to encode and decode data in DNA. Research published by
Eurecom EURECOM is a French Graduate school (''Grande École)'' and a research center in digital sciences. It is part of the Institut Mines-Télécom and it is a founding member of the SophiaTech Campus in Sophia Antipolis, the largest Science and Technol ...
and
Imperial College Imperial College London (legally Imperial College of Science, Technology and Medicine) is a public research university in London, United Kingdom. Its history began with Prince Albert, consort of Queen Victoria, who developed his vision for a cu ...
in January 2019, demonstrated the ability to store structured data in synthetic DNA. The research showed how to encode structured or, more specifically, relational data in synthetic DNA and also demonstrated how to perform
data processing Data processing is the collection and manipulation of digital data to produce meaningful information. Data processing is a form of ''information processing'', which is the modification (processing) of information in any manner detectable by an ...
operations (similar to SQL) directly on the DNA as chemical processes. In June 2019, scientists reported that all 16 GB of
Wikipedia Wikipedia is a multilingual free online encyclopedia written and maintained by a community of volunteers, known as Wikipedians, through open collaboration and using a wiki-based editing system. Wikipedia is the largest and most-read refer ...
have been encoded into synthetic DNA. In 2021, CATALOG reported that they had developed a custom DNA writer capable of writing data at 18 Mbps into DNA. The first article describing data storage on native DNA sequences via enzymatic nicking was published in April 2020. In the paper, scientists demonstrate a new method of recording information in DNA backbone which enables bit-wise random access and in-memory computing.


Davos Bitcoin Challenge

On January 21, 2015, Nick Goldman from the
European Bioinformatics Institute The European Bioinformatics Institute (EMBL-EBI) is an Intergovernmental Organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Well ...
(EBI), one of the original authors of the 2013 ''
Nature Nature, in the broadest sense, is the physics, physical world or universe. "Nature" can refer to the phenomenon, phenomena of the physical world, and also to life in general. The study of nature is a large, if not the only, part of science. ...
'' paper, announced the Davos Bitcoin Challenge at the
World Economic Forum The World Economic Forum (WEF) is an international non-governmental and lobbying organisation based in Cologny, canton of Geneva, Switzerland. It was founded on 24 January 1971 by German engineer and economist Klaus Schwab. The foundation, ...
annual meeting in Davos. During his presentation, DNA tubes were handed out to the audience, with the message that each tube contained the private key of exactly one
bitcoin Bitcoin ( abbreviation: BTC; sign: ₿) is a decentralized digital currency that can be transferred on the peer-to-peer bitcoin network. Bitcoin transactions are verified by network nodes through cryptography and recorded in a public distr ...
, all coded in DNA. The first one to
sequence In mathematics, a sequence is an enumerated collection of objects in which repetitions are allowed and order matters. Like a set, it contains members (also called ''elements'', or ''terms''). The number of elements (possibly infinite) is calle ...
and decode the DNA could claim the bitcoin and win the challenge. The challenge was set for three years and would close if nobody claimed the prize before January 21, 2018. Almost three years later on January 19, 2018, the EBI announced that a Belgian PhD student, Sander Wuyts, of the
University of Antwerp The University of Antwerp ( nl, Universiteit Antwerpen) is a major Belgian university located in the city of Antwerp. The official abbreviation is ''UA'', but ''UAntwerpen'' is more recently used. The University of Antwerp has about 20,000 stud ...
and
Vrije Universiteit Brussel The Vrije Universiteit Brussel (VUB) () is a Dutch and English-speaking research university located in Brussels, Belgium.The Vrije Universiteit Brussel is one of the five universities officially recognised by the Flemish Community, Flemish gov ...
, was the first one to complete the challenge. Next to the instructions on how to claim the bitcoin (stored as a plain text and
PDF file Portable Document Format (PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems. ...
), the logo of the EBI, the logo of the company that printed the DNA (CustomArray), and a sketch of
James Joyce James Augustine Aloysius Joyce (2 February 1882 – 13 January 1941) was an Irish novelist, poet, and literary critic. He contributed to the modernist avant-garde movement and is regarded as one of the most influential and important writers of ...
were retrieved from the DNA.


The Lunar Library

The Lunar Library, launched on the Beresheet Lander by the
Arch Mission Foundation Arch Mission Foundation is a non-profit organization whose goal is to create multiple redundant repositories of human knowledge around the Solar System, including on Earth. The organization was founded by Nova Spivack and Nick Slavin in 2015 and ...
, carries information encoded in DNA, which includes 20 famous books and 10,000 images. This was one of the optimal choices of storage, as DNA can last an immense period of time. The Arch Mission Foundation suggests that it can still be read after billions of years.


DNA of Things

The concept of the DNA of Things (DoT) was introduced in 2019 by a team of researchers from Israel and Switzerland, including
Yaniv Erlich Yaniv Erlich is an Israeli-American scientist. He formerly served as an Associate Professor of Computer Science at Columbia University and was the Chief Science Officer of MyHeritage. Erlich's work combines computer science and genomics. Biog ...
and Robert Grass. DoT encodes digital data into DNA molecules, which are then embedded into objects. This gives the ability to create objects that carry their own blueprint, similar to biological organisms. In contrast to
Internet of things The Internet of things (IoT) describes physical objects (or groups of such objects) with sensors, processing ability, software and other technologies that connect and exchange data with other devices and systems over the Internet or other comm ...
, which is a system of interrelated computing devices, DoT creates objects which are independent storage objects, completely
off-grid Off-the-grid or off-grid is a characteristic of buildings and a lifestyle designed in an independent manner without reliance on one or more public utilities. The term "off-the-grid" traditionally refers to not being connected to the electrical gr ...
. As a proof of concept for DoT, the researcher 3D-printed a
Stanford bunny The Stanford bunny is a computer graphics 3D test model developed by Greg Turk and Marc Levoy in 1994 at Stanford University. The model consists of 69,451 triangles, with the data determined by 3D scanner, 3D scanning a ceramic figurine of a rab ...
which contains its blueprint in the plastic filament used for printing. By clipping off a tiny bit of the ear of the bunny, they were able to read out the blueprint, multiply it and produce a next generation of bunnies. In addition, the ability of DoT to serve for
steganographic Steganography ( ) is the practice of representing information within another message or physical object, in such a manner that the presence of the information is not evident to human inspection. In computing/electronic contexts, a computer file, ...
purposes was shown by producing non-distinguishable lenses which contain a
YouTube YouTube is a global online video platform, online video sharing and social media, social media platform headquartered in San Bruno, California. It was launched on February 14, 2005, by Steve Chen, Chad Hurley, and Jawed Karim. It is owned by ...
video integrated into the material.


See also

*
DNA computing DNA computing is an emerging branch of unconventional computing which uses DNA, biochemistry, and molecular biology hardware, instead of the traditional electronic computing. Research and development in this area concerns theory, experiments, a ...
*
DNA nanotechnology DNA nanotechnology is the design and manufacture of artificial nucleic acid structures for technological uses. In this field, nucleic acids are used as non-biological engineering materials for nanotechnology rather than as the carriers of geneti ...
*
Nanobiotechnology Nanobiotechnology, bionanotechnology, and nanobiology are terms that refer to the intersection of nanotechnology and biology. Given that the subject is one that has only emerged very recently, bionanotechnology and nanobiotechnology serve as blan ...
*
Natural computing Natural computing,G.Rozenberg, T.Back, J.Kok, Editors, Handbook of Natural Computing, Springer Verlag, 2012A.Brabazon, M.O'Neill, S.McGarraghyNatural Computing Algorithms Springer Verlag, 2015 also called natural computation, is a terminology intro ...
* Plant-based digital data storage *
5D optical data storage 5D optical data storage (also branded as Superman memory crystal, a reference to the Kryptonian memory crystals from the Superman franchise) is an experimental nanostructured glass for permanently recording digital data using a femtosecond las ...


References


Further reading

* * * *
DNA Sequencing Caught in Deluge of Data
The New York Times (NYTimes.com). * * {{refend DNA Molecular biology Storage media Computational biology