Nvidia Parabricks
   HOME

TheInfoList



OR:

Parabricks company started at the
University of Michigan The University of Michigan (U-M, U of M, or Michigan) is a public university, public research university in Ann Arbor, Michigan, United States. Founded in 1817, it is the oldest institution of higher education in the state. The University of Mi ...
by Mehrzad Samadi, Ankit Sethia, and Scott Mahlke. It was acquired by
Nvidia Nvidia Corporation ( ) is an American multinational corporation and technology company headquartered in Santa Clara, California, and incorporated in Delaware. Founded in 1993 by Jensen Huang (president and CEO), Chris Malachowsky, and Curti ...
in 2020. Nvidia Parabricks is a suite of free software for genome analysis developed by Nvidia, designed to deliver high throughput by using
graphics processing unit A graphics processing unit (GPU) is a specialized electronic circuit designed for digital image processing and to accelerate computer graphics, being present either as a discrete video card or embedded on motherboards, mobile phones, personal ...
(GPU) acceleration. Parabricks offers workflows for
DNA Deoxyribonucleic acid (; DNA) is a polymer composed of two polynucleotide chains that coil around each other to form a double helix. The polymer carries genetic instructions for the development, functioning, growth and reproduction of al ...
and
RNA Ribonucleic acid (RNA) is a polymeric molecule that is essential for most biological functions, either by performing the function itself (non-coding RNA) or by forming a template for the production of proteins (messenger RNA). RNA and deoxyrib ...
analyses and the detection of
germline In biology and genetics, the germline is the population of a multicellular organism's cells that develop into germ cells. In other words, they are the cells that form gametes ( eggs and sperm), which can come together to form a zygote. They dif ...
and
somatic mutation A somatic mutation is a change in the DNA sequence of a somatic cell of a multicellular organism with dedicated reproductive cells; that is, any mutation that occurs in a cell other than a gamete, germ cell, or gametocyte. Unlike germline muta ...
s, using
open-source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use and view the source code, design documents, or content of the product. The open source model is a decentrali ...
tools. It is designed to improve the computing time of genomic data analysis while maintaining the flexibility required for various
bioinformatics Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...
experiments. Along with the speed of GPU-based processing, Parabricks ensures high
accuracy Accuracy and precision are two measures of ''observational error''. ''Accuracy'' is how close a given set of measurements (observations or readings) are to their ''true value''. ''Precision'' is how close the measurements are to each other. The ...
, compliance with standard genomic formats and the ability to scale in order to handle very large datasets. Users can download and run Parabricks pipelines locally or directly deploy them on cloud providers, such as
Amazon Web Services Amazon Web Services, Inc. (AWS) is a subsidiary of Amazon.com, Amazon that provides Software as a service, on-demand cloud computing computing platform, platforms and Application programming interface, APIs to individuals, companies, and gover ...
,
Google Cloud Google Cloud Platform (GCP) is a suite of cloud computing services offered by Google that provides a series of modular cloud services including computing, data storage, data analytics, and machine learning, alongside a set of management tools ...
, Oracle Cloud Infrastructure, and
Microsoft Azure Microsoft Azure, or just Azure ( /ˈæʒər, ˈeɪʒər/ ''AZH-ər, AY-zhər'', UK also /ˈæzjʊər, ˈeɪzjʊər/ ''AZ-ure, AY-zure''), is the cloud computing platform developed by Microsoft. It has management, access and development of ...
.


Accelerated genome analysis fundamentals

The massive reduction in
sequencing In genetics and biochemistry, sequencing means to determine the primary structure (sometimes incorrectly called the primary sequence) of an unbranched biopolymer. Sequencing results in a symbolic linear depiction known as a sequence which succ ...
costs resulted in a significant increase in the size and the availability of genomics data with the potential of revolutionizing many fields, from
medicine Medicine is the science and Praxis (process), practice of caring for patients, managing the Medical diagnosis, diagnosis, prognosis, Preventive medicine, prevention, therapy, treatment, Palliative care, palliation of their injury or disease, ...
to
drug design Drug design, often referred to as rational drug design or simply rational design, is the invention, inventive process of finding new medications based on the knowledge of a biological target. The drug is most commonly an organic compound, organi ...
. Starting from a biological sample (e.g.,
saliva Saliva (commonly referred as spit or drool) is an extracellular fluid produced and secreted by salivary glands in the mouth. In humans, saliva is around 99% water, plus electrolytes, mucus, white blood cells, epithelial cells (from which ...
or
blood Blood is a body fluid in the circulatory system of humans and other vertebrates that delivers necessary substances such as nutrients and oxygen to the cells, and transports metabolic waste products away from those same cells. Blood is com ...
), it is possible to extract the individual's DNA and sequence it with sequencing machinery to translate the biological information into a textual sequence of bases. Then, once the entire genome is obtained through the genome assembly process, the DNA can be analyzed to extract information that is key in several domains, including personalized medicine and medical diagnostics. Typically, genomics data analysis is performed with tools based on Central Processing Units (CPUs) for processing. Recently, several researchers in this field have underlined the challenges in terms of computing power delivered by these tools and focused their efforts on finding ways to boost the
performance A performance is an act or process of staging or presenting a play, concert, or other form of entertainment. It is also defined as the action or process of carrying out or accomplishing an action, task, or function. Performance has evolved glo ...
of the applications. The issue has been addressed in two ways: developing more efficient algorithms or accelerating the compute-intensive part using hardware accelerators. Examples of accelerators used in the domain are GPUs, FPGAs, and ASICs In this context, GPUs have revolutionized
genomics Genomics is an interdisciplinary field of molecular biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, ...
by exploiting their parallel processing power to accelerate computationally intensive tasks. GPUs deliver promising results in these scenarios thanks to their architecture, composed of thousands of small cores capable of performing computations in parallel. This parallelism allows GPUs to process multiple tasks simultaneously, significantly speeding up computations that can be broken down into independent units. For instance, aligning millions of sequencing reads against a reference genome or performing statistical analyses on large genomic datasets can be completed much faster on GPUs than when using CPUs. This facilitates the rapid analysis of genomic data from diverse sources, ranging from individual genomes to large-scale population studies, accelerating the understanding of
genetic diseases A genetic disorder is a health problem caused by one or more abnormalities in the genome. It can be caused by a mutation in a single gene (monogenic) or multiple genes (polygenic) or by a chromosome abnormality. Although polygenic disorders are ...
,
genetic diversity Genetic diversity is the total number of genetic characteristics in the genetic makeup of a species. It ranges widely, from the number of species to differences within species, and can be correlated to the span of survival for a species. It is d ...
, and more complex
biological systems A biological system is a complex Biological network inference, network which connects several biologically relevant entities. Biological organization spans several scales and are determined based different structures depending on what the system is ...
.


Featured pipelines

Parabricks offers end users various collections of tools organized sequentially to analyze the raw data according to the user's requirements, called ''pipelines''. Nevertheless, users can decide to run the tools provided by Parabricks as a standalone, still exploiting GPU acceleration to overcome possible computational bottlenecks. Only some of the provided tools in the suite are GPU-based. Overall, all the pipelines share a standard structure. Most of the pipelines are built to analyze FASTQ data resulting from various sequencing technologies (e.g., short- or long-read). Input genomic sequences are firstly aligned and then undergo a quality control process. These two processes provide a BAM or a
CRAM Cram may refer to: * Cram (surname), a surname, and list of notable persons having the surname * Cram.com, a website for creating and sharing flashcards * ''Cram'' (Australian game show), a television show * ''Cram'' (game show), a TV game show ...
file as an intermediate result. Based on this data, the variant calling task that follows employs high-accuracy tools that are already widely used. As output, these pipelines provide the identified mutations in a VCF (or a gVCF).


Germline pipeline

The germline pipeline offered by Parabricks follows the ''best practices'' proposed by the
Broad Institute The Eli and Edythe L. Broad Institute of MIT and Harvard (IPA: , pronunciation respelling: ), often referred to as the Broad Institute, is a biomedical and genomic research center located in Cambridge, Massachusetts, United States. The institu ...
in their Genome Analysis ToolKit (GATK). The germline pipeline operates on the FASTQ files provided as input by the user to call the variants that, belonging to the
germ Germ or germs may refer to: Science * Germ (microorganism), an informal word for a pathogen * Germ cell, cell that gives rise to the gametes of an organism that reproduces sexually * Germ layer, a primary layer of cells that forms during embry ...
line, can be inherited. This pipeline analyzes data computing the read alignment with BWA-MEM and calling variants using GATK HaplotypeCaller, one of the most relevant tools in the domain for germline variant calling.


DeepVariant germline pipeline

Besides the pipeline that resorts to HaplotypeCaller to call variants, Parabricks also offers an alternative pipeline that still calls germline variants but is based on DeepVariant. DeepVariant is a variant caller, developed and maintained by
Google Google LLC (, ) is an American multinational corporation and technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, consumer electronics, and artificial ...
, capable of identifying mutations using a
deep learning Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...
-based approach. The core of DeepVariant is a
convolutional neural network A convolutional neural network (CNN) is a type of feedforward neural network that learns features via filter (or kernel) optimization. This type of deep learning network has been applied to process and make predictions from many different ty ...
(CNN) that identifies variants by transforming this task into an
image classification Computer vision tasks include methods for acquiring, processing, analyzing, and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information, e.g. in the form o ...
operation. In Parabricks, the inference process is accelerated in hardware. For this pipeline, only T4, V100, and A100 GPUs are supported. Analyses performed according to this pipeline are compliant with the use of BWA-MEM for the alignment by Google's CNN for variant calling.


Human_par pipeline

Still compliant with GATK best practices, the ''human_par'' pipeline allows users to identify mutations in the entire human genome, including
sex Sex is the biological trait that determines whether a sexually reproducing organism produces male or female gametes. During sexual reproduction, a male and a female gamete fuse to form a zygote, which develops into an offspring that inheri ...
chromosomes A chromosome is a package of DNA containing part or all of the genetic material of an organism. In most chromosomes, the very long thin DNA fibers are coated with nucleosome-forming packaging proteins; in eukaryotic cells, the most importa ...
X and Y, and, thus, it is compliant with their
ploidy Ploidy () is the number of complete sets of chromosomes in a cell, and hence the number of possible alleles for autosomal and pseudoautosomal genes. Here ''sets of chromosomes'' refers to the number of maternal and paternal chromosome copies, ...
. For male samples, firstly, the pipeline runs HaplotypeCaller on all the regions that do not belong to the X and Y chromosomes and on the
pseudoautosomal region The pseudoautosomal regions or PARs are Homology (biology), homologous sequences of Nucleotide, nucleotides found within the Sex chromosome, sex chromosomes of species with an XY sex-determination system, XY or ZW sex-determination system, ZW mech ...
with ploidy equal to 1. Then, HaplotypeCaller analyses the X and Y regions without the pseudoautosomal region with ploidy 2. Regarding female samples, instead, the pipeline runs HaplotypeCaller on the entire genome, with ploidy 2. The sex of the sample can be determined in two main ways: # Manually set with the --sample-sex option; # Specify the X vs. Y ratio with range options --range-male and --range-female and let the tool automatically infer the sex of the samples based on the X and Y reads count. The pipeline requires the user to specify at least one of these three options. As for the germline case, since this pipeline targets the germline variants, the pipeline resorts to BWA-MEM for the alignment, followed by HaplotypeCaller for variant calling.


Somatic pipeline

Parabricks' somatic pipeline is designed to call somatic variants, i.e., those mutations affecting non-reproductive (somatic) cells. This pipeline can analyze both tumor and non-tumor genomes, offering either tumor-only or tumor/normal analyses for comprehensive examinations. As in the germline pipeline, the alignment task is carried out using BWA-MEM followed by GATK Mutect to identify the possible mutations. Mutect is used instead of HaplotypeCaller due to its focus on somatic mutations, as opposed to germline mutations targeted by HaplotypeCaller.


RNA pipeline

This pipeline is optimized for short variant discovery (i.e.,
Single-nucleotide polymorphism In genetics and bioinformatics, a single-nucleotide polymorphism (SNP ; plural SNPs ) is a germline substitution of a single nucleotide at a specific position in the genome. Although certain definitions require the substitution to be present in a ...
s (SNPs) and
indel Indel (insertion-deletion) is a molecular biology term for an insertion or deletion of bases in the genome of an organism. Indels ≥ 50 bases in length are classified as structural variants. In coding regions of the genome, unless the lengt ...
s) in RNAseq data. It follows the Broad Institute's best practices for these types of analyses. It relies on the STAR aligner, a read aligner specialized for RNA sequences for aligning the reads, and HaplotypeCaller for calling variants.


Parabricks tools

Parabricks provides a collection of tools to perform genomics analyses, classified into six main categories related to their task. These tools combined constitutes Parabricks' pipelines, and can be also used as-is. For FASTQ and BAM files processing, the proposed tools are: * * * * * * * * * (beta) For calling variants, the proposed tools are: * * * * (GATK Germline Pipeline) * * * (beta) * * * (Somatic Variant Caller) For RNA processing, the proposed tools are: * * For results quality control, the proposed tools are: * * For processing variants, the proposed tools are: * For processing gVCF files, the proposed tools are: * * Not all the listed tools are accelerated on GPU.


Hardware support

Users can download and run Parabricks pipelines on their local servers, allowing for private, on-site data processing and analysis. They also can deploy Parabricks pipelines on cloud platforms, with improved scalability for larger datasets. Supported cloud providers include AWS, GCP, OCI, and Azure. In the latest release (v4.3.1-1), Parabricks includes support for the Nvidia
Grace Hopper Grace Brewster Hopper (; December 9, 1906 – January 1, 1992) was an American computer scientist, mathematician, and United States Navy rear admiral. She was a pioneer of computer programming. Hopper was the first to devise the theory of mach ...
super chip. The Nvidia GH200 Grace Hopper Superchip is a heterogeneous platform designed for
high-performance computing High-performance computing (HPC) is the use of supercomputers and computer clusters to solve advanced computation problems. Overview HPC integrates systems administration (including network and security knowledge) and parallel programming into ...
and
artificial intelligence Artificial intelligence (AI) is the capability of computer, computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. It is a field of re ...
, combining an Nvidia
Grace Grace may refer to: Places United States * Grace, Idaho, a city * Grace (CTA station), Chicago Transit Authority's Howard Line, Illinois * Little Goose Creek (Kentucky), location of Grace post office * Grace, Carroll County, Missouri, an uni ...
and a Hopper on a single chip. This platform enhances application performance using both GPUs and CPUs, offering a
programming model A programming model is an execution model coupled to an API or a particular pattern of code. In this style, there are actually two execution models in play: the execution model of the base programming language and the execution model of the p ...
aimed at improving performance, portability, and
productivity Productivity is the efficiency of production of goods or services expressed by some measure. Measurements of productivity are often expressed as a ratio of an aggregate output to a single input or an aggregate input used in a production proce ...
.


Applications

Due to the computational power required by genomics workloads, Parabricks has found application in several research studies with different applicative domains, especially in
cancer Cancer is a group of diseases involving Cell growth#Disorders, abnormal cell growth with the potential to Invasion (cancer), invade or Metastasis, spread to other parts of the body. These contrast with benign tumors, which do not spread. Po ...
research. Scientists from Washington University used the Parabricks DeepVariant pipeline for identifying variants (e.g., SNPs and small indels) in long-read Hi-Fi whole-genome sequencing (WGS) data generated with PacBio's Revio SMRT Cell technology. In addition to the pipelines, individual components of Parabricks have been used as standalone tools in academic settings. For example, the accelerated DeepVariant has been employed in a novel process to reduce the processing time further for WGS Nanopore data. In 2022, Nvidia announced a collaboration with the Broad Institute to provide researchers with the benefits of accelerated computing. This partnership includes the entire suite of Nvidia's biomedical hardware-accelerated software suite called Clara, that includes Parabricks and MONAI. Similarly, the Regeneron Genetics Center uses Parabricks to expedite the secondary analysis of the exomes they sequence in their
high-throughput sequencing DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, thymine, cytosine, and guanine. The ...
center, leverage the DeepVariant Germline pipeline inside their workflows.


See also

* List of bioinformatics software * List of sequence alignment software


References


Further reading

* * {{refend


External links


NVIDIA Clara

NVIDIA Clara for Genomics
Nvidia software Bioinformatics software Medical software