UGENE is computer
software
Software consists of computer programs that instruct the Execution (computing), execution of a computer. Software also includes design documents and specifications.
The history of software is closely tied to the development of digital comput ...
for
bioinformatics
Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...
.
It helps biologists to analyze various
biological
Biology is the scientific study of life and living organisms. It is a broad natural science that encompasses a wide range of fields and unifying principles that explain the structure, function, growth, origin, evolution, and distribution of ...
genetics
Genetics is the study of genes, genetic variation, and heredity in organisms.Hartl D, Jones E (2005) It is an important branch in biology because heredity is vital to organisms' evolution. Gregor Mendel, a Moravian Augustinians, Augustinian ...
data, such as
sequences
In mathematics, a sequence is an enumerated collection of objects in which repetitions are allowed and order matters. Like a set, it contains members (also called ''elements'', or ''terms''). The number of elements (possibly infinite) is call ...
, annotations,
multiple alignments,
phylogenetic tree
A phylogenetic tree or phylogeny is a graphical representation which shows the evolutionary history between a set of species or taxa during a specific time.Felsenstein J. (2004). ''Inferring Phylogenies'' Sinauer Associates: Sunderland, MA. In ...
s,
NGS assemblies, and others. UGENE integrates dozens of well-known biological tools, algorithms, and original tools in the context of
genomics
Genomics is an interdisciplinary field of molecular biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, ...
,
evolutionary biology
Evolutionary biology is the subfield of biology that studies the evolutionary processes such as natural selection, common descent, and speciation that produced the diversity of life on Earth. In the 1930s, the discipline of evolutionary biolo ...
,
virology
Virology is the Scientific method, scientific study of biological viruses. It is a subfield of microbiology that focuses on their detection, structure, classification and evolution, their methods of infection and exploitation of host (biology), ...
, and other branches of life science.
UGENE works on
personal computer
A personal computer, commonly referred to as PC or computer, is a computer designed for individual use. It is typically used for tasks such as Word processor, word processing, web browser, internet browsing, email, multimedia playback, and PC ...
operating systems such as
Windows
Windows is a Product lining, product line of Proprietary software, proprietary graphical user interface, graphical operating systems developed and marketed by Microsoft. It is grouped into families and subfamilies that cater to particular sec ...
,
macOS
macOS, previously OS X and originally Mac OS X, is a Unix, Unix-based operating system developed and marketed by Apple Inc., Apple since 2001. It is the current operating system for Apple's Mac (computer), Mac computers. With ...
, or
Linux
Linux ( ) is a family of open source Unix-like operating systems based on the Linux kernel, an kernel (operating system), operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically package manager, pac ...
. It is released as
free and open-source software
Free and open-source software (FOSS) is software available under a license that grants users the right to use, modify, and distribute the software modified or not to everyone free of charge. FOSS is an inclusive umbrella term encompassing free ...
, under a
GNU General Public License
The GNU General Public Licenses (GNU GPL or simply GPL) are a series of widely used free software licenses, or ''copyleft'' licenses, that guarantee end users the freedom to run, study, share, or modify the software. The GPL was the first ...
(GPL) version 2. The data can be stored both locally and on shared/networked storage. The
graphical user interface
A graphical user interface, or GUI, is a form of user interface that allows user (computing), users to human–computer interaction, interact with electronic devices through Graphics, graphical icon (computing), icons and visual indicators such ...
(GUI) provides access to pre-built tools so users with no
computer programming
Computer programming or coding is the composition of sequences of instructions, called computer program, programs, that computers can follow to perform tasks. It involves designing and implementing algorithms, step-by-step specifications of proc ...
experience can access those tools easily. UGENE also has a
command-line interface
A command-line interface (CLI) is a means of interacting with software via command (computing), commands each formatted as a line of text. Command-line interfaces emerged in the mid-1960s, on computer terminals, as an interactive and more user ...
to execute Workflows.
Using UGENE Workflow Designer, it is possible to streamline a multi-step analysis. The workflow consists of blocks such as data readers, blocks executing embedded tools and algorithms, and data writers. Blocks can be created with command line tools or a script. A set of sample workflows is available in the Workflow Designer, to annotate sequences, convert data formats, analyze NGS data, etc.
To improve performance, UGENE uses
multi-core processor
A multi-core processor (MCP) is a microprocessor on a single integrated circuit (IC) with two or more separate central processing units (CPUs), called ''cores'' to emphasize their multiplicity (for example, ''dual-core'' or ''quad-core''). Ea ...
s (CPUs) and
graphics processing unit
A graphics processing unit (GPU) is a specialized electronic circuit designed for digital image processing and to accelerate computer graphics, being present either as a discrete video card or embedded on motherboards, mobile phones, personal ...
s (GPUs) to optimize a few algorithms.
Key features
The software supports the following features:
* Create, edit, and annotate
nucleic acid
Nucleic acids are large biomolecules that are crucial in all cells and viruses. They are composed of nucleotides, which are the monomer components: a pentose, 5-carbon sugar, a phosphate group and a nitrogenous base. The two main classes of nuclei ...
and
protein
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residue (biochemistry), residues. Proteins perform a vast array of functions within organisms, including Enzyme catalysis, catalysing metab ...
sequences
In mathematics, a sequence is an enumerated collection of objects in which repetitions are allowed and order matters. Like a set, it contains members (also called ''elements'', or ''terms''). The number of elements (possibly infinite) is call ...
* Fast search in a sequence
*
Multiple sequence alignment:
Clustal W and O,
MUSCLE
Muscle is a soft tissue, one of the four basic types of animal tissue. There are three types of muscle tissue in vertebrates: skeletal muscle, cardiac muscle, and smooth muscle. Muscle tissue gives skeletal muscles the ability to muscle contra ...
,
Kalign,
MAFFT,
T-Coffee
* Create and use shared storage, e.g., lab database
* Search through
online database
In computing, a database is an organized collection of Data (computing), data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, Application software, applications, and ...
s:
National Center for Biotechnology Information
The National Center for Biotechnology Information (NCBI) is part of the National Library of Medicine (NLM), a branch of the National Institutes of Health (NIH). It is approved and funded by the government of the United States. The NCBI is lo ...
(NCBI),
Protein Data Bank
The Protein Data Bank (PDB) is a database for the three-dimensional structural data of large biological molecules such as proteins and nucleic acids, which is overseen by the Worldwide Protein Data Bank (wwPDB). This structural data is obtained a ...
(PDB),
UniProtKB/Swiss-Prot,
UniProtKB/TrEMBL, DAS servers
* Local and NCBI Genbank
BLAST search
*
Open reading frame
In molecular biology, reading frames are defined as spans of DNA sequence between the start and stop codons. Usually, this is considered within a studied region of a prokaryotic DNA sequence, where only one of the six possible reading frames ...
finder
*
Restriction enzyme
A restriction enzyme, restriction endonuclease, REase, ENase or'' restrictase '' is an enzyme that cleaves DNA into fragments at or near specific recognition sites within molecules known as restriction sites. Restriction enzymes are one class o ...
finder with integrated REBASE
restriction enzymes list
* Integrated Primer3 package for
PCR primer design
*
Plasmid
A plasmid is a small, extrachromosomal DNA molecule within a cell that is physically separated from chromosomal DNA and can replicate independently. They are most commonly found as small circular, double-stranded DNA molecules in bacteria and ...
construction and annotation
*
Cloning
Cloning is the process of producing individual organisms with identical genomes, either by natural or artificial means. In nature, some organisms produce clones through asexual reproduction; this reproduction of an organism by itself without ...
in silico
In biology and other experimental sciences, an ''in silico'' experiment is one performed on a computer or via computer simulation software. The phrase is pseudo-Latin for 'in silicon' (correct ), referring to silicon in computer chips. It was c ...
by designing of cloning vectors
* Genome mapping of short reads with
Bowtie
The bow tie or dicky bow is a type of neckwear, distinguishable from a necktie because it does not drape down the shirt placket, but is tied just underneath a winged collar. A modern bow tie is tied using a common shoelace knot, which is also ...
, BWA, and UGENE Genome Aligner
* Visualize
next generation sequencing data (BAM files) using
UGENE Assembly Browser
* Variant calling with SAMtools
*
RNA-Seq
RNA-Seq (named as an abbreviation of RNA sequencing) is a technique that uses next-generation sequencing to reveal the presence and quantity of RNA molecules in a biological sample, providing a snapshot of gene expression in the sample, also k ...
data analysis with Tuxedo pipeline (TopHat, Cufflinks, etc.)
*
ChIP-seq
ChIP-sequencing, also known as ChIP-seq, is a method used to analyze protein interactions with DNA. ChIP-seq combines chromatin immunoprecipitation (ChIP) with Massively parallel signature sequencing, massively parallel DNA sequencing to identify t ...
data analysis with Cistrome pipeline (MACS, CEAS, etc.)
* Raw NGS data processing
*
HMMER 2 and 3 packages integration
*
Chromatogram
In chemical analysis, chromatography is a laboratory technique for the Separation process, separation of a mixture into its components. The mixture is dissolved in a fluid solvent (gas or liquid) called the ''mobile phase'', which carries it ...
viewer
* Search for
transcription factor
In molecular biology, a transcription factor (TF) (or sequence-specific DNA-binding factor) is a protein that controls the rate of transcription (genetics), transcription of genetics, genetic information from DNA to messenger RNA, by binding t ...
binding site
In biochemistry and molecular biology, a binding site is a region on a macromolecule such as a protein that binds to another molecule with specificity. The binding partner of the macromolecule is often referred to as a ligand. Ligands may includ ...
s (
TFBS) with
weight matrix an
SITECONalgorithms
* Search for
direct
Direct may refer to:
Mathematics
* Directed set, in order theory
* Direct limit of (pre), sheaves
* Direct sum of modules, a construction in abstract algebra which combines several vector spaces
Computing
* Direct access (disambiguation), ...
,
inverted, and
tandem
Tandem, or in tandem, is an arrangement in which two or more animals, machines, or people are lined up one behind another, all facing in the same direction. ''Tandem'' can also be used more generally to refer to any group of persons or objects w ...
repeats in
DNA
Deoxyribonucleic acid (; DNA) is a polymer composed of two polynucleotide chains that coil around each other to form a double helix. The polymer carries genetic instructions for the development, functioning, growth and reproduction of al ...
sequences
* Local
sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural biology, structural, or evolutionary relationships between ...
with optimized
Smith-Waterman algorithm
* Build (using integrated
PHYLIP neighbor joining, MrBayes, or PhyML Maximum Likelihood) and edit
phylogenetic tree
A phylogenetic tree or phylogeny is a graphical representation which shows the evolutionary history between a set of species or taxa during a specific time.Felsenstein J. (2004). ''Inferring Phylogenies'' Sinauer Associates: Sunderland, MA. In ...
s
* Combine various algorithms into custom
workflow
Workflow is a generic term for orchestrated and repeatable patterns of activity, enabled by the systematic organization of resources into processes that transform materials, provide services, or process information. It can be depicted as a seque ...
s with
UGENE Workflow Designer
* Contigs assembly with CAP3
*
3D structure viewer for files in
Protein Data Bank
The Protein Data Bank (PDB) is a database for the three-dimensional structural data of large biological molecules such as proteins and nucleic acids, which is overseen by the Worldwide Protein Data Bank (wwPDB). This structural data is obtained a ...
(PDB) and
Molecular Modeling Database (MMDB)
formats,
anaglyph view support
* Predict
protein secondary structure
Protein secondary structure is the local spatial conformation of the polypeptide backbone excluding the side chains. The two most common secondary structural elements are alpha helices and beta sheets, though beta turns and omega loops occu ...
with
GOR IV and
PSIPRED algorithms
* Construct
dot plots for
nucleic acid sequence
A nucleic acid sequence is a succession of Nucleobase, bases within the nucleotides forming alleles within a DNA (using GACT) or RNA (GACU) molecule. This succession is denoted by a series of a set of five different letters that indicate the orde ...
s
*
mRNA
In molecular biology, messenger ribonucleic acid (mRNA) is a single-stranded molecule of RNA that corresponds to the genetic sequence of a gene, and is read by a ribosome in the process of Protein biosynthesis, synthesizing a protein.
mRNA is ...
alignment with Spidey
* Search for complex signals with ExpertDiscovery
* Search for a pattern of various algorithms' results in a
nucleic acid sequence
A nucleic acid sequence is a succession of Nucleobase, bases within the nucleotides forming alleles within a DNA (using GACT) or RNA (GACU) molecule. This succession is denoted by a series of a set of five different letters that indicate the orde ...
with
UGENE Query Designer
* PCR in silico for primer designing and mapping
* Spade de novo assembler
Sequence View
The Sequence View is used to visualize, analyze and modify
nucleic acid
Nucleic acids are large biomolecules that are crucial in all cells and viruses. They are composed of nucleotides, which are the monomer components: a pentose, 5-carbon sugar, a phosphate group and a nitrogenous base. The two main classes of nuclei ...
or
protein
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residue (biochemistry), residues. Proteins perform a vast array of functions within organisms, including Enzyme catalysis, catalysing metab ...
sequences. Depending on the sequence type and the options selected, the following views can be present in the Sequence View window:
*
3D structure view
* Circular view
*
Chromatogram
In chemical analysis, chromatography is a laboratory technique for the Separation process, separation of a mixture into its components. The mixture is dissolved in a fluid solvent (gas or liquid) called the ''mobile phase'', which carries it ...
view
* Graphs View: GC-content, AG-content, and other
*
Dot plot view
Alignment Editor
The Alignment Editor allows working with multiple
nucleic acid
Nucleic acids are large biomolecules that are crucial in all cells and viruses. They are composed of nucleotides, which are the monomer components: a pentose, 5-carbon sugar, a phosphate group and a nitrogenous base. The two main classes of nuclei ...
or
protein
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residue (biochemistry), residues. Proteins perform a vast array of functions within organisms, including Enzyme catalysis, catalysing metab ...
sequences -
aligning them, editing the alignment, analyzing it, storing the
consensus sequence
In molecular biology and bioinformatics, the consensus sequence (or canonical sequence) is the calculated sequence of most frequent residues, either nucleotide or amino acid, found at each position in a sequence alignment. It represents the result ...
, building a phylogenetic tree, and so on.
Phylogenetic Tree Viewer
The Phylogenetic Tree Viewer helps to visualize and edit phylogenetic trees. It is possible to synchronize a tree with the corresponding multiple alignment used to build the tree.
Assembly Browser
The ''Assembly Browser'' project was started in 2010 as an entry for Illumina iDEA Challenge 2011. The browser allows users to visualize and browse large (up to hundreds of millions of short reads) next generation sequence assemblies. It supports SAM, BAM (the binary version of SAM), and ACE formats. Before browsing assembly data in UGENE, an input file is converted to a UGENE database file automatically. This approach has its pros and cons. The pros are that this allows viewing the whole assembly, navigating in it, and going to well-covered regions rapidly. The cons are that a conversion may take time for a large file, and needs enough disk space to store the database.
Workflow Designer
''UGENE Workflow Designer'' allows creating and running complex computational
workflow
Workflow is a generic term for orchestrated and repeatable patterns of activity, enabled by the systematic organization of resources into processes that transform materials, provide services, or process information. It can be depicted as a seque ...
schemas.
The distinguishing feature of Workflow Designer, relative to other
bioinformatics workflow management systems is that workflows are executed on a local computer. It helps to avoid data transfer issues, whereas other tools’ reliance on remote file storage and internet connectivity does not.
The elements that a workflow consists of correspond to the bulk of algorithms integrated into UGENE. Using Workflow Designer also allows creating custom workflow elements. The elements can be based on a command-line tool or a script.
Workflows are stored in a special text format. This allows their reuse, and transfer between users.
A workflow can be run using the graphical interface or launched from the command line. The graphical interface also allows controlling the workflow execution, storing the parameters, and so on.
There is an embedded library of workflow samples to convert, filter, and annotate data, with several pipelines to analyze NGS data developed in collaboration with NIH NIAID. A wizard is available for each workflow sample.
Supported biological data formats
*
Sequence
In mathematics, a sequence is an enumerated collection of objects in which repetitions are allowed and order matters. Like a set, it contains members (also called ''elements'', or ''terms''). The number of elements (possibly infinite) is cal ...
s and
annotation
An annotation is extra information associated with a particular point in a document or other piece of information. It can be a note that includes a comment or explanation. Annotations are sometimes presented Marginalia, in the margin of book page ...
s:
FASTA (.fa),
GenBank
The GenBank sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. It is produced and maintained by the National Center for Biotechnology Information (NCBI; a par ...
(.gb),
EMBL (.emb),
GFF (.gff)
*
Multiple sequence alignments:
Clustal (.aln), MSF (.msf),
Stockholm
Stockholm (; ) is the Capital city, capital and List of urban areas in Sweden by population, most populous city of Sweden, as well as the List of urban areas in the Nordic countries, largest urban area in the Nordic countries. Approximately ...
(.sto),
Nexus (.nex)
*
3D structures:
PDB (.pdb),
MMDB (.prt)
*
Chromatogram
In chemical analysis, chromatography is a laboratory technique for the Separation process, separation of a mixture into its components. The mixture is dissolved in a fluid solvent (gas or liquid) called the ''mobile phase'', which carries it ...
s: ABIF (.abi), SCF (.scf)
* Short reads: Sequence Alignment/Map(SAM) (.sam), binary version of SAM (.bam),
ACE (.ace), FASTQ (.fastq)
*
Phylogenetic tree
A phylogenetic tree or phylogeny is a graphical representation which shows the evolutionary history between a set of species or taxa during a specific time.Felsenstein J. (2004). ''Inferring Phylogenies'' Sinauer Associates: Sunderland, MA. In ...
s:
Newick (.nwk), PHYLIP (.phy)
* Other formats: Bairoch (
enzyme
An enzyme () is a protein that acts as a biological catalyst by accelerating chemical reactions. The molecules upon which enzymes may act are called substrate (chemistry), substrates, and the enzyme converts the substrates into different mol ...
s info), HMM (
HMMER profiles), PWM and PFM (
position matrices), SNP and VCF4 (genome variations)
Release cycle
UGENE is primarily developed by Unipro LLC
with headquarters in Akademgorodok of Novosibirsk, Russia. Each
iteration
Iteration is the repetition of a process in order to generate a (possibly unbounded) sequence of outcomes. Each repetition of the process is a single iteration, and the outcome of each iteration is then the starting point of the next iteration.
...
lasts about 1–2 months, followed by a new
release
Release may refer to:
* Art release, the public distribution of an artistic production, such as a film, album, or song
* Legal release, a legal instrument
* News release, a communication directed at the news media
* Release (ISUP), a code to i ...
. Development snapshots may also be downloaded.
The features to include in each release are mostly initiated by users.
See also
*
Sequence alignment software
*
Bioinformatics
Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...
*
Computational biology
Computational biology refers to the use of techniques in computer science, data analysis, mathematical modeling and Computer simulation, computational simulations to understand biological systems and relationships. An intersection of computer sci ...
*
List of open source bioinformatics software
References
External links
*
* , UniPro
UGENE podcastUGENE forumЛучший свободный проект России , Журнал Linux Format - все о Linux по-русски
Phylogenetics software
Computational science
Free science software
Free software programmed in C++
Russian inventions