BioJava is an open-source software project dedicated to provide
Java tools to process biological data.
BioJava is a set of
library functions written in the programming language Java for
manipulating sequences, protein structures, file parsers, Common
Object Request Broker Architecture (CORBA) interoperability,
Distributed Annotation System (DAS), access to AceDB, dynamic
programming, and simple statistical routines.
BioJava supports a huge
range of data, starting from DNA and protein sequences to the level of
3D protein structures. The
BioJava libraries are useful for automating
many daily and mundane bioinformatics tasks such as to parsing a
Protein Data Bank
MUSI: an integrated system to identify multiple specificity from very large peptide or nucleic acid data sets. JEnsembl: a version-aware Java API to Ensembl data systems. Expression profiling of signature gene sets with trinucleotide threading Resolving the structural features of genomic islands: a machine learning approach Utility library for structural bioinformatics
The BioJava project grew out of work by Thomas Down and Matthew Pocock to create an API to simplify development of Java-based Bioinformatics tools. BioJava is an active open source project that has been developed over more than 12 years and by more than 60 developers. BioJava is one of a number of Bio* projects designed to reduce code duplication. Examples of such projects that fall under Bio* apart from BioJava are BioPython, BioPerl, BioRuby, EMBOSS etc. Version 3.0.5 was a major update to the prior versions. It contained several independent modules. The old project has been moved to a separate project called BioJava-legacy project.
1 Features 2 History and publications 3 Modules
3.1 Core Module
3.2 Protein structure modules
3.3 Genome and Sequencing modules
3.4 Alignment module
3.5 ModFinder module
3.7.1 Making library function calls 3.7.2 Using command line
3.8 Web service access module
4 Comparisons with other alternatives 5 Projects using BioJava 6 See also 7 External links 8 References
Features BioJava provides software modules for many of the typical tasks of bioinformatics programming. These include:
Accessing nucleotide and peptide sequence data from local and remote databases Transforming formats of database/ file records Protein structure parsing and manipulation Manipulating individual sequences Searching for similar sequences Creating and manipulating sequence alignments
History and publications
In the year 2008, BioJava's first Application note was published.
It was migrated from its original CVS repository to
This window shows two proteins with IDs "4hhb.A" and "4hhb.B" aligned against each other. The code is given on the left side. This is produced using BioJava libraries which in turn uses Jmol viewer. The FATCAT rigid algorithm is used here to do the alignment.
The protein structure modules provide tools to represent and manipulate 3D biomolecular structures. They focus on protein structure comparison. The following algorithms have been implemented and included in BioJava.
FATCAT algorithm for flexible and rigid body alignment. The standard Combinatorial Extension (CE) algorithm. A new version of CE that can detect circular permutations in proteins.
These algorithms are used to provide the RCSB Protein Data Bank
(PDB) Protein Comparison Tool as well as systematic comparisons of
all proteins in the PDB on a weekly basis.
Parsers for PDB and mmCIF file formats allow the loading of
structure data into a reusable data model. This feature is used by the
SIFTS project to map between
String name1 = "4hhb.A"; String name2 = "4hhb.B";
AtomCache cache = new AtomCache();
Structure structure1 = null; Structure structure2 = null;
StructureAlignment algorithm = StructureAlignmentFactory.getAlgorithm(FatCatRigid.algorithmName);
structure1 = cache.getStructure(name1); structure2 = cache.getStructure(name2);
Atom ca1 = StructureTools.getAtomCAArray(structure1); Atom ca2 = StructureTools.getAtomCAArray(structure2);
FatCatParameters params = new FatCatParameters();
AFPChain afpChain = algorithm.align(ca1,ca2,params);
StructureAlignmentDisplay.display(afpChain, ca1, ca2);
The code aligns the two protein sequences "4hhb.A" and "4hhb.B" based on the FATCAT rigid algorithm. Genome and Sequencing modules This module is focused on the creation of gene sequence objects from the core module. This is realised by supporting the parsing of the following popular standard file formats generated by open source gene prediction applications:
GTF files generated by GeneMark GFF2 files generated by GeneID GFF3 files generated by Glimmer
Then the gene sequence objects are written out as a GFF3 format and is imported into GMOD. These file formats are well defined but what gets written in the file is very flexible. The following code example takes a 454scaffold file that was used by genemark to predict genes and returns a collection of ChromosomeSequences. Each chromosome sequence maps to a named entry in the fasta file and would contain N gene sequences. The gene sequences can be +/- strand with frame shifts and multiple transcriptions. Passing the collection of ChromosomeSequences to GeneFeatureHelper.getProteinSequences would return all protein sequences. You can then write the protein sequences to a fasta file.
LinkedHashMap<String, ChromosomeSequence> chromosomeSequenceList = GeneFeatureHelper.loadFastaAddGeneFeaturesFromGeneMarkGTF(new File("454Scaffolds.fna"), new File("genemark_hmm.gtf")); LinkedHashMap<String, ProteinSequence> proteinSequenceList = GeneFeatureHelper.getProteinSequences(chromosomeSequenceList.values()); FastaWriterHelper.writeProteinSequence(new File("genemark_proteins.faa"), proteinSequenceList.values());
You can also output the gene sequence to a fasta file where the coding regions will be upper case and the non-coding regions will be lower case
LinkedHashMap<String, GeneSequence> geneSequenceHashMap = GeneFeatureHelper.getGeneSequences(chromosomeSequenceList.values()); Collection<GeneSequence> geneSequences = geneSequenceHashMap.values(); FastaWriterHelper.writeGeneSequence(new File("genemark_genes.fna"), geneSequences, true);
You can easily write out a gff3 view of a ChromosomeSequence with the following code.
FileOutputStream fo = new FileOutputStream("genemark.gff3"); GFF3Writer gff3Writer = new GFF3Writer(); gff3Writer.write(fo, chromosomeSequenceList); fo.close();
For providing input-output support for several common variants of the FASTQ file format from the next generation sequencers, a separate sequencing module is provided. It is called the Sequence Module and is contained in the package org.biojava3.sequencing.io.fastq. For samples on how to use this module please go to this link. Work is in progress towards providing a complete set of java classes to do conversions between different file formats where the list of supported gene prediction applications and genome browsers will get longer based on end user requests. Alignment module This module contains several classes and methods that allow users to perform pairwise and multiple sequence alignment. Pairwise sequence alignment For optimal global alignment, BioJava implements the Needleman-Wunsch algorithm and for performing local alignments the Smith and Waterman's algorithm has been implemented. The outputs of both local and global alignments are available in standard formats. An example on how to use the libraries is shown below.
protected void align(String uniProtID_1, String uniProtID_2, PairwiseSequenceAlignerType alignmentType) throws IOException, Exception ProteinSequence proteinSeq1 = FastaReaderHelper.readFastaProteinSequence((new URL(String.format ("http://www.uniprot.org/uniprot/%s.fasta", uniProtID_1))).openStream()).get(uniProtID_1); ProteinSequence proteinSeq2 = FastaReaderHelper.readFastaProteinSequence((new URL(String.format ("http://www.uniprot.org/uniprot/%s.fasta", uniProtID_2))).openStream()).get(uniProtID_2);
SequencePair<ProteinSequence, AminoAcidCompound> result = Alignments.getPairwiseAlignment(proteinSeq1, proteinSeq2, alignmentType, new SimpleGapPenalty(), new SimpleSubstitutionMatrix<AminoAcidCompound>()); System.out.println(result.toString());
An example call to the above function would look something like this: For Global Alignment
align("Q21691", "Q21495", PairwiseSequenceAlignerType.GLOBAL);
For Local Alignment
align("Q21691", "Q21495", PairwiseSequenceAlignerType.LOCAL);
In addition to these two algorithms, there is an implementation of Guan–Uberbacher algorithm which performs global sequence alignment very efficiently since it only uses linear memory. For Multiple Sequence Alignment, any of the methods discussed above can be used to progressively perform a multiple sequence alignment. ModFinder module
An example application using the ModFinder module and the protein structure module. Protein modifications are mapped onto the sequence and structure of ferredoxin I (PDB ID 1GAO). Two possible iron–sulfur clusters are shown on the protein sequence (3Fe–4S (F3S): orange triangles/lines; 4Fe–4S (SF4): purple diamonds/ lines). The 4Fe–4S cluster is displayed in the Jmol structure window above the sequence display
The ModFinder module provides new methods to identify and classify protein modifications in protein 3D structures. Over 400 different types of protein modifications such as phosphorylation, glycosylation, disulfide bonds metal chelation etc. were collected and curated based on annotations in PSI-MOD, RESID and RCSB PDB. The module also provides an API for detecting protein modifications within protein structures. Example: identify and print all preloaded modifications from a structure
Set<ModifiedCompound> identifyAllModfications(Structure struc) ProteinModificationIdentifier parser = new ProteinModificationIdentifier(); parser.identify(struc); Set<ModifiedCompound> mcs = parser.getIdentifiedModifiedCompound(); return mcs;
Example: identify phosphorylation sites in a structure
List<ResidueNumber> identifyPhosphosites(Structure struc) List<ResidueNumber> phosphosites = new ArrayList<ResidueNumber>(); ProteinModificationIdentifier parser = new ProteinModificationIdentifier(); parser.identify(struc, ProteinModificationRegistry.getByKeyword("phosphoprotein")); Set<ModifiedCompound> mcs = parser.getIdentifiedModifiedCompound(); for (ModifiedCompound mc : mcs) Set<StructureGroup> groups = mc.getGroups(true); for (StructureGroup group : groups) phosphosites.add(group.getPDBResidueNumber());
Demo code to run the above methods
import org.biojava.bio.structure.ResidueNumber; import org.biojava.bio.structure.Structure; import org.biojava.bio.structure.io.PDBFileReader; import org.biojava3.protmod.structure.ProteinModificationIdentifier;
public static void main(String args) try PDBFileReader reader = new PDBFileReader(); reader.setAutoFetch(true);
// identify all modifications from PDB:1CAD and print them String pdbId = "1CAD"; Structure struc = reader.getStructureById(pdbId); Set<ModifiedCompound> mcs = identifyAllModfications(struc); for (ModifiedCompound mc : mcs) System.out.println(mc.toString());
// identify all phosphosites from PDB:3MVJ and print them pdbId = "3MVJ"; struc = reader.getStructureById(pdbId); List<ResidueNumber> psites = identifyPhosphosites(struc); for (ResidueNumber psite : psites) System.out.println(psite.toString());
catch(Exception e) e.printStackTrace();
There are plans to include further protein modifications by
integrating other resources such as UniProt
Grand average of hydropathy
The precise molecular weights for common isotopically labelled amino
acids are included in this module. There also exists flexibility to
define new amino acid molecules with their molecular weights using
Using library function calls Using command line
Making library function calls The following examples show how to use the module and make function calls to get information about protein disorders. The first two examples make library function calls to calculate the probability of disorder for every residue in the sequence provided. The third and fourth examples demonstrates how easily one can get the disordered regions of the protein. Example 1: Calculate the probability of disorder for every residue in the sequence
FastaSequence fsequence = new FastaSequence("name", "LLRGRHLMNGTMIMRPWNFLNDHHFPKFFPHLIEQQAIWLADWWRKKHC" + "RPLPTRAPTMDQWDHFALIQKHWTANLWFLTFPFNDKWGWIWFLKDWTPGSADQAQRACTWFFCHGHDTN"); floatrawProbabilityScores = Jronn.getDisorderScores(fsequence);
Example 2: Calculate the probability of disorder for every residue in the sequence for all proteins from the FASTA input file
final List<FastaSequence> sequences = SequenceUtil.readFasta(new FileInputStream("src/test/resources/fasta.in")); Map<FastaSequence, float> rawProbabilityScores = Jronn.getDisorderScores(sequences);
Example 3: Get the disordered regions of the protein for a single protein sequence
FastaSequence fsequence = new FastaSequence("Prot1", "LLRGRHLMNGTMIMRPWNFLNDHHFPKFFPHLIEQQAIWLADWWRKKHC" + "RPLPTRAPTMDQWDHFALIQKHWTANLWFLTFPFNDKWGWIWFLKDWTPGSADQAQRACTWFFCHGHDTN" + "CQIIFEGRNAPERADPMWTGGLNKHIIARGHFFQSNKFHFLERKFCEMAEIERPNFTCRTLDCQKFPWDDP"); Range ranges = Jronn.getDisorder(fsequence);
Example 4: Calculate the disordered regions for the proteins from FASTA file
final List<FastaSequence> sequences = SequenceUtil.readFasta(new FileInputStream("src/test/resources/fasta.in")); Map<FastaSequence, Range> ranges = Jronn.getDisorder(sequences);
Using command line BioJava module biojava3-protein-disorder can be compiled into a single executable JAR file and run using the following command.
java -jar <jar_file_name>
Options supported by the command line executable
JRONN version 3.1b usage 1 August 2011: java -jar JRONN_JAR_NAME -i=inputfile <OPTIONS>
Where -i=input file Input file can contain one or more FASTA formatted sequences.
All OPTIONS are optional
OPTION DETAILED DESCRIPTION: -o full path to the output file, if not specified standard out is used
-d the value of disorder, defaults to 0.5
-f output format, V for vertical, where the letters of the sequence and corresponding disorder values are output in two column layout. H for horizontal, where the disorder values are provided under the letters of the sequence. Letters and values separated by tabulation in this case. Defaults to V.
-s the file name to write execution statistics to.
-n the number of threads to use. Defaults to the number of cores available on the computer. n=1 mean sequential processing. Valid values are 1 < n < (2 x num_of_cores) Default value will give the best performance.
Examples Predict disorder values for sequences from input file /home/input.fasta output the results to the standard out. Use default disorder value and utilise all CPUs available on the computer.
java -jar JRONN.JAR -i=/home/input.fasta
Predict disorder values for sequences from input file /home/input.fasta output the results in horizontal layout to the /home/jronn.out, collect execution statistics to /home/jronn.stat.txt file and limit the number of threads to two.
java -jar JRONN.JAR -i=/home/input.fasta -o=/home/jronn.out -d=0.6 -n=2 -f=H
The arguments can be provided in any order.
Web service access module
As per the current trends in bioinformatics, web based tools are
gaining popularity. The web service module allows bioinformatics
services to be accessed using REST protocols. Currently, two services
are implemented: NCBI Blast through the Blast URLAPI (previously known
as QBlast) and the HMMER web service.
Comparisons with other alternatives
The need for customized software in the field of bioinformatics has
been addressed by several groups and individuals. Similar to BioJava,
open-source software projects such as BioPerl, BioPython, and BioRuby
all provide tool-kits with multiple functionality that make it easier
to create customized pipelines or analysis.
As the names suggest, the projects mentioned above use different
programming languages. All of these APIs offer similar tools so on
what criteria should one base their choice? For programmers who are
experienced in only one of these languages, the choice is
straightforward. However, for a well-rounded bioinformaticist who
knows all of these languages and wants to choose the best language for
a job, the choice can be made based on the following guidelines given
by a software review done on the Bio* tool-kits.
In general, for small programs (<500 lines) that will be used by
only an individual or small group, it is hard to beat
Both provide comprehensive collections of methods for protein sequences. Both are used by Java programmers to code bioinformatics algorithms. Both separate implementations and definitions by using java interfaces. Both are open source projects. Both can read and write many sequence file formats.
BioJava is applicable to nucleotide and peptide sequences and can be applied for entire genomes. STRAP cannot cope with single sequences as long as an entire chromosome. Instead STRAP manipulates peptide sequences and 3D- structures of the size of single proteins. Nevertheless, it can hold a high number of sequences and structures in memory. STRAP is designed for protein sequences but can read coding nucleotide files, which are then translated to peptide sequences. STRAP is very fast since the graphical user interface must be highly responsive. BioJava is used where speed is less critical. BioJava is well designed in terms of type safety, ontology and object design. BioJava uses objects for sequences, annotations and sequence positions. Even single amino acids or nucleotides are object references. To enhance speed, STRAP avoids frequent object instantiations and invocation of non-final object-methods.
In BioJava peptide sequences and nucleotide sequences are lists of symbols. The symbols can be retrieved one after the other with an iterator or sub-sequences can be obtained. The advantages are that the entire sequence does not necessarily reside in memory and that programs are less susceptible to programming errors. Symbol objects are immutable elements of an alphabet. In STRAP however simple byte arrays are used for sequences and float arrays for coordinates. Besides speed the low memory consumption is an important advantage of basic data types. Classes in Strap expose internal data. Therefore, programmers might commit programming errors like manipulating byte arrays directly instead of using the setter methods. Another disadvantage is that no checks are performed in STRAP whether the characters in sequences are valid with respect to an underlying alphabet. In BioJava sequence positions are realized by the class Location. Discontiguous Location objects are composed of several contiguous RangeLocation objects or PointLocation objects. For the class StrapProtein however, single residue positions are indicated by integer numbers between 0 and countResidues()-1. Multiple positions are given by boolean arrays. True at a given index means selected whereas false means not selected.
BioJava throws exceptions when methods are invoked with invalid parameters. STRAP avoids the time consuming creation of Throwable objects. Instead, errors in methods are indicated by the return values NaN, -1 or null. From the point of program design however Throwable objects are nicer. In BioJava a Sequence object is either a peptide sequence or a nucleotide sequence. A StrapProtein can hold both at the same time if a coding nucleotide sequence was read and translated into protein. Both, the nucleotide sequence and the peptide sequence are contained in the same StrapProtein object. The coding or non-coding regions can be changed and the peptide sequence alters accordingly.
Projects using BioJava The following projects make use of BioJava.
Metabolic Pathway Builder: Software suite dedicated to the exploration of connections among genes, proteins, reactions and metabolic pathways DengueInfo: a Dengue genome information portal that uses BioJava in the middleware and talks to a biosql database. Dazzle: A BioJava based DAS server. BioSense: A plug-in for the InforSense Suite, an analytics software platform by IDBS that unitizes BioJava. Bioclipse: A free, open source, workbench for chemo- and bioinformatics with powerful editing and visualizing abilities for molecules, sequences, proteins, spectra, etc. PROMPT: A free, open source framework and application for the comparison and mapping of protein sets. Uses BioJava for handling most input data formats. Cytoscape: An open source bioinformatics software platform to visualize molecular interaction networks. BioWeka: An open source biological data mining application. Geneious: A molecular biology toolkit. MassSieve: An open source application to analyze mass spec proteomics data. Strap: A tool for multiple sequence alignment and sequence based structure alignment. Jstacs: A Java framework for statistical analysis and classification of biological sequences jLSTM "Long Short-Term Memory" for protein classification LaJolla Structural alignment of RNA and proteins using an index structure for fast alignment of thousands of structures. Including an easy to use command line interface. Open source at Sourceforge. GenBeans: A rich client platform for bioinformatics primarily focused on molecular biology and sequence analysis.
Official website RONN
^ a b Prlić A, Yates A, Bliven SE, et al. (October 2012). "BioJava:
an open-source framework for bioinformatics in 2012". Bioinformatics.
28 (20): 2693–5. doi:10.1093/bioinformatics/bts494.
PMC 3467744 . PMID 22877863.
^ a b Holland RC, Down TA, Pocock M, Prlić A, Huen D, James K, et al.
(2008). "BioJava: an open-source framework for bioinformatics".
Bioinformatics. 24 (18): 2096–7. doi:10.1093/bioinformatics/btn397.
PMC 2530884 . PMID 18689808.
^ VS Matha and P Kangueane, 2009, Bioinformatics: a concept-based
introduction, 2009. p26
^ a b c Hanson, R.M. (2010) Jmol a paradigm shift in crystallographic
^ Kim T, Tyndel MS, Huang H, et al. (March 2012). "MUSI: an integrated
system for identifying multiple specificity from very large peptide or
nucleic acid data sets". Nucleic Acids Res. 40 (6): e47.
doi:10.1093/nar/gkr1294. PMC 3315295 .
^ Paterson T, Law A (November 2012). "JEnsembl: a version-aware Java
API to Ensembl data systems". Bioinformatics. 28 (21): 2724–31.
doi:10.1093/bioinformatics/bts525. PMC 3476335 .
^ Zajac P, Pettersson E, Gry M, Lundeberg J, Ahmadian A (February
2008). "Expression profiling of signature gene sets with trinucleotide
threading". Genomics. 91 (2): 209–17.
doi:10.1016/j.ygeno.2007.10.012. PMID 18061398.
^ Vernikos GS, Parkhill J (February 2008). "Resolving the structural
features of genomic islands: a machine learning approach". Genome Res.
18 (2): 331–42. doi:10.1101/gr.7004508. PMC 2203631 .
^ Gront D, Kolinski A (February 2008). "Utility library for structural
bioinformatics". Bioinformatics. 24 (4): 584–5.
doi:10.1093/bioinformatics/btm627. PMID 18227118.
^ a b Mangalam H (2002). "The Bio* toolkits--a brief overview".
Briefings in Bioinformatics. 3 (3): 296–302.
doi:10.1093/bib/3.3.296. PMID 12230038.
^ Cock PJ, Antao T, Chang JT, et al. (June 2009). "Biopython: freely
available Python tools for computational molecular biology and
bioinformatics". Bioinformatics. 25 (11): 1422–3.
doi:10.1093/bioinformatics/btp163. PMC 2682512 .
^ Stajich JE, Block D, Boulez K, et al. (October 2002). "The Bioperl