HOME
The Info List - BioJava


--- Advertisement ---



BIOJAVA is an open-source software project dedicated to provide Java tools to process biological data. BioJava is a set of library functions written in the programming language Java for manipulating sequences, protein structures, file parsers, Common Object Request Broker Architecture (CORBA) interoperability, Distributed Annotation System (DAS), access to AceDB , dynamic programming, and simple statistical routines. BioJava supports a huge range of data, starting from DNA and protein sequences to the level of 3D protein structures. The BioJava libraries are useful for automating many daily and mundane bioinformatics tasks such as to parsing a Protein Data Bank
Protein Data Bank
(PDB) file, interacting with Jmol and many more. This application programming interface (API) provides various file parsers, data models and algorithms to facilitate working with the standard data formats and enables rapid application development and analysis. These libraries have also been used in developing various extended analysis tools! for example:

* MUSI: an integrated system to identify multiple specificity from very large peptide or nucleic acid data sets. * JEnsembl: a version-aware Java API to Ensembl data systems. * Expression profiling of signature gene sets with trinucleotide threading * Resolving the structural features of genomic islands: a machine learning approach * Utility library for structural bioinformatics

The BioJava project grew out of work by Thomas Down and Matthew Pocock to create an API to simplify development of Java-based Bioinformatics
Bioinformatics
tools. BioJava is an active open source project that has been developed over more than 12 years and by more than 60 developers. BioJava is one of a number of Bio* projects designed to reduce code duplication. Examples of such projects that fall under Bio* apart from BioJava are BioPython , BioPerl , BioRuby , EMBOSS etc.

Version 3.0.5 was a major update to the prior versions. It contained several independent modules. The old project has been moved to a separate project called BioJava-legacy project.

CONTENTS

* 1 Features * 2 History and publications

* 3 Modules

* 3.1 Core Module * 3.2 Protein structure modules * 3.3 Genome and Sequencing modules * 3.4 Alignment module * 3.5 ModFinder module * 3.6 Amino acid
Amino acid
properties module

* 3.7 Protein disorder module

* 3.7.1 Making library function calls * 3.7.2 Using command line

* 3.8 Web service access module

* 4 Comparisons with other alternatives * 5 Projects using BioJava * 6 See also * 7 External links * 8 References

FEATURES

BioJava provides software modules for many of the typical tasks of bioinformatics programming. These include:

* Accessing nucleotide and peptide sequence data from local and remote databases * Transforming formats of database/ file records * Protein structure parsing and manipulation * Manipulating individual sequences * Searching for similar sequences * Creating and manipulating sequence alignments

HISTORY AND PUBLICATIONS

In the year 2008, BioJava's first Application note was published. It was migrated from its original CVS repository to Git hub in April 2013.

In October 2012, the most recent paper on BioJava was published. As of November 2012 Google Scholar counts more than 130 citations.

MODULES

During 2014-2015, large parts of the original code base were rewritten. BioJava 3 is a clear departure from the version 1 series. It now consists of several independent modules built using an automation tool called Apache Maven
Apache Maven
. These modules provide state-of-the-art tools for protein structure comparison, pairwise and multiple sequence alignments, working with DNA and protein sequences, analysis of amino acid properties, detecting protein modifications, predicting disordered regions in proteins, and parsers for common file formats using a biologically meaningful data model. The original code has been moved into a separate BioJava legacy project, which is still available for backward compatibility.

The following sections will describe several of the new modules and highlight some of the new features that are included in the latest version of BioJava.

CORE MODULE

This module provides Java classes to model amino acid or nucleotide sequences. The classes were designed so that the names are familiar and make sense to biologists and also provide a concrete representation of the steps in going from a gene sequence to a protein sequence for computer scientists and programmers.

A major change between the legacy BioJava project and BioJava3 lies in the way framework has been designed to exploit then-new innovations in Java. A sequence is defined as a generic interface allowing the rest of the modules to create any utility that operates on all sequences. Specific classes for common sequences such as DNA and proteins have been defined in order to improve usability for biologists. The translation engine really leverages this work by allowing conversions between DNA, RNA and amino acid sequences. This engine can handle details such as choosing the codon table, converting start codons to methionine, trimming stop codons, specifying the reading frame and handing ambiguous sequences.

Special
Special
attentions has been paid to designing the storage of sequences to minimize space needs. Special
Special
design patterns such as the Proxy pattern
Proxy pattern
allowed the developers to create the framework such that sequences can be stored in memory, fetched on demand from a web service such as UniProt, or read from a FASTA file as needed. The latter two approaches save memory by not loading sequence data until it is referenced in the application. This concept can be extended to handle very large genomic datasets, such as NCBI GenBank or a proprietary database.

PROTEIN STRUCTURE MODULES

This window shows two proteins with IDs "4hhb.A" and "4hhb.B" aligned against each other. The code is given on the left side. This is produced using BioJava libraries which in turn uses Jmol viewer. The FATCAT rigid algorithm is used here to do the alignment.

The protein structure modules provide tools to represent and manipulate 3D biomolecular structures. They focus on protein structure comparison.

The following algorithms have been implemented and included in BioJava.

* FATCAT algorithm for flexible and rigid body alignment. * The standard Combinatorial Extension (CE) algorithm. * A new version of CE that can detect circular permutations in proteins.

These algorithms are used to provide the RCSB Protein Data Bank
Protein Data Bank
(PDB) Protein Comparison Tool as well as systematic comparisons of all proteins in the PDB on a weekly basis.

Parsers for PDB and mmCIF file formats allow the loading of structure data into a reusable data model. This feature is used by the SIFTS project to map between UniProt
UniProt
sequences and PDB structures. Information from the RCSB PDB can be dynamically fetched without the need to manually download data. For visualization, an interface to the 3D viewer Jmol http://www.jmol.org/ is provided. The team claims that work is underway to improve interaction with the RCSB PDB viewers.

Below is an outline of the code to initialize a window that will display and compare two protein sequences. Please bear in mind that this is just an outline of the code. To make this work one will need to import the correct found in the "org.biojava.bio.structure" package and add also handle exceptions by using a try-catch block.

String name1 = "4hhb.A"; String name2 = "4hhb.B"; AtomCache cache = new AtomCache(); Structure structure1 = null; Structure structure2 = null; StructureAlignment algorithm = StructureAlignmentFactory.getAlgorithm(FatCatRigid.algorithmName); structure1 = cache.getStructure(name1); structure2 = cache.getStructure(name2); Atom[] ca1 = StructureTools.getAtomCAArray(structure1); Atom[] ca2 = StructureTools.getAtomCAArray(structure2); FatCatParameters params = new FatCatParameters(); AFPChain afpChain = algorithm.align(ca1,ca2,params); afpChain.setName1(name1); afpChain.setName2(name2); StructureAlignmentDisplay.display(afpChain, ca1, ca2);

The code aligns the two protein sequences "4hhb.A" and "4hhb.B" based on the FATCAT rigid algorithm.

GENOME AND SEQUENCING MODULES

This module is focused on the creation of gene sequence objects from the core module. This is realised by supporting the parsing of the following popular standard file formats generated by open source gene prediction applications:

* GTF files generated by GeneMark * GFF2 files generated by GeneID * GFF3 files generated by Glimmer

Then the gene sequence objects are written out as a GFF3 format and is imported into GMOD. These file formats are well defined but what gets written in the file is very flexible.

The following code example takes a 454scaffold file that was used by genemark to predict genes and returns a collection of ChromosomeSequences. Each chromosome sequence maps to a named entry in the fasta file and would contain N gene sequences. The gene sequences can be +/- strand with frame shifts and multiple transcriptions.

Passing the collection of ChromosomeSequences to GeneFeatureHelper.getProteinSequences would return all protein sequences. You can then write the protein sequences to a fasta file.

LinkedHashMap chromosomeSequenceList = GeneFeatureHelper.loadFastaAddGeneFeaturesFromGeneMarkGTF(new File("454Scaffolds.fna"), new File("genemark_hmm.gtf")); LinkedHashMap proteinSequenceList = GeneFeatureHelper.getProteinSequences(chromosomeSequenceList.values()); FastaWriterHelper.writeProteinSequence(new File("genemark_proteins.faa"), proteinSequenceList.values());

You can also output the gene sequence to a fasta file where the coding regions will be upper case and the non-coding regions will be lower case

LinkedHashMap geneSequenceHashMap = GeneFeatureHelper.getGeneSequences(chromosomeSequenceList.values()); Collection geneSequences = geneSequenceHashMap.values(); FastaWriterHelper.writeGeneSequence(new File("genemark_genes.fna"), geneSequences, true);

You can easily write out a gff3 view of a ChromosomeSequence with the following code.

FileOutputStream fo = new FileOutputStream("genemark.gff3"); GFF3Writer gff3Writer = new GFF3Writer(); gff3Writer.write(fo, chromosomeSequenceList); fo.close();

For providing input-output support for several common variants of the FASTQ file format from the next generation sequencers, a separate sequencing module is provided. It is called the SEQUENCE MODULE and is contained in the package org.biojava3.sequencing.io.fastq. For samples on how to use this module please go to this link.

Work is in progress towards providing a complete set of java classes to do conversions between different file formats where the list of supported gene prediction applications and genome browsers will get longer based on end user requests.

ALIGNMENT MODULE

This module contains several classes and methods that allow users to perform pairwise and multiple sequence alignment.

PAIRWISE SEQUENCE ALIGNMENT

For optimal global alignment, BioJava implements the Needleman-Wunsch algorithm and for performing local alignments the Smith and Waterman\'s algorithm has been implemented. The outputs of both local and global alignments are available in standard formats.

An EXAMPLE on how to use the libraries is shown below.

protected void align(String uniProtID_1, String uniProtID_2, PairwiseSequenceAlignerType alignmentType) throws IOException, Exception { ProteinSequence proteinSeq1 = FastaReaderHelper.readFastaProteinSequence((new URL(String.format ("http://www.uniprot.org/uniprot/%s.fasta", uniProtID_1))).openStream()).get(uniProtID_1); ProteinSequence proteinSeq2 = FastaReaderHelper.readFastaProteinSequence((new URL(String.format ("http://www.uniprot.org/uniprot/%s.fasta", uniProtID_2))).openStream()).get(uniProtID_2); SequencePair result = Alignments.getPairwiseAlignment(proteinSeq1, proteinSeq2, alignmentType, new SimpleGapPenalty(), new SimpleSubstitutionMatrix()); System.out.println(result.toString()); }

An example call to the above function would look something like this:

FOR GLOBAL ALIGNMENT

align("Q21691", "Q21495", PairwiseSequenceAlignerType.GLOBAL);

FOR LOCAL ALIGNMENT

align("Q21691", "Q21495", PairwiseSequenceAlignerType.LOCAL);

In addition to these two algorithms, there is an implementation of Guan–Uberbacher algorithm which performs global sequence alignment very efficiently since it only uses linear memory.

For MULTIPLE SEQUENCE ALIGNMENT , any of the methods discussed above can be used to progressively perform a multiple sequence alignment.

MODFINDER MODULE

An example application using the ModFinder module and the protein structure module. Protein modifications are mapped onto the sequence and structure of ferredoxin I (PDB ID 1GAO). Two possible iron–sulfur clusters are shown on the protein sequence (3Fe–4S (F3S): orange triangles/lines; 4Fe–4S (SF4): purple diamonds/ lines). The 4Fe–4S cluster is displayed in the Jmol structure window above the sequence display

The ModFinder module provides new methods to identify and classify protein modifications in protein 3D structures. Over 400 different types of protein modifications such as phosphorylation , glycosylation , disulfide bonds metal chelation etc. were collected and curated based on annotations in PSI-MOD, RESID and RCSB PDB. The module also provides an API for detecting protein modifications within protein structures.

EXAMPLE: IDENTIFY AND PRINT ALL PRELOADED MODIFICATIONS FROM A STRUCTURE

Set identifyAllModfications(Structure struc) { ProteinModificationIdentifier parser = new ProteinModificationIdentifier(); parser.identify(struc); Set mcs = parser.getIdentifiedModifiedCompound(); return mcs; }

EXAMPLE: IDENTIFY PHOSPHORYLATION SITES IN A STRUCTURE

List identifyPhosphosites(Structure struc) { List phosphosites = new ArrayList(); ProteinModificationIdentifier parser = new ProteinModificationIdentifier(); parser.identify(struc, ProteinModificationRegistry.getByKeyword("phosphoprotein")); Set mcs = parser.getIdentifiedModifiedCompound(); for (ModifiedCompound mc : mcs) { Set groups = mc.getGroups(true); for (StructureGroup group : groups) { phosphosites.add(group.getPDBResidueNumber()); } } return phosphosites; }

DEMO CODE TO RUN THE ABOVE METHODS

import org.biojava.bio.structure.ResidueNumber; import org.biojava.bio.structure.Structure; import org.biojava.bio.structure.io.PDBFileReader; import org.biojava3.protmod.structure.ProteinModificationIdentifier; public static void main(String[] args) { try { PDBFileReader reader = new PDBFileReader(); reader.setAutoFetch(true); // identify all modifications from PDB:1CAD and print them String pdbId = "1CAD"; Structure struc = reader.getStructureById(pdbId); Set mcs = identifyAllModfications(struc); for (ModifiedCompound mc : mcs) { System.out.println(mc.toString()); } // identify all phosphosites from PDB:3MVJ and print them pdbId = "3MVJ"; struc = reader.getStructureById(pdbId); List psites = identifyPhosphosites(struc); for (ResidueNumber psite : psites) { System.out.println(psite.toString()); } } catch(Exception e) { e.printStackTrace(); } }

There are plans to include further protein modifications by integrating other resources such as UniProt
UniProt

AMINO ACID PROPERTIES MODULE

This module attempts to provide accurate physio-chemical properties of proteins. The properties that can calculated using this module are as follows:

* Molecular mass * Extinction coefficient * Instability index * Aliphatic index * Grand average of hydropathy * Isoelectric point
Isoelectric point
* Amino acid
Amino acid
composition

The precise molecular weights for common isotopically labelled amino acids are included in this module. There also exists flexibility to define new amino acid molecules with their molecular weights using simple XML
XML
configuration files. This can be useful where the precise mass is of high importance such as mass spectrometry experiments.

PROTEIN DISORDER MODULE

The goal of this module is to provide users ways to find disorders in protein molecules. BioJava includes a Java implementation of the RONN predictor. The latest version of BioJava(3.0.5) makes use of Java's support for multithreading to improve performance by up to 3.2 times, on a modern quad-core machine, as compared to the legacy C implementation.

There are two ways to use this module:

* Using library function calls * Using command line

Making Library Function Calls

The following examples show how to use the module and make function calls to get information about protein disorders. The first two examples make library function calls to calculate the probability of disorder for every residue in the sequence provided.

The third and fourth examples demonstrates how easily one can get the disordered regions of the protein.

EXAMPLE 1: CALCULATE THE PROBABILITY OF DISORDER FOR EVERY RESIDUE IN THE SEQUENCE

FastaSequence fsequence = new FastaSequence("name", "LLRGRHLMNGTMIMRPWNFLNDHHFPKFFPHLIEQQAIWLADWWRKKHC" + "RPLPTRAPTMDQWDHFALIQKHWTANLWFLTFPFNDKWGWIWFLKDWTPGSADQAQRACTWFFCHGHDTN"); float[] rawProbabilityScores = Jronn.getDisorderScores(fsequence);

EXAMPLE 2: CALCULATE THE PROBABILITY OF DISORDER FOR EVERY RESIDUE IN THE SEQUENCE FOR ALL PROTEINS FROM THE FASTA INPUT FILE

final List sequences = SequenceUtil.readFasta(new FileInputStream("src/test/resources/fasta.in")); Map rawProbabilityScores = Jronn.getDisorderScores(sequences);

EXAMPLE 3: GET THE DISORDERED REGIONS OF THE PROTEIN FOR A SINGLE PROTEIN SEQUENCE

FastaSequence fsequence = new FastaSequence("Prot1", "LLRGRHLMNGTMIMRPWNFLNDHHFPKFFPHLIEQQAIWLADWWRKKHC" + "RPLPTRAPTMDQWDHFALIQKHWTANLWFLTFPFNDKWGWIWFLKDWTPGSADQAQRACTWFFCHGHDTN" + "CQIIFEGRNAPERADPMWTGGLNKHIIARGHFFQSNKFHFLERKFCEMAEIERPNFTCRTLDCQKFPWDDP"); Range[] ranges = Jronn.getDisorder(fsequence);

EXAMPLE 4: CALCULATE THE DISORDERED REGIONS FOR THE PROTEINS FROM FASTA FILE

final List sequences = SequenceUtil.readFasta(new FileInputStream("src/test/resources/fasta.in")); Map ranges = Jronn.getDisorder(sequences);

Using Command Line

BioJava module biojava3-protein-disorder can be compiled into a single executable JAR file and run using the following command.

java -jar

OPTIONS SUPPORTED BY THE COMMAND LINE EXECUTABLE

JRONN version 3.1b usage 1 August 2011: java -jar JRONN_JAR_NAME -i=inputfile Where -i=input file Input file can contain one or more FASTA formatted sequences. All OPTIONS are optional OPTION DETAILED DESCRIPTION: -o full path to the output file, if not specified standard out is used -d the value of disorder, defaults to 0.5 -f output format, V for vertical, where the letters of the sequence and corresponding disorder values are output in two column layout. H for horizontal, where the disorder values are provided under the letters of the sequence. Letters and values separated by tabulation in this case. Defaults to V. -s the file name to write execution statistics to. -n the number of threads to use. Defaults to the number of cores available on the computer. n=1 mean sequential processing. Valid values are 1 < n < (2 x num_of_cores) Default value will give the best performance.

EXAMPLES

Predict disorder values for sequences from input file /home/input.fasta output the results to the standard out. Use default disorder value and utilise all CPUs available on the computer.

java -jar JRONN.JAR -i=/home/input.fasta

Predict disorder values for sequences from input file /home/input.fasta output the results in horizontal layout to the /home/jronn.out, collect execution statistics to /home/jronn.stat.txt file and limit the number of threads to two.

java -jar JRONN.JAR -i=/home/input.fasta -o=/home/jronn.out -d=0.6 -n=2 -f=H

The arguments can be provided in any order.

WEB SERVICE ACCESS MODULE

As per the current trends in bioinformatics, web based tools are gaining popularity. The web service module allows bioinformatics services to be accessed using REST protocols. Currently, two services are implemented: NCBI Blast through the Blast URLAPI (previously known as QBlast) and the HMMER web service.

COMPARISONS WITH OTHER ALTERNATIVES

The need for customized software in the field of bioinformatics has been addressed by several groups and individuals. Similar to BioJava, open-source software projects such as BioPerl , BioPython , and BioRuby all provide tool-kits with multiple functionality that make it easier to create customized pipelines or analysis.

As the names suggest, the projects mentioned above use different programming languages. All of these APIs offer similar tools so on what criteria should one base their choice? For programmers who are experienced in only one of these languages, the choice is straightforward. However, for a well-rounded bioinformaticist who knows all of these languages and wants to choose the best language for a job, the choice can be made based on the following guidelines given by a software review done on the Bio* tool-kits.

In general,

.