Stockholm format
   HOME

TheInfoList



OR:

Stockholm format is a
multiple sequence alignment Multiple sequence alignment (MSA) may refer to the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutio ...
format used by
Pfam Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. The most recent version, Pfam 35.0, was released in November 2021 and contains 19,632 families. Uses ...
,
Rfam Rfam is a database containing information about non-coding RNA (ncRNA) families and other structured RNA elements. It is an annotated, open access database originally developed at the Wellcome Trust Sanger Institute in collaboration with Janel ...
an
Dfam
to disseminate protein, RNA and DNA sequence alignments. The alignment editor
Ralee

an
Jalview
support Stockholm format as do the probabilistic database search tools
Infernal
and
HMMER HMMER is a free and commonly used software package for sequence analysis written by Sean Eddy. Its general usage is to identify homologous protein or nucleotide sequences, and to perform sequence alignments. It detects homology by comparing ...
, and the phylogenetic analysis tool
Xrate XRATE is a program for prototyping phylogenetic hidden Markov models and stochastic context-free grammars. It is used to discover patterns of evolutionary conservation in sequence alignments. The program can be used to estimate parameters for such ...
. Stockholm format files often have the
filename extension A filename extension, file name extension or file extension is a suffix to the name of a computer file (e.g., .txt, .docx, .md). The extension indicates a characteristic of the file contents or its intended use. A filename extension is typically d ...
.sto or .stk.


Syntax

A well-formed stockholm file always contains a header which states the format and version identifier, currently ''. The header is then followed by a multiple lines, a mix of markup (starting with ) and sequences. Finally, the "" line indicates the end of the alignment. An example without markup looks like:
# STOCKHOLM 1.0
#=GF ID   EXAMPLE
 
 
 
//
Sequences are written one per line. The sequence name is written first, and after any number of whitespaces the sequence is written. Sequence names are typically in the form "name/start-end" or just "name". Sequence letters may include any characters except whitespace. Gaps may be indicated by "" or "". Mark-up lines start with . The "parameters" are separated by whitespace, so an underscore ("_") instead of space should be used for the 1-char-per-column markups. Mark-up types defined include:
#=GF  
#=GC  
#=GS   
#=GR   


Recommended features

These feature names are used by Pfam and Rfam for specific types of annotation. (See th
Pfam
and th
Rfam
documentation under "Description of fields")


#=GF

Pfam and Rfam may use the following tags:
   Compulsory fields:
   ------------------
   AC   Accession number:           Accession number in form PFxxxxx (Pfam) or RFxxxxx (Rfam).
   ID   Identification:             One word name for family.
   DE   Definition:                 Short description of family.
   AU   Author:                     Authors of the entry.
   SE   Source of seed:             The source suggesting the seed members belong to one family.
   SS   Source of structure:        The source (prediction or publication) of the consensus RNA secondary structure used by Rfam.
   BM   Build method:               Command line used to generate the model
   SM   Search method:              Command line used to perform the search
   GA   Gathering threshold:        Search threshold to build the full alignment.
   TC   Trusted Cutoff:             Lowest sequence score (and domain score for Pfam) of match in the full alignment.
   NC   Noise Cutoff:               Highest sequence score (and domain score for Pfam) of match not in full alignment.
   TP   Type:                       Type of family -- presently Family, Domain, Motif or Repeat for Pfam.
                                                   -- a tree with roots Gene, Intron or Cis-reg for Rfam.
   SQ   Sequence:                   Number of sequences in alignment.

   Optional fields:
   ----------------
   DC   Database Comment:           Comment about database reference.
   DR   Database Reference:         Reference to external database.
   RC   Reference Comment:          Comment about literature reference.
   RN   Reference Number:           Reference Number.
   RM   Reference Medline:          Eight digit medline UI number.
   RT   Reference Title:            Reference Title.
   RA   Reference Author:           Reference Author
   RL   Reference Location:         Journal location.
   PI   Previous identifier:        Record of all previous ID lines.
   KW   Keywords:                   Keywords.
   CC   Comment:                    Comments.
   NE   Pfam accession:		    Indicates a nested domain.
   NL   Location:                   Location of nested domains - sequence ID, start and end of insert.
   WK   Wikipedia link:             Wikipedia page
   CL   Clan:                       Clan accession
   MB   Membership:                 Used for listing Clan membership

   For embedding trees:
   ----------------
   NH  New Hampshire                A tree in New Hampshire eXtended format.
   TN  Tree ID                      A unique identifier for the next tree.

   Other:
   ------
   FR False discovery Rate:         A method used to set the bit score threshold based on the ratio of 
                                    expected false positives to true positives. Floating point number between 0 and 1.
   CB Calibration method:           Command line used to calibrate the model (Rfam only, release 12.0 and later)
* Notes: A tree may be stored on multiple #=GF NH lines. * If multiple trees are stored in the same file, each tree must be preceded by a #=GF TN line with a unique tree identifier. If only one tree is included, the #=GF TN line may be omitted.


#=GS

Rfam and Pfam may use these features:
      Feature                    Description
      ---------------------      -----------
      AC              ACcession number
      DE               DEscription
      DR ; ;      Database Reference
      OS               Organism (species)
      OC                  Organism Classification (clade, etc.)
      LO                   Look (Color, etc.)


#=GR

      Feature   Description            Markup letters
      -------   -----------            --------------
      SS        Secondary Structure    For RNA  ,;<>()[aBb.-_.html" ;"title="html" ;"title=",;<>()[">,;<>()[aBb.-_">html" ;"title=",;<>()[">,;<>()[aBb.-_--supports pseudoknot and further structure markup (see WUSS documentation) 
                                       For protein [HGIEBTSCX]
      SA        Surface Accessibility  [0-9X] 
                    (0=0%-10%; ...; 9=90%-100%)
      TM        TransMembrane          [Mio]
      PP        Posterior Probability  [0-9*] 
                    (0=0.00-0.05; 1=0.05-0.15; *=0.95-1.00)
      LI        LIgand binding                AS        Active Site                  pAS        AS - Pfam predicted          sAS        AS - from SwissProt           IN        INtron (in or after)    -2 
     For RNA tertiary interactions:
     ------------------------------
     tWW       WC/WC        in trans   For basepairs:  >AaBb...Zz For unpaired:       cWH       WC/Hoogsteen in cis
     cWS       WC/SugarEdge in cis
     tWS       WC/SugarEdge in trans
     notes: (1)  for general format. 
            (2) cWW is equivalent to SS.


#=GC

The list of valid features includes those shown below, as well as the same features as for #=GR with "_cons" appended, meaning "consensus". Example: "SS_cons".
      Feature   Description            Description
      -------   -----------            --------------
      RF        ReFerence annotation   Often the consensus RNA or protein sequence is used as a reference
                                       Any non-gap character (e.g. x's) can indicate consensus/conserved/match columns
                                       .'s or -'s indicate insert columns
                                       ~'s indicate unaligned insertions
                                       Upper and lower case can be used to discriminate strong and weakly conserved 
                                       residues respectively
      MM        Model Mask             Indicates which columns in an alignment should be masked, such
                                       that the emission probabilities for match states corresponding to
                                       those columns will be the background distribution.


Notes

*Do not use multiple lines with the same #=GC label. *For a single sequence, do not use multiple lines with the same #=GR label. Only one unique feature assignment can be made for each sequence. *"X" in SA and SS means "residue with unknown structure". *The protein SS letters are taken from DSSP: H=alpha-helix, G=3/10-helix, I=p-helix, E=extended strand, B=residue in isolated b-bridge, T=turn, S=bend, C=coil/loop.) *The RNA SS letters are taken from WUSS (Washington University Secondary Structure) notation. Matching nested parentheses characters <>, (), [], or indicate a basepair. The symbols '.', ',' and ';' indicate unpaired regions. Matched upper and lower case characters from the English alphabet indicate
pseudoknot __NOTOC__ A pseudoknot is a nucleic acid secondary structure containing at least two stem-loop structures in which half of one stem is intercalated between the two halves of another stem. The pseudoknot was first recognized in the turnip yellow ...
interactions. The 5' nucleotide within the knot should be in uppercase and the 3' nucleotide lowercase.


Recommended placements

* #=GF Above the alignment * #=GC Below the alignment * #=GS Above the alignment or just below the corresponding sequence * #=GR Just below the corresponding sequence


Size limits

There are no explicit size limits on any field. However, a simple parser that uses fixed field sizes should work safely on Pfam and Rfam alignments with these limits: * Line length: 10000. * : 255. * : 255.


Examples

A simple example of an Rfam alignment ( UPSK RNA) with a
pseudoknot __NOTOC__ A pseudoknot is a nucleic acid secondary structure containing at least two stem-loop structures in which half of one stem is intercalated between the two halves of another stem. The pseudoknot was first recognized in the turnip yellow ...
in Stockholm format is shown below:
# STOCKHOLM 1.0
#=GF ID    UPSK
#=GF SE    Predicted; Infernal 
#=GF SS    Published; PMID 9223489
#=GF RN     #=GF RM    9223489
#=GF RT    The role of the pseudoknot at the 3' end of turnip yellow mosaic
#=GF RT    virus RNA in minus-strand synthesis by the viral RNA-dependent RNA
#=GF RT    polymerase.
#=GF RA    Deiman BA, Kortlever RM, Pleij CW;
#=GF RL    J Virol 1997;71:5990-5996.

AF035635.1/619-641             UGAGUUCUCGAUCUCUAAAAUCG
M24804.1/82-104                UGAGUUCUCUAUCUCUAAAAUCG
J04373.1/6212-6234             UAAGUUCUCGAUCUUUAAAAUCG
M24803.1/1-23                  UAAGUUCUCGAUCUCUAAAAUCG
#=GC SS_cons                   .AAA....<<<>>>
//
Here is a slightly more complex example showing the Pfam
CBS CBS Broadcasting Inc., commonly shortened to CBS, the abbreviation of its former legal name Columbia Broadcasting System, is an American commercial broadcast television and radio network serving as the flagship property of the CBS Entertainmen ...
domain:
# STOCKHOLM 1.0
#=GF ID CBS
#=GF AC PF00571
#=GF DE CBS domain
#=GF AU Bateman A
#=GF CC CBS domains are small intracellular modules mostly found
#=GF CC in 2 or four copies within a protein.
#=GF SQ 5
#=GS O31698/18-71 AC O31698
#=GS O83071/192-246 AC O83071
#=GS O83071/259-312 AC O83071
#=GS O31698/88-139 AC O31698
#=GS O31698/88-139 OS Bacillus subtilis
O83071/192-246          MTCRAQLIAVPRASSLAEAIACAQKMRVSRVPVYERS
#=GR O83071/192-246 SA  9998877564535242525515252536463774777
O83071/259-312          MQHVSAPVFVFECTRLAYVQHKLRAHSRAVAIVLDEY
#=GR O83071/259-312 SS  CCCCCHHHHHHHHHHHHHEEEEEEEEEEEEEEEEEEE
O31698/18-71            MIEADKVAHVQVGNNLEHALLVLTKTGYTAIPVLDPS
#=GR O31698/18-71 SS    CCCHHHHHHHHHHHHHHHEEEEEEEEEEEEEEEEHHH
O31698/88-139           EVMLTDIPRLHINDPIMKGFGMVINN..GFVCVENDE
#=GR O31698/88-139 SS   CCCCCCCHHHHHHHHHHHHEEEEEEEEEEEEEEEEEH
#=GC SS_cons            CCCCCHHHHHHHHHHHHHEEEEEEEEEEEEEEEEEEH
O31699/88-139           EVMLTDIPRLHINDPIMKGFGMVINN..GFVCVENDE
#=GR O31699/88-139 AS   ________________*____________________
#=GR O31699/88-139 IN   ____________1____________2______0____
//


See also

*
FASTA format In bioinformatics and biochemistry, the FASTA format is a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes. The format a ...
*
Rfam Rfam is a database containing information about non-coding RNA (ncRNA) families and other structured RNA elements. It is an annotated, open access database originally developed at the Wellcome Trust Sanger Institute in collaboration with Janel ...
*
Pfam Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. The most recent version, Pfam 35.0, was released in November 2021 and contains 19,632 families. Uses ...


References


External links


Erik Sonnhammers' definition of Stockholm format
{{Bioinformatics Bioinformatics Biological sequence format