A position weight matrix (PWM), also known as a position-specific weight matrix (PSWM) or position-specific scoring matrix (PSSM), is a commonly used representation of motifs (patterns) in biological sequences. PWMs are often derived from a set of aligned sequences that are thought to be functionally related and have become an important part of many software tools for computational motif discovery.

Background

Creation

Conversion of sequence to position probability matrix

A PWM has one row for each symbol of the alphabet (4 rows for

nucleotide Nucleotides are Organic compound, organic molecules composed of a nitrogenous base, a pentose sugar and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both o ...

s in

DNA Deoxyribonucleic acid (; DNA) is a polymer composed of two polynucleotide chains that coil around each other to form a double helix. The polymer carries genetic instructions for the development, functioning, growth and reproduction of al ...

sequences or 20 rows for

amino acid Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although over 500 amino acids exist in nature, by far the most important are the 22 α-amino acids incorporated into proteins. Only these 22 a ...

s in

protein Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residue (biochemistry), residues. Proteins perform a vast array of functions within organisms, including Enzyme catalysis, catalysing metab ...

sequences) and one column for each position in the pattern. In the first step in constructing a PWM, a basic position frequency matrix (PFM) is created by counting the occurrences of each nucleotide at each position. From the PFM, a position probability matrix (PPM) can now be created by dividing that former nucleotide count at each position by the number of sequences, thereby normalising the values. Formally, given a set ''X'' of ''N'' aligned sequences of length ''l'', the elements of the PPM M are calculated:
:

M_=\frac\sum_^N I(X_=k),

where ''i''

\in

(1,...,''N''), ''j''

\in

(1,...,''l''), ''k'' is the set of symbols in the alphabet and ''I(a=k)'' is an

indicator function In mathematics, an indicator function or a characteristic function of a subset of a set is a function that maps elements of the subset to one, and all other elements to zero. That is, if is a subset of some set , then the indicator functio ...

where ''I(a=k)'' is 1 if ''a=k'' and 0 otherwise. For example, given the following DNA sequences:
: The corresponding PFM is:
:

M = \begin
A\\
C\\
G\\
T
\end
\begin
3 & 6 & 1 & 0 & 0 & 6 & 7 & 2 & 1\\
2 & 2 & 1 & 0 & 0 & 2 & 1 & 1 & 2\\
1 & 1 & 7 & 10 & 0 & 1 & 1 & 5 & 1\\
4 & 1 & 1 & 0 & 10 & 1 & 1 & 2 & 6
\end.

Therefore, the resulting PPM is:
:

M = \begin
A\\
C\\
G\\
T
\end
\begin
0.3 & 0.6 & 0.1 & 0.0 & 0.0 & 0.6 & 0.7 & 0.2 & 0.1\\
0.2 & 0.2 & 0.1 & 0.0 & 0.0 & 0.2 & 0.1 & 0.1 & 0.2\\
0.1 & 0.1 & 0.7 & 1.0 & 0.0 & 0.1 & 0.1 & 0.5 & 0.1\\
0.4 & 0.1 & 0.1 & 0.0 & 1.0 & 0.1 & 0.1 & 0.2 & 0.6
\end.

Both PPMs and PWMs assume

statistical independence Independence is a fundamental notion in probability theory, as in statistics and the theory of stochastic processes. Two events are independent, statistically independent, or stochastically independent if, informally speaking, the occurrence of ...

between positions in the pattern, as the probabilities for each position are calculated independently of other positions. From the definition above, it follows that the sum of values for a particular position (that is, summing over all symbols) is 1. Each column can therefore be regarded as an independent

multinomial distribution In probability theory, the multinomial distribution is a generalization of the binomial distribution. For example, it models the probability of counts for each side of a ''k''-sided die rolled ''n'' times. For ''n'' statistical independence, indepen ...

. This makes it easy to calculate the probability of a sequence given a PPM, by multiplying the relevant probabilities at each position. For example, the probability of the sequence ''S'' = given the above PPM M can be calculated:
:

p(S\vert M) = 0.1 \times 0.6 \times 0.7 \times 1.0 \times 1.0 \times 0.6 \times 0.7 \times 0.2 \times 0.2 = 0.0007056.

Pseudocount In statistics, additive smoothing, also called Laplace smoothing or Lidstone smoothing, is a technique used to smooth count data, eliminating issues caused by certain values having 0 occurrences. Given a set of observation counts \mathbf = \lang ...

s (or '' Laplace estimators'') are often applied when calculating PPMs if based on a small dataset, in order to avoid matrix entries having a value of 0. This is equivalent to multiplying each column of the PPM by a

Dirichlet distribution In probability and statistics, the Dirichlet distribution (after Peter Gustav Lejeune Dirichlet), often denoted \operatorname(\boldsymbol\alpha), is a family of continuous multivariate probability distributions parameterized by a vector of pos ...

and allows the probability to be calculated for new sequences (that is, sequences which were not part of the original dataset). In the example above, without pseudocounts, any sequence which did not have a in the 4th position or a in the 5th position would have a probability of 0, regardless of the other positions.

Conversion of position probability matrix to position weight matrix

Most often the elements in PWMs are calculated as log odds. That is, the elements of a PPM are transformed using a background model

b

so that: :

M_=\mathrm\;(M_ / b_k).

describes how ''an element in the PWM (left)'',

M_

, can be calculated. The simplest background model assumes that each letter appears equally frequently in the dataset. That is, the value of

b_k = 1/\vert k \vert

for all symbols in the alphabet (0.25 for nucleotides and 0.05 for amino acids). Applying this transformation to the PPM M from above (with no pseudocounts added) gives: :

M = \begin
A\\
C\\
G\\
T
\end
\begin
0.26 & 1.26 & -1.32 & -\infty & -\infty & 1.26 & 1.49 & -0.32 & -1.32\\
-0.32 & -0.32 & -1.32 & -\infty & -\infty & -0.32 & -1.32 & -1.32 & -0.32\\
-1.32 & -1.32 & 1.49 & 2.0 & -\infty & -1.32 & -1.32 & 1.0 & -1.32\\
0.68 & -1.32 & -1.32 & -\infty & 2.0 & -1.32 & -1.32 & -0.32 & 1.26
\end.

The

-\infty

entries in the matrix make clear the advantage of adding pseudocounts, especially when using small datasets to construct M. The background model need not have equal values for each symbol: for example, when studying organisms with a high

GC-content In molecular biology and genetics, GC-content (or guanine-cytosine content) is the percentage of nitrogenous bases in a DNA or RNA molecule that are either guanine (G) or cytosine (C). This measure indicates the proportion of G and C bases out of ...

, the values for and may be increased with a corresponding decrease for the and values. When the PWM elements are calculated using log likelihoods, the score of a sequence can be calculated by adding (rather than multiplying) the relevant values at each position in the PWM. The sequence score gives an indication of how different the sequence is from a random sequence. The score is 0 if the sequence has the same probability of being a functional site and of being a random site. The score is greater than 0 if it is more likely to be a functional site than a random site, and less than 0 if it is more likely to be a random site than a functional site. The sequence score can also be interpreted in a physical framework as the binding energy for that sequence.

Information content

The

information content In information theory, the information content, self-information, surprisal, or Shannon information is a basic quantity derived from the probability of a particular event occurring from a random variable. It can be thought of as an alternative w ...

(IC) of a PWM is sometimes of interest, as it says something about how different a given PWM is from a uniform distribution. The

self-information In information theory, the information content, self-information, surprisal, or Shannon information is a basic quantity derived from the probability of a particular event occurring from a random variable. It can be thought of as an alternative w ...

of observing a particular symbol at a particular position of the motif is: :

-\log(p_)

The expected (average) self-information of a particular element in the PWM is then: :

-p_ \cdot \log(p_)

Finally, the IC of the PWM is then the sum of the expected self-information of every element: :

\textstyle -\sum_ p_\cdot \log(p_)

Often, it is more useful to calculate the information content with the background letter frequencies of the sequences you are studying rather than assuming equal probabilities of each letter (e.g., the GC-content of DNA of

thermophilic A thermophile is a type of extremophile that thrives at relatively high temperatures, between . Many thermophiles are archaea, though some of them are bacteria and fungi. Thermophilic eubacteria are suggested to have been among the earliest bact ...

bacteria range from 65.3 to 70.8, thus a motif of ATAT would contain much more information than a motif of CCGG). The equation for information content thus becomes :

\textstyle -\sum_ p_\cdot \log(p_/p_)

where

p_

is the background frequency for letter

j

. This corresponds to the

Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler (KL) divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how much a model probability distribution is diff ...

or relative entropy. However, it has been shown that when using PSSM to search genomic sequences (see below) this uniform correction can lead to overestimation of the importance of the different bases in a motif, due to the uneven distribution of n-mers in real genomes, leading to a significantly larger number of false positives.

Uses

There are various algorithms to scan for hits of PWMs in sequences. One example is the MATCH algorithm which has been implemented in the ModuleMaster. More sophisticated algorithms for fast database searching with nucleotide as well as amino acid PWMs/PSSMs are implemented in the possumsearch software. The basic PWM/PSSM is unable to deal with insertions and deletions. A PSSM with additional probabilities for insertion and deletion at each position can be interpreted as a

hidden Markov model A hidden Markov model (HMM) is a Markov model in which the observations are dependent on a latent (or ''hidden'') Markov process (referred to as X). An HMM requires that there be an observable process Y whose outcomes depend on the outcomes of X ...

. This is the approach used by

Pfam Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. The latest version of Pfam, 37.0, was released in June 2024 and contains 21,979 families. It is cur ...

References

{{reflist

External links

3PFDB
– a database of Best Representative PSSM Profiles (BRPs) of Protein Families generated using a novel data mining approach.
UGENE
– PSS matrices design, integrated interface to JASPAR, UniPROBE and SITECON databases. Bioinformatics Evaluation methods