In
theoretical computer science
Theoretical computer science (TCS) is a subset of general computer science and mathematics that focuses on mathematical aspects of computer science such as the theory of computation, lambda calculus, and type theory.
It is difficult to circumsc ...
, the closest string is an
NP-hard
In computational complexity theory, NP-hardness ( non-deterministic polynomial-time hardness) is the defining property of a class of problems that are informally "at least as hard as the hardest problems in NP". A simple example of an NP-hard pr ...
computational problem,
which tries to find the geometrical center of a set of input strings.
To understand the word "center", it is necessary to define a distance between two strings. Usually, this problem is studied with the
Hamming distance
In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number of ''substitutions'' required to chang ...
in mind.
Formal definition
More formally, given ''n'' length-''m'' strings ''s''
1, ''s''
2, ..., ''s''
''n'', the closest string problem seeks for a new length-''m'' string ''s'' such that ''d''(''s'',''s''
''i'') ≤ ''k'' for all ''i'', where ''d'' denotes the
Hamming distance
In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number of ''substitutions'' required to chang ...
, and where ''k'' is as small as possible. A
decision problem
In computability theory and computational complexity theory, a decision problem is a computational problem that can be posed as a yes–no question of the input values. An example of a decision problem is deciding by means of an algorithm whethe ...
version of the closest string problem, which is
NP-complete
In computational complexity theory, a problem is NP-complete when:
# it is a problem for which the correctness of each solution can be verified quickly (namely, in polynomial time) and a brute-force search algorithm can find a solution by tryin ...
, instead takes ''k'' as another input and questions whether there is a string within Hamming distance ''k'' of all the input strings.
The closest string problem can be seen as a special case of the generic
1-center problem
The 1-center problem, also known as minimax problem or minmax location problem, is a classical combinatorial optimization problem in operations research of facilities location type. In its most general case the problem is stated as follows: given ...
in which the distances between elements are measured using Hamming distance.
Motivation
In
bioinformatics
Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combin ...
, the closest string problem is an intensively studied facet of the problem of finding signals in
DNA.
Simplifications and data reductions
Instances of closest string may contain information that is not essential to the problem. In some sense, the usual input of closest string contains information, that does not contribute to the hardness of the problem. For example, if some strings contain the character ''a'', but none contains the character ''z'', replacing all ''a''s with ''z''s would yield an essentially equivalent instance, that is: from a solution of the modified instance, the original solution can be restored, and vice versa.
Normalizing the input
When all input strings that share the same length are written on top of each other, they form a matrix. Certain row types have essentially the same implications to the solution. For example, replacing a column with entries (''a'', ''a'', ''b'') with another column (''x'', ''x'', ''y'') might lead to a different solution string, but cannot affect solvability, because both columns express the same structure, viz. the first two entries being equal, but different from the third one.
An input instance can be ''normalized'' by replacing, in each column, the character that occurs the most often with ''a'', the character that occurs the second most often with ''b'', and so forth. Given a solution to the normalized instance, the original instance can be found by remapping the characters of the solution to its original version in every column.
The order of the columns does not contribute to the hardness of the problem. That means, if we permute all input strings according to a certain permutation π and obtain a solution string ''s'' to that modified instance, then π
−1(''s'') will be a solution to the original instance.
Example

Given an instance with three input strings ''uvwx'', ''xuwv'', and ''xvwu''. This could be written as a matrix like this:
The first column has the values (''u'', ''x'', ''x''). As ''x'' is the character that appears the most often, we replace it by ''a'', and we replace ''u'', the second most often character, by ''b'', obtaining the new first column (''b'', ''a'', ''a''). The second column has the values (''v'', ''u'', ''v''). As for the first column, ''v'' is replaced by ''a'' and ''u'' is replaced by ''b'', obtaining the new second column (''a'', ''b'', ''a''). Doing the same with all columns gives the normalized instance
Data reduction obtained from normalization
Normalizing the input reduces the alphabet size to at most the number of input strings. This can be useful for algorithms whose running times depend on the alphabet size.
Approximability
Li et al. evolved a
polynomial-time approximation scheme
In computer science (particularly algorithmics), a polynomial-time approximation scheme (PTAS) is a type of approximation algorithm for optimization problems (most often, NP-hard optimization problems).
A PTAS is an algorithm which takes an in ...
which is practically unusable because of the large hidden constants.
Fixed-parameter tractability
Closest String can be solved in
, where ''k'' is the number of input strings, ''L'' is the length of all strings and ''d'' is the desired maximum distance from the solution string to any input string.
[{{citation, title=Fixed-Parameter Algorithms for Closest String and Related Problems, authors=Jens Gramm, ]Rolf Niedermeier
Rolf Niedermeier (21 July 1966 – 19 March 2022) was a professor of computer science, known for his research in computational complexity theory, especially in parameterized complexity, graph theory, computational social choice, and social netwo ...
, and Peter Rossmanith, journal=Algorithmica
''Algorithmica'' is a monthly peer-reviewed scientific journal focusing on research and the application of computer science algorithms. The journal was established in 1986 and is published by Springer Science+Business Media. The editor in chief is ...
, year=2003, volume=37, pages=25–42, doi=10.1007/s00453-003-1028-3, citeseerx=10.1.1.61.736, s2cid=8206021
Relations to other problems
Closest string is a special case of the more general
closest substring problem, which is strictly more difficult. While closest string turns out to be
fixed-parameter tractable
In computer science, parameterized complexity is a branch of computational complexity theory that focuses on classifying computational problems according to their inherent difficulty with respect to ''multiple'' parameters of the input or output. ...
in a number of ways, closest substring is
W hard">hard with regard to these parameters.
References
NP-hard problems
Formal languages