Shortest Common Superstring Problem
   HOME

TheInfoList



OR:

In
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
, the shortest common supersequence of two sequences X and Y is the shortest sequence which has X and Y as
subsequence In mathematics, a subsequence of a given sequence is a sequence that can be derived from the given sequence by deleting some or no elements without changing the order of the remaining elements. For example, the sequence \langle A,B,D \rangle is a ...
s. This is a problem closely related to the
longest common subsequence problem The longest common subsequence (LCS) problem is the problem of finding the longest subsequence common to all sequences in a set of sequences (often just two sequences). It differs from the longest common substring problem: unlike substrings, subs ...
. Given two sequences X = < x1,...,xm > and Y = < y1,...,yn >, a sequence U = < u1,...,uk > is a common supersequence of X and Y if items can be removed from U to produce X and Y. A shortest common supersequence (SCS) is a common supersequence of minimal length. In the shortest common supersequence problem, two sequences X and Y are given, and the task is to find a shortest possible common supersequence of these sequences. In general, an SCS is not unique. For two input sequences, an SCS can be formed from a
longest common subsequence A longest common subsequence (LCS) is the longest subsequence common to all sequences in a set of sequences (often just two sequences). It differs from the longest common substring: unlike substrings, subsequences are not required to occupy conse ...
(LCS) easily. For example, the longest common subsequence of X ..m= abcbdab and Y ..n= bdcaba is Z ..L= bcba. By inserting the non-LCS symbols into Z while preserving their original order, we obtain a shortest common supersequence U ..S= abdcabdab. In particular, the equation L + S = m + n holds for any two input sequences. There is no similar relationship between shortest common supersequences and longest common subsequences of three or more input sequences. (In particular, LCS and SCS are not
dual problem In mathematical optimization theory, duality or the duality principle is the principle that optimization problems may be viewed from either of two perspectives, the primal problem or the dual problem. If the primal is a minimization problem then th ...
s.) However, both problems can be solved in O(n^k) time using dynamic programming, where k is the number of sequences, and n is their maximum length. For the general case of an arbitrary number of input sequences, the problem is
NP-hard In computational complexity theory, NP-hardness ( non-deterministic polynomial-time hardness) is the defining property of a class of problems that are informally "at least as hard as the hardest problems in NP". A simple example of an NP-hard pr ...
.


Shortest common superstring

The closely related problem of finding a minimum-length string which is a superstring of a finite set of strings = is also NP-hard. Several constant factor approximations have been proposed throughout the years, and the current best known algorithm has an approximation factor of 2.475. However, perhaps the simplest solution is to reformulate the problem as an instance of weighted set cover in such a way that the weight of the optimal solution to the set cover instance is less than twice the length of the shortest superstring . One can then use the O(log())-approximation for weighted set-cover to obtain an O(log())-approximation for the shortest superstring (note that this is ''not'' a constant factor approximation). For any string in this alphabet, define () to be the set of all strings which are substrings of . The instance of set cover is formulated as follows: * Let be empty. * For each pair of strings and , if the last symbols of are the same as the first symbols of , then add a string to that consists of the concatenation with maximal overlap of with . * Define the universe \mathcal U of the set cover instance to be * Define the set of subsets of the universe to be * Define the cost of each subset (x) to be , , , the length of . The instance can then be solved using an algorithm for weighted set cover, and the algorithm can output an arbitrary concatenation of the strings for which the weighted set cover algorithm outputs ().


Example

Consider the set = , which becomes the universe of the weighted set cover instance. In this case, = . Then the set of subsets of the universe is : \begin \ &= \ \\ &= \ \} \\ &= \ \} \\ \end which have costs 3, 3, 3, 5, and 4, respectively.


References

* * * {{Citation, last = Vazirani , first = Vijay V. , authorlink = Vijay Vazirani , title = Approximation Algorithms , year = 2001 , publisher = Springer-Verlag , isbn = 3-540-65367-8 , url = http://www.cc.gatech.edu/fac/Vijay.Vazirani/book.pd


External links


Dictionary of Algorithms and Data Structures: shortest common supersequence

Shortest Common Supersequence

1092. Shortest Common Supersequence

Shortest Common Supersequence

Shortest common supersequence

Shortest Common Supersequence (SCS)
Problems on strings Combinatorics Formal languages Dynamic programming NP-complete problems Approximation algorithms