HOME

TheInfoList



OR:

In
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to practical disciplines (includi ...
, an algorithm for matching wildcards (also known as globbing) is useful in comparing text strings that may contain wildcard syntax. Common uses of these algorithms include command-line interfaces, e.g. the Bourne shell or Microsoft Windows command-line or text editor or file manager, as well as the interfaces for some search engines and databases. Wildcard matching is a subset of the problem of matching
regular expressions A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" o ...
and string matching in general.


The problem

A wildcard matcher tests a wildcard pattern ''p'' against an input string ''s''. It performs an ''anchored'' match, returns true only when ''p'' matches the entirety of ''s''. The pattern can be based on any common syntax (see
globbing In computer programming, glob () patterns specify sets of filenames with wildcard characters. For example, the Unix Bash shell command mv *.txt textfiles/ moves (mv) all files with names ending in .txt from the current directory to the director ...
), but on Windows programmers tend to only discuss a simplified syntax supported by the native C runtime: * No escape characters are defined * Wildcards: matches exactly one occurrence of any character. matches arbitrary many (including zero) occurrences of any character. This article mainly discusses the Windows formulation of the problem, unless otherwise stated.


Definition

Stated in zero-based indices, the wildcard-matching problem can be defined recursively as: : \begin m_ &= (p_ = t_) \\ m_ &= (p_ = \text) \land m_\\ m_ & = \text \\ m_ &= \begin m_ & \text\; p_ = t_ \lor p_ = \text\\ m_ \lor m_ & \text\; p_ = \text\\ \text & \text\; p_ \neq t_ \end & & \quad \text\; 1 \leq i \le , p, , 1 \leq j \le , t, . \end where ''mij'' is the result of matching the pattern ''p'' against the text ''t'' truncated at ''i'' and ''j'' characters respectively. This is the formulation used by Richter's algorithm and the ''Snippets'' algorithm found in Cantatore's collection. This description is similar to the
Levenshtein distance In information theory, linguistics, and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-charac ...
.


Related problems

Directly related problems in computer science include: * Pattern matching with don't cares or gaps, an unanchored string search with only the equivalent of defined. * Pattern matching with wildcards, an unanchored string search with the equivalent of both wildcards defined. Has an exponential runtime unless a length-bound is given in the pattern matching with flexible wildcards variant.


History

Early algorithms for matching wildcards often relied on
recursion Recursion (adjective: ''recursive'') occurs when a thing is defined in terms of itself or of its type. Recursion is used in a variety of disciplines ranging from linguistics to logic. The most common application of recursion is in mathemati ...
, but the technique was criticized on grounds of performance and reliability considerations. Non-recursive algorithms for matching wildcards have gained favor in light of these considerations. Among both recursive and non-recursive algorithms, strategies for performing the pattern matching operation vary widely, as evidenced among the variety of example algorithms referenced below.
Test case In software engineering, a test case is a specification of the inputs, execution conditions, testing procedure, and expected results that define a single test to be executed to achieve a particular software testing objective, such as to exercise ...
development and performance optimization techniques have been demonstrably brought to bear on certain algorithms, particularly those developed by critics of the recursive algorithms.


Recursive algorithms

The recursion generally happens on matching * when there is more suffix to match against. This is a form of
backtracking Backtracking is a class of algorithms for finding solutions to some computational problems, notably constraint satisfaction problems, that incrementally builds candidates to the solutions, and abandons a candidate ("backtracks") as soon as it d ...
, also done by some regular expression matchers. *
Rich Salz InterNetNews (INN) is a Usenet news server package, originally released by Rich Salz in 1991, and presented at the Summer 1992 USENIX conference in San Antonio, Texas. It was the first news server with integrated NNTP functionality. While pr ...
' wildmat algorithm (sh-like syntax) * Filip's algorithm and Vignesh Murugesan's algorithm * Martin Richter's algorithm (identical to ''Snippets'' and related to the 7-zip algorithm) * C library fnmatch implementations (supports ../code> and multibyte character sets): **
Guido van Rossum Guido van Rossum (; born 31 January 1956) is a Dutch programmer best known as the creator of the Python programming language, for which he was the "benevolent dictator for life" (BDFL) until he stepped down from the position on 12 July 2018 ...
's BSD libc fnmatch, also part of Apple libc **
Glibc The GNU C Library, commonly known as glibc, is the GNU Project's implementation of the C standard library. Despite its name, it now also directly supports C++ (and, indirectly, other programming languages). It was started in the 1980s by ...
fnmatch The general form of these algorithms are the same. On recursion the algorithm slices the input into substrings, and considers a match to have happened when ONE of the substrings return a positive match. For , it would greedily call , , and . They usually differ by less-important things like support for features and by more important factors such as minor but highly effective optimizations. Some of them include: * The ABORT signal against over-recursion (Lars Mathiesen 1991). While it is correct to naively recurse by the entire rest of the strings (pattern and text) on * and making sure that ONE of the substrings return a positive match, the running time becomes exponential for rejecting a match with many * in the text. Lars Mathiesen changes the return to three classes, match, no-match, and ABORT (no match possible at all for asterisk recursion.) The ABORT value is returned when the text is consumed too early or when another asterisk match has failed, guaranteeing a linear performance with respect to the number of asterisks. (The overall complexity is additionally quadratic to the number of characters left to match.) Git/Rsync's wildmatch ABORT also covers invalid inputs. The new INN uwildmat does the same. * Asterisk advancement in recursion. This wildmatch tweak is relatively more minor. It applies to when the recursion wants to match "*X" on "abcX": when an asterisk is followed by a literal like "X", it is obvious that only the last comparison with equal lengths would have a chance of producing a match. This is seen earlier in uwildmat in 2000 and more implicitly in van Rossum's fnmatch for . Martin Richter's algorithm is an exception to this pattern, although the overall operation is equivalent. On * it recurses into increasing ''either'' of the indexes, following the dynamic programming formulation of the problem. The "ABORT" technique is applicable to it as well. On typical patterns (as tested by Cantatore) it is slower than the greedy-call implementations. The recursive algorithms are in general easier to reason about, and with the ABORT modification they perform acceptably in terms of worst-case complexity. On strings without * they take linear-to-string-size time to match since there is a fixed one-to-one relation.


Non-recursive algorithms

The following are developed by critics of the recursive algorithms: * Kirk J. Krauss's wildcard-matching algorithm, used by IBM * Alessandro Cantatore's collection of wildcard matching algorithms * Dogan Kurt's iterative matcher and slower NFA matcher. * Siler's incorrect algorithm (fails ) The following is not: * Jack Handy's incorrect algorithm (fails ) The iterative functions above implement backtracking by saving an old set of pattern/text pointers, and reverting to it should a match fails. According to Kurt, since only one successful match is required, only one such set needs to be saved. In addition, the problem of wildcard matching can be converted into
regular expression A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...
matching using a naive text-replacement approach. Although non-recursive regular expression matchers such as Thompson's construction are less used in practice due to lack of backreference support, wildcard matching in general does not come with a similarly rich set of features. (In fact, many of the algorithms above only has support for and .) The Russ Cox implementation of Thompson NFA can be trivially modified for such. Gustavo Navarro's -based nrgrep algorithm provides a more streamlined implementation with emphasis on efficient suffixes. See also .


See also

*
Pattern matching In computer science, pattern matching is the act of checking a given sequence of tokens for the presence of the constituents of some pattern. In contrast to pattern recognition, the match usually has to be exact: "either it will or will not be ...
* Pattern calculus *
Glob (programming) In computer programming, glob () patterns specify sets of filenames with wildcard characters. For example, the Unix Bash shell command mv *.txt textfiles/ moves (mv) all files with names ending in .txt from the current directory to the directory ...
*
Wildcard character In software, a wildcard character is a kind of placeholder represented by a single character, such as an asterisk (), which can be interpreted as a number of literal characters or an empty string. It is often used in file searches so the full na ...
*
List of algorithms The following is a list of well-known algorithms along with one-line descriptions for each. Automated planning Combinatorial algorithms General combinatorial algorithms * Brent's algorithm: finds a cycle in function value iterations using on ...


References

{{reflist Computer file formats Computing commands History of human–computer interaction Pattern matching Software architecture User interface techniques User interfaces