Contents 1 Overview 1.1 Probabilistic classifiers 1.2 Number of important feature variables 2 Problem statement (supervised version) 2.1 Frequentist or Bayesian approach to pattern recognition 3 Uses 4 Algorithms 4.1 Classification algorithms (supervised algorithms predicting
categorical labels)
4.2 Clustering algorithms (unsupervised algorithms predicting
categorical labels)
4.3
5 See also 6 References 7 Further reading 8 External links Overview[edit]
It has been suggested that portions of this section be split out into another article titled Probabilistic classifier. (Discuss) (May 2014) Many common pattern recognition algorithms are probabilistic in nature, in that they use statistical inference to find the best label for a given instance. Unlike other algorithms, which simply output a "best" label, often probabilistic algorithms also output a probability of the instance being described by the given label. In addition, many probabilistic algorithms output a list of the N-best labels with associated probabilities, for some value of N, instead of simply a single best label. When the number of possible labels is fairly small (e.g., in the case of classification), N may be set so that the probability of all possible labels is output. Probabilistic algorithms have many advantages over non-probabilistic algorithms: They output a confidence value associated with their choice. (Note that some other algorithms may also output confidence values, but in general, only for probabilistic algorithms is this value mathematically grounded in probability theory. Non-probabilistic confidence values can in general not be given any specific meaning, and only used to compare against other confidence values output by the same algorithm.) Correspondingly, they can abstain when the confidence of choosing any particular output is too low. Because of the probabilities output, probabilistic pattern-recognition algorithms can be more effectively incorporated into larger machine-learning tasks, in a way that partially or completely avoids the problem of error propagation. Number of important feature variables[edit]
n displaystyle n features the powerset consisting of all 2 n − 1 displaystyle 2^ n -1 subsets of features need to be explored. The Branch-and-Bound algorithm[4] does reduce this complexity but is intractable for medium to large values of the number of available features n displaystyle n . For a large-scale comparison of feature-selection algorithms see .[5] Techniques to transform the raw feature vectors (feature extraction) are sometimes used prior to application of the pattern-matching algorithm. For example, feature extraction algorithms attempt to reduce a large-dimensionality feature vector into a smaller-dimensionality vector that is easier to work with and encodes less redundancy, using mathematical techniques such as principal components analysis (PCA). The distinction between feature selection and feature extraction is that the resulting features after feature extraction has taken place are of a different sort than the original features and may not easily be interpretable, while the features left after feature selection are simply a subset of the original features. Problem statement (supervised version)[edit] Formally, the problem of supervised pattern recognition can be stated as follows: Given an unknown function g : X → Y displaystyle g: mathcal X rightarrow mathcal Y (the ground truth) that maps input instances x ∈ X displaystyle boldsymbol x in mathcal X to output labels y ∈ Y displaystyle yin mathcal Y , along with training data D = ( x 1 , y 1 ) , … , ( x n , y n ) displaystyle mathbf D = ( boldsymbol x _ 1 ,y_ 1 ),dots ,( boldsymbol x _ n ,y_ n ) assumed to represent accurate examples of the mapping, produce a function h : X → Y displaystyle h: mathcal X rightarrow mathcal Y that approximates as closely as possible the correct mapping g displaystyle g . (For example, if the problem is filtering spam, then x i displaystyle boldsymbol x _ i is some representation of an email and y displaystyle y is either "spam" or "non-spam"). In order for this to be a well-defined problem, "approximates as closely as possible" needs to be defined rigorously. In decision theory, this is defined by specifying a loss function or cost function that assigns a specific value to "loss" resulting from producing an incorrect label. The goal then is to minimize the expected loss, with the expectation taken over the probability distribution of X displaystyle mathcal X . In practice, neither the distribution of X displaystyle mathcal X nor the ground truth function g : X → Y displaystyle g: mathcal X rightarrow mathcal Y are known exactly, but can be computed only empirically by collecting a large number of samples of X displaystyle mathcal X and hand-labeling them using the correct value of Y displaystyle mathcal Y (a time-consuming process, which is typically the limiting factor in the amount of data of this sort that can be collected). The particular loss function depends on the type of label being predicted. For example, in the case of classification, the simple zero-one loss function is often sufficient. This corresponds simply to assigning a loss of 1 to any incorrect labeling and implies that the optimal classifier minimizes the error rate on independent test data (i.e. counting up the fraction of instances that the learned function h : X → Y displaystyle h: mathcal X rightarrow mathcal Y labels wrongly, which is equivalent to maximizing the number of correctly classified instances). The goal of the learning procedure is then to minimize the error rate (maximize the correctness) on a "typical" test set. For a probabilistic pattern recognizer, the problem is instead to estimate the probability of each possible output label given a particular input instance, i.e., to estimate a function of the form p ( l a b e l
x , θ ) = f ( x ; θ ) displaystyle p( rm label boldsymbol x , boldsymbol theta )=fleft( boldsymbol x ; boldsymbol theta right) where the feature vector input is x displaystyle boldsymbol x , and the function f is typically parameterized by some parameters θ displaystyle boldsymbol theta .[6] In a discriminative approach to the problem, f is estimated directly. In a generative approach, however, the inverse probability p ( x
l a b e l ) displaystyle p( boldsymbol x rm label ) is instead estimated and combined with the prior probability p ( l a b e l
θ ) displaystyle p( rm label boldsymbol theta ) using Bayes' rule, as follows: p ( l a b e l
x , θ ) = p ( x
l a b e l , θ ) p ( l a b e l
θ ) ∑ L ∈ all labels p ( x
L ) p ( L
θ ) . displaystyle p( rm label boldsymbol x , boldsymbol theta )= frac p( boldsymbol x rm label, boldsymbol theta )p( rm label boldsymbol theta ) sum _ Lin text all labels p( boldsymbol x L)p(L boldsymbol theta ) . When the labels are continuously distributed (e.g., in regression analysis), the denominator involves integration rather than summation: p ( l a b e l
x , θ ) = p ( x
l a b e l , θ ) p ( l a b e l
θ ) ∫ L ∈ all labels p ( x
L ) p ( L
θ ) d L . displaystyle p( rm label boldsymbol x , boldsymbol theta )= frac p( boldsymbol x rm label, boldsymbol theta )p( rm label boldsymbol theta ) int _ Lin text all labels p( boldsymbol x L)p(L boldsymbol theta )operatorname d L . The value of θ displaystyle boldsymbol theta is typically learned using maximum a posteriori (MAP) estimation. This finds the best value that simultaneously meets two conflicting objects: To perform as well as possible on the training data (smallest error-rate) and to find the simplest possible model. Essentially, this combines maximum likelihood estimation with a regularization procedure that favors simpler models over more complex models. In a Bayesian context, the regularization procedure can be viewed as placing a prior probability p ( θ ) displaystyle p( boldsymbol theta ) on different values of θ displaystyle boldsymbol theta . Mathematically: θ ∗ = arg max θ p ( θ
D ) displaystyle boldsymbol theta ^ * =arg max _ boldsymbol theta p( boldsymbol theta mathbf D ) where θ ∗ displaystyle boldsymbol theta ^ * is the value used for θ displaystyle boldsymbol theta in the subsequent evaluation procedure, and p ( θ
D ) displaystyle p( boldsymbol theta mathbf D ) , the posterior probability of θ displaystyle boldsymbol theta , is given by p ( θ
D ) = [ ∏ i = 1 n p ( y i
x i , θ ) ] p ( θ ) . displaystyle p( boldsymbol theta mathbf D )=left[prod _ i=1 ^ n p(y_ i boldsymbol x _ i , boldsymbol theta )right]p( boldsymbol theta ). In the Bayesian approach to this problem, instead of choosing a single parameter vector θ ∗ displaystyle boldsymbol theta ^ * , the probability of a given label for a new instance x displaystyle boldsymbol x is computed by integrating over all possible values of θ displaystyle boldsymbol theta , weighted according to the posterior probability: p ( l a b e l
x ) = ∫ p ( l a b e l
x , θ ) p ( θ
D ) d θ . displaystyle p( rm label boldsymbol x )=int p( rm label boldsymbol x , boldsymbol theta )p( boldsymbol theta mathbf D )operatorname d boldsymbol theta . Frequentist or Bayesian approach to pattern recognition[edit] The first pattern classifier – the linear discriminant presented by Fisher – was developed in the frequentist tradition. The frequentist approach entails that the model parameters are considered unknown, but objective. The parameters are then computed (estimated) from the collected data. For the linear discriminant, these parameters are precisely the mean vectors and the covariance matrix. Also the probability of each class p ( l a b e l
θ ) displaystyle p( rm label boldsymbol theta ) is estimated from the collected dataset. Note that the usage of
'Bayes rule' in a pattern classifier does not make the classification
approach Bayesian.
p ( l a b e l
θ ) displaystyle p( rm label boldsymbol theta ) can be chosen by the user, which are then a priori. Moreover, experience quantified as a priori parameter values can be weighted with empirical observations – using e.g., the Beta- (conjugate prior) and Dirichlet-distributions. The Bayesian approach facilitates a seamless intermixing between expert knowledge in the form of subjective probabilities, and objective observations. Probabilistic pattern classifiers can be used according to a frequentist or a Bayesian approach. Uses[edit] The face was automatically detected by special software. Within medical science, pattern recognition is the basis for
computer-aided diagnosis (CAD) systems. CAD describes a procedure that
supports the doctor's interpretations and findings. Other typical
applications of pattern recognition techniques are automatic speech
recognition, classification of text into several categories (e.g.,
spam/non-spam email messages), the automatic recognition of
handwritten postal codes on postal envelopes, automatic recognition of
images of human faces, or handwriting image extraction from medical
forms.[7] The last two examples form the subtopic image analysis of
pattern recognition that deals with digital images as input to pattern
recognition systems.[8][9]
Optical character recognition is a classic example of the application
of a pattern classifier, see OCR-example. The method of signing one's
name was captured with stylus and overlay starting in 1990.[citation
needed] The strokes, speed, relative min, relative max, acceleration
and pressure is used to uniquely identify and confirm identity. Banks
were first offered this technology, but were content to collect from
the FDIC for any bank fraud and did not want to inconvenience
customers..[citation needed]
identification and authentication: e.g., license plate recognition,[10] fingerprint analysis and face detection/verification;[11] medical diagnosis: e.g., screening for cervical cancer (Papnet)[12] or breast tumors; defence: various navigation and guidance systems, target recognition systems, shape recognition technology etc. For a discussion of the aforementioned applications of neural networks in image processing, see e.g.[13] In psychology, pattern recognition (making sense of and identifying objects) is closely related to perception, which explains how the sensory inputs humans receive are made meaningful. Pattern recognition can be thought of in two different ways: the first being template matching and the second being feature detection. A template is a pattern used to produce items of the same proportions. The template-matching hypothesis suggests that incoming stimuli are compared with templates in the long term memory. If there is a match, the stimulus is identified. Feature detection models, such as the Pandemonium system for classifying letters (Selfridge, 1959), suggest that the stimuli are broken down into their component parts for identification. For example, a capital E has three horizontal lines and one vertical line.[14] Algorithms[edit] Algorithms for pattern recognition depend on the type of label output, on whether learning is supervised or unsupervised, and on whether the algorithm is statistical or non-statistical in nature. Statistical algorithms can further be categorized as generative or discriminative. This article contains embedded lists that may be poorly defined, unverified or indiscriminate. Please help to clean it up to meet Wikipedia's quality standards. Where appropriate, incorporate items into the main body of the article. (May 2014) Classification algorithms (supervised algorithms predicting categorical labels)[edit] Main article: Statistical classification Parametric:[15] Linear discriminant analysis
Quadratic discriminant analysis
Nonparametric:[16] Decision trees, decision lists
Kernel estimation and
Clustering algorithms (unsupervised algorithms predicting categorical labels)[edit] Main article: Cluster analysis Categorical mixture models
Boosting (meta-algorithm)
General algorithms for predicting arbitrarily-structured (sets of) labels[edit] Bayesian networks Markov random fields
Real-valued sequence labeling algorithms (predicting sequences of real-valued labels)[edit] Main article: sequence labeling Supervised (?): Kalman filters Particle filters Regression algorithms (predicting real-valued labels)[edit] Main article: Regression analysis Supervised:
Unsupervised:
Conditional random fields (CRFs) Hidden Markov models (HMMs) Maximum entropy Markov models (MEMMs) Recurrent neural networks Unsupervised: Hidden Markov models (HMMs) See also[edit] Adaptive resonance theory
Black box
Cache language model
Compound term processing
Computer-aided diagnosis
References[edit] This article is based on material taken from the Free On-line Dictionary of Computing prior to 1 November 2008 and incorporated under the "relicensing" terms of the GFDL, version 1.3 or later. ^ Bishop, Christopher M. (2006). Pattern Recognition and Machine
Learning (PDF). Springer. p. vii.
θ displaystyle boldsymbol theta consists of the two mean vectors μ 1 displaystyle boldsymbol mu _ 1 and μ 2 displaystyle boldsymbol mu _ 2 and the common covariance matrix Σ displaystyle boldsymbol Sigma .
^ Milewski, Robert; Govindaraju, Venu (31 March 2008). "Binarization
and cleanup of handwritten text from carbon copy medical form images".
Pattern Recognition. 41 (4): 1308–1315.
doi:10.1016/j.patcog.2007.08.018.
^ Richard O. Duda, Peter E. Hart, David G. Stork (2001). Pattern
classification (2nd ed.). Wiley, New York.
ISBN 0-471-05669-3. CS1 maint: Multiple names: authors list
(link)
^ R. Brunelli,
Further reading[edit] Fukunaga, Keinosuke (1990). Introduction to Statistical Pattern Recognition (2nd ed.). Boston: Academic Press. ISBN 0-12-269851-7. Hornegger, Joachim; Paulus, Dietrich W. R. (1999). Applied Pattern Recognition: A Practical Introduction to Image and Speech Processing in C++ (2nd ed.). San Francisco: Morgan Kaufmann Publishers. ISBN 3-528-15558-2. Schuermann, Juergen (1996). Pattern Classification: A Unified View of Statistical and Neural Approaches. New York: Wiley. ISBN 0-471-13534-8. Godfried T. Toussaint, ed. (1988). Computational Morphology. Amsterdam: North-Holland Publishing Company. Kulikowski, Casimir A.; Weiss, Sholom M. (1991). Computer Systems That Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Machine Learning. San Francisco: Morgan Kaufmann Publishers. ISBN 1-55860-065-5. Jain, Anil.K.; Duin, Robert.P.W.; Mao, Jianchang (2000). "Statistical pattern recognition: a review". IEEE Transactions on Pattern Analysis and Machine Intelligence. 22 (1): 4–37. doi:10.1109/34.824819. An introductory tutorial to classifiers (introducing the basic terms, with numeric example) External links[edit] The International Association for Pattern Recognition List of Pattern Recognition web sites Journal of Pattern Recognition Research Pattern Recognition Info Pattern Recognition (Journal of the Pattern Recognition Society) International Journal of Pattern Recognition and Artificial Intelligence International Journal of Applied Pattern Recognition Open Pattern Recognition Project, intended to be an open source platform for sharing algorithms of pattern recognition Improved Fast Pattern Matching Improved Fast Pattern Matching Authority control GND: 40409 |