Object recognition – technology in the field of
computer vision
Computer vision tasks include methods for image sensor, acquiring, Image processing, processing, Image analysis, analyzing, and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical ...
for finding and identifying objects in an image or video sequence. Humans recognize a multitude of objects in images with little effort, despite the fact that the image of the objects may vary somewhat in different view points, in many different sizes and scales or even when they are translated or rotated. Objects can even be recognized when they are partially obstructed from view. This task is still a challenge for computer vision systems. Many approaches to the task have been implemented over multiple decades.
Approaches based on CAD-like object models
*
Edge detection
Edge or EDGE may refer to:
Technology Computing
* Edge computing, a network load-balancing system
* Edge device, an entry point to a computer network
* Adobe Edge, a graphical development application
* Microsoft Edge, a web browser developed b ...
*
Primal sketch
* Marr, Mohan and Nevatia
* Lowe
*
Olivier Faugeras
Recognition by parts
*
Generalized cylinders (
Thomas Binford)
*
Geons (
Irving Biederman
Irving Biederman (1939 – August 17, 2022) was an American vision scientist specializing in the study of brain processes underlying humans' ability to quickly recognize and interpret what they see. While best known for his Recognition by Comp ...
)
* Dickinson, Forsyth and Ponce
Appearance-based methods
* Use example images (called templates or exemplars) of the objects to perform recognition
* Objects look different under varying conditions:
** Changes in lighting or color
** Changes in viewing direction
** Changes in size/shape
* A single exemplar is unlikely to succeed reliably. However, it is impossible to represent all appearances of an object.
Edge matching
* Uses edge detection techniques, such as the
Canny edge detection, to find edges.
* Changes in lighting and color usually don't have much effect on image edges
* Strategy:
*# Detect edges in template and image
*# Compare edges images to find the template
*# Must consider range of possible template positions
* Measurements:
** Good – count the number of overlapping edges. Not robust to changes in shape
** Better – count the number of template edge pixels with some distance of an edge in the search image
** Best – determine probability distribution of distance to nearest edge in search image (if template at correct position). Estimate likelihood of each template position generating image
Divide-and-Conquer search
* Strategy:
** Consider all positions as a set (a cell in the space of positions)
** Determine lower bound on score at best position in cell
** If bound is too large, prune cell
** If bound is not too large, divide cell into subcells and try each subcell recursively
** Process stops when cell is “small enough”
* Unlike multi-resolution search, this technique is guaranteed to find all matches that meet the criterion (assuming that the lower bound is accurate)
* Finding the Bound:
** To find the lower bound on the best score, look at score for the template position represented by the center of the cell
** Subtract maximum change from the “center” position for any other position in cell (occurs at cell corners)
* Complexities arise from determining bounds on distance
Greyscale matching
* Edges are (mostly) robust to illumination changes, however they throw away a lot of information
* Must compute pixel distance as a function of both pixel position and pixel intensity
* Can be applied to color also
Gradient matching
* Another way to be robust to illumination changes without throwing away as much information is to compare image gradients
* Matching is performed like matching greyscale images
* Simple alternative: Use (normalized) correlation
Histograms of receptive field responses
* Avoids explicit point correspondences
* Relations between different image points implicitly coded in the receptive field responses
* Swain and Ballard (1991), Schiele and Crowley (2000), Linde and Lindeberg (2004, 2012)
Large modelbases
* One approach to efficiently searching the database for a specific image to use eigenvectors of the templates (called
eigenface
An eigenface ( ) is the name given to a set of eigenvectors when used in the computer vision problem of human face recognition. The approach of using eigenfaces for recognition was developed by Sirovich and Kirby and used by Matthew Turk and ...
s)
* Modelbases are a collection of geometric models of the objects that should be recognized
Feature-based methods
* a search is used to find feasible matches between object features and
image features.
* the primary constraint is that a single position of the object must account for all of the feasible matches.
* methods that
extract features from the objects to be recognized and the images to be searched.
** surface patches
** corners
** linear edges
Interpretation trees
* A method for searching for feasible matches, is to search through a tree.
* Each node in the tree represents a set of matches.
** Root node represents empty set
** Each other node is the union of the matches in the parent node and one additional match.
** Wildcard is used for features with no match
* Nodes are “pruned” when the set of matches is infeasible.
** A pruned node has no children
* Historically significant and still used, but less commonly
Hypothesize and test
* General Idea:
** Hypothesize a
correspondence between a collection of image features and a collection of object features
** Then use this to generate a hypothesis about the projection from the object coordinate frame to the image frame
** Use this projection hypothesis to generate a rendering of the object. This step is usually known as backprojection
** Compare the rendering to the image, and, if the two are sufficiently similar, accept the hypothesis
* Obtaining Hypothesis:
** There are a variety of different ways of generating hypotheses.
** When camera intrinsic parameters are known, the hypothesis is equivalent to a hypothetical position and orientation –
pose – for the object.
** Utilize geometric constraints
** Construct a correspondence for small sets of object features to every correctly sized subset of image points. (These are the hypotheses)
* Three basic approaches:
** Obtaining Hypotheses by Pose Consistency
** Obtaining Hypotheses by Pose Clustering
** Obtaining Hypotheses by Using Invariants
* Expense search that is also redundant, but can be improved using Randomization and/or Grouping
** Randomization
*** Examining small sets of image features until likelihood of missing object becomes small
*** For each set of image features, all possible matching sets of model features must be considered.
*** Formula:
***: (1 – W
c)
k = Z
**** W = the fraction of image points that are “good” (w ~ m/n)
**** c = the number of correspondences necessary
**** k = the number of trials
**** Z = the probability of every trial using one (or more) incorrect correspondences
** Grouping
*** If we can determine groups of points that are likely to come from the same object, we can reduce the number of hypotheses that need to be examined
Pose consistency
* Also called Alignment, since the object is being aligned to the image
* Correspondences between image features and model features are not independent – Geometric constraints
* A small number of correspondences yields the object position – the others must be consistent with this
* General Idea:
** If we hypothesize a match between a sufficiently large group of image features and a sufficiently large group of object features, then we can recover the missing camera parameters from this hypothesis (and so render the rest of the object)
* Strategy:
** Generate hypotheses using small number of correspondences (e.g. triples of points for 3D recognition)
** Project other model features into image (
backproject) and verify additional correspondences
* Use the smallest number of correspondences necessary to achieve discrete object poses
Pose clustering
* General Idea:
** Each object leads to many correct sets of correspondences, each of which has (roughly) the same pose
** Vote on pose. Use an accumulator array that represents pose space for each object
** This is essentially a
Hough transform
The Hough transform () is a feature extraction technique used in image analysis, computer vision, pattern recognition, and digital image processing. The purpose of the technique is to find imperfect instances of objects within a certain class of ...
* Strategy:
** For each object, set up an accumulator array that represents pose space – each element in the accumulator array corresponds to a “bucket” in pose space.
** Then take each image frame group, and hypothesize a correspondence between it and every frame group on every object
** For each of these correspondences, determine pose parameters and make an entry in the accumulator array for the current object at the pose value.
** If there are large numbers of votes in any object's accumulator array, this can be interpreted as evidence for the presence of that object at that pose.
** The evidence can be checked using a verification method
* Note that this method uses sets of correspondences, rather than individual correspondences
** Implementation is easier, since each set yields a small number of possible object poses.
* Improvement
** The noise resistance of this method can be improved by not counting votes for objects at poses where the vote is obviously unreliable
*: § For example, in cases where, if the object was at that pose, the object frame group would be invisible.
** These improvements are sufficient to yield working systems
Invariance
* There are geometric properties that are invariant to camera transformations
* Most easily developed for images of planar objects, but can be applied to other cases as well
Geometric hashing
Geometry (; ) is a branch of mathematics concerned with properties of space such as the distance, shape, size, and relative position of figures. Geometry is, along with arithmetic, one of the oldest branches of mathematics. A mathematician w ...
* An algorithm that uses geometric invariants to vote for object hypotheses
* Similar to pose clustering, however instead of voting on pose, we are now voting on geometry
* A technique originally developed for matching geometric features (uncalibrated affine views of plane models) against a database of such features
* Widely used for pattern-matching, CAD/CAM, and medical imaging.
* It is difficult to choose the size of the buckets
* It is hard to be sure what “enough” means. Therefore, there may be some danger that the table will get clogged.
Scale-invariant feature transform
The scale-invariant feature transform (SIFT) is a computer vision algorithm to detect, describe, and match local '' features'' in images, invented by David Lowe in 1999. Applications include object recognition, robotic mapping and navigation, ...
(SIFT)
* Keypoints of objects are first extracted from a set of reference images and stored in a database
* An object is recognized in a new image by individually comparing each feature from the new image to this database and finding candidate matching features based on Euclidean distance of their feature vectors.
* Lowe (2004)
Speeded Up Robust Features
In computer vision, speeded up robust features (SURF) is a local feature detector and descriptor, with patented applications. It can be used for tasks such as object recognition, image registration, classification, or 3D reconstruction. It is ...
(SURF)
* A robust image detector & descriptor
* The standard version is several times faster than SIFT and claimed by its authors to be more robust against different image transformations than SIFT
* Based on sums of approximated
2D Haar wavelet responses and made efficient use of integral images.
* Bay et al. (2008)
Bag of words representations
Genetic algorithm
Genetic algorithm
In computer science and operations research, a genetic algorithm (GA) is a metaheuristic inspired by the process of natural selection that belongs to the larger class of evolutionary algorithms (EA). Genetic algorithms are commonly used to g ...
s can operate without prior knowledge of a given dataset and can develop recognition procedures without human intervention. A recent project achieved 100 percent accuracy on the benchmark motorbike, face, airplane and car image datasets from Caltech and 99.4 percent accuracy on fish species image datasets.
Other approaches
*
3D object recognition {{FeatureDetectionCompVisNavbox
In computer vision, 3D object recognition involves recognizing and determining 3D information, such as the pose, volume, or shape, of user-chosen 3D objects in a photograph or range scan. Typically, an example of ...
and
reconstruction
Reconstruction may refer to:
Politics, history, and sociology
*Reconstruction (law), the transfer of a company's (or several companies') business to a new company
*''Perestroika'' (Russian for "reconstruction"), a late 20th century Soviet Union ...
*
Biologically inspired
Bioinspiration refers to the human development of novel materials, devices, structures, and behaviors inspired by solutions found in biological organisms, where they have evolved and been refined over millions of years. The goal is to improve model ...
object recognition
*
Artificial neural networks
In machine learning, a neural network (also artificial neural network or neural net, abbreviated ANN or NN) is a computational model inspired by the structure and functions of biological neural networks.
A neural network consists of connected ...
and
Deep Learning
Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...
especially
convolutional neural network
A convolutional neural network (CNN) is a type of feedforward neural network that learns features via filter (or kernel) optimization. This type of deep learning network has been applied to process and make predictions from many different ty ...
s
*
Context
In semiotics, linguistics, sociology and anthropology, context refers to those objects or entities which surround a ''focal event'', in these disciplines typically a communicative event, of some kind. Context is "a frame that surrounds the event ...
[Oliva, Aude, and Antonio Torralba.]
The role of context in object recognition
" Trends in cognitive sciences 11.12 (2007): 520-527.
* Explicit and
implicit
Implicit may refer to:
Mathematics
* Implicit function
* Implicit function theorem
* Implicit curve
* Implicit surface
* Implicit differential equation
Other uses
* Implicit assumption, in logic
* Implicit-association test, in social psychology
* ...
3D object models
*
Fast indexing
*
Global scene representations
*
Gradient histograms
*
Stochastic grammar
A stochastic grammar (statistical grammar) is a grammar framework with a probabilistic notion of grammaticality:
*Stochastic context-free grammar
*Statistical parsing
*Data-oriented parsing
*Hidden Markov model (or stochastic regular grammar)
*Esti ...
s
* Intraclass
transfer learning
Transfer learning (TL) is a technique in machine learning (ML) in which knowledge learned from a task is re-used in order to boost performance on a related task. For example, for image classification, knowledge gained while learning to recogniz ...
*
Object categorization from image search
In computer vision, object categorization from image search is the problem of training a classifier to recognize categories of objects using only image search, i.e., images retrieved automatically with an Internet search engine. Ideally, automatic ...
*
Reflectance
The reflectance of the surface of a material is its effectiveness in reflecting radiant energy. It is the fraction of incident electromagnetic power that is reflected at the boundary. Reflectance is a component of the response of the electronic ...
*
Shape-from-shading
*
Template matching
* Texture
[Shotton, Jamie, et al.]
Textonboost for image understanding: Multi-class object recognition and segmentation by jointly modeling texture, layout, and context
" International journal of computer vision 81.1 (2009): 2-23.
*
Topic model
In statistics and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden ...
s
[Niu, Zhenxing, et al.]
Context aware topic model for scene recognition
" 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2012.
*
Unsupervised learning
Unsupervised learning is a framework in machine learning where, in contrast to supervised learning, algorithms learn patterns exclusively from unlabeled data. Other frameworks in the spectrum of supervisions include weak- or semi-supervision, wh ...
*
Window-based detection
*
Deformable Part Model
*
Bingham distribution
In statistics, the Bingham distribution, named after Christopher Bingham, is an antipodally symmetric probability distribution on the ''n''-sphere. It is a generalization of the Watson distribution and a special case of the Kent and Fisher–Bin ...
Applications
Object recognition methods has the following applications:
*
Activity recognition
Activity recognition aims to recognize the actions and goals of one or more agents from a series of observations on the agents' actions and the environmental conditions. Since the 1980s, this research field has captured the attention of several co ...
*
Automatic image annotation
Automatic image annotation (also known as automatic image tagging or linguistic indexing) is the process by which a computer system automatically assigns metadata in the form of captioning or keywords to a digital image. This application of compu ...
*
Automatic target recognition
Automatic target recognition (ATR) is the ability for an algorithm or device to recognize targets or other objects based on data obtained from sensors.
Target recognition was initially done by using an audible representation of the received signal ...
*
Android Eyes - Object Recognition
Android most commonly refers to:
*Android (robot), a humanoid robot or synthetic organism designed to imitate a human
* Android (operating system), a mobile operating system primarily developed by Google
* Android TV, a operating system developed ...
*
Computer-aided diagnosis
Computer-aided detection (CADe), also called computer-aided diagnosis (CADx), are systems that assist doctors in the interpretation of medical imaging, medical images. Imaging techniques in X-ray, MRI, endoscopy, and Medical ultrasound, ultraso ...
* Image
panorama
A panorama (formed from Greek language, Greek πᾶν "all" + ὅραμα "view") is any Obtuse angle, wide-angle view or representation of a physical space, whether in painting, drawing, photography (panoramic photography), film, seismic image ...
s
*
Image watermarking
* Global
robot localization
Robot localization denotes the robot's ability to establish its own position and orientation within the frame of reference. Path planning is effectively an extension of localization, in that it requires the determination of the robot's current pos ...
*
Face detection
Face detection is a computer technology being used in a variety of applications that identifies human faces in digital images. Face detection also refers to the psychological process by which humans locate and attend to faces in a visual scene ...
*
Optical Character Recognition
Optical character recognition or optical character reader (OCR) is the electronics, electronic or machine, mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo ...
* Manufacturing
quality control
Quality control (QC) is a process by which entities review the quality of all factors involved in production. ISO 9000 defines quality control as "a part of quality management focused on fulfilling quality requirements".
This approach plac ...
*
Content-based image retrieval
Content-based image retrieval, also known as query by image content ( QBIC) and content-based visual information retrieval (CBVIR), is the application of computer vision techniques to the image retrieval problem, that is, the problem of searching ...
*
Object Counting and Monitoring
*
Automated parking system
An automated (car) parking system (APS) is a mechanical system designed to minimize the area and/or volume required for parking cars. Like a multi-story parking garage, an APS provides parking for cars on multiple levels stacked vertically to ...
s
* Visual Positioning and
tracking
Tracking may refer to:
Science and technology Computing
* Tracking, in computer graphics, in match moving (insertion of graphics into footage)
* Tracking, composing music with music tracker software
* Eye tracking, measuring the position of ...
*
Video stabilization
Image stabilization (IS) is a family of techniques that reduce blurring associated with the motion of a camera or other imaging device during exposure.
Generally, it compensates for pan and tilt (angular movement, equivalent to yaw and pi ...
*
Pedestrian detection
Pedestrian detection is an essential and significant task in any intelligent video surveillance system, as it provides the fundamental information for semantic understanding of the video footages. It has an obvious extension
to automotive applic ...
*
Intelligent speed assistance
Intelligent speed assistance (ISA), or intelligent speed adaptation, also known as ''alerting'', and ''intelligent authority'', is any system that ensures that vehicle speed does not exceed a safe or speed limit, legally enforced speed. In case ...
(in car and other vehicles)
Surveys
*Daniilides and Eklundh, Edelman.
*
See also
*
Histogram of oriented gradients
The histogram of oriented gradients (HOG) is a feature descriptor used in computer vision and image processing for the purpose of object detection. The technique counts occurrences of gradient orientation in localized portions of an image. This m ...
*
Convolutional neural network
A convolutional neural network (CNN) is a type of feedforward neural network that learns features via filter (or kernel) optimization. This type of deep learning network has been applied to process and make predictions from many different ty ...
*
OpenCV
OpenCV (Open Source Computer Vision Library) is a Library (computing), library of programming functions mainly for Real-time computing, real-time computer vision. Originally developed by Intel, it was later supported by Willow Garage, then Itseez ...
*
Scale-invariant feature transform
The scale-invariant feature transform (SIFT) is a computer vision algorithm to detect, describe, and match local '' features'' in images, invented by David Lowe in 1999. Applications include object recognition, robotic mapping and navigation, ...
(SIFT)
*
Object detection
Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class (such as humans, buildings, or cars) in digital images and videos. Well-researched ...
Scholarpedia article on scale-invariant feature transform and related object recognition methods*
SURF
*
Template matching
*
Integral channel feature
; Lists
*
List of computer vision topics
*
List of emerging technologies
This is a list of emerging technologies, which are emerging technologies, in-development technical innovations that have significant potential in their applications. The criteria for this list is that the technology must:
# Exist in some way; ...
*
Outline of artificial intelligence
The following outline is provided as an overview of and topical guide to artificial intelligence:
Artificial intelligence (AI) is intelligence exhibited by machines or software. It is also the name of the scientific field which studies how to ...
Notes
References
* Elgammal, Ahmed
"CS 534: Computer Vision 3D Model-based recognition" Dept of Computer Science, Rutgers University;
* Hartley, Richard and Zisserman, Andrew
"Multiple View Geometry in computer vision" Cambridge Press, 2000, .
* Roth, Peter M. and Winter, Martin
"Survey of Appearance-Based Methods for Object Recognition", Technical Report ICG-TR-01/08 Inst. for Computer Graphics and Vision, Graz University of Technology, Austria; January 15, 2008.
* Collins, Robert
"Lecture 31: Object Recognition: SIFT Keys" CSE486, Penn State
* Image Processing - Online Open Research Group
an
Dumitru ErhanDeep Neural Networks for Object DetectionAdvances in Neural Information Processing Systems 26, 2013. page 2553–2561.
External links
{{Outline footer
Object recognition
Object recognition – technology in the field of computer vision for finding and identifying objects in an image or video sequence. Humans recognize a multitude of objects in images with little effort, despite the fact that the image of the ...
Object recognition
Object recognition – technology in the field of computer vision for finding and identifying objects in an image or video sequence. Humans recognize a multitude of objects in images with little effort, despite the fact that the image of the ...