Caltech 101 is a
data set A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record of the d ...
of
digital images
A digital image is an image composed of picture elements, also known as ''pixels'', each with ''finite'', '' discrete quantities'' of numeric representation for its intensity or gray level that is an output from its two-dimensional functions f ...
created in September 2003 and compiled by
Fei-Fei Li
Fei-Fei Li (; born 1976) is a Chinese-American computer scientist who is known for establishing ImageNet, the dataset that enabled rapid advances in computer vision in the 2010s.
She is the Sequoia Capital Professor of Computer Science at Sta ...
, Marco Andreetto, Marc 'Aurelio Ranzato and
Pietro Perona at the
California Institute of Technology
The California Institute of Technology (branded as Caltech or CIT)The university itself only spells its short form as "Caltech"; the institution considers other spellings such a"Cal Tech" and "CalTech" incorrect. The institute is also occasional ...
. It is intended to facilitate
Computer Vision
Computer vision is an Interdisciplinarity, interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to understand and automate t ...
research and techniques and is most applicable to techniques involving
image recognition
Computer vision is an interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to understand and automate tasks that the huma ...
classification and categorization. Caltech 101 contains a total of 9,146 images, split between 101 distinct object categories (
face
The face is the front of an animal's head that features the eyes, nose and mouth, and through which animals express many of their emotions. The face is crucial for human identity, and damage such as scarring or developmental deformities may aff ...
s,
watches
A watch is a portable timepiece intended to be carried or worn by a person. It is designed to keep a consistent movement despite the motions caused by the person's activities. A wristwatch is designed to be worn around the wrist, attached b ...
,
ants
Ants are Eusociality, eusocial insects of the Family (biology), family Formicidae and, along with the related wasps and bees, belong to the Taxonomy (biology), order Hymenoptera. Ants evolved from Vespoidea, vespoid wasp ancestors in the Creta ...
,
pianos
The piano is a stringed keyboard instrument in which the strings are struck by wooden hammers that are coated with a softer material (modern hammers are covered with dense wool felt; some early pianos used leather). It is played using a keyboa ...
, etc.) and a background category. Provided with the images are a set of
annotations
An annotation is extra information associated with a particular point in a document or other piece of information. It can be a note that includes a comment or explanation. Annotations are sometimes presented in the margin of book pages. For anno ...
describing the outlines of each image, along with a
Matlab
MATLAB (an abbreviation of "MATrix LABoratory") is a proprietary multi-paradigm programming language and numeric computing environment developed by MathWorks. MATLAB allows matrix manipulations, plotting of functions and data, implementa ...
script
Script may refer to:
Writing systems
* Script, a distinctive writing system, based on a repertoire of specific elements or symbols, or that repertoire
* Script (styles of handwriting)
** Script typeface, a typeface with characteristics of ha ...
for viewing.
Purpose
Most Computer Vision and
Machine Learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
algorithms function by training on example inputs. They require a large and varied set of training data to work effectively. For example, the real-time
face detection
Face detection is a computer technology being used in a variety of applications that identifies human faces in digital images. Face detection also refers to the psychological process by which humans locate and attend to faces in a visual scene. ...
method used by Paul Viola and Michael J. Jones was trained on 4,916 hand-labeled faces.
Cropping, re-sizing and hand-marking points of interest is tedious and time-consuming.
Historically, most data sets used in computer vision research have been tailored to the specific needs of the project being worked on. A large problem in comparing
computer vision
Computer vision is an Interdisciplinarity, interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to understand and automate t ...
techniques is the fact that most groups use their own data sets. Each set may have different properties that make reported results from different methods harder to compare directly. For example, differences in image size, image quality, relative location of objects within the images and level of occlusion and clutter present can lead to varying results.
The Caltech 101 data set aims at alleviating many of these common problems.
*The images are cropped and re-sized.
*Many categories are represented, which suits both single and multiple class recognition algorithms.
*Detailed object outlines are marked.
*Available for general use, Caltech 101 acts as a common standard by which to compare different algorithms without bias due to different data sets.
However, a recent study
demonstrates that tests based on uncontrolled natural images (like the Caltech 101 data set) can be seriously misleading, potentially guiding progress in the wrong direction.
Data set
Images
The Caltech 101 data set consists of a total of 9,146 images, split between 101 different object categories, as well as an additional background/clutter category.
Each object category contains between 40 and 800 images. Common and popular categories such as faces tend to have a larger number of images than others.
Each image is about 300x200 pixels. Images of oriented objects such as
airplanes
An airplane or aeroplane (informally plane) is a fixed-wing aircraft that is propelled forward by thrust from a jet engine, propeller, or rocket engine. Airplanes come in a variety of sizes, shapes, and wing configurations. The broad spectr ...
and
motorcycles
A motorcycle (motorbike, bike, or trike (if three-wheeled)) is a two or three-wheeled motor vehicle steered by a handlebar. Motorcycle design varies greatly to suit a range of different purposes: long-distance travel, commuting, cruisin ...
were mirrored to be left to right aligned and vertically oriented structures such as buildings were rotated to be off axis.
Annotations
A set of annotations is provided for each image. Each set of annotations contains two pieces of information: the general bounding box in which the object is located and a detailed human-specified outline enclosing the object.
A Matlab script is provided with the annotations. It loads an image and its corresponding annotation file and displays them as a Matlab figure.
Uses
The Caltech 101 data set was used to train and test several computer vision recognition and classification algorithms. The first paper to use Caltech 101 was an incremental
Bayesian
Thomas Bayes (/beɪz/; c. 1701 – 1761) was an English statistician, philosopher, and Presbyterian minister.
Bayesian () refers either to a range of concepts and approaches that relate to statistical methods based on Bayes' theorem, or a follower ...
approach to one shot learning,
[L. Fei-Fei, R. Fergus and P. Perona. Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. IEEE. CVPR 2004, Workshop on Generative-Model Based Vision. 2004](_blank)
/ref> an attempt to classify an object using only a few examples, by building on prior knowledge of other classes.
The Caltech 101 images, along with the annotations, were used for another one shot learning paper at Caltech.
Other Computer Vision papers that report using the Caltech 101 data set include:
*Shape Matching and Object Recognition using Low Distortion Correspondence. Alexander C. Berg, Tamara L. Berg, Jitendra Malik
Jitendra Malik is an Indian-American academic who is the Arthur J. Chick Professor of Electrical Engineering and Computer Sciences at the University of California, Berkeley.
He is known for his research in computer vision.
Academic biograph ...
. CVPR
The Conference on Computer Vision and Pattern Recognition (CVPR) is an annual conference on computer vision and pattern recognition, which is regarded as one of the most important conferences in its field. According to Google Scholar Metrics (2022 ...
2005
*The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features. K. Grauman and T. Darrell. International Conference on Computer Vision (ICCV), 2005
*Combining Generative Models and Fisher Kernels for Object Class Recognition. Holub, AD. Welling, M. Perona, P. International Conference on Computer Vision (ICCV), 2005
*Object Recognition with Features Inspired by Visual Cortex. T. Serre, L. Wolf and T. Poggio. Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), IEEE Computer Society Press, San Diego, June 2005.
*SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition. Hao Zhang, Alex Berg, Michael Maire, Jitendra Malik
Jitendra Malik is an Indian-American academic who is the Arthur J. Chick Professor of Electrical Engineering and Computer Sciences at the University of California, Berkeley.
He is known for his research in computer vision.
Academic biograph ...
. CVPR, 2006
*Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. Svetlana Lazebnik, Cordelia Schmid
Cordelia Schmid is computer vision researcher, currently Head of the THOTH project team at INRIA (French Institute for Research in Computer Science and Automation), Montbonnot, France.
Schmid obtained a degree in Computer Science from the Unive ...
, and Jean Ponce. CVPR, 2006
*Empirical Study of Multi-Scale Filter Banks for Object Categorization. M.J. Mar韓-Jim閚ez, and N. P閞ez de la Blanca. December 2005
*Multiclass Object Recognition with Sparse, Localized Features. Jim Mutch and David G. Lowe., pg. 11-18, CVPR 2006, IEEE Computer Society Press, New York, June 2006
*Using Dependent Regions or Object Categorization in a Generative Framework. G. Wang, Y. Zhang, and L. Fei-Fei. IEEE Comp. Vis. Patt. Recog. 2006
Analysis and comparison
Advantages
Caltech 101 has several advantages over other similar data sets:
*Uniform size and presentation:
**Almost all the images within each category are uniform in image size and in the relative position of interest objects. Caltech 101 users generally do not need to crop or scale images before they can be used.
*Low level of clutter/occlusion:
**Algorithms concerned with recognition usually function by storing features unique to the object. However, most images taken have varying degrees of background clutter, which means algorithms may build incorrectly.
*Detailed annotations
Weaknesses
Weaknesses to the Caltech 101 data set may be conscious trade-offs, but others are limitations of the data set. Papers that rely solely on Caltech 101 are frequently rejected.
Weaknesses include:
*The data set is too clean:
**Images are very uniform in presentation, aligned from left to right, and usually not occluded. As a result, the images are not always representative of practical inputs that the algorithm might later expect to see. Under practical conditions, images are more cluttered, occluded and display greater variance in relative position and orientation of interest objects. The uniformity allows concepts to be derived using the average of a category, which is unrealistic.
*Limited number of categories:
**The Caltech 101 data set represents only a small fraction of possible object categories.
*Some categories contain few images:
**Certain categories are not represented as well as others, containing as few as 31 images.
**This means that . The number of images used for training must be less than or equal to 30, which is not sufficient for all purposes.
*Aliasing and artifacts due to manipulation:
**Some images have been rotated and scaled from their original orientation, and suffer from some amount of artifacts or aliasing
In signal processing and related disciplines, aliasing is an effect that causes different signals to become indistinguishable (or ''aliases'' of one another) when sampled. It also often refers to the distortion or artifact that results when ...
.
Other data sets
* Caltech 256 is another image data set, created in 2007. It is a successor to Caltech 101. It is intended to address some of the weaknesses of Caltech 101. Overall, it is a more difficult data set than Caltech 101, but it suffers from comparable problems. It includes
**30,607 images, covering a larger number of categories
**Minimum number of images per category raised to 80
**Images are not left-right aligned
**More variation in image presentation
* LabelMe is an open, dynamic data set created at MIT Computer Science and Artificial Intelligence Laboratory
Computer Science and Artificial Intelligence Laboratory (CSAIL) is a research institute at the Massachusetts Institute of Technology (MIT) formed by the 2003 merger of the Laboratory for Computer Science (LCS) and the Artificial Intelligence La ...
(CSAIL). LabelMe takes a different approach to the problem of creating a large image data set, with different trade-offs.
**106,739 images, 41,724 annotated images, and 203,363 labeled objects.
**Users may add images to the data set by upload, and add labels or annotations to existing images.
**Due to its open nature, LabelMe has many more images covering a much wider scope than Caltech 101. However, since each person decides what images to upload, and how to label and annotate each image, the images are less consistent.
*VOC 2008 is a European effort to collect images for benchmarking visual categorization methods. Compared to Caltech 101/256, a smaller number of categories (about 20) are collected. The number of images in each category, however, is larger.
* Overhead Imagery Research Data Set (OIRDS) is an annotated library of imagery and tools.[F. Tanner, B. Colder, C. Pullen, D. Heagy, C. Oertel, & P. Sallee, ''Overhead Imagery Research Data Set (OIRDS) – an annotated data library and tools to aid in the development of computer vision algorithms'', June 2009, (28 December 2009)] OIRDS v1.0 is composed of passenger vehicle objects annotated in overhead imagery. Passenger vehicles in the OIRDS include cars, trucks, vans, etc. In addition to the object outlines, the OIRDS includes subjective and objective statistics that quantify the vehicle within the image's context. For example, subjective measures of image clutter, clarity, noise, and vehicle color are included along with more objective statistics such as ground sample distance In remote sensing, ground sample distance (GSD) in a digital photo (such as an orthophoto) of the ground from air or space is the distance between pixel centers measured on the ground. For example, in an image with a one-meter GSD, adjacent pixels i ...
(GSD), time of day, and day of year.
** ~900 images, containing ~1800 annotated images
** ~30 annotations per object
** ~60 statistical measures per object
** Wide variation in object context
** Limited to passenger vehicles in overhead imagery
*MICC-Flickr 101 is an image data set created at the Media Integration and Communication Center (MICC), University of Florence
The University of Florence ( Italian: ''Università degli Studi di Firenze'', UniFI) is an Italian public research university located in Florence, Italy. It comprises 12 schools and has around 50,000 students enrolled.
History
The first univer ...
, in 2012. It is based on Caltech 101 and is collected from Flickr
Flickr ( ; ) is an American image hosting and video hosting service, as well as an online community, founded in Canada and headquartered in the United States. It was created by Ludicorp in 2004 and was a popular way for amateur and professiona ...
. MICC-Flickr 101 corrects the main drawback of Caltech 101, i.e. its low inter-class variability and provides social annotations through user tags. It builds on a standard and widely used data set composed of a manageable number of categories (101) and therefore can be used to compare object categorization performance in a constrained scenario (Caltech 101) and object categorization "in the wild" (MICC-Flickr 101) on the same 101 categories.
See also
* List of datasets for machine learning research
These datasets are applied for machine learning research and have been cited in peer-reviewed academic journals. Datasets are an integral part of the field of machine learning. Major advances in this field can result from advances in learning a ...
* MNIST database
The MNIST database (''Modified National Institute of Standards and Technology database'') is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training a ...
* LabelMe
References
{{reflist
External links
* http://www.vision.caltech.edu/Image_Datasets/Caltech101/ -Caltech 101 Homepage (Includes download)
* http://www.vision.caltech.edu/Image_Datasets/Caltech256/ -Caltech 256 Homepage (Includes download)
* http://labelme.csail.mit.edu/ -LabelMe Homepage
* http://www2.it.lut.fi/project/visiq/ -Randomized Caltech 101 download page (Includes download)
* http://www.micc.unifi.it/vim/datasets/micc-flickr-101/ -MICC-Flickr101 Homepage (Includes download)
California Institute of Technology
Datasets in computer vision