These
datasets are applied for
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
research and have been cited in
peer-reviewed academic journals. Datasets are an integral part of the field of machine learning. Major advances in this field can result from advances in learning
algorithm
In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...
s (such as
deep learning), computer hardware, and, less-intuitively, the availability of high-quality training datasets. High-quality labeled training datasets for
supervised and
semi-supervised machine learning algorithms are usually difficult and expensive to produce because of the large amount of time needed to label the data. Although they do not need to be labeled, high-quality datasets for
unsupervised learning can also be difficult and costly to produce.
Image data
These datasets consist primarily of images or videos for tasks such as
object detection
Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class (such as humans, buildings, or cars) in digital images and videos. Well-researched ...
,
facial recognition, and
multi-label classification.
Facial recognition
In
computer vision, face images have been used extensively to develop
facial recognition system
A facial recognition system is a technology capable of matching a human face from a digital image or a video frame against a database of faces. Such a system is typically employed to authenticate users through ID verification services, and ...
s,
face detection
Face detection is a computer technology being used in a variety of applications that identifies human faces in digital images. Face detection also refers to the psychological process by which humans locate and attend to faces in a visual scene. ...
, and many other projects that use images of faces.
Action recognition
Object detection and recognition
Handwriting and character recognition
Aerial images
Other images
Text data
These datasets consist primarily of text for tasks such as
natural language processing,
sentiment analysis
Sentiment analysis (also known as opinion mining or emotion AI) is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjec ...
, translation, and
cluster analysis.
Reviews
News articles
Messages
Twitter and tweets
Dialogues
Other text
Sound data
These datasets consist of sounds and sound features used for tasks such as
speech recognition
Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the ...
and
speech synthesis.
Speech
Music
Other sounds
Signal data
Datasets containing electric signal information requiring some sort of
signal processing
Signal processing is an electrical engineering subfield that focuses on analyzing, modifying and synthesizing ''signals'', such as sound, images, and scientific measurements. Signal processing techniques are used to optimize transmissions, ...
for further analysis.
Electrical
Motion-tracking
Other signals
Physical data
Datasets from physical systems.
High-energy physics
Systems
Astronomy
Earth science
Other physical
Biological data
Datasets from biological systems.
Human
Animal
Fungi
Plant
Microbe
Drug Discovery
Anomaly data
Question Answering data
This section includes datasets that deals with structured data.
Multivariate data
Datasets consisting of rows of observations and columns of attributes characterizing those observations. Typically used for
regression analysis
In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a 'label' in machine learning parlance) and one ...
or classification but other types of algorithms can also be used. This section includes datasets that do not fit in the above categories.
Financial
Weather
Census
Transit
Internet
Games
Other multivariate
Curated repositories of datasets
As datasets come in myriad formats and can sometimes be difficult to use, there has been considerable work put into curating and standardizing the format of datasets to make them easier to use for machine learning research.
* OpenML: Web platform with Python, R, Java, and other APIs for downloading hundreds of machine learning datasets, evaluating algorithms on datasets, and benchmarking algorithm performance against dozens of other algorithms.
* PMLB: A large, curated repository of benchmark datasets for evaluating supervised machine learning algorithms. Provides classification and regression datasets in a standardized format that are accessible through a Python API.
*Metatext NLP: https://metatext.io/datasets web repository maintained by community, containing nearly 1000 benchmark datasets, and counting. Provides many tasks from classification to QA, and various languages from English, Portuguese to Arabic.
*
Appen
Appen is a municipality in the district of Pinneberg, in Schleswig-Holstein, Germany. It is situated approximately 3 km west of Pinneberg, and 20 km northwest of Hamburg.
It is twinned with the village of Polegate, near Eastbourne ...
: Off The Shelf and Open Source Datasets hosted and maintained by the company. These biological, image, physical, question answering, signal, sound, text, and video resources number over 250 and can be applied to over 25 different use cases.
See also
*
Comparison of deep learning software
The following table compares notable software frameworks, libraries and computer programs for deep learning.
Deep-learning software by name
Comparison of compatibility of machine learning models
See also
*Comparison of numerical-analys ...
*
List of manual image annotation tools
Manual image annotation is the process of manually defining regions in an image and creating a textual description of those regions. Such annotations can for instance be used to train machine learning algorithms for computer vision applications.
...
*
List of biological databases
Biological databases are stores of biological information. The journal ''Nucleic Acids Research'' regularly publishes special issues on biological databases and has a list of such databases. The 2018 issue has a list of about 180 such databases a ...
References
{{Differentiable computing