HOME

TheInfoList



OR:

These datasets are applied for
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
research and have been cited in peer-reviewed academic journals. Datasets are an integral part of the field of machine learning. Major advances in this field can result from advances in learning
algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...
s (such as deep learning), computer hardware, and, less-intuitively, the availability of high-quality training datasets. High-quality labeled training datasets for supervised and semi-supervised machine learning algorithms are usually difficult and expensive to produce because of the large amount of time needed to label the data. Although they do not need to be labeled, high-quality datasets for unsupervised learning can also be difficult and costly to produce.


Image data

These datasets consist primarily of images or videos for tasks such as
object detection Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class (such as humans, buildings, or cars) in digital images and videos. Well-researched ...
, facial recognition, and multi-label classification.


Facial recognition

In computer vision, face images have been used extensively to develop
facial recognition system A facial recognition system is a technology capable of matching a human face from a digital image or a video frame against a database of faces. Such a system is typically employed to authenticate users through ID verification services, and ...
s,
face detection Face detection is a computer technology being used in a variety of applications that identifies human faces in digital images. Face detection also refers to the psychological process by which humans locate and attend to faces in a visual scene. ...
, and many other projects that use images of faces.


Action recognition


Object detection and recognition


Handwriting and character recognition


Aerial images


Other images


Text data

These datasets consist primarily of text for tasks such as natural language processing,
sentiment analysis Sentiment analysis (also known as opinion mining or emotion AI) is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjec ...
, translation, and cluster analysis.


Reviews


News articles


Messages


Twitter and tweets


Dialogues


Other text


Sound data

These datasets consist of sounds and sound features used for tasks such as
speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the ...
and speech synthesis.


Speech


Music


Other sounds


Signal data

Datasets containing electric signal information requiring some sort of
signal processing Signal processing is an electrical engineering subfield that focuses on analyzing, modifying and synthesizing ''signals'', such as sound, images, and scientific measurements. Signal processing techniques are used to optimize transmissions, ...
for further analysis.


Electrical


Motion-tracking


Other signals


Physical data

Datasets from physical systems.


High-energy physics


Systems


Astronomy


Earth science


Other physical


Biological data

Datasets from biological systems.


Human


Animal


Fungi


Plant


Microbe


Drug Discovery


Anomaly data


Question Answering data

This section includes datasets that deals with structured data.


Multivariate data

Datasets consisting of rows of observations and columns of attributes characterizing those observations. Typically used for
regression analysis In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a 'label' in machine learning parlance) and one ...
or classification but other types of algorithms can also be used. This section includes datasets that do not fit in the above categories.


Financial


Weather


Census


Transit


Internet


Games


Other multivariate


Curated repositories of datasets

As datasets come in myriad formats and can sometimes be difficult to use, there has been considerable work put into curating and standardizing the format of datasets to make them easier to use for machine learning research. * OpenML: Web platform with Python, R, Java, and other APIs for downloading hundreds of machine learning datasets, evaluating algorithms on datasets, and benchmarking algorithm performance against dozens of other algorithms. * PMLB: A large, curated repository of benchmark datasets for evaluating supervised machine learning algorithms. Provides classification and regression datasets in a standardized format that are accessible through a Python API. *Metatext NLP: https://metatext.io/datasets web repository maintained by community, containing nearly 1000 benchmark datasets, and counting. Provides many tasks from classification to QA, and various languages from English, Portuguese to Arabic. *
Appen Appen is a municipality in the district of Pinneberg, in Schleswig-Holstein, Germany. It is situated approximately 3 km west of Pinneberg, and 20 km northwest of Hamburg. It is twinned with the village of Polegate, near Eastbourne ...
: Off The Shelf and Open Source Datasets hosted and maintained by the company. These biological, image, physical, question answering, signal, sound, text, and video resources number over 250 and can be applied to over 25 different use cases.


See also

*
Comparison of deep learning software The following table compares notable software frameworks, libraries and computer programs for deep learning. Deep-learning software by name Comparison of compatibility of machine learning models See also *Comparison of numerical-analys ...
*
List of manual image annotation tools Manual image annotation is the process of manually defining regions in an image and creating a textual description of those regions. Such annotations can for instance be used to train machine learning algorithms for computer vision applications. ...
*
List of biological databases Biological databases are stores of biological information. The journal ''Nucleic Acids Research'' regularly publishes special issues on biological databases and has a list of such databases. The 2018 issue has a list of about 180 such databases a ...


References

{{Differentiable computing