Kitti Kovács
   HOME

TheInfoList



OR:

These datasets are applied for machine learning research and have been cited in
peer-reviewed Peer review is the evaluation of work by one or more people with similar competencies as the producers of the work (peers). It functions as a form of self-regulation by qualified members of a profession within the relevant field. Peer review ...
academic journals. Datasets are an integral part of the field of machine learning. Major advances in this field can result from advances in learning algorithms (such as
deep learning Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. De ...
), computer hardware, and, less-intuitively, the availability of high-quality training datasets. High-quality labeled training datasets for supervised and semi-supervised machine learning algorithms are usually difficult and expensive to produce because of the large amount of time needed to label the data. Although they do not need to be labeled, high-quality datasets for
unsupervised ''Unsupervised'' is an American adult animated sitcom created by David Hornsby, Rob Rosell, and Scott Marder which ran on FX from January 19 to December 20, 2012. The show was created, and for the most part, written by David Hornsby, Scott Marder ...
learning can also be difficult and costly to produce.


Image data

These datasets consist primarily of images or videos for tasks such as object detection,
facial recognition Facial recognition or face recognition may refer to: * Face detection, often a step done before facial recognition * Face perception, the process by which the human brain understands and interprets the face * Pareidolia, which involves, in part, se ...
, and
multi-label classification In machine learning, multi-label classification or multi-output classification is a variant of the classification problem where multiple nonexclusive labels may be assigned to each instance. Multi-label classification is a generalization of mult ...
.


Facial recognition

In
computer vision Computer vision is an interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to understand and automate tasks that the hum ...
, face images have been used extensively to develop facial recognition systems, face detection, and many other projects that use images of faces.


Action recognition


Object detection and recognition


Handwriting and character recognition


Aerial images


Other images


Text data

These datasets consist primarily of text for tasks such as
natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
, sentiment analysis, translation, and
cluster analysis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of ...
.


Reviews


News articles


Messages


Twitter and tweets


Dialogues


Other text


Sound data

These datasets consist of sounds and sound features used for tasks such as speech recognition and
speech synthesis Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal languag ...
.


Speech


Music


Other sounds


Signal data

Datasets containing electric signal information requiring some sort of signal processing for further analysis.


Electrical


Motion-tracking


Other signals


Physical data

Datasets from physical systems.


High-energy physics


Systems


Astronomy


Earth science


Other physical


Biological data

Datasets from biological systems.


Human


Animal


Fungi


Plant


Microbe


Drug Discovery


Anomaly data


Question Answering data

This section includes datasets that deals with structured data.


Multivariate data

Datasets consisting of rows of observations and columns of attributes characterizing those observations. Typically used for regression analysis or classification but other types of algorithms can also be used. This section includes datasets that do not fit in the above categories.


Financial


Weather


Census


Transit


Internet


Games


Other multivariate


Curated repositories of datasets

As datasets come in myriad formats and can sometimes be difficult to use, there has been considerable work put into curating and standardizing the format of datasets to make them easier to use for machine learning research. * OpenML: Web platform with Python, R, Java, and other APIs for downloading hundreds of machine learning datasets, evaluating algorithms on datasets, and benchmarking algorithm performance against dozens of other algorithms. * PMLB: A large, curated repository of benchmark datasets for evaluating supervised machine learning algorithms. Provides classification and regression datasets in a standardized format that are accessible through a Python API. *Metatext NLP: https://metatext.io/datasets web repository maintained by community, containing nearly 1000 benchmark datasets, and counting. Provides many tasks from classification to QA, and various languages from English, Portuguese to Arabic. * Appen: Off The Shelf and Open Source Datasets hosted and maintained by the company. These biological, image, physical, question answering, signal, sound, text, and video resources number over 250 and can be applied to over 25 different use cases.


See also

* Comparison of deep learning software * List of manual image annotation tools * List of biological databases


References

{{Differentiable computing