Labeled data is a group of samples that have been tagged with one or more labels. Labeling typically takes a set of unlabeled data and augments each piece of it with informative tags. For example, a data label might indicate whether a photo contains a horse or a cow, which words were uttered in an audio recording, what type of action is being performed in a video, what the topic of a news article is, what the overall sentiment of a tweet is, or whether a dot in an X-ray is a tumor. Labels can be obtained by asking humans to make judgments about a given piece of unlabeled data. Labeled data is significantly more expensive to obtain than the raw unlabeled data.

Crowdsourced labeled data

In 2006

Fei-Fei Li Fei-Fei Li (; born 1976) is a Chinese-American computer scientist who is known for establishing ImageNet, the dataset that enabled rapid advances in computer vision in the 2010s. She is the Sequoia Capital Professor of Computer Science at S ...

, the co-director of the Stanford Human-Centered AI Institute, set out to improve the

artificial intelligence Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machines, as opposed to intelligence displayed by animals and humans. Example tasks in which this is done include speech re ...

models and algorithms for image recognition by significantly enlarging the

training data In machine learning, a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms function by making data-driven predictions or decisions, through building a mathematical model from ...

. The researchers downloaded millions of images from the

World Wide Web The World Wide Web (WWW), commonly known as the Web, is an information system enabling documents and other web resources to be accessed over the Internet. Documents and downloadable media are made available to the network through web se ...

and a team of undergraduates started to apply labels for objects to each image. In 2007 Li outsourced the data labelling work on

Amazon Mechanical Turk Amazon Mechanical Turk (MTurk) is a crowdsourcing website for businesses to hire remotely located "crowdworkers" to perform discrete on-demand tasks that computers are currently unable to do. It is operated under Amazon Web Services, and is owned ...

, an

online marketplace An online marketplace (or online e-commerce marketplace) is a type of e-commerce website where product or service information is provided by multiple third parties. Online marketplaces are the primary type of multichannel ecommerce and can be a way ...

for digital

piece work Piece work (or piecework) is any type of employment in which a worker is paid a fixed piece rate for each unit produced or action performed, regardless of time. Context When paying a worker, employers can use various methods and combinations of ...

. The 3.2 million images that were labelled by more than 49,000 workers formed the basis for ImageNet, one of the largest hand-labeled database for

outline of object recognition Object recognition – technology in the field of computer vision for finding and identifying objects in an image or video sequence. Humans recognize a multitude of objects in images with little effort, despite the fact that the image of the ...

Automated data labelling

After obtaining a labeled dataset,

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

models can be applied to the data so that new unlabeled data can be presented to the model and a likely label can be guessed or predicted for that piece of unlabeled data.

Data-driven bias

Algorithmic decision-making is subject to programmer-driven bias as well as data-driven bias. Training data that relies on bias labeled data will result in prejudices and omissions in a

predictive model Predictive modelling uses statistics to predict outcomes. Most often the event one wants to predict is in the future, but predictive modelling can be applied to any type of unknown event, regardless of when it occurred. For example, predictive mod ...

, despite the machine learning algorithm being legitimate. The labelled data used to train a specific machine learning algorithm needs to be a statistically

representative sample In statistics, quality assurance, and survey methodology, sampling is the selection of a subset (a statistical sample) of individuals from within a statistical population to estimate characteristics of the whole population. Statisticians attempt ...

to not bias the results. Because the labeled data available to train

facial recognition system A facial recognition system is a technology capable of matching a human face from a digital image or a video frame against a database of faces. Such a system is typically employed to authenticate users through ID verification services, and wo ...

s has not been representative of a population, underrepresented groups in the labeled data are later often misclassified. In 2018 a study by

Joy Buolamwini Joy Adowaa Buolamwini is a Ghanaian-American-Canadian computer scientist and digital activist based at the MIT Media Lab. Buolamwini introduces herself as a poet of code, daughter of art and science. She founded the Algorithmic Justice League ...

and

Timnit Gebru Timnit Gebru ( am, ትምኒት ገብሩ; born 1983/1984) is an American computer scientist who works on algorithmic bias and data mining. She is an advocate for diversity in technology and co-founder of Black in AI, a community of Black resea ...

demonstrated that two facial analysis datasets that have been used to train facial recognition algorithms, IJB-A and Adience, are composed of 79.6% and 86.2% lighter skinned humans respectively.

References

{{Reflist Machine learning