Labeled data is a group of samples that have been tagged with one or more labels. Labeling typically takes a set of unlabeled data and augments each piece of it with informative tags. For example, a data label might indicate whether a photo contains a horse or a cow, which words were uttered in an audio recording, what type of action is being performed in a video, what the topic of a news article is, what the overall sentiment of a tweet is, or whether a dot in an X-ray is a tumor. Labels can be obtained by having humans make judgments about a given piece of unlabeled data. Labeled data is significantly more expensive to obtain than the raw unlabeled data. The quality of labeled data directly influences the performance of supervised machine learning models in operation, as these models learn from the provided labels.

Crowdsourced labeled data

In 2006, Fei-Fei Li, the co-director of the

Stanford Leland Stanford Junior University, commonly referred to as Stanford University, is a private research university in Stanford, California, United States. It was founded in 1885 by railroad magnate Leland Stanford (the eighth governor of and th ...

Human-Centered AI Institute, initiated research to improve the

artificial intelligence Artificial intelligence (AI) is the capability of computer, computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. It is a field of re ...

models and algorithms for image recognition by significantly enlarging the

training data In machine learning, a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms function by making data-driven predictions or decisions, through building a mathematical model from ...

. The researchers downloaded millions of images from the

World Wide Web The World Wide Web (WWW or simply the Web) is an information system that enables Content (media), content sharing over the Internet through user-friendly ways meant to appeal to users beyond Information technology, IT specialists and hobbyis ...

and a team of undergraduates started to apply labels for objects to each image. In 2007, Li outsourced the data labeling work on

Amazon Mechanical Turk Amazon Mechanical Turk (MTurk) is a crowdsourcing website with which businesses can hire remotely located "crowdworkers" to perform discrete on-demand tasks that computers are currently unable to do as economically. It is operated under Amazon Web ...

, an

online marketplace An online marketplace (or online e-commerce marketplace) is a type of e-commerce website where product or service information is provided by multiple third parties. Online marketplaces are the primary type of multichannel ecommerce and can be a wa ...

for digital

piece work Piece work or piecework is any type of employment in which a worker is paid a fixed piece rate for each unit produced or action performed, regardless of time. Context When paying a worker, employers can use various methods and combinations of m ...

. The 3.2 million images that were labeled by more than 49,000 workers formed the basis for ImageNet, one of the largest hand-labeled database for outline of object recognition.

Automated data labelling

After obtaining a labeled dataset,

machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

models can be applied to the data so that new unlabeled data can be presented to the model and a likely label can be guessed or predicted for that piece of unlabeled data.

Challenges

Data-driven bias

Algorithmic decision-making is subject to programmer-driven bias as well as data-driven bias. Training data that relies on bias labeled data will result in prejudices and omissions in a predictive model, despite the machine learning algorithm being legitimate. The labeled data used to train a specific machine learning algorithm needs to be a statistically

representative sample In this statistics, quality assurance, and survey methodology, sampling is the selection of a subset or a statistical sample (termed sample for short) of individuals from within a statistical population to estimate characteristics of the whole ...

to not bias the results. For example, in

facial recognition system A facial recognition system is a technology potentially capable of matching a human face from a digital image or a Film frame, video frame against a database of faces. Such a system is typically employed to authenticate users through ID verif ...

s underrepresented groups are subsequently often misclassified if the labeled data available to train has not been representative of the population,. In 2018, a study by

Joy Buolamwini Joy Adowaa Buolamwini is a Canadian-American computer scientist and digital activist formerly based at the MIT Media Lab. She founded the Algorithmic Justice League (AJL), an organization that works to challenge bias in decision-making software ...

and Timnit Gebru demonstrated that two facial analysis datasets that have been used to train facial recognition algorithms, IJB-A and Adience, are composed of 79.6% and 86.2% lighter skinned humans respectively.

Human error and inconsistency

Human annotators are prone to errors and biases when labeling data. This can lead to inconsistent labels and affect the quality of the data set. The inconsistency can affect the

model's ability to generalize well.

Domain expertise

Certain fields, such as legal document analysis or

medical imaging Medical imaging is the technique and process of imaging the interior of a body for clinical analysis and medical intervention, as well as visual representation of the function of some organs or tissues (physiology). Medical imaging seeks to revea ...

, require annotators with specialized domain knowledge. Without the expertise, the annotations or labeled data may be inaccurate, negatively impacting the machine learning model's performance in a real-world scenario.

References

{{Reflist Machine learning