Data harvesting
   HOME

TheInfoList



OR:

In
metadata Metadata (or metainformation) is "data that provides information about other data", but not the content of the data itself, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive ...
, metadata discovery (also metadata harvesting) is the process of using automated tools to discover the
semantics Semantics is the study of linguistic Meaning (philosophy), meaning. It examines what meaning is, how words get their meaning, and how the meaning of a complex expression depends on its parts. Part of this process involves the distinction betwee ...
of a
data element In metadata, the term data element is an atomic unit of data that has precise meaning or precise semantics. A data element has: # An identification such as a data element name # A clear data element definition # One or more representation term ...
in data sets. This process usually ends with a set of mappings between the data source elements and a centralized
metadata registry A metadata registry is a central location in an organization where metadata definitions are stored and maintained in a controlled method. A metadata repository is the database where metadata is stored. The registry also adds relationships with ...
. Metadata discovery is also known as metadata scanning.


Data source formats for metadata discovery

Data sets may be in a variety of different forms including: #
Relational database A relational database (RDB) is a database based on the relational model of data, as proposed by E. F. Codd in 1970. A Relational Database Management System (RDBMS) is a type of database management system that stores data in a structured for ...
s #
NoSQL NoSQL (originally meaning "Not only SQL" or "non-relational") refers to a type of database design that stores and retrieves data differently from the traditional table-based structure of relational databases. Unlike relational databases, which ...
databases #
Spreadsheet A spreadsheet is a computer application for computation, organization, analysis and storage of data in tabular form. Spreadsheets were developed as computerized analogs of paper accounting worksheets. The program operates on data entered in c ...
s #
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding electronic document, documents in a format that is both human-readable and Machine-r ...
files #
Web services A web service (WS) is either: * a service offered by an electronic device to another electronic device, communicating with each other via the Internet, or * a server running on a computer device, listening for requests at a particular port over a n ...
# Software
source code In computing, source code, or simply code or source, is a plain text computer program written in a programming language. A programmer writes the human readable source code to control the behavior of a computer. Since a computer, at base, only ...
such as Fortran, Jovial, COBOL, Assembler, RPG, PL/1, EasyTrieve, Java, C# or C++ classes, and thousands of other software languages # Unstructured text documents such as
Microsoft Word Microsoft Word is a word processor program, word processing program developed by Microsoft. It was first released on October 25, 1983, under the name Multi-Tool Word for Xenix systems. Subsequent versions were later written for several other platf ...
or
PDF Portable document format (PDF), standardized as ISO 32000, is a file format developed by Adobe Inc., Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, computer hardware, ...
files


A taxonomy of metadata matching algorithms

There are distinct categories of automated metadata discovery:


Lexical matching

# Exact match - where data element linkages are made based on the exact name of a column in a database, the name of an XML element or a label on a screen. For example, if a database column has the name "PersonBirthDate" and a data element in a metadata registry also has the name "PersonBirthDate", automated tools can infer that the column of a database has the same semantics (meaning) as the data element in the metadata registry. # Synonym match - where the discovery tool is not just given a single name but a set of synonym. # Pattern match - in this case the tools is given a set of lexical patterns that it can match. For example, the tools may search for "*gender*" or "*sex*"


Semantic matching

Semantic matching Semantic matching is a technique used in computer science to identify information that is semantically related. Given any two graph-like structures, e.g. classifications, taxonomies database or XML schemas and ontologies, matching is an operato ...
attempts to use
semantics Semantics is the study of linguistic Meaning (philosophy), meaning. It examines what meaning is, how words get their meaning, and how the meaning of a complex expression depends on its parts. Part of this process involves the distinction betwee ...
to associate target data with registered
data element In metadata, the term data element is an atomic unit of data that has precise meaning or precise semantics. A data element has: # An identification such as a data element name # A clear data element definition # One or more representation term ...
s. # Semantic similarity - In this algorithm that relies on a database of word conceptual nearness is used. For example, the
WordNet WordNet is a lexical database of semantic relations between words that links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into ''synsets'' with short definitions and usage examples. It can thu ...
system can rank how close words are conceptually to each other. For example, the terms "Person", "Individual" and "Human" may be highly similar concepts.


Statistical matching

Statistical matching uses statistics about data sources data itself to derive similarities with registered data elements. # Distinct value analysis - By analyzing all the distinct values in a column the similarity to a registered data element may be made. For example, if a column only has two distinct values of 'male' and 'female' this could be mapped to 'PersonGenderCode'. # Data distribution analysis - By analyzing the distribution of values within a single column and comparing this distribution with known data elements a semantic linkage could be inferred.


Vendors

The following vendors (listed in alphabetical order) provide metadata discovery and metadata mapping software and solutions *
IBM International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American Multinational corporation, multinational technology company headquartered in Armonk, New York, and present in over 175 countries. It is ...
*
Imperva Imperva, Inc. is an American cyber security software and services company which provides protection to enterprise data and application software. The company is headquartered in San Mateo, California. French multinational Thales Group acquired the ...


Research

* INDUS project at
Iowa State University Iowa State University of Science and Technology (Iowa State University, Iowa State, or ISU) is a Public university, public land-grant university, land-grant research university in Ames, Iowa, United States. Founded in 1858 as the Iowa Agricult ...
. * Mercury - A Distributed Metadata Management and
Data Discovery Data mining is the process of extracting and finding patterns in massive data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and s ...
System developed at the Oak Ridge National Laboratory DAAC. * National Digital Library of India.


See also

*
Metadata Metadata (or metainformation) is "data that provides information about other data", but not the content of the data itself, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive ...
*
Data mapping In computing and data management, data mapping is the process of creating data element mappings between two distinct data models. Data mapping is used as a first step for a wide variety of data integration tasks, including: * Data transforma ...
*
Data warehouse In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for Business intelligence, reporting and data analysis and is a core component of business intelligence. Data warehouses are central Re ...
*
Semantic web The Semantic Web, sometimes known as Web 3.0, is an extension of the World Wide Web through standards set by the World Wide Web Consortium (W3C). The goal of the Semantic Web is to make Internet data machine-readable. To enable the encoding o ...
* Defense Discovery Metadata Specification


References


Further reading


Massive Data Analysis Systems
by San Diego Supercomputer Center June 1997
IBM Whitepaper on Enterprise Metadata Discovery

White Paper on Metadata Management
- b
Esquire Innovations
* * *{{Cite journal , last=Nag , first=Ruben , last2=Guhathakurta , first2=Rahul , date=31 December 2024 , title=Metadata Harvesting: Applications and Influence in Digital Publishing , url=https://oacases.com/index.php/cases/article/view/15 , journal=Open Access Cases , volume=1 , issue=4 , eissn=3067-0349 Metadata