Metadata discovery
   HOME

TheInfoList



OR:

In
metadata Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive metadata – the descriptive ...
, metadata discovery (also metadata harvesting) is the process of using automated tools to discover the
semantics Semantics (from grc, σημαντικός ''sēmantikós'', "significant") is the study of reference, meaning, or truth. The term can be used to refer to subfields of several distinct disciplines, including philosophy, linguistics and comput ...
of a
data element In metadata, the term data element is an atomic unit of data that has precise meaning or precise semantics. A data element has: # An identification such as a data element name # A clear data element definition # One or more representation terms ...
in data sets. This process usually ends with a set of mappings between the data source elements and a centralized
metadata registry A metadata registry is a central location in an organization where metadata definitions are stored and maintained in a controlled method. A metadata repository is the database where metadata is stored. The registry also adds relationships with r ...
. Metadata discovery is also known as metadata scanning.


Data source formats for metadata discovery

Data sets may be in a variety of different forms including: #
Relational database A relational database is a (most commonly digital) database based on the relational model of data, as proposed by E. F. Codd in 1970. A system used to maintain relational databases is a relational database management system (RDBMS). Many relati ...
s #
NoSQL A NoSQL (originally referring to "non- SQL" or "non-relational") database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Such databases have existed ...
databases #
Spreadsheet A spreadsheet is a computer application for computation, organization, analysis and storage of data in tabular form. Spreadsheets were developed as computerized analogs of paper accounting worksheets. The program operates on data entered in ...
s #
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable ...
files # Web services # Software
source code In computing, source code, or simply code, is any collection of code, with or without comments, written using a human-readable programming language, usually as plain text. The source code of a program is specially designed to facilitate the w ...
such as Fortran, Jovial, COBOL, Assembler, RPG, PL/1, EasyTrieve, Java, C# or C++ classes, and thousands of other software languages # Unstructured text documents such as
Microsoft Word Microsoft Word is a word processor, word processing software developed by Microsoft. It was first released on October 25, 1983, under the name ''Multi-Tool Word'' for Xenix systems. Subsequent versions were later written for several other pla ...
or PDF files


A taxonomy of metadata matching algorithms

There are distinct categories of automated metadata discovery:


Lexical matching

# Exact match - where data element linkages are made based on the exact name of a column in a database, the name of an XML element or a label on a screen. For example, if a database column has the name "PersonBirthDate" and a data element in a metadata registry also has the name "PersonBirthDate", automated tools can infer that the column of a database has the same semantics (meaning) as the data element in the metadata registry. # Synonym match - where the discovery tool is not just given a single name but a set of synonym. # Pattern match - in this case the tools is given a set of lexical patterns that it can match. For example, the tools may search for "*gender*" or "*sex*"


Semantic matching

Semantic matching attempts to use
semantics Semantics (from grc, σημαντικός ''sēmantikós'', "significant") is the study of reference, meaning, or truth. The term can be used to refer to subfields of several distinct disciplines, including philosophy, linguistics and comput ...
to associate target data with registered
data element In metadata, the term data element is an atomic unit of data that has precise meaning or precise semantics. A data element has: # An identification such as a data element name # A clear data element definition # One or more representation terms ...
s. # Semantic similarity - In this algorithm that relies on a database of word conceptual nearness is used. For example, the
WordNet WordNet is a lexical database of semantic relations between words in more than 200 languages. WordNet links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into '' synsets'' with short defin ...
system can rank how close words are conceptually to each other. For example, the terms "Person", "Individual" and "Human" may be highly similar concepts.


Statistical matching

Statistical matching uses statistics about data sources data itself to derive similarities with registered data elements. # Distinct value analysis - By analyzing all the distinct values in a column the similarity to a registered data element may be made. For example, if a column only has two distinct values of 'male' and 'female' this could be mapped to 'PersonGenderCode'. # Data distribution analysis - By analyzing the distribution of values within a single column and comparing this distribution with known data elements a semantic linkage could be inferred.


Vendors

The following vendors (listed in alphabetical order) provide metadata discovery and metadata mapping software and solutions * Atlan (se

* BigHand/Esquire Innovations (se

* IBM * Talend * InfoLibrarian Corporation (se

* MindHARBOR Metadata Database application (se

* Octopai - a Cross-Platform Metadata Discovery and Management Automation (se

* Revelytix (se

* Silver Creek Systems (se

* Stratio (company), Stratio (se
Data reliability is the base of successful companies
* Sypherlink: Harvester (se

* Unicorn Systems (se


Research

* INDUS project at the
Iowa State University Iowa State University of Science and Technology (Iowa State University, Iowa State, or ISU) is a public land-grant research university in Ames, Iowa. Founded in 1858 as the Iowa Agricultural College and Model Farm, Iowa State became one of the ...
(se

* Mercury - A Distributed Metadata Management and Data discovery, Data Discovery System developed at the Oak Ridge National Laboratory DAAC (se


See also

* Metadata *
Data mapping In computing and data management, data mapping is the process of creating data element mappings between two distinct data models. Data mapping is used as a first step for a wide variety of data integration tasks, including: * Data transforma ...
*
Data warehouse In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis and is considered a core component of business intelligence. DWs are central repositories of integra ...
* Semantic web * Defense Discovery Metadata Specification


References


Citations


Sources


Massive Data Analysis Systems
by San Diego Supercomputer Center June 1997
IBM Whitepaper on Enterprise Metadata Discovery

White Paper on Metadata Management
- b
Esquire Innovations
{{refend Metadata