HOME

TheInfoList



OR:

Data quality refers to the state of qualitative or quantitative pieces of information. There are many definitions of data quality, but data is generally considered high quality if it is "fit for tsintended uses in operations, decision making and planning". Moreover, data is deemed of high quality if it correctly represents the real-world construct to which it refers. Furthermore, apart from these definitions, as the number of data sources increases, the question of internal data consistency becomes significant, regardless of fitness for use for any particular external purpose. People's views on data quality can often be in disagreement, even when discussing the same set of data used for the same purpose. When this is the case,
data governance Data governance is a term used on both a macro and a micro level. The former is a political concept and forms part of international relations and Internet governance; the latter is a data management concept and forms part of corporate data govern ...
is used to form agreed upon definitions and standards for data quality. In such cases,
data cleansing Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the d ...
, including standardization, may be required in order to ensure data quality.


Definitions

Defining data quality is difficult due to the many contexts data are used in, as well as the varying perspectives among end users, producers, and custodians of data. From a consumer perspective, data quality is: * "data that are fit for use by data consumers" * data "meeting or exceeding consumer expectations" * data that "satisfies the requirements of its intended use" From a business perspective, data quality is: * data that are "'fit for use' in their intended operational, decision-making and other roles" or that exhibits "'conformance to standards' that have been set, so that fitness for use is achieved" * data that "are fit for their intended uses in operations, decision making and planning" * "the capability of data to satisfy the stated business, system, and technical requirements of an enterprise" From a standards-based perspective, data quality is: * the "degree to which a set of inherent characteristics (quality dimensions) of an object (data) fulfills requirements" * "the usefulness, accuracy, and correctness of data for its application" Arguably, in all these cases, "data quality" is a comparison of the actual state of a particular set of data to a desired state, with the desired state being typically referred to as "fit for use," "to specification," "meeting consumer expectations," "free of defect," or "meeting requirements." These expectations, specifications, and requirements are usually defined by one or more individuals or groups, standards organizations, laws and regulations, business policies, or software development policies.


Dimensions of data quality

Drilling down further, those expectations, specifications, and requirements are stated in terms of characteristics or dimensions of the data, such as: * accessibility or availability * accuracy or correctness * comparability * completeness or comprehensiveness * consistency, coherence, or clarity * credibility, reliability, or reputation *flexibility *plausibility * relevance, pertinence, or usefulness * timeliness or latency * uniqueness * validity or reasonableness A systematic scoping review of the literature suggests that data quality dimensions and methods with real world data are not consistent in the literature, and as a result quality assessments are challenging due to the complex and heterogeneous nature of these data.


History

Before the rise of the inexpensive
computer data storage Computer data storage is a technology consisting of computer components and recording media that are used to retain digital data. It is a core function and fundamental component of computers. The central processing unit (CPU) of a compute ...
, massive mainframe computers were used to maintain name and address data for delivery services. This was so that mail could be properly routed to its destination. The mainframes used business rules to correct common misspellings and typographical errors in name and address data, as well as to track customers who had moved, died, gone to prison, married, divorced, or experienced other life-changing events. Government agencies began to make postal data available to a few service companies to cross-reference customer data with the National Change of Address registry (NCOA). This technology saved large companies millions of dollars in comparison to manual correction of customer data. Large companies saved on postage, as bills and direct marketing materials made their way to the intended customer more accurately. Initially sold as a service, data quality moved inside the walls of corporations, as low-cost and powerful server technology became available. Companies with an emphasis on marketing often focused their quality efforts on name and address information, but data quality is recognized as an important property of all types of data. Principles of data quality can be applied to supply chain data, transactional data, and nearly every other category of data found. For example, making supply chain data conform to a certain standard has value to an organization by: 1) avoiding overstocking of similar but slightly different stock; 2) avoiding false stock-out; 3) improving the understanding of vendor purchases to negotiate volume discounts; and 4) avoiding logistics costs in stocking and shipping parts across a large organization. For companies with significant research efforts, data quality can include developing
protocols Protocol may refer to: Sociology and politics * Protocol (politics), a formal agreement between nation states * Protocol (diplomacy), the etiquette of diplomacy and affairs of state * Etiquette, a code of personal behavior Science and technology ...
for research methods, reducing
measurement error Observational error (or measurement error) is the difference between a measured value of a quantity and its true value.Dodge, Y. (2003) ''The Oxford Dictionary of Statistical Terms'', OUP. In statistics, an error is not necessarily a "mistake ...
,
bounds checking In computer programming, bounds checking is any method of detecting whether a variable is within some bounds before it is used. It is usually used to ensure that a number fits into a given type (range checking), or that a variable being used as ...
of data, cross tabulation, modeling and outlier detection, verifying
data integrity Data integrity is the maintenance of, and the assurance of, data accuracy and consistency over its entire life-cycle and is a critical aspect to the design, implementation, and usage of any system that stores, processes, or retrieves data. The ter ...
, etc.


Overview

There are a number of theoretical frameworks for understanding data quality. A systems-theoretical approach influenced by American pragmatism expands the definition of data quality to include information quality, and emphasizes the inclusiveness of the fundamental dimensions of accuracy and precision on the basis of the theory of science (Ivanov, 1972). One framework, dubbed "Zero Defect Data" (Hansen, 1991) adapts the principles of statistical process control to data quality. Another framework seeks to integrate the product perspective (conformance to specifications) and the service perspective (meeting consumers' expectations) (Kahn et al. 2002). Another framework is based in
semiotics Semiotics (also called semiotic studies) is the systematic study of sign processes ( semiosis) and meaning making. Semiosis is any activity, conduct, or process that involves signs, where a sign is defined as anything that communicates something ...
to evaluate the quality of the form, meaning and use of the data (Price and Shanks, 2004). One highly theoretical approach analyzes the
ontological In metaphysics, ontology is the philosophical study of being, as well as related concepts such as existence, becoming, and reality. Ontology addresses questions like how entities are grouped into categories and which of these entities exi ...
nature of
information systems An information system (IS) is a formal, sociotechnical, organizational system designed to collect, process, store, and distribute information. From a sociotechnical perspective, information systems are composed by four components: task, people ...
to define data quality rigorously (Wand and Wang, 1996). A considerable amount of data quality research involves investigating and describing various categories of desirable attributes (or dimensions) of data. Nearly 200 such terms have been identified and there is little agreement in their nature (are these concepts, goals or criteria?), their definitions or measures (Wang et al., 1993). Software engineers may recognize this as a similar problem to " ilities".
MIT The Massachusetts Institute of Technology (MIT) is a private land-grant research university in Cambridge, Massachusetts. Established in 1861, MIT has played a key role in the development of modern technology and science, and is one of the m ...
has an Information Quality (MITIQ) Program, led by Professor Richard Wang, which produces a large number of publications and hosts a significant international conference in this field (International Conference on Information Quality, ICIQ). This program grew out of the work done by Hansen on the "Zero Defect Data" framework (Hansen, 1991). In practice, data quality is a concern for professionals involved with a wide range of information systems, ranging from
data warehousing In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis and is considered a core component of business intelligence. DWs are central repositories of integr ...
and
business intelligence Business intelligence (BI) comprises the strategies and technologies used by enterprises for the data analysis and management of business information. Common functions of business intelligence technologies include reporting, online analytical p ...
to
customer relationship management Customer relationship management (CRM) is a process in which a business or other organization administers its interactions with customers, typically using data analysis to study big data, large amounts of information. CRM systems data collectio ...
and supply chain management. One industry study estimated the total cost to the U.S. economy of data quality problems at over U.S. $600 billion per annum (Eckerson, 2002). Incorrect data – which includes invalid and outdated information – can originate from different data sources – through data entry, or
data migration Data migration is the process of selecting, preparing, extracting, and transforming data and permanently transferring it from one computer storage system to another. Additionally, the validation of migrated data for completeness and the decommis ...
and conversion projects. In 2002, the USPS and PricewaterhouseCoopers released a report stating that 23.6 percent of all U.S. mail sent is incorrectly addressed. One reason contact data becomes stale very quickly in the average database – more than 45 million Americans change their address every year. In fact, the problem is such a concern that companies are beginning to set up a
data governance Data governance is a term used on both a macro and a micro level. The former is a political concept and forms part of international relations and Internet governance; the latter is a data management concept and forms part of corporate data govern ...
team whose sole role in the corporation is to be responsible for data quality. In some organizations, this
data governance Data governance is a term used on both a macro and a micro level. The former is a political concept and forms part of international relations and Internet governance; the latter is a data management concept and forms part of corporate data govern ...
function has been established as part of a larger Regulatory Compliance function - a recognition of the importance of Data/Information Quality to organizations. Problems with data quality don't only arise from ''incorrect'' data; ''inconsistent'' data is a problem as well. Eliminating data shadow systems and centralizing data in a warehouse is one of the initiatives a company can take to ensure data consistency. Enterprises, scientists, and researchers are starting to participate within data curation communities to improve the quality of their common data. The market is going some way to providing data quality assurance. A number of vendors make tools for analyzing and repairing poor quality data ''in situ'', service providers can clean the data on a contract basis and consultants can advise on fixing processes or systems to avoid data quality problems in the first place. Most data quality tools offer a series of tools for improving data, which may include some or all of the following: #
Data profiling Data profiling is the process of examining the data available from an existing information source (e.g. a database or a file) and collecting statistics or informative summaries about that data. The purpose of these statistics may be to: # Find ou ...
- initially assessing the data to understand its current state, often including value distributions # Data standardization - a business rules engine that ensures that data conforms to standards # Geocoding - for name and address data. Corrects data to U.S. and Worldwide geographic standards # Matching or Linking - a way to compare data so that similar, but slightly different records can be aligned. Matching may use "fuzzy logic" to find duplicates in the data. It often recognizes that "Bob" and "Bbo" may be the same individual. It might be able to manage "householding", or finding links between spouses at the same address, for example. Finally, it often can build a "best of breed" record, taking the best components from multiple data sources and building a single super-record. # Monitoring - keeping track of data quality over time and reporting variations in the quality of data. Software can also auto-correct the variations based on pre-defined business rules. # Batch and Real time - Once the data is initially cleansed (batch), companies often want to build the processes into enterprise applications to keep it clean. There are several well-known authors and self-styled experts, with Larry English perhaps the most popular
guru Guru ( sa, गुरु, IAST: ''guru;'' Pali'': garu'') is a Sanskrit term for a "mentor, guide, expert, or master" of certain knowledge or field. In pan-Indian traditions, a guru is more than a teacher: traditionally, the guru is a reverential ...
. In addition
IQ International - the International Association for Information and Data Quality
was established in 2004 to provide a focal point for professionals and researchers in this field.
ISO 8000 ISO 8000 is the global standard for ''Data Quality and Enterprise Master Data''. It describes the features and defines the requirements for standard exchange of Master Data among business partners. It establishes the concept of Portability as a ...
is an international standard for data quality.


Data quality assurance

Data quality assurance is the process of
data profiling Data profiling is the process of examining the data available from an existing information source (e.g. a database or a file) and collecting statistics or informative summaries about that data. The purpose of these statistics may be to: # Find ou ...
to discover inconsistencies and other anomalies in the data, as well as performing
data cleansing Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the d ...
activities (e.g. removing
outliers In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are ...
, missing data interpolation) to improve the data quality. These activities can be undertaken as part of
data warehousing In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis and is considered a core component of business intelligence. DWs are central repositories of integr ...
or as part of the
database administration Database administration is the function of managing and maintaining database management systems (DBMS) software. Mainstream DBMS software such as Oracle, IBM Db2 and Microsoft SQL Server need ongoing management. As such, corporations that use DBMS ...
of an existing piece of
application software Application may refer to: Mathematics and computing * Application software, computer software designed to help the user to perform specific tasks ** Application layer, an abstraction layer that specifies protocols and interface methods used in a ...
.


Data quality control

Data quality control is the process of controlling the usage of data for an application or a process. This process is performed both before and after a Data Quality Assurance (QA) process, which consists of discovery of data inconsistency and correction. Before: * Restricts inputs After QA process the following statistics are gathered to guide the Quality Control (QC) process: * Severity of inconsistency * Incompleteness * Accuracy * Precision * Missing / Unknown The Data QC process uses the information from the QA process to decide to use the data for analysis or in an application or business process. General example: if a Data QC process finds that the data contains too many errors or inconsistencies, then it prevents that data from being used for its intended process which could cause disruption. Specific example: providing invalid measurements from several sensors to the automatic pilot feature on an aircraft could cause it to crash. Thus, establishing a QC process provides data usage protection.


Optimum use of data quality

Data Quality (DQ) is a niche area required for the integrity of the data management by covering gaps of data issues. This is one of the key functions that aid data governance by monitoring data to find exceptions undiscovered by current data management operations. Data Quality checks may be defined at attribute level to have full control on its remediation steps. DQ checks and business rules may easily overlap if an organization is not attentive of its DQ scope. Business teams should understand the DQ scope thoroughly in order to avoid overlap. Data quality checks are redundant if business logic covers the same functionality and fulfills the same purpose as DQ. The DQ scope of an organization should be defined in DQ strategy and well implemented. Some data quality checks may be translated into business rules after repeated instances of exceptions in the past. Below are a few areas of data flows that may need perennial DQ checks: Completeness and precision DQ checks on all data may be performed at the point of entry for each mandatory attribute from each source system. Few attribute values are created way after the initial creation of the transaction; in such cases, administering these checks becomes tricky and should be done immediately after the defined event of that attribute's source and the transaction's other core attribute conditions are met. All data having attributes referring to ''Reference Data'' in the organization may be validated against the set of well-defined valid values of Reference Data to discover new or discrepant values through the validity DQ check. Results may be used to update ''Reference Data'' administered under ''Master Data Management (MDM)''. All data sourced from a ''third party'' to organization's internal teams may undergo accuracy (DQ) check against the third party data. These DQ check results are valuable when administered on data that made multiple hops after the point of entry of that data but before that data becomes authorized or stored for enterprise intelligence. All data columns that refer to ''Master Data'' may be validated for its consistency check. A DQ check administered on the data at the point of entry discovers new data for the MDM process, but a DQ check administered after the point of entry discovers the failure (not exceptions) of consistency. As data transforms, multiple timestamps and the positions of that timestamps are captured and may be compared against each other and its leeway to validate its value, decay, operational significance against a defined SLA (service level agreement). This timeliness DQ check can be utilized to decrease data value decay rate and optimize the policies of data movement timeline. In an organization complex logic is usually segregated into simpler logic across multiple processes. Reasonableness DQ checks on such complex logic yielding to a logical result within a specific range of values or static interrelationships (aggregated business rules) may be validated to discover complicated but crucial business processes and outliers of the data, its drift from BAU (business as usual) expectations, and may provide possible exceptions eventually resulting into data issues. This check may be a simple generic aggregation rule engulfed by large chunk of data or it can be a complicated logic on a group of attributes of a transaction pertaining to the core business of the organization. This DQ check requires high degree of business knowledge and acumen. Discovery of reasonableness issues may aid for policy and strategy changes by either business or data governance or both. Conformity checks and integrity checks need not covered in all business needs, it's strictly under the database architecture's discretion. There are many places in the data movement where DQ checks may not be required. For instance, DQ check for completeness and precision on not–null columns is redundant for the data sourced from database. Similarly, data should be validated for its accuracy with respect to time when the data is stitched across disparate sources. However, that is a business rule and should not be in the DQ scope. Regretfully, from a software development perspective, DQ is often seen as a nonfunctional requirement. And as such, key data quality checks/processes are not factored into the final software solution. Within Healthcare, wearable technologies or Body Area Networks, generate large volumes of data. The level of detail required to ensure data quality is extremely high and is often underestimated. This is also true for the vast majority of
mHealth mHealth (also written as m-health or mhealth) is an abbreviation for mobile health, a term used for the practice of medicine and public health supported by mobile devices. The term is most commonly used in reference to using mobile communicatio ...
apps, EHRs and other health related software solutions. However, some open source tools exist that examine data quality. The primary reason for this, stems from the extra cost involved is added a higher degree of rigor within the software architecture.


Health data security and privacy

The use of mobile devices in health, or mHealth, creates new challenges to health data security and privacy, in ways that directly affect data quality. mHealth is an increasingly important strategy for delivery of health services in low- and middle-income countries. Mobile phones and tablets are used for collection, reporting, and analysis of data in near real time. However, these mobile devices are commonly used for personal activities, as well, leaving them more vulnerable to security risks that could lead to data breaches. Without proper security safeguards, this personal use could jeopardize the quality, security, and confidentiality of health data.


Data quality in public health

Data quality has become a major focus of public health programs in recent years, especially as demand for accountability increases. Work towards ambitious goals related to the fight against diseases such as AIDS, Tuberculosis, and Malaria must be predicated on strong Monitoring and Evaluation systems that produce quality data related to program implementation. These programs, and program auditors, increasingly seek tools to standardize and streamline the process of determining the quality of data, verify the quality of reported data, and assess the underlying data management and reporting systems for indicators. An example is WHO and MEASURE Evaluation's Data Quality Review Tool WHO, the Global Fund, GAVI, and MEASURE Evaluation have collaborated to produce a harmonized approach to data quality assurance across different diseases and programs.


Open data quality

There are a number of scientific works devoted to the analysis of the data quality in
open data Open data is data that is openly accessible, exploitable, editable and shared by anyone for any purpose. Open data is licensed under an open license. The goals of the open data movement are similar to those of other "open(-source)" movement ...
sources, such as
Wikipedia Wikipedia is a multilingual free online encyclopedia written and maintained by a community of volunteers, known as Wikipedians, through open collaboration and using a wiki-based editing system. Wikipedia is the largest and most-read refer ...
,
Wikidata Wikidata is a collaboratively edited multilingual knowledge graph hosted by the Wikimedia Foundation. It is a common source of open data that Wikimedia projects such as Wikipedia, and anyone else, can use under the CC0 public domain license ...
,
DBpedia DBpedia (from "DB" for " database") is a project aiming to extract structured content from the information created in the Wikipedia project. This structured information is made available on the World Wide Web. DBpedia allows users to semanti ...
and other. In the case of Wikipedia, quality analysis may relate to the whole article Modeling of quality there is carried out by means of various methods. Some of them use
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
algorithms, including Random Forest, Support Vector Machine, and others. Methods for assessing data quality in Wikidata, DBpedia and other LOD sources differ.


Professional associations

;IQ International—the International Association for Information and Data Quality :IQ International is a not-for-profit, vendor neutral, professional association formed in 2004, dedicated to building the information and data quality profession.


ECCMA (Electronic Commerce Code Management Association)

The Electronic Commerce Code Management Association (ECCMA) is a member-based, international not-for-profit association committed to improving data quality through the implementation of international standards. ECCMA is the current project leader for the development of ISO 8000 and ISO 22745, which are the international standards for data quality and the exchange of material and service master data, respectively. ECCMA provides a platform for collaboration amongst subject experts on data quality and data governance around the world to build and maintain global, open standard dictionaries that are used to unambiguously label information. The existence of these dictionaries of labels allows information to be passed from one computer system to another without losing meaning.


See also

* Data quality firewall *
Data validation In computer science, data validation is the process of ensuring data has undergone data cleansing to ensure they have data quality, that is, that they are both correct and useful. It uses routines, often called "validation rules", "validation cons ...
*
Record linkage Record linkage (also known as data matching, data linkage, entity resolution, and many other terms) is the task of finding records in a data set that refer to the same entity across different data sources (e.g., data files, books, websites, and d ...
* Information quality * Master data management *
Data governance Data governance is a term used on both a macro and a micro level. The former is a political concept and forms part of international relations and Internet governance; the latter is a data management concept and forms part of corporate data govern ...
*
Database normalization Database normalization or database normalisation (see spelling differences) is the process of structuring a relational database in accordance with a series of so-called normal forms in order to reduce data redundancy and improve data integrity ...
* Data visualization * Data Analysis *
Clinical data management Clinical data management (CDM) is a critical process in clinical research, which leads to generation of high-quality, reliable, and statistically sound data from clinical trials. Clinical data management ensures collection, integration and availabi ...


References


Further reading

* * Baamann, Katharina, "Data Quality Aspects of Revenue Assurance"
Article
* Eckerson, W. (2002) "Data Warehousing Special Report: Data quality and the bottom line"
Article
* Ivanov, K. (1972
"Quality-control of information: On the concept of accuracy of information in data banks and in management information systems"
The University of Stockholm and The Royal Institute of Technology. Doctoral dissertation. * Hansen, M. (1991) Zero Defect Data, MIT. Masters thesi

* Kahn, B., Strong, D., Wang, R. (2002) "Information Quality Benchmarks: Product and Service Performance," Communications of the ACM, April 2002. pp. 184–192
Article
* Price, R. and Shanks, G. (2004) A Semiotic Information Quality Framework, Proc. IFIP International Conference on Decision Support Systems (DSS2004): Decision Support in an Uncertain and Complex World, Prato
Article
* Redman, T. C. (2008) Data Driven: Profiting From Our Most Important Business Asset * Wand, Y. and Wang, R. (1996) "Anchoring Data Quality Dimensions in Ontological Foundations," Communications of the ACM, November 1996. pp. 86–95
Article
* Wang, R., Kon, H. & Madnick, S. (1993), Data Quality Requirements Analysis and Modelling, Ninth International Conference of Data Engineering, Vienna, Austria
Article
* Fournel Michel, Accroitre la qualité et la valeur des données de vos clients, éditions Publibook, 2007. . * Daniel F., Casati F., Palpanas T., Chayka O., Cappiello C. (2008) "Enabling Better Decisions through Quality-aware Reports", International Conference on Information Quality (ICIQ), MIT
Article
* Jack E. Olson (2003), "Data Quality: The Accuracy dimension", Morgan Kaufmann Publishers * Woodall P., Oberhofer M., and Borek A. (2014)
"A Classification of Data Quality Assessment and Improvement Methods"
International Journal of Information Quality 3 (4), 298–321. doi:10.1504/IJIQ.2014.068656, doi:10.1504/ijiq.2014.068656. * Woodall, P., Borek, A., and Parlikad, A. (2013), "Data Quality Assessment: The Hybrid Approach." ''Information & Management'' 50 (7), 369–382.


External links


Data quality course
from the Global Health Learning Center {{Authority control Information science