Data proliferation refers to the prodigious amount of
data
In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted ...
,
structured
Structuring, also known as smurfing in banking jargon, is the practice of executing financial transactions such as making bank deposits in a specific pattern, calculated to avoid triggering financial institutions to file reports required by law ...
and unstructured, that businesses and governments continue to generate at an unprecedented rate and the
usability
Usability can be described as the capacity of a system to provide a condition for its users to perform the tasks safely, effectively, and efficiently while enjoying the experience. In software engineering, usability is the degree to which a soft ...
problems that result from attempting to store and manage that data. While originally pertaining to problems associated with paper
documentation
Documentation is any communicable material that is used to describe, explain or instruct regarding some attributes of an object, system or procedure, such as its parts, assembly, installation, maintenance and use. As a form of knowledge manageme ...
, data proliferation has become a major problem in primary and secondary
data storage
Data storage is the recording (storing) of information (data) in a storage medium. Handwriting, phonographic recording, magnetic tape, and optical discs are all examples of storage media. Biological molecules such as RNA and DNA are conside ...
on computers.
While digital storage has become cheaper, the associated costs, from raw power to maintenance and from metadata to search engines, have not kept up with the proliferation of data. Although the power required to maintain a unit of data has fallen, the cost of facilities which house the digital storage has tended to rise.
Data proliferation has been documented as a problem for the
U.S. military
The United States Armed Forces are the military forces of the United States. The armed forces consists of six service branches: the Army, Marine Corps, Navy, Air Force, Space Force, and Coast Guard. The president of the United States is the ...
since August 1971, in particular regarding the excessive documentation submitted during the acquisition of major weapon systems.
Efforts to mitigate data proliferation and the problems associated with it are ongoing.
Problems caused
The problem of data proliferation is affecting all areas of commerce as a result of the availability of relatively inexpensive data storage devices. This has made it very easy to dump data into secondary storage immediately after its window of usability has passed. This masks problem that could gravely affect the profitability of businesses and the efficient functioning of health services, police and security forces, local and national governments, and many other types of organizations.
Data proliferation is problematic for several reasons:
*Difficulty when trying to find and retrieve information. At
Xerox
Xerox Holdings Corporation (; also known simply as Xerox) is an American corporation that sells print and electronic document, digital document products and services in more than 160 countries. Xerox is headquartered in Norwalk, Connecticut (ha ...
, on average it takes employees more than one hour per week to
find
Find, FIND or Finding may refer to:
Computing
* find (Unix), a command on UNIX platforms
* find (Windows), a command on DOS/Windows platforms
Books
* ''The Find'' (2010), by Kathy Page
* ''The Find'' (2014), by William Hope Hodgson
Film and t ...
hard-copy documents, costing $2,152 a year to manage and store them. For businesses with more than 10 employees, this increases to almost two hours per week at $5,760 per year. In large
networks
Network, networking and networked may refer to:
Science and technology
* Network theory, the study of graphs as a representation of relations between discrete objects
* Network science, an academic field that studies complex networks
Mathematics
...
of primary and secondary data storage, problems finding electronic data are analogous to problems finding hard copy data.
*
Data loss Data loss is an error condition in information systems in which information is destroyed by failures (like failed spindle motors or head crashes on hard drives) or neglect (like mishandling, careless handling or storage under unsuitable conditions) ...
and legal liability when data is disorganized, not properly replicated, or cannot be found promptly. In April 2005, the
Ameritrade Holding Corporation told 200,000 current and past customers that a
tape containing confidential information had been lost or destroyed in transit. In May of the same year,
Time Warner Incorporated
Warner Media, LLC ( traded as WarnerMedia) was an American multinational mass media and entertainment conglomerate. It was headquartered at the 30 Hudson Yards complex in New York City, United States.
It was originally established in 1972 by ...
reported that 40 tapes containing personal data on 600,000 current and former employees had been lost en route to a storage facility. In March 2005, a Florida judge hearing a $2.7 billion lawsuit against Morgan Stanley issued an "
adverse inference Adverse inference is a legal inference, adverse to the concerned party, drawn from silence or absence of requested evidence. It is part of evidence codes based on common law in various countries.
According to Lawvibe, "the 'adverse inference' can b ...
order" against the company for "willful and gross abuse of its discovery obligations." The judge cited Morgan Stanley for repeatedly finding misplaced tapes of e-mail messages long after the company had claimed that it had turned over all such tapes to the court.
*Increased manpower requirements to manage increasingly chaotic data storage resources.
*Slower networks and application performance due to excess traffic as users search and search again for the material they need.
*High cost in terms of the energy resources required to operate storage hardware. A 100 terabyte system will cost up to $35,040 a year to run—not counting cooling costs.
"Power and storage: the hidden cost of ownership”, Computer Technology Review, October 2003
/ref>
Proposed solutions
*Applications that better utilize modern technology
*Reductions in duplicate data (especially as caused by data movement)
*Improvement of metadata
Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata, including:
* Descriptive metadata – the descriptive ...
structures
*Improvement of file and storage transfer structures
*User education and discipline
*The implementation of Information Lifecycle Management solutions to eliminate low-value information as early as possible before putting the rest into actively managed long-term storage in which it can be quickly and cheaply accessed.
See also
*Backup
In information technology, a backup, or data backup is a copy of computer data taken and stored elsewhere so that it may be used to restore the original after a data loss event. The verb form, referring to the process of doing so, is "back up", w ...
* Digital Asset Management
Digital asset management (DAM) and the implementation of its use as a computer application is required in the collection of digital assets to ensure that the owner, and possibly their delegates, can perform operations on the data files.
Termin ...
*Disk storage
Disk storage (also sometimes called drive storage) is a general category of storage mechanisms where data is recorded by various electronic, magnetic, optical, or mechanical changes to a surface layer of one or more rotating disks. A disk drive is ...
*Document management system
A document management system (DMS) is usually a computerized system used to store, share, track and manage files or documents. Some systems include history tracking where a log of the various versions created and modified by different users is r ...
*Hierarchical storage management
Hierarchical storage management (HSM), also known as Tiered storage, is a data storage and Data management technique that automatically moves data between high-cost and low-cost storage media. HSM systems exist because high-speed storage devices, ...
* Information Lifecycle Management
*Information repository
In information technology, an information repository or simply a repository is "a central place in which an aggregation of data is kept and maintained in an organized way, usually in computer storage." It "may be just the aggregation of data itse ...
*Magnetic tape data storage
Magnetic-tape data storage is a system for storing digital information on magnetic tape using digital recording.
Tape was an important medium for primary data storage in early computers, typically using large open reels of IBM 7 track, 7-track, ...
*Retention schedule A retention schedule is a listing of organizational information types, or series of information in a manner which facilitates the understanding and application of the identified and approved retention period, and other information retention aspects ...
References
{{reflist
Content management systems
Data management