Mass digitization is a term used to describe "large-scale digitization projects of varying scopes." Such projects include efforts to digitize physical books, on a mass scale, to make knowledge openly and publicly accessible and are made possible by selecting cultural objects, prepping them, scanning them, and constructing necessary digital infrastructures including
digital libraries
A digital library, also called an online library, an internet library, a digital repository, or a digital collection is an online database of digital objects that can include text, still images, audio, video, digital documents, or other digital me ...
. These projects are often piloted by cultural institutions and private bodies, however, individuals may attempt to conduct a mass digitization effort as well. Mass digitization efforts occur quite often; millions of files (books, photos, color swatches, etc.) are uploaded to large-scale public or private online archives every single day. This practice of taking the physical to the digital on a mass realm changes the way we interact with knowledge. The history of mass digitization can be traced as early as the mid-1800s with the advent of microfilm, and technical infrastructures such as the
internet
The Internet (or internet) is the global system of interconnected computer networks that uses the Internet protocol suite (TCP/IP) to communicate between networks and devices. It is a '' network of networks'' that consists of private, pub ...
,
data
In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted ...
farms, and
computer data storage
Computer data storage is a technology consisting of computer components and Data storage, recording media that are used to retain digital data (computing), data. It is a core function and fundamental component of computers.
The central pro ...
make these efforts technologically possible. This seemingly simple process of digitization of physical knowledge, or even products, has vast implications that can be explored.
History of Mass Digitization Initiatives
Fictional Considerations
Perhaps one of the most notable considerations of mass digitization, in a fictional sense, is the speculations on the Library of
Babel by
Jorge Luis Borges
Jorge Francisco Isidoro Luis Borges Acevedo (; ; 24 August 1899 – 14 June 1986) was an Argentine short-story writer, essayist, poet and translator, as well as a key figure in Spanish-language and international literature. His best-known bo ...
. In this account, Borges describes a vision of a library in which every possible permutation of books were available. Although Borges describes the preservation and archival practices of all knowledge in a physical space (a library), Borges' fictional vision has already taken place in a digital sense. Endless copies of online books are freely available to the public by means of internet archives or library databases. An account like this was actually quite common, and expertly conveys the idea that "the dream and practice of mass digitization cultural works have been around for decades."
Non-fictional considerations
Some of the earliest digitization programs started before the age of the internet, and include the adaption of technologies such as
microfilm
Microforms are scaled-down reproductions of documents, typically either photographic film, films or paper, made for the purposes of transmission, storage, reading, and printing. Microform images are commonly reduced to about 4% or of the origin ...
in the 19th century. The technical affordances of microfilm allowed it to be a significant medium in the efforts to preserve and extend library materials, as well as its feature of "graphically dramatizing questions of scale." Microfilm was also known as
microphotograph
Microphotographs are photographs shrunk to microscopic scale. y, developed in1839, and its capabilities demonstrate (perhaps for the first time) the ability to store mass amounts of information, in this case photos, on a physically small space. When discussing the affordances of microfilm, it was noted by an observer that, "the whole archives of the nation might be packed away in a snuffbox." Such notes expertly demonstrate ''how'' the technical infrastructure of microfilm could be leveraged to archive and preserve on a mass scale.
Paul Otlet
Paul Marie Ghislain Otlet (; ; 23 August 1868 – 10 December 1944) was a Belgian author, entrepreneur, lawyer and peace activist; predicting the arrival of the internet before World War II, he is among those considered to be the father of infor ...
, a Belgian author often considered one of the founders of information science, "outlined the benefits of microfilm as a stable and long-term remediation format that could be used to extend the reach of literature" in his 1906 work "''Sur une forme nouvelle du livre : le livre microphotographique".'' His claim was proven right, with the
Library of Congress
The Library of Congress (LOC) is the research library that officially serves the United States Congress and is the ''de facto'' national library of the United States. It is the oldest federal cultural institution in the country. The library is ...
and other bodies using microfilm to "digitize" cultural objects such as manuscripts, books, images, and newspapers in the early 20th century.
Technical Infrastructures
Microfilm
Microfilm represents a shift in the infrastructure of data storage: an immense amount of pictures could be stored in a physically small space, and then expanded for viewing with the help of the microfilm machine. Microfilm, in combination with the
microfilm viewer, were leveraged to allow objects to be digitized, preserved, and viewed on a mass scale. It is interesting to note that students needed the help of staff before using the machine; accessing digital materials now is a swift, easy process that one can conduct independently. More information on microfilm can be found under the "Non-fictional considerations" tab of this page.
Server Farms
Another large shift in the infrastructure of data storage was the advent server farms. Websites rely on
server farms
A server farm or server cluster is a collection of computer servers, usually maintained by an organization to supply server functionality far beyond the capability of a single machine. They often consist of thousands of computers which require ...
for “scalability, reliability, and low-latency access to Internet content”. According to Burns, these technologies are essential when building a high-performance infrastructure for content delivery. Moving from microfilm to complex server farms with their own schemas demonstrates the infrastructural demands mass digitization requires over time. Here, mass digitization is both facilitated and exists in this place. Without server farms, data would not be able to be stored or accessed on the necessary scale for mass digitization projects. However, it is important to note that server farms do not act alone in storing data. Other web based infrastructures aid greatly in the storage of data, such as
hard drives
A hard disk drive (HDD), hard disk, hard drive, or fixed disk is an electro-mechanical data storage device that stores and retrieves digital data using magnetic storage with one or more rigid rapidly rotating platters coated with magne ...
on a personal computer.
Encryption
In cryptography, encryption is the process of encoding information. This process converts the original representation of the information, known as plaintext, into an alternative form known as ciphertext. Ideally, only authorized parties can decip ...
tools and services also work to protect and secure data in sensitive, or internal use, mass digitization projects.
Databases
Databases are often seen as the "home" of a variety of mass digitization efforts. Databases, such as
Google Books
Google Books (previously known as Google Book Search, Google Print, and by its code-name Project Ocean) is a service from Google Inc. that searches the full text of books and magazines that Google has scanned, converted to text using optical c ...
, allow one to view an entire collection of digitized objects. In the case of Google Books, the database allows a user to search, research, and preview an estimated 40 million titles, corresponding to roughly 30% of the estimated number of all books ever published that the Google team has scanned and uploaded However, faults do exist within such databases; the hands of a scanner can accidentally be scanned and posted, as opposed to the page of a book itself. Errors such as these in public, and often permanent, databases call into question the efficiency of human efforts in mass digitization projects.
Other databases allow researchers from all over the world to upload or view data for scientific inquiry. In this case, raw data from scientific experiments - anonymized for participant privacy - is uploaded and stored on a mass scale. A prime example of such databases for research purposes include the Child Language Data Exchange System (
CHILDES The Child Language Data Exchange System (CHILDES) is a corpus established in 1984 by Brian MacWhinney and Catherine Snow to serve as a central repository for data of first language acquisition. Its earliest transcripts date from the 1960s, and it ...
) Database. This database houses raw data for language acquisition, and includes videos, audio, transcripts, and de-identified participant information. Databases that store published research articles also exist, and include sites such as
PubMed
PubMed is a free search engine accessing primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. The United States National Library of Medicine (NLM) at the National Institutes of Health maintain the ...
,
ScienceDirect
ScienceDirect is a website which provides access to a large bibliographic database of scientific and medical publications of the Dutch publisher Elsevier. It hosts over 18 million pieces of content from more than 4,000 academic journals and 30,0 ...
,
JSTOR
JSTOR (; short for ''Journal Storage'') is a digital library founded in 1995 in New York City. Originally containing digitized back issues of academic journals, it now encompasses books and other primary sources as well as current issues of j ...
, and
EBSCO
EBSCO Industries is an American company founded in 1944 by Elton Bryson Stephens Sr. and headquartered in Birmingham, Alabama. The ''EBSCO'' acronym is based on ''Elton Bryson Stephens Company''. EBSCO Industries is a diverse company of over 40 ...
.
Databases, in conjunction with server farms and other web based infrastructures, allow for crucial collaboration in the scientific realm. Here, mass digitization has expanded from the digitization of physical objects (such as books) to the digitization of interactions for scientific inquiry.
Implications
References
* Auerbach, J.; Gitelman, L. (2007-06-13). "Microfilm, Containment, and the Cold War". ''American Literary History''. 19 (3): 745–768. . {{ISSN, 0896-7148
* Luther, Frederic. Microfilm: A History, 1839–1900. Annapolis, MD: The National Microfilm Association, 1959.
* Goldschmidt, & Otlet, P. (1906). ''Sur une forme nouvelle du livre : le livre microphotographique''.
nstitut international de bibliographie
* La Hood, Charles G. "Microfilm for the Library of Congress." ''College & Research Libraries'' 34.4 (1973): 291–294.
* Duncan, Virginia L., and Frances E. Parsons. "Use of Microfilm in an Industrial Research Library." ''Spec Libr'' 61.6 (1970): 288–290.