
Book scanning or book digitization (also: magazine scanning or magazine digitization) is the process of converting physical
book
A book is a structured presentation of recorded information, primarily verbal and graphical, through a medium. Originally physical, electronic books and audiobooks are now existent. Physical books are objects that contain printed material, ...
s and
magazine
A magazine is a periodical literature, periodical publication, print or digital, produced on a regular schedule, that contains any of a variety of subject-oriented textual and visual content (media), content forms. Magazines are generally fin ...
s into
digital media
In mass communication, digital media is any media (communication), communication media that operates in conjunction with various encoded machine-readable data formats. Digital content can be created, viewed, distributed, modified, listened to, an ...
such as
images,
electronic text, or
electronic books
An ebook (short for electronic book), also spelled as e-book or eBook, is a book publication made available in electronic form, consisting of text, images, or both, readable on the flat-panel display of computers or other electronic devices. A ...
(e-books) by using an
image scanner
An image scanner (often abbreviated to just scanner) is a device that optically scans images, printed text, handwriting, or an object and converts it to a digital image. The most common type of scanner used in the home and the office is the flatbe ...
.
Large scale book scanning projects have made many books available online.
Digital books can be easily distributed, reproduced, and
read on-screen. Common file formats are
DjVu,
Portable Document Format
Portable document format (PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating syste ...
(PDF), and
Tag Image File Format (TIFF). To convert the raw images
optical character recognition
Optical character recognition or optical character reader (OCR) is the electronics, electronic or machine, mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo ...
(OCR)
is used to turn book pages into a digital text format like
ASCII
ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...
or other similar format, which reduces the file size and allows the text to be reformatted, searched, or processed by other applications.
Image scanners may be manual or automated. In an ordinary commercial image scanner, the book is placed on a flat glass plate (or platen), and a light and optical array moves across the book underneath the glass. In manual book scanners, the glass plate extends to the edge of the scanner, making it easier to line up the book's spine.
A problem with scanning bound books is that when a book that is not very thin is laid flat, the part of the page close to the spine (the gutter) is significantly curved, distorting the text in that part of the scan. One solution is to separate the book into separate pages by cutting or unbinding. A non-destructive method is to hold the book in a V-shaped holder and photograph it, rather than lay it flat and scan it. The curvature in the gutter is much less pronounced this way. Pages may be turned by hand or by automated paper transport devices. Transparent plastic or glass sheets are usually pressed against the page to flatten it.
After scanning, software adjusts the document images by lining it up, cropping it, picture-editing it, and converting it to text and final e-book form. Human proofreaders usually check the output for errors.
Scanning resolution for book digitization varies depending on the purpose and nature of the material. While () is generally adequate for text conversion, archival institutions recommend higher resolutions for preservation and rare materials. The
National Archives of Australia
The National Archives of Australia (NAA), formerly known as the Commonwealth Archives Office and Australian Archives, is an Australian Government agency that is the National archives, official repository for all federal government documents. It ...
suggests 400 ppi for bound books and 600 ppi for rare or significant documents,
while th
Federal Agencies Digitization Guidelines Initiative (FADGI)recommends a minimum of 400 ppi for archival materials.
These higher resolutions ensure the capture of fine details and support long-term preservation efforts, while a tiered approach balances quality with practical constraints such as storage capacity and resource limitations. This strategy allows institutions to optimize digitization efforts, applying higher resolutions selectively to rare or significant materials while using standard resolutions for more common documents.
High-end scanners capable of thousands of pages per hour can cost thousands of dollars, but
do-it-yourself
"Do it yourself" ("DIY") is the method of building, modifying, or repairing things by oneself without the direct aid of professionals or certified experts. Academic research has described DIY as behaviors where "individuals use raw and semi- ...
(DIY), manual book scanners capable of 1,200 pages per hour have been built for US$300.
Commercial book scanners

Commercial book scanners are not like normal
scanners; these book scanners are usually a high quality
digital camera
A digital camera, also called a digicam, is a camera that captures photographs in Digital data storage, digital memory. Most cameras produced today are digital, largely replacing those that capture images on photographic film or film stock. Dig ...
with light sources on either side of the camera mounted on some sort of frame to provide easy access for a person or machine to flip through the pages of the book. Some models involve V-shaped book cradles, which provide support for book spines and also center book position automatically.
The advantage of this type of scanner is that it is very fast, compared to the productivity of overhead scanners.
Large-scale projects
Projects like
Project Gutenberg
Project Gutenberg (PG) is a volunteer effort to digitize and archive cultural works, as well as to "encourage the creation and distribution of eBooks."
It was founded in 1971 by American writer Michael S. Hart and is the oldest digital li ...
(est. 1971),
Million Book Project (est. circa 2001),
Google Books
Google Books (previously known as Google Book Search, Google Print, and by its code-name Project Ocean) is a service from Google that searches the full text of books and magazines that Google has scanned, converted to text using optical charac ...
(est. 2004), and the
Open Content Alliance (est. 2005) scan books on a large scale.
One of the main challenges to this is the sheer volume of books that must be scanned. In 2010 the total number of works appearing as books in human history was estimated to be around 130 million. All of these must be scanned and then made searchable online for the public to use as a
universal library. Currently, there are three main ways that large organizations are relying on: outsourcing, scanning in-house using commercial book scanners, and scanning in-house using robotic scanning solutions.
As for outsourcing, books are often shipped to be scanned by low-cost sources to
India
India, officially the Republic of India, is a country in South Asia. It is the List of countries and dependencies by area, seventh-largest country by area; the List of countries by population (United Nations), most populous country since ...
or
China
China, officially the People's Republic of China (PRC), is a country in East Asia. With population of China, a population exceeding 1.4 billion, it is the list of countries by population (United Nations), second-most populous country after ...
. Alternatively, due to convenience, safety and technology improvement, many organizations choose to scan in-house by using either overhead scanners which are time-consuming, or digital camera-based scanning machines which are substantially faster and is a method employed by Internet Archive as well as Google.
Traditional methods have included cutting off the book's spine and scanning the pages in a
scanner with automatic page-feeding capability, with subsequent rebinding of the loose pages.
Once the page is scanned, the
data
Data ( , ) are a collection of discrete or continuous values that convey information, describing the quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted for ...
is either entered manually or via OCR, another major cost of the book scanning projects.
Due to
copyright
A copyright is a type of intellectual property that gives its owner the exclusive legal right to copy, distribute, adapt, display, and perform a creative work, usually for a limited time. The creative work may be in a literary, artistic, ...
issues, most scanned books are those that are out of copyright; however, Google Books is known to scan books still protected under copyright unless the
publisher
Publishing is the activities of making information, literature, music, software, and other content, physical or digital, available to the public for sale or free of charge. Traditionally, the term publishing refers to the creation and distribu ...
specifically prohibits this.
Collaborative projects
There are many collaborative digitization projects throughout the United States. Two of the earliest projects were the Collaborative Digitization Project in Colorado and
NC ECHO – North Carolina Exploring Cultural Heritage Online, based at the
State Library of North Carolina.
These projects establish and publish best practices for digitization and work with regional partners to digitize cultural heritage materials. Additional criteria for best practices have more recently been established in the UK, Australia and the European Union.
Wisconsin Heritage Online is a collaborative digitization project modeled after the Colorado Collaborative Digitization Project. Wisconsin uses a
wiki
A wiki ( ) is a form of hypertext publication on the internet which is collaboratively edited and managed by its audience directly through a web browser. A typical wiki contains multiple pages that can either be edited by the public or l ...
to build and distribute collaborative documentation. Georgia's collaborative digitization program, the Digital Library of Georgia, presents a seamless virtual library on the state's history and life, including more than a hundred digital collections from 60 institutions and 100 agencies of government. The
Digital Library of Georgia is a
GALILEO
Galileo di Vincenzo Bonaiuti de' Galilei (15 February 1564 – 8 January 1642), commonly referred to as Galileo Galilei ( , , ) or mononymously as Galileo, was an Italian astronomer, physicist and engineer, sometimes described as a poly ...
initiative based at the University of Georgia Libraries.
In the twentieth century, the
Hill Museum and Manuscript Library photographed books in Ethiopia that were subsequently destroyed amidst political violence in 1975. The library has since worked to photograph manuscripts in Middle Eastern countries.
In South Asia, the Nanakshahi trust is digitizing manuscripts of
Gurmukhī script
Gurmukhī ( , Shahmukhi alphabet, Shahmukhi: ) is an abugida developed from the Laṇḍā scripts, standardized and used by the second Sikh gurus, Sikh guru, Guru Angad (1504–1552). Commonly regarded as a Sikhs, Sikh script, Gurmukhi is used ...
.
In Australia, there have been many collaborative projects between the
National Library of Australia
The National Library of Australia (NLA), formerly the Commonwealth National Library and Commonwealth Parliament Library, is the largest reference library in Australia, responsible under the terms of the ''National Library Act 1960'' for "mainta ...
and universities to improve the repository infrastructure that digitized information would be stored in. Some of these projects include, the ARROW (Australian Research Repositories Online to the World) project and the APSR (Australian Partnership for Sustainable Repository) project.
Destructive scanning methods
For book scanning on a low budget, the least expensive way to scan a book or magazine is to cut off the binding. This converts the book or magazine into a sheaf of separate sheets which can be loaded into a standard
automatic document feeder (ADF) and scanned using inexpensive and common scanning technology. The method is not suitable for rare or valuable books. There are two technical difficulties with this process, first with the cutting and second with the scanning.
Unbinding
More precise and less destructive than cutting pages is to unbind by hand using suitable tools. This technique has been successfully employed for tens of thousands of pages of archival original paper scanned for the Riazanov Library digital archive project from newspapers and magazines and pamphlets, varying from 50 to 100 years old and more, and often composed of fragile, brittle paper. Although the monetary value for some collectors (and for most sellers of this sort of material) is destroyed by unbinding, it in many cases actually greatly assists preservation of the pages, making them more accessible to researchers
and less likely to be damaged when subsequently examined. A disadvantage is that unbound stacks of pages are "fluffed up", and therefore more exposed to oxygen in the air, which may in some cases speed deterioration. This can be addressed by putting weights on the pages after they are unbound, and storage in appropriate containers.
Hand unbinding will preserve text that runs into the gutters of bindings, and most critically allows more easy and complete high quality scans to be made of two-page-wide material, such as center cartoons, graphic art, and photos in magazines. The digital archive of ''The Liberator'' 1918-1924 on
Marxists Internet Archive
Marxists Internet Archive, also known as MIA or Marxists.org, is a non-profit online encyclopedia that hosts a multilingual library (created in 1990) of the works of communist, anarchist, and socialist writers, such as Karl Marx, Friedrich Enge ...
demonstrates the quality of two-page-wide graphic art scans made possible by careful hand unbinding, then scanning.
Unbinding techniques vary with the binding technology, from simply removing a few staples, to unbending and removing nails, to meticulously grinding down layers of glue on the spine of a book to precisely the right point, followed by laborious removal of the string used to hold the book together.
With some newspapers (such as ''Labor Action'' 1950-1952) there are columns on the center of facing pages that run across the pages. Chopping off part of the spine of a bound volume of such papers will lose part of this text. Even the Greenwood Reprint of this publication failed to preserve the text content of those center columns, cutting off significant amounts of text there. Only when bound volumes of the original newspaper were meticulously unbound, and the opened pairs of center pages were scanned as a single page on a flat bed scanner was the center column content made digitally available. Alternatively, one can present the two facing center pages as three scans: one of each individual page, and one of a page sized area situated over the center of the two pages.
Cutting
One way of cutting a stack of 500 to 1,000 pages in one pass is to use a guillotine
paper cutter, a large steel table with a paper
vise
A vise or vice (British English) is a mechanical apparatus used to secure an object to allow work to be performed on it. Vises have two parallel jaws, one fixed and the other movable, threaded in and out by a screw and lever. The jaws are ofte ...
that screws down onto the stack and firmly secures it before cutting.
A large sharpened steel blade which moves straight down cuts the entire length of each sheet in one operation. A lever on the blade permits several hundred pounds of force to be applied to the blade for a quick one-pass cut.
A clean cut through a thick stack of paper cannot be made with a traditional inexpensive sickle-shaped hinged
paper cutter. These cutters are only intended for a few sheets, with up to ten sheets being the practical cutting limit. A large stack of paper applies torsional forces on the hinge, pulling the blade away from the cutting edge on the table. The cut becomes more inaccurate as the cut moves away from the hinge, and the force required to hold the blade against the cutting edge increases as the cut moves away from the hinge.
The guillotine cutting process dulls the blade over time, requiring that it be resharpened.
Coated paper
Coated paper (also known as enamel paper, gloss paper, and thin paper) is paper that has been coated with a mixture of materials or a polymer to impart certain qualities to the paper, including weight, surface gloss, smoothness, or reduced ink ab ...
such as slick magazine paper dulls the blade more quickly than plain book paper, due to the
kaolinite
Kaolinite ( ; also called kaolin) is a clay mineral, with the chemical composition Al2 Si2 O5( OH)4. It is a layered silicate mineral, with one tetrahedral sheet of silica () linked through oxygen atoms to one octahedral sheet of alumina () ...
clay
Clay is a type of fine-grained natural soil material containing clay minerals (hydrous aluminium phyllosilicates, e.g. kaolinite, ). Most pure clay minerals are white or light-coloured, but natural clays show a variety of colours from impuriti ...
coating. Additionally, removing the binding of an entire hardcover book causes excessive wear due to cutting through the cover's stiff backing material. Instead the outer cover can be removed and only interior pages need be cut.
An alternate method of unbinding books is to use a table saw. While this method is potentially dangerous and does not leave as smooth an edge as the guillotine paper cutter method, it is more readily available to the average person. The ideal method is to clamp the book between two thick boards using heavy machine screws to provide the clamping force. The entire wood and book package is fed through the table saw using the rip fence as a guide. A sharp fine carbide tooth blade is ideal for generating an acceptable cut. The quality of the cut depends on the blade, feed rate, type of paper, paper coating, and binding material.
Scanning

Once the paper is liberated from the spine, it can be scanned one sheet at a time using a
flatbed scanner or
automatic document feeder (ADF).
Pages with a decorative riffled edging or curving in an arc due to a non-flat binding can be difficult to scan using an ADF, as they are designed to scan pages of uniform shape and size, and variably sized or shaped pages can lead to improper scanning. The riffled edges or curved edge can be guillotined off to render the outer edges flat and smooth before the binding is cut.
The coated paper of magazines and bound textbooks can make them difficult for the rollers in an ADF to pick up and guide along the paper path. An ADF which uses a series of rollers and channels to flip sheets over may jam or misfeed when fed coated paper. Generally there are fewer problems by using as straight a paper path as is possible, with few bends and curves. The clay can also rub off the paper over time and coat sticky pickup rollers, causing them to loosely grip the paper. The ADF rollers may need periodic cleaning to prevent this slipping.
Magazines can pose a bulk-scanning challenge due to small nonuniform sheets of paper in the stack, such as magazine subscription cards and fold out pages. These need to be removed before the bulk scan begins, and are either scanned separately if they include worthwhile content, or are simply left out of the scan process.
Robotic book scanners
A robotic or automated book scanner is a device that digitizes printed books by using robotic systems to turn pages and capture images of each page without the need for human hands to touch the book. The scanner consists of a mechanism to automatically turn pages, one or more cameras to photograph each page, and software to compile these images into a digital file. These scanners are used to digitize large quantities of books quickly. Some models allow for manual operation if a book is too delicate or complex for the robot to handle alone. The process is designed to be gentle on books, often using special cradles and glass plates to avoid damage during scanning.
Most high-end commercial robotic scanners use air and
suction
Suction is the day-to-day term for the movement of gases or liquids along a pressure gradient with the implication that the movement occurs because the lower pressure pulls the gas or liquid. However, the forces acting in this case do not orig ...
technology to turn and separate pages. These scanners utilize a vacuum or air suction to gently lift a page from the stack, while a puff of air is used to turn the page over, allowing the device to scan both sides efficiently. Some use newer approaches such as bionic fingers for turning pages. Some scanners take advantage of
ultrasonic or
photoelectric sensors to detect dual pages and prevent skipping of pages.
With reports of machines being able to scan up to 2,900 pages per hour, robotic book scanners are specifically designed for large-scale digitization projects.
Google's patent 7508978 shows an
infrared
Infrared (IR; sometimes called infrared light) is electromagnetic radiation (EMR) with wavelengths longer than that of visible light but shorter than microwaves. The infrared spectral band begins with the waves that are just longer than those ...
camera technology which allows detection and automatic adjustment of the three-dimensional shape of the page.
The Secret Of Google's Book Scanning Machine Revealed
', by Maureen Clements, April 30, 2009.
Robotic book scanners that use air and suction technology rely on specialized systems to turn and separate pages without causing damage to fragile or rare books. These scanners utilize a vacuum or air suction to gently lift a page from the stack, while a puff of air is used to turn the page over, allowing the device to scan both sides efficiently
See also
*
Digital library
A digital library (also called an online library, an internet library, a digital repository, a library without walls, or a digital collection) is an online database of digital resources that can include text, still images, audio, video, digital ...
*
Institutional repository
An institutional repository (IR) is an archive for collecting, preserving, and disseminating digital copies of the intellectual output of an institution, particularly a research institution. Academics also utilize their IRs for archiving published ...
*
Optical character recognition
Optical character recognition or optical character reader (OCR) is the electronics, electronic or machine, mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo ...
*
Planetary scanner
*
Europeana
References
External links
Do It Yourself book scanner device forumGoogle Open Source Linear Book ScannerStanford University videoshows some book scanning
University of Tokyohigh speed scanner
{{Books
scanning
Digital libraries
Publishing