HOME

TheInfoList



OR:

Book scanning or book digitization (also: magazine scanning or magazine digitization) is the process of converting physical
book A book is a medium for recording information in the form of writing or images, typically composed of many pages (made of papyrus, parchment, vellum, or paper) bound together and protected by a cover. The technical term for this physi ...
s and magazines into digital media such as
images An image is a visual representation of something. It can be two-dimensional, three-dimensional, or somehow otherwise feed into the visual system to convey information. An image can be an artifact, such as a photograph or other two-dimensiona ...
, electronic text, or
electronic books An ebook (short for electronic book), also known as an e-book or eBook, is a book publication made available in digital form, consisting of text, images, or both, readable on the flat-panel display of computers or other electronic devices. Alth ...
(e-books) by using an
image scanner An image scanner—often abbreviated to just scanner—is a device that optically scans images, printed text, handwriting or an object and converts it to a digital image. Commonly used in offices are variations of the desktop ''flatbed scanner'' ...
. Large scale book scanning projects have made many books available online. Digital books can be easily distributed, reproduced, and read on-screen. Common file formats are DjVu,
Portable Document Format Portable Document Format (PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating syste ...
(PDF), and
Tagged Image File Format Tag Image File Format, abbreviated TIFF or TIF, is an image file format for storing raster graphics images, popular among graphic artists, the publishing industry, and photographers. TIFF is widely supported by scanning, faxing, word process ...
(TIFF). To convert the raw images
optical character recognition Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a sc ...
(OCR) is used to turn book pages into a digital text format like
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because ...
or other similar format, which reduces the file size and allows the text to be reformatted, searched, or processed by other applications. Image scanners may be manual or automated. In an ordinary commercial image scanner, the book is placed on a flat glass plate (or platen), and a light and optical array moves across the book underneath the glass. In manual book scanners, the glass plate extends to the edge of the scanner, making it easier to line up the book's spine. A problem with scanning bound books is that when a book that is not very thin is laid flat, the part of the page close to the spine (the gutter) is significantly curved, distorting the text in that part of the scan. One solution is to separate the book into separate pages by cutting or unbinding. A non-destructive method is to hold the book in a V-shaped holder and photograph it, rather than lay it flat and scan it. The curvature in the gutter is much less pronounced this way. Pages may be turned by hand or by automated paper transport devices. Transparent plastic or glass sheets are usually pressed against the page to flatten it. After scanning, software adjusts the document images by lining it up, cropping it, picture-editing it, and converting it to text and final e-book form. Human proofreaders usually check the output for errors. Scanning at () is adequate for conversion to digital text output, but for archival reproduction of rare, elaborate or illustrated books, much higher resolution is used. High-end scanners capable of thousands of pages per hour can cost thousands of dollars, but do-it-yourself (DIY), manual book scanners capable of 1200 pages per hour have been built for US$300.


Commercial book scanners

Commercial book scanners are not like normal
scanners ''Scanners'' is a 1981 Canadian science fiction horror film written and directed by David Cronenberg and starring Stephen Lack, Jennifer O'Neill, Michael Ironside, and Patrick McGoohan. In the film, "scanners" are psychics with unusual telepathi ...
; these book scanners are usually a high quality
digital camera A digital camera is a camera that captures photographs in digital memory. Most cameras produced today are digital, largely replacing those that capture images on photographic film. Digital cameras are now widely incorporated into mobile devices ...
with light sources on either side of the camera mounted on some sort of frame to provide easy access for a person or machine to flip through the pages of the book. Some models involve V-shaped book cradles, which provide support for book spines and also center book position automatically. The advantage of this type of scanner is that it is very fast, compared to the productivity of overhead scanners.


Large-scale projects

Projects like
Project Gutenberg Project Gutenberg (PG) is a volunteer effort to digitize and archive cultural works, as well as to "encourage the creation and distribution of eBooks." It was founded in 1971 by American writer Michael S. Hart and is the oldest digital libr ...
(est. 1971),
Million Book Project The Million Book Project (or the Universal Library) was a book digitization project led by Carnegie Mellon University School of Computer Science and University Libraries from 2007–2008. Working with government and research partners in India ( D ...
(est. circa 2001),
Google Books Google Books (previously known as Google Book Search, Google Print, and by its code-name Project Ocean) is a service from Google Inc. that searches the full text of books and magazines that Google has scanned, converted to text using optical ...
(est. 2004), and the
Open Content Alliance The Open Content Alliance (OCA) was a consortium of organizations contributing to a permanent, publicly accessible archive of digitized texts. Its creation was announced in October 2005 by Yahoo!, the Internet Archive, the University of California ...
(est. 2005) scan books on a large scale. One of the main challenges to this is the sheer volume of books that must be scanned. In 2010 the total number of works appearing as books in human history was estimated to be around 130 million. All of these must be scanned and then made searchable online for the public to use as a
universal library A universal library is a library with universal collections. This may be expressed in terms of it containing all existing information, useful information, all books, all works (regardless of format) or even all possible works. This ideal, althoug ...
. Currently, there are three main ways that large organizations are relying on: outsourcing, scanning in-house using commercial book scanners, and scanning in-house using robotic scanning solutions. As for outsourcing, books are often shipped to be scanned by low-cost sources to
India India, officially the Republic of India (Hindi: ), is a country in South Asia. It is the seventh-largest country by area, the second-most populous country, and the most populous democracy in the world. Bounded by the Indian Ocean on the so ...
or China. Alternatively, due to convenience, safety and technology improvement, many organizations choose to scan in-house by using either overhead scanners which are time-consuming, or digital camera-based scanning machines which are substantially faster and is a method employed by Internet Archive as well as Google. Traditional methods have included cutting off the book's spine and scanning the pages in a scanner with automatic page-feeding capability, with subsequent rebinding of the loose pages. Once the page is scanned, the
data In the pursuit of knowledge, data (; ) is a collection of discrete Value_(semiotics), values that convey information, describing quantity, qualitative property, quality, fact, statistics, other basic units of meaning, or simply sequences of sy ...
is either entered manually or via OCR, another major cost of the book scanning projects. Due to
copyright A copyright is a type of intellectual property that gives its owner the exclusive right to copy, distribute, adapt, display, and perform a creative work, usually for a limited time. The creative work may be in a literary, artistic, educatio ...
issues, most scanned books are those that are out of copyright; however, Google Books is known to scan books still protected under copyright unless the
publisher Publishing is the activity of making information, literature, music, software and other content available to the public for sale or for free. Traditionally, the term refers to the creation and distribution of printed works, such as books, newsp ...
specifically prohibits this.


Collaborative projects

There are many collaborative digitization projects throughout the United States. Two of the earliest projects were the Collaborative Digitization Project in Colorado and
NC ECHO NC may refer to: People * Naga Chaitanya, an Indian Telugu film actor; sometimes nicknamed by the initials of his first and middle name, NC * Nathan Connolly, lead guitarist for Snow Patrol *Nostalgia Critic, the alter ego of Internet comedian D ...
– North Carolina Exploring Cultural Heritage Online, based at the State Library of North Carolina. These projects establish and publish best practices for digitization and work with regional partners to digitize cultural heritage materials. Additional criteria for best practices have more recently been established in the UK, Australia and the European Union. Wisconsin Heritage Online is a collaborative digitization project modeled after the Colorado Collaborative Digitization Project. Wisconsin uses a
wiki A wiki ( ) is an online hypertext publication collaboratively edited and managed by its own audience, using a web browser. A typical wiki contains multiple pages for the subjects or scope of the project, and could be either open to the pub ...
to build and distribute collaborative documentation. Georgia's collaborative digitization program, the Digital Library of Georgia, presents a seamless virtual library on the state's history and life, including more than a hundred digital collections from 60 institutions and 100 agencies of government. The
Digital Library of Georgia The Digital Library of Georgia (DLG) is an online, public collection of documents and media about the history and culture of the state of Georgia, United States. The collection includes more than a million digitized objects from more than 200 Georg ...
is a GALILEO initiative based at the University of Georgia Libraries. In the twentieth century, the Hill Museum and Manuscript Library photographed books in Ethiopia that were subsequently destroyed amidst political violence in 1975. The library has since worked to photograph manuscripts in Middle Eastern countries. In South Asia, the Nanakshahi trust is digitizing manuscripts of
Gurmukhī script Gurmukhī ( pa, ਗੁਰਮੁਖੀ, , Shahmukhi: ) is an abugida developed from the Laṇḍā scripts, standardized and used by the second Sikh guru, Guru Angad (1504–1552). It is used by Punjabi Sikhs to write the language, commonly r ...
. In Australia, there have been many collaborative projects between the
National Library of Australia The National Library of Australia (NLA), formerly the Commonwealth National Library and Commonwealth Parliament Library, is the largest reference library in Australia, responsible under the terms of the ''National Library Act 1960'' for "mainta ...
and universities to improve the repository infrastructure that digitized information would be stored in. Some of these projects include, the ARROW (Australian Research Repositories Online to the World) project and the APSR (Australian Partnership for Sustainable Repository) project.


Destructive scanning methods

For book scanning on a low budget, the least expensive way to scan a book or magazine is to cut off the binding. This converts the book or magazine into a sheaf of separate sheets which can be loaded into a standard
automatic document feeder In multifunction or all-in-one printers, fax machines, photocopiers and scanners, an automatic document feeder or ADF is a feature which takes several pages and feeds the paper one page at a time into a scanner or copier, and less likely to be damaged when subsequently examined. A disadvantage is that unbound stacks of pages are "fluffed up", and therefore more exposed to oxygen in the air, which may in some cases speed deterioration. This can be addressed by putting weights on the pages after they are unbound, and storage in appropriate containers. Hand unbinding will preserve text that runs into the gutters of bindings, and most critically allows more easy and complete high quality scans to be made of two-page-wide material, such as center cartoons, graphic art, and photos in magazines. The digital archive of ''The Liberator'' 1918-1924 on
Marxists Internet Archive Marxists Internet Archive (also known as MIA or Marxists.org) is a non-profit online encyclopedia that hosts a multilingual library (created in 1990) of the works of communist, anarchist, and socialist writers, such as Karl Marx, Friedrich Eng ...
demonstrates the quality of two-page-wide graphic art scans made possible by careful hand unbinding, then scanning. Unbinding techniques vary with the binding technology, from simply removing a few staples, to unbending and removing nails, to meticulously grinding down layers of glue on the spine of a book to precisely the right point, followed by laborious removal of the string used to hold the book together. With some newspapers (such as ''Labor Action'' 1950-1952) there are columns on the center of facing pages that run across the pages. Chopping off part of the spine of a bound volume of such papers will lose part of this text. Even the Greenwood Reprint of this publication failed to preserve the text content of those center columns, cutting off significant amounts of text there. Only when bound volumes of the original newspaper were meticulously unbound, and the opened pairs of center pages were scanned as a single page on a flat bed scanner was the center column content made digitally available. Alternatively, one can present the two facing center pages as three scans: one of each individual page, and one of a page sized area situated over the center of the two pages.


Cutting

One way of cutting a stack of 500 to 1,000 pages in one pass is to use a
guillotine A guillotine is an apparatus designed for efficiently carrying out executions by beheading. The device consists of a tall, upright frame with a weighted and angled blade suspended at the top. The condemned person is secured with stocks at t ...
paper cutter, a large steel table with a paper
vise A vise or vice (British English) is a mechanical apparatus used to secure an object to allow work to be performed on it. Vises have two parallel jaws, one fixed and the other movable, threaded in and out by a screw and lever. A vise grip is n ...
that screws down onto the stack and firmly secures it before cutting. A large sharpened steel blade which moves straight down cuts the entire length of each sheet in one operation. A lever on the blade permits several hundred pounds of force to be applied to the blade for a quick one-pass cut. A clean cut through a thick stack of paper cannot be made with a traditional inexpensive sickle-shaped hinged
paper cutter A paper cutter, also known as a paper guillotine or simply a guillotine, is a tool often found in offices and classrooms, designed to administer straight cuts to single sheets or large stacks of paper at once. History Paper cutters were dev ...
. These cutters are only intended for a few sheets, with up to ten sheets being the practical cutting limit. A large stack of paper applies torsional forces on the hinge, pulling the blade away from the cutting edge on the table. The cut becomes more inaccurate as the cut moves away from the hinge, and the force required to hold the blade against the cutting edge increases as the cut moves away from the hinge. The guillotine cutting process dulls the blade over time, requiring that it be resharpened. Coated paper such as slick magazine paper dulls the blade more quickly than plain book paper, due to the kaolinite
clay Clay is a type of fine-grained natural soil material containing clay minerals (hydrous aluminium phyllosilicates, e.g. kaolin, Al2 Si2 O5( OH)4). Clays develop plasticity when wet, due to a molecular film of water surrounding the clay par ...
coating. Additionally, removing the binding of an entire hardcover book causes excessive wear due to cutting through the cover's stiff backing material. Instead the outer cover can be removed and only interior pages need be cut. An alternate method of unbinding books is to use a table saw. While this method is potentially dangerous and does not leave as smooth an edge as the guillotine paper cutter method, it is more readily available to the average person. The ideal method is to clamp the book between two thick boards using heavy machine screws to provide the clamping force. The entire wood and book package is fed through the table saw using the rip fence as a guide. A sharp fine carbide tooth blade is ideal for generating an acceptable cut. The quality of the cut depends on the blade, feed rate, type of paper, paper coating, and binding material.


Scanning

Once the paper is liberated from the spine, it can be scanned one sheet at a time using a
flatbed scanner An image scanner—often abbreviated to just scanner—is a device that optically scans images, printed text, handwriting or an object and converts it to a digital image. Commonly used in offices are variations of the desktop ''flatbed scanner'' w ...
or
automatic document feeder In multifunction or all-in-one printers, fax machines, photocopiers and scanners, an automatic document feeder or ADF is a feature which takes several pages and feeds the paper one page at a time into a scanner or copier, Software driven machines and robots have been developed to scan books without the need of unbinding them in order to preserve both the contents of the document and create a digital image archive of its current state. This recent trend has been due in part to ever improving imaging technologies that allow a high quality digital archive image to be captured with little or no damage to a rare or fragile book in a reasonably short period of time. The first fully automated book scanner was the DL (Digitizing Line) scanner, manufactured by 4DigitalBooks in Switzerland. The first known installation was at Stanford University in 2001. The scanner received a Dow Jones Runner-Up award under Business Applications Category in 2001. In 2007 the company TREVENTUS presented an automated book scanner with a book opening angle for scanning of 60°. Which is an improvement in the area of conservation of the books during scanning. The company was awarded with the European Union "ICT Grand Prize 2007", for its development of the ScanRobot. This technology was also used in a mass digitization project from the Bavarian State Library where 8,900 books from the 16th century were digitized in 18 months using three of these v-shape scanners. Indus International, Inc, based in
West Salem, Wisconsin West Salem is a village in La Crosse County, Wisconsin, United States, along the La Crosse River. It is part of the La Crosse-Onalaska, WI-MN Metropolitan Statistical Area. The population was 4,799 as of the 2010 Census. History West Salem was p ...
, produces scanners which were bought by some US entities for services such as
interlibrary loan Interlibrary loan (abbreviated ILL, and sometimes called interloan, interlending, document delivery, document supply, or interlibrary services, abbreviated ILS) is a service where patrons of one library can borrow materials and receive photocopies ...
. Most high-end commercial robotic scanners use air and suction technology, while some use newer approaches such as bionic fingers for turning pages. Some scanners take advantage of
ultrasonic Ultrasound is sound waves with frequencies higher than the upper audible limit of human hearing. Ultrasound is not different from "normal" (audible) sound in its physical properties, except that humans cannot hear it. This limit varies fr ...
or
photoelectric sensor A photoelectric sensor is a device used to determine the distance, absence, or presence of an object by using a light transmitter, often infrared, and a photoelectric receiver. They are largely used in industrial manufacturing. There are three ...
s to detect dual pages and prevent skipping of pages. With reports of machines being able to scan up to 2,900 pages per hour, robotic book scanners are specifically designed for large-scale digitization projects. Google's patent 7508978 shows an
infrared Infrared (IR), sometimes called infrared light, is electromagnetic radiation (EMR) with wavelengths longer than those of visible light. It is therefore invisible to the human eye. IR is generally understood to encompass wavelengths from around ...
camera technology which allows detection and automatic adjustment of the three-dimensional shape of the page. Researchers from the University of Tokyo have an experimental non-destructive book scanner that includes a 3D surface scanner to allow images of a curved page to be straightened in software. Thus the book or magazine can be scanned as quickly as the operator can flip through the pages, about 200
pages per minute In computing, a printer is a peripheral machine which makes a persistent representation of graphics or text, usually on paper. While most output is human-readable, bar code printers are an example of an expanded use for printers. Differ ...
. There are techniques to minimise and to correct for distortion in the page gutter.


See also

*
Digital library A digital library, also called an online library, an internet library, a digital repository, or a digital collection is an online database of digital objects that can include text, still images, audio, video, digital documents, or other digital ...
*
Institutional repository An institutional repository is an archive for collecting, preserving, and disseminating digital copies of the intellectual output of an institution, particularly a research institution. Academics also utilize their IRs for archiving published work ...
*
Optical character recognition Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a sc ...
*
Planetary scanner A planetary scanner (also called an orbital scanner) is a type of image scanner for making scans of rare books and other easily damaged documents. In essence, such a scanner is a mounted camera taking photos of a well-lit environment. Originally, ...
*
Europeana Europeana is a web portal created by the European Union containing digitised cultural heritage collections of more than 3,000 institutions across Europe. It includes records of over 50 million cultural and scientific artefacts, brought togethe ...


References


External links


Do It Yourself book scanner device forumGoogle Open Source Linear Book ScannerStanford University video
shows some book scanning
University of Tokyo
high speed scanner {{Books Book terminology Digital libraries Publishing