HOME

TheInfoList



OR:

DjVu ( , like French "
déjà vu ''Déjà vu'' ( , ; "already seen") is a French loanword for the phenomenon of feeling as though one has lived through the present situation before.Schnider, Armin. (2008). ''The Confabulating Mind: How the Brain Creates Reality''. Oxford Univers ...
") is a
computer A computer is a machine that can be programmed to Execution (computing), carry out sequences of arithmetic or logical operations (computation) automatically. Modern digital electronic computers can perform generic sets of operations known as C ...
file format A file format is a standard way that information is encoded for storage in a computer file. It specifies how bits are used to encode information in a digital storage medium. File formats may be either proprietary or free. Some file formats ...
designed primarily to store scanned documents, especially those containing a combination of text, line drawings, indexed color images, and photographs. It uses technologies such as image layer separation of text and background/images, progressive loading,
arithmetic coding Arithmetic coding (AC) is a form of entropy encoding used in lossless data compression. Normally, a string of characters is represented using a fixed number of bits per character, as in the ASCII code. When a string is converted to arithmetic ...
, and
lossy compression In information technology, lossy compression or irreversible compression is the class of data compression methods that uses inexact approximations and partial data discarding to represent the content. These techniques are used to reduce data size ...
for
bitonal Polytonality (also polyharmony) is the musical use of more than one key simultaneously. Bitonality is the use of only two different keys at the same time. Polyvalence or polyvalency is the use of more than one harmonic function, from the same key, ...
(
monochrome A monochrome or monochromatic image, object or palette is composed of one color (or values of one color). Images using only shades of grey are called grayscale (typically digital) or black-and-white (typically analog). In physics, monochrom ...
) images. This allows high-quality, readable images to be stored in a minimum of space, so that they can be made available on the web. DjVu has been promoted as providing smaller files than
PDF Portable Document Format (PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems. ...
for most scanned documents. The DjVu developers report that color magazine pages compress to 40–70 kB, black-and-white technical papers compress to 15–40 kB, and ancient manuscripts compress to around 100 kB; a satisfactory
JPEG JPEG ( ) is a commonly used method of lossy compression for digital images, particularly for those images produced by digital photography. The degree of compression can be adjusted, allowing a selectable tradeoff between storage size and imag ...
image typically requires 500 kB. Like PDF, DjVu can contain an OCR text layer, making it easy to perform
copy and paste In human–computer interaction and user interface design, cut, copy, and paste are related commands that offer an interprocess communication technique for transferring data through a computer's user interface. The ''cut'' command removes ...
and text search operations. Free creators, manipulators, converters, Web browser plug-ins, and desktop viewers are available. DjVu is supported by a number of multi-format document viewers and e-book reader software on Linux (
Okular Okular is a multiplatform document viewer developed by the KDE community and based on Qt and KDE Frameworks libraries. It is distributed as part of the KDE Applications bundle. Its origins are from KPDF and it replaces KPDF, KGhostView, KFax, ...
,
Evince Evince (), also known as GNOME Document Viewer, is a free and open source document viewer supporting many document file formats including PDF, PostScript, DjVu, TIFF, XPS and DVI. It is designed for the GNOME desktop environment. The develo ...
,
Zathura ''Zathura'' is a 2002 science fiction children's literature, children's picture book written and illustrated by American author Chris Van Allsburg. In the story, two boys are drawn into an intergalactic space adventure when their house is magic ...
), Windows (
Okular Okular is a multiplatform document viewer developed by the KDE community and based on Qt and KDE Frameworks libraries. It is distributed as part of the KDE Applications bundle. Its origins are from KPDF and it replaces KPDF, KGhostView, KFax, ...
, SumatraPDF), and Android
Document Viewer
FBReader, EBookDroid, PocketBook).


History

The DjVu technology was originally developed by
Yann LeCun Yann André LeCun ( , ; originally spelled Le Cun; born 8 July 1960) is a French computer scientist working primarily in the fields of machine learning, computer vision, mobile robotics and computational neuroscience. He is the Silver Professo ...
,
Léon Bottou Léon Bottou (born 1965) is a researcher best known for his work in machine learning and data compression. His work presents stochastic gradient descent as a fundamental learning algorithm. He is also one of the main creators of the DjVu image comp ...
, Patrick Haffner,
Paul G. Howard Paul may refer to: *Paul (given name), a given name (includes a list of people with that name) *Paul (surname), a list of people People Christianity *Paul the Apostle (AD c.5–c.64/65), also known as Saul of Tarsus or Saint Paul, early Chris ...
,
Patrice Simard Patrice is a given name meaning ''noble'' or ''patrician'', related to the names Patrick and Patricia. In English, Patrice is often a feminine first name. In French, it is used as a masculine first name. Popularity In the United States, the popul ...
, and
Yoshua Bengio Yoshua Bengio (born March 5, 1964) is a Canadian computer scientist, most noted for his work on artificial neural networks and deep learning. He is a professor at the Department of Computer Science and Operations Research at the Université ...
at
AT&T Labs AT&T Labs is the research & development division of AT&T, the telecommunications company. It employs some 1,800 people in various locations, including: Bedminster NJ; Middletown, NJ; Manhattan, NY; Warrenville, IL; Austin, TX; Dallas, TX; Atla ...
from 1996 to 2001. Prior to the standardization of
PDF Portable Document Format (PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems. ...
in 2008, DjVu had been considered superior due to it being an
open file format An open file format is a file format for storing digital data, defined by an openly published specification usually maintained by a standards organization, and which can be used and implemented by anyone. Open file format is licensed with open li ...
in contrast to the proprietary nature of PDF at the time. The declared higher compression ratio (and thus smaller file size), and the claimed ease of converting large volumes of text into DjVu format, were other arguments for DjVu's superiority over PDF in the technology landscape of 2004. Independent technologist Brewster Kahle in a 2004 talk on IT Conversations discussed the benefits of allowing easier access to DjVu files. The DjVu library distributed as part of the open-source package ''DjVuLibre'' has become the reference implementation for the DjVu format. DjVuLibre has been maintained and updated by the original developers of DjVu since 2002. The DjVu file format specification has gone through a number of revisions, the most recent being from 2005.


Role in the software ecosystem

The primary usage of the DjVu format has been the electronic distribution of documents with a quality comparable to that of printed documents. As that niche is also the primary usage for PDF, it was inevitable that the two formats would become competitors. It should however be observed that the two formats approach the problem of delivering high resolution documents in very different ways: PDF primarily encodes graphics and text as vectorised data, whereas DjVu primarily encodes them as pixmap images. This means PDF places the burden of rendering the document on the reader, whereas DjVu places that burden on the creator. During a number of years, significantly overlapping with the period when DjVu was being developed, there were no PDF viewers for free operating systems — a particular stumbling block was the rendering of vectorised fonts, which are essential for combining small file size with high resolution in PDF. Since displaying DjVu was a simpler problem for which free software was available, there were suggestions that the
free software movement The free software movement is a social movement with the goal of obtaining and guaranteeing certain freedoms for software users, namely the freedoms to run the software, to study the software, to modify the software, and to share copies of the s ...
should employ DjVu instead of PDF for distributing documentation; rendering for creating DjVu is in principle not much different from rendering for a device-specific printer driver, and DjVu can as a last resort be generated from scans of paper media. However when FreeType 2.0 in 2000 began provide rendering of all major vectorised font formats, that specific advantage of DjVu began to erode. In the 2000s, with the growth of the
World Wide Web The World Wide Web (WWW), commonly known as the Web, is an information system enabling documents and other web resources to be accessed over the Internet. Documents and downloadable media are made available to the network through web se ...
and before widespread adoption of
broadband In telecommunications, broadband is wide bandwidth data transmission which transports multiple signals at a wide range of frequencies and Internet traffic types, that enables messages to be sent simultaneously, used in fast internet connections. ...
, DjVu was often adopted by
digital libraries A digital library, also called an online library, an internet library, a digital repository, or a digital collection is an online database of digital objects that can include text, still images, audio, video, digital documents, or other digital m ...
as their format of choice, thanks to its integration with software like Greenstone and the
Internet Archive The Internet Archive is an American digital library with the stated mission of "universal access to all knowledge". It provides free public access to collections of digitized materials, including websites, software applications/games, music, ...
, browser plugins which allowed advanced online browsing, smaller file size for comparable quality of book scans and other image-heavy documents and support for embedding and searching full text from OCR. Some features such as the thumbnail previews were later integrated in the Internet Archive's BookReader and DjVu browsing was deprecated in its favour as around 2015 some major browsers stopped supporting
NPAPI Netscape Plugin Application Programming Interface (NPAPI) was an application programming interface (API) of the web browsers that allows plugins to be integrated. Initially developed for Netscape browsers, starting in 1995 with Netscape Navigato ...
and DjVu plugins with them.
DjVu.js Viewer
attempts to replace the missing plugins.


Technical overview


File structure

The DjVu file format is based on the
Interchange File Format Interchange File Format (IFF), is a generic container file format originally introduced by Electronic Arts in 1985 (in cooperation with Commodore) in order to facilitate transfer of data between software produced by different companies. IFF fil ...
and is composed of hierarchically organized chunks. The IFF structure is preceded by a 4-byte AT&T magic number. Following is a single FORM chunk with a secondary identifier of either DJVU or DJVM for a single-page or a multi-page document, respectively. All the chunks can be contained in a single file in the case of the so called bundled documents, or can be contained in several files: one file for every page plus some files with shared chunks.


Chunk types


Compression

DjVu divides a single image into many different images, then compresses them separately. To create a DjVu file, the initial image is first separated into three images: a background image, a foreground image, and a mask image. The background and foreground images are typically lower-resolution color images (e.g., 100 dpi); the mask image is a high-resolution bilevel image (e.g., 300 dpi) and is typically where the text is stored. The background and foreground images are then compressed using a wavelet-based compression algorithm named IW44. The mask image is compressed using a method called JB2 (similar to
JBIG2 JBIG2 is an image compression standard for bi-level images, developed by the Joint Bi-level Image Experts Group. It is suitable for both lossless and lossy compression. According to a press release from the Group, in its lossless mode JBIG2 ty ...
). The JB2 encoding method identifies nearly identical shapes on the page, such as multiple occurrences of a particular character in a given font, style, and size. It compresses the bitmap of each unique shape separately, and then encodes the locations where each shape appears on the page. Thus, instead of compressing a letter "e" in a given font multiple times, it compresses the letter "e" once (as a compressed bit image) and then records every place on the page it occurs. Optionally, these shapes may be mapped to
UTF-8 UTF-8 is a variable-width encoding, variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit'' ...
codes (either by hand or potentially by a
text recognition Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scen ...
system) and stored in the DjVu file. If this mapping exists, it is possible to select and copy text. Since JB2 (also called DjVuBitonal) is a variation on JBIG2, working on the same principles, both compression methods have the same problems when performing lossy compression. In 2013 it emerged that Xerox photocopiers and scanners had been substituting digits for similar looking ones, for example replacing a 6 with an 8. A DjVu document has been spotted in the wild with character substitutions, such as an n with bleeding serifs turning into a u and an o with a spot inside turning into an e. Whether lossy compression has occurred is not stored in the file. Thus the DjView viewing application can't warn the user that glyph substitutions might have occurred, neither when opening a lossy compressed file, nor in the Information or Metadata dialogue boxes.


Format licensing

DjVu is an
open file format An open file format is a file format for storing digital data, defined by an openly published specification usually maintained by a standards organization, and which can be used and implemented by anyone. Open file format is licensed with open li ...
with patents. The file format specification is published, as well as source code for the reference library. The original authors distribute an open-source implementation named "''DjVuLibre''" under the
GNU General Public License The GNU General Public License (GNU GPL or simply GPL) is a series of widely used free software licenses that guarantee end users the Four Freedoms (Free software), four freedoms to run, study, share, and modify the software. The license was th ...
. The rights to the commercial development of the encoding software have been transferred to different companies over the years, including
AT&T Corporation AT&T Corporation, originally the American Telephone and Telegraph Company, is the subsidiary of AT&T Inc. that provides voice, video, data, and Internet telecommunications and professional services to businesses, consumers, and government agen ...
,
LizardTech Celartem, Inc., doing business as Extensis, is a software company based in Portland, Oregon. History Extensis and its parent company ''CreativePro.com'' were sold to ''ImageX'' in year 2000, which in turn sold Extensis to Japanese content-manage ...
, ''Celartem'' and ''Cuminas''. Celartem acquired
LizardTech Celartem, Inc., doing business as Extensis, is a software company based in Portland, Oregon. History Extensis and its parent company ''CreativePro.com'' were sold to ''ImageX'' in year 2000, which in turn sold Extensis to Japanese content-manage ...
and Extensis.


Support

The selection of downloadable DjVu viewers is wider on
Linux distributions A Linux distribution (often abbreviated as distro) is an operating system made from a software collection that includes the Linux kernel and, often, a package management system. Linux users usually obtain their operating system by downloading one ...
than it is on Windows or Mac OS. Additionally, the format is rarely supported by proprietary scanning software. In 2002, the DjVu file format was chosen by the
Internet Archive The Internet Archive is an American digital library with the stated mission of "universal access to all knowledge". It provides free public access to collections of digitized materials, including websites, software applications/games, music, ...
as a format in which its ''
Million Book Project The Million Book Project (or the Universal Library) was a book digitization project led by Carnegie Mellon University School of Computer Science and University Libraries from 2007–2008. Working with government and research partners in India ( D ...
'' provides scanned
public-domain The public domain (PD) consists of all the creative work to which no exclusive intellectual property rights apply. Those rights may have expired, been forfeited, expressly waived, or may be inapplicable. Because those rights have expired, ...
books online (along with
TIFF Tag Image File Format, abbreviated TIFF or TIF, is an image file format for storing raster graphics images, popular among graphic artists, the publishing industry, and photographers. TIFF is widely supported by scanning, faxing, word process ...
and PDF). In February 2016, the Internet Archive announced that DjVu would no longer be used for new uploads, among other reasons citing the format's declining use and the difficulty of maintaining their
Java applet Java applets were small applications written in the Java programming language, or another programming language that compiles to Java bytecode, and delivered to users in the form of Java bytecode. The user launched the Java applet from a ...
based viewer for the format.
Wikimedia Commons Wikimedia Commons (or simply Commons) is a media repository of free-to-use images, sounds, videos and other media. It is a project of the Wikimedia Foundation. Files from Wikimedia Commons can be used across all of the Wikimedia projects in ...
, a media repository used by
Wikipedia Wikipedia is a multilingual free online encyclopedia written and maintained by a community of volunteers, known as Wikipedians, through open collaboration and using a wiki-based editing system. Wikipedia is the largest and most-read refer ...
among others, conditionally permits PDF and DjVu media files. Wikimedia Commons. Project scope: PDF and DjVu.


See also

* Comparison of e-book formats *
International Image Interoperability Framework The International Image Interoperability Framework (IIIF, spoken as 'triple-I-eff') defines several Application_programming_interface, application programming interfaces that provide a standardised method of describing and delivering images over ...
(IIIF) * JPEG 2000 Compound image file format (JPM) *
Mixed raster content Mixed raster content (MRC) is a method for compressing images that contain both binary-compressible text and continuous-tone components, using image segmentation methods to improve the level of compression and the quality of the rendered image. By ...
(MRC)


References


External links


A collection of DjVu documents (mostly unbundled)

DjVuLibre site

The site of DjVu.js Viewer usable with the current Firefox and Chrome

pdf2djvu
Jakub Wilk's tools
djvu.org
(maintained by an anonymous webmaster)
djvu.com
("DjVu Universe") (Caminova Corporation)
Cuminas Corporation – Software Downloads


DjVu decoder/encoder library * An actual link to
(2001) DjVu document
{{Graphics file formats Computer file formats Electronic documents Electronic publishing Filename extensions Graphics file formats Office document file formats Open formats Computer-related introductions in 1998