DjVu

Learn more about DjVu

Jump to: navigation, search
DjVu <tr><th style="white-space: nowrap;">File extension:</th><td>.djvu, .djv</td></tr><tr><th style="white-space: nowrap;">MIME type:</th><td>image/vnd.djvu</td></tr><tr><th style="white-space: nowrap;">Type code:</th><td>DJVU</td></tr><tr><th style="white-space: nowrap;">Developed by:</th><td>ATT Research</td></tr><tr><th style="white-space: nowrap;">Type of format:</th><td>Image file formats</td></tr>

DjVu (pronounced déjà vu) is a computer file format designed primarily to store scanned images, especially those containing text and line drawings. It features advanced technologies such as image layer separation of text and background/images, progressive loading, arithmetic coding, and lossy compression for bitonal images. This allows for high quality, readable images to be stored in a minimum of space, so that they can be made available on the web.

Progressive loading makes the format ideal for images served over the Internet. DjVu has been promoted as an alternative to PDF, actually outperforming PDF on most scanned documents. The DjVu developers report that color magazine pages compress to 40-70KB, black and white technical papers compress to 15–40KB, and ancient manuscripts compress to around 100KB; all of these are significantly better thant the typical 500KB required for a satisfactory JPEG image. This has led to its widespread use in distributing math books on file sharing networks. Like PDF, DjVu can contain an OCRed text layer, making it easy to perform cut and paste operations.

The DjVu technology was originally developed by Yann Le Cun, Léon Bottou, Patrick Haffner, and Paul G. Howard at AT&T Laboratories in 1996. DjVu is a free file format. The file format specification is published as well as source code for the reference library. The ownership rights to the commercial development of the encoding software have been transferred to different companies over the years, including AT&T and LizardTech. The original authors maintain a GPLed implementation named "DjVuLibre".

DjVu divides a single image into many different images, then compresses them separately. To create a DjVu file, the initial image is first separated into three images: a background image, a foreground image, and a mask image. The background and foreground images are typically lower-resolution color images (e.g., 100dpi); the mask image is a high-resolution bilevel image (e.g., 300dpi) and is typically where the text is stored. The background and foreground images are then compressed using a wavelet-based compression algorithm named IW44. The mask image is compressed using a method called JB2. The JB2 encoding method identifies nearly-identical shapes on the page, such as multiple occurrences of a particular character in a given font, style, and size. It compresses the bitmap of each unique shape separately, and then encodes the locations where each shape appears on the page. Thus, instead of compressing a letter "e" in a given font multiple times, it compresses the letter "e" once (as a compressed bit image) and then records every place on the page it occurs.

In 2002 the DjVu file format was chosen by the Internet archive as the format in which its Million Book Project provides scanned public domain books online (along with TIFF and PDF).

DjVu format will be used by the One Laptop per Child project in order to easily supply existing paper books in an eBook format. The advantage of DjVu is that it is highly compressed and it does not require any font support. [1]

[edit] Comparison of the DjVu and PDF file formats

The primary difference between DjVu and PDF is that DjVu is a raster format, whereas PDF is primarily a vector format. This difference has several consequences:

  • The maximal resolution of a DjVu file must be specified at creation time. On the other hand, a vector image represented by a PDF file can usually be magnified at arbitrary resolution without loss of quality.
  • DjVu files render characters as images, without using fonts. PDF files usually render characters using fonts. Many PDF files do not embed the full representation of the necessary fonts, but simply specify their names and properties. The PDF viewer uses the exact same font if it is available. Otherwise it transforms an available font to compute an approximation of the desired font.

The PDF format defines various means to store and render raster images. This capability is often used for representing scanned documents. Such PDF files suffer from the same fundamental limitations as raster formats. The size of these files depends dramatically from the underlying compression scheme. Some PDF compression schemes sometimes approach the performance of DjVu. In principle, DjVu compression could be adapted to represent raster images in PDF files. However there is no momentum for creating such a combination because it pleases neither the proponents of DjVu nor those of PDF.

Both the DjVu and PDF formats define features that do not address the representation of the document appearance but aim at creating a document delivery platform. Both DjVu files and PDF files can be enriched with text, table of contents, hyperlinks, and metadata. The PDF format goes further by allowing sounds, interactive forms, and JavaScript programs. The DjVu format defines a protocol to transfer document pages on demand over the Internet. On the other hand, the DjVu format does not specify a way to certify the authenticity of a document or to define Digital Rights Management policies.


[edit] When to select what format (DjVu or PDF)

With PDF documents one can zoom in on vector-based content to an arbitrary depth or print them at an arbitrarily high resolution without introducing quality loss or jaggedness inherent to raster formats. On the other hand, if a PDF is simply used as a container for non-vector images (such as scans) those images will not gain anything. Another thing to keep in mind is that one can always convert a vector format into a raster format, usually with irrevocable data loss, but the other direction is very difficult.

PDF is most useful when the original source is an electronic document such as a Microsoft Word doc or TeX file. Such documents benefit most from the vector graphics technology that underlies PDF. DjVu files can be marginally smaller but only deliver a high resolution image, possibly enriched with the associated text.

DjVu is very good for image files, and has especially been optimized for scanned text and images. If one has a set of scanned pages from a book or article, DjVu is superior to PDF. However, PDF could be better if the scanned raster images can be transformed into high quality vector graphics, for instance by applying Optical Character Recognition to the scanned image, identifying the fonts, and carefully proofreading the resulting file. This procedure is often undesirable or time/cost prohibitive. Suitable fonts might not be available. One may want to preserve the original document exactly, including signatures, marginal comments, or other markings. Or perhaps one wants to reproduce the original handwriting or properties of the paper. For example, you have scanned some old scriptures and hand-code the text. In such cases, the DjVu is the good choice.

[edit] External links

de:DjVu es:DjVu fr:DjVu it:DjVu ja:DjVu lt:DjVu pl:DjVu pt:DjVu ru:DjVu zh:DjVu sv:DjVu

DjVu

Views
Personal tools
what is world wizzy?
  • World Wizzy is a static snapshot taken of Wikipedia in early 2007. It cannot be edited and is online for historic & educational purposes only.