How to Scan a Document | Best Practices for Document Scanning
In an ideal world, we could create a simple scanning requirements document that lays out the exact requirements for every conceivable document and situation with little regard for file size, future compatibility and functionality. Unfortunately, that’s not reality.
There are numerous considerations to take into account when considering a scanning project but it all starts with a single question:
How will you be using these images?
This is probably the most critical consideration. What is the main business driver behind this scanning initiative? Will these images be used to produce true to life reproductions or is the purpose to create readable, reproductions of the original record?
When an artist wants to scan a historical picture to enhance the colors and make a large print for a special event, a higher resolution and color mode are his/her friends. This would result in a large file with a wealth of detail at the pixel level that can be manipulated and enhanced to suit his/her needs. It would also facilitate the reproduction of a larger than life copy of the original. The file size is not a major concern in this case because the artist is only working with a handful of images.
However, if we took those same scanner settings and applied them across a collection of millions of documents and drawings, the result would be several hundred terabytes of data and file sizes so large that employees would spend most of their day waiting for images to load. In reality, whether a single pixel is the exact right shade of green or not is irrelevant to the intended use of the file. In these cases, what is important is the information on the document. Can it be read, understood and acted upon?
The following document will walk you through several considerations and the pros and cons of each. It is designed to help guide you to a better understanding of functions such as color modes, resolution, file formats and more so that once you answer the “How” question, you can plan the rest accordingly.
Glossary of Terms
Bare with us, we about to get kind of “techy,” but the below glossary will help shed some light on the more technical aspects of this guide.
Bitonal: For the purposes of this document, bitonal simply refers to an image that was scanned and saved using only black and white, much like a photocopier.
Color Mode: Is the image being captured as in black and white, greyscale or full color.
Compression: The act of reducing the file size of a scanned image.
Document Generations: Refers to how many levels a document is away from the original file. For instance, if you were to print an email, then later scan the paper, the scanned image would be the 3rd generation.
Dots Per Inch (DPI): Scanned images are made up of tiny squares (see Pixels below). The number of dots/squares/pixels within one inch of the scanned image is referred to as Dots Per Inch (DPI) or Pixels Per Inch (PPI). In theory, the higher the DPI, the more information is captured during the scanning process.
ISO: The International Organization for Standards is a global entity who develop and promote proprietary, industrial and commercial standards.
Multi-Page File: One digital file, that contains more than one image within it. For example, one PDF file that has 7 pages.
Pixel: If you zoom in close enough on a photograph or a scanned document you will notice individual squares that make up the text and images. These are called pixels.
Raster: Raster images are ones that were created by capturing an image through a scanner or camera. It is a grid of tiny coordinates covering the entire image, and each one of those coordinates (or pixels) are assigned a color value.
Single Page File: One digital file that only contains a single image within it. For example, your digital camera produces one image file for every photograph you’ve taken.
Overview of Common Digital File Formats
Portable Document Format (PDF)
PDF is a common, flexible file format for storing and sharing images created digitally or using a document scanner. PDF presents documents in a manner independent of the software used to create the image, hardware, and operating systems. As an example, if you created a spreadsheet in Excel and saved it as a PDF, it could be viewed on any computer with a free PDF reader regardless of whether they have Excel installed. Each PDF file includes a complete description of the document, including the text, fonts, graphics, and other information needed to display it.
PDF files can support a variety of compression options, as illustrated in the “File Formats and Commonly Supported Compression” section of this document. PDF supports multiple pages per file, allowing you to contain all the pages related to a single record within a single digital file.
PDF/A is an ISO-standardized version of the typical PDF file format specialized for use in the archiving and long-term preservation. PDF/A prevents features that were common in standard PDF file formats such as linking fonts and encryption, as features such as these are not well suited for long-term archiving. The ISO requirements for PDF/A viewing applications include color management guidelines, support for embedded fonts, and a user interface for reading embedded annotations.
PDF/A files can support a variety of compression options as, illustrated in the “File Formats and Commonly Supported Compression” section of this document. PDF/A also supports multiple pages per file, allowing you to contain all the pages related to a single record within a single digital file.
Related Article: Using PDF/A for the Preservation of Your Scanned Documents
Tagged Image File Format (TIFF)
TIFF or TIF are common file formats for storing raster-based images (i.e. scanned documents, photographs, etc.). While TIFF can support multiple color modes, there have been compatibility issues with color and greyscale TIFF files in various viewing applications. Most commonly, TIFF files are used to store bitonal (black & white) scanned documents.
TIFF files can support a variety of compression options as illustrated in the “File Formats and Commonly Supported Compression” section of this document. TIFF also supports multiple pages per file, allowing you to contain all the pages related to a single record within a single digital file.
The JPEG format is typically used with photographs and paintings of realistic scenes with smooth variations of tone and color. JPEG is not well suited for line drawings and other textual or iconic graphics, where the sharp contrasts between adjacent pixels is required. By nature, JPEG uses a lossy compression method and should not be used business documents, drawings, technical documents, etc.
JPEG is also not well suited to files that will undergo multiple edits, as some image quality will usually be lost each time the image is decompressed and recompressed. JPEG does not support multiple pages per file, so each individual page is saved as its own individual file.
Lossy file compression works by discarding all the “unnecessary” bits and pieces of information in the original file to make it smaller when compressed. It can do this in a variety of methods, such as combining similarly colored pixels into a single color to reduce the amount of data stored.
Unlike lossy file compression, using the lossless format can end up reducing a file’s size without discarding any of the original data. Historically, lossless compression would result in a larger file size when compared to a file using lossy compression. However, with new compression algorithms, lossless compressed files are now manageable in size and preserve data better than ever.
An uncompressed image is one in which no file level or image level compression has been applied. While common in the digital photography world, uncompressed file formats are typically not used in archiving content from paper, film or microfilm content. The resulting file size would create major roadblocks in terms of file storage and sharing of information. For instance, a standard letter size page saved in an uncompressed format can be 20x larger than the same page saved as a bitonal (black & white) image saved with a lossless compression method. Multiply that over a collection of millions of records and the technical requirements for this become unwieldy.
File Formats and Commonly Supported Compression
The chart below outlines the compression methods typically used in document scanning applications. While the file formats listed below may support additional formats, it is uncommon in scanning applications and long-term archiving.
Compression Method Lossy/Lossless Supported Color Depth Supported File Types
|Compression Method||Lossy/Lossless||Supported Color Depth||Supported File Types|
|Uncompressed||N/A||Up to 48-bit colour||TIFF|
|LZW||Lossless||Up to 8-bit greyscale||TIFF, PDF, PDF/A|
|CCITT Group 4||Lossless||1-bit bitonal||TIFF, PDF, PDF/A|
|JPEG||Lossy||Up to 24-bit colour||JPEG, PDF, PDF/A|
When assessing a records collection, one of the factors to consider is the generation level of the content. It should come as no surprise that the further removed from the original document, through photocopying, scanning, reprinting, etc., the more information will be lost or distorted. A common example of this is paper record or drawing that has been frequently handled over time. The document accumulates dirt around the corners from normal handling and once photocopied, those muddied areas become black and almost unreadable. The same is true, and often magnified, for legacy media such as microfilm or microfiche when they are duplicated. Duplicate copies often do not have the same density levels as the originals and frequently have scratches that have developed through normal use of the media.
It’s not to say all duplicates/photocopies require special handling while scanning. Many copies turn out almost distinguishable from the original records. If the contrast between documents varies greatly throughout a collection, you may want to consider capturing the content in greyscale mode to pick-up more detail.
Typically, scanning of documents and drawings are done in one of three modes, bitonal (black & white), greyscale (8-bit), or color (24-bit).
Used on documents and drawings with only minor deviations in contrast throughout the collection. Bitonal is well suited for documents where there is no importance to the colors on the documents (i.e. logos). Bitonal files result in much smaller digital images which make them ideal for long-term archiving and active sharing/retrieval across intranets or the internet.
Bitonal is not suited well for poor contrast documents (i.e. faded thermal print pages) or documents where the identification of colors is important (i.e. photographs, drawing mark-ups)
Greyscale is recommended for files which would suffer information loss if scanned in bitonal mode. For example, poor contrast documents, photographs (where the color is not critical), documents with a wide-range of variation, microfilm and microfiche.
Color mode will typically not return more detail then greyscale on poor contrast files and is only recommended when color has a significant relevance to the information on the document. Examples of this may include: drawings with colored markups, photographs, charts/graphs which rely on color, etc.
Color scanning results in a much larger digital image, which can make it difficult to share images due to slow transfer speeds and can create network storage problems. Another option to consider to reduce file size is the use of automatic color detection. This is a process by which documents that contain color are saved as color files and documents that do not are saved as bitonal files.
Resolution, which is typically measured in Dots Per Inch (DPI), refers to the number of pixels captured during scanning. For instance, a document scanned at 300DPI would have 300 pixels per inch. In theory, the greater the number of pixels, the more information is captured. We say “in theory” because scanning at a higher resolution cannot capture data that was never there, to begin with. As an example, most laser printers print documents at 300DPI, in that case scanning at 400DPI would not result in any more content then scanning at 300DPI. This is also true for legacy media like microfilm and microfiche which have a realistic resolution of approximately 300DPI.
Another consideration to take into account with resolution is file size. As you increase the resolution, you increase the size of the digital image, which can lead to issues sharing and storing information.
When scanning large collections, 300DPI is more than sufficient for most applications. Adjusting the bit-depth (outlined above) is preferable to raising the resolution. A document scanned at 300DPI in greyscale mode will capture more data than the same document scanning at 400DPI in bitonal mode.
Related Article: What is the Best DPI for Scanning Documents?
The sample below shows a section from the same page, scanned at both 300DPI and 600DPI in greyscale mode. As you’ll see, there is virtually no noticeable difference between the two images because the original material was not generated at a resolution over 300DPI.
42MB File Size
129MB File Size
Resolution for Microfilm, Fiche and Slides
Microfilm formats and photographic slides have something in common: They have all been reduced in size from what they originally captured. For example, a microfilm image of a letter size page captured at 24x reduction ratio means that the resulting microfilm image is 24x smaller than the original letter size page. This means the dimensions of the page on film are no longer 8.5” x 11” but rather 9mm x 11.6mm. So how does all this play into scanning resolution? When looking at scanning resolution for reduced size images, you need to evaluate what the required DPI will be when the image is enlarged back to the original size. When a film scanner claims a scanning resolution of 2700DPI, it is not scaling the image back up to it’s original size. In the case of that letter size page on microfilm, the resulting image would still be 9mm x 11.6mm at 2700DPI. When scaled back to it’s original size, that image would now be 8.5” x 11” at approximate 110DPI; at 4000DPI the scaled resolution only be 165DPI. When evaluating DPI requirements for reduced size media, it is important to state the desired resolution at the desired image size, not just the scanning resolution.
Optical Character Recognition (OCR)
OCR is the function that turns a scanned image into a fully searchable record. OCR does not replace a proper indexing/filing system; it only serves to enhance it. For example, you may have a legal file that contains over 200 pages. Flipping through the scanned images one page at a time trying to find the name “Thompson” would be extremely time-consuming, but typing the name “Thompson” into the search box and being instantly shown all the instances of the name within the document is a huge benefit. The results of OCR’ing are only as good as the original material. It’s not a perfect technology, and as such, should not be relied upon as your only method of finding scanned documents. By nature, some documents are not well suited for OCR. These can include:
- Drawings and maps
- Handwritten files
- Poor/low contrast documents
- Some microfilm and microfiche
- Should I Use PDF OCR on My Scanned Documents?
- How Does OCR Software Work?
- What Are Some Uses for OCR Software in Businesses
As you can see, there is no hard and fast rule for “the best scanner settings.” There are numerous variables to consider. However, in our experience, the clear majority of documents, drawings and photographs being scanned for their informational value are suitable to be captured at:
- Saved as PDF/A documents
- With the color mode being adjust based on the project needs.
5 Key Take Away Items
- Compression is not a bad thing! Lossless compression will not impact the quality of your scanned files and will result in a much more smaller and much more useful file.
- Choice of color mode should be made based on the quality of the document and the importance of color.
- Higher resolution doesn’t always equate to a better image.
- File format plays a key role in the long-term archiving of your records.
- Storage space may be relatively inexpensive but your time is not. The larger the file size, the more time will be wasted searching through these records.
- 5 Things to Look for in a Professional Scanning Services Company
- What is the Best DPI for Scanning Documents?
- Should I Use PDF OCR on My Scanned Documents?
- How Does OCR Software Work?
- What Are Some Uses for OCR Software in Businesses
- Using PDF/A for the Preservation of Your Scanned Documents
- Why Financial Advisors Are Scanning Client Files
- Scanning Closed Case Files Should Be a Top Priority for All Ontario Law Firms
- Why Ontario Regulatory Bodies Are Flocking to Scan Their Membership Files
- The 3 Best Ways for Business Records Storage
- The 5 Bottom Line Benefits of an Effective Office Filing System
- Get a document scanning quote