Data Validation Details
On this page, we take an in-depth look at the process for validating general images files that are not DNG, and are not stored on write-once media.
Data validation protocol for general image files should be thought of as a set of steps. Some basic validation should be done regularly. For local online drives, it's helpful to perform parts of this process whenever time permits. Offsite drives should be validated at least once a year, and more often if practical, perhaps four times a year.
Of course, any time you suspect something might be wrong, it's advisable to check the drives out thoroughly. Don't let warning signs go by without investigating.
We want to start with a basic check on the general health of the data structure on the drive. Before you do this, you are advised to make sure you have good backups of any data on the drive. Sometimes disk utilities can uncover a problem that is hiding just below the surface, and in the process of repairing the problem, some files may further deteriorate. Of course, if you are following our best practices recommendations, you already have good backups.
On Mac, DiskWarrior is an excellent choice for basic maintenance of your drive. Run it periodically, or whenever you suspect there is a problem with a drive.
On PC, the Vista Disk Management tools do a good job of basic disk maintenance, as shown in Figure 1.
|Figure 1 shows the Vista Disk tools menu. Run error-checking and defragmentation periodically.|
You'll also want to check on the condition of the storage media. There are trillions of individual bits of storage on the drive, and each one should retain the capability of being read or written to. When you run a media check, the software will read and write to each bit of the drive, making sure it works as it should. It's normal to occasionally find sectors that aren't functioning properly, and the drive takes care of fencing them off so they don't get used in the future.
If the additional bad sectors are found with each scan of the drive, it's likely to be on its way out, and should be retired at the earliest possible opportunity.
Running a media check is a great way to take a backup drive out for some exercise – a process that should ideally be done four times a year. Unlike the volume and directory scan, the media scan touches every part of the drive and provides more thorough integrity information. Scanning the media will also do some invisible housekeeping, and record the results in the drive's onboard SMART firmware, described below.
Computer and drive manufacturers have collaborated on a number of diagnostic tools that might be able to tell you that a drive is having trouble and is about to fail. The Self Monitoring, Analysis, and Reporting Technology (SMART) status keeps track of some important statistics and can alert you to problems with the drive. It’s nowhere near foolproof, but certain factors are highly likely to indicate a dying drive. There are two levels of SMART reporting: Pass/fail, and raw value reporting. Your system software can tell you the pass/fail status, but you need to get add-on utilities if you want to check the raw values.
|Figure 2 shows the SMART status as reported by the operating system. It is listed as verified, since the pass/fail flag on the drive's logic board reported it as passing.|
The initial SMART implementations were very cautious and used a low threshold to determine problems with drives. That caused a lot of consumers to return drives, and manufacturers have since raised the threshold for what constitutes failing, which means that now problems can be missed. Figure 2 shows a report from Apple’s disk utility, indicating a drive has passed the SMART check. A second utility, SMART Reporter, looked at the same drive and saw a failing unit, as shown in Figure 3. The manufacturer recommended sending the drive in for replacement.
|Figure 3 The same drive as shown in Figure 2, when examined with SMART Reporter software, showed an excessive Reallocated Sector Count for a new drive. The manufacturer replaced it with no questions asked. Reallocated sector count (or almost any error with “sector” in the name) is one of the most important indicators of impending failure, as are nonzero read error rates and spin retry rates.|
If you’re on PC, you can get similar benefit with SMART. The built-in utilities will do some monitoring, but the threshold for passing is pretty low so passing does not prove anything. If, however, your OS alerts you to a SMART failure, you should take care of that drive immediately. You can add more thorough SMART reporting by using third-party utilities like Active Hard Disk Monitor from LSoft.
SMART reporting is generally available only with PATA and SATA drives directly connected to the computer’s motherboard (or connected through some eSATA ports). Connections made through FireWire or USB (Universal Serial Bus) don’t report the SMART status properly. If you like the idea of collecting this additional SMART information, your best option is to configure your external drives with eSATA connections. Check with the maker of your eSATA expansion card to see if it passes SMART data through.
As we move from the easiest to hardest tasks in data validation, the next most automated procedure we can run is a check of the file structure of proprietary raw files. Adobe's free DNG Converter can act as data validation software for those raw file types it knows how to parse. (This is a pretty comprehensive list, if you use the latest version). When you send a set of raw files to the DNG Converter, it keeps a log of the conversion process. If it encounters an error parsing a file, it logs that error.
You can send thousands of raw images to the DNG converter with a single action. Even if you don't choose to use DNG in your collection, this can be a valuable indicator of the health of your raw files. You can check the log at the end of the process to make sure that DNG Converter was able to open all files, then erase the DNGs after conversion.
|Figure 4 This video shows how you can use the DNG Converter to check the file integrity of proprietary raw files.|
Keep in mind that just because a file can be opened and converted by the DNG Converter does not mean that the image data inside the file is problem-free. It's possible that the file structure is intact, but the image data is corrupted. For full validation of the image data, you'll want to save raw files as DNGs.
At this point, we've gone through the tools that can be used to validate an archive with a good level of automation. From here out, the only way to be more certain is to do a visual inspection of the files. And, yes, that can take some time. It's also harder to accomplish that you might expect.
The gold standard of visual file validation is to open the file in Photoshop and look at the image and any layers that are contained in the file. After doing this for a handful of files, you'll probably be looking for a better method. Here are some tips.
|Figure 5 This video shows a visual inspection of a group of images in Adobe Bridge.|
Parse first, then inspect
Instead of opening each file one by one, you can send the images to an imaging program that will build thumbnails or previews of the files. Once that's done, you can skim through the files checking for something unexpected such as strange lines or incomplete jumbled images. It's possible to do this even for many thousands of files without taking too much time, but it's important to know what you're looking at.
What am I looking at?
There are two basic questions that you need to answer to know if your visual inspection is actually telling you something valuable. The first is whether you are looking at "fresh" data, and the second is how the preview or thumbnail is created.
Adobe Bridge, for instance, keeps several caches of the image data: a central cache, a cache in the folder with the image, and one that Camera Raw uses for quick display of raw or DNG files. If Bridge can find the file in the cache, it won't spend the time parsing the file, and will show you cached data instead. While this helps speed up your general workflow, it makes the visual inspection nearly worthless for images that are already in the Bridge cache. If you want to get a true look at the current file, you need to purge all three caches each time you parse files. The video embedded below shows how this works.
|Figure 6 This video shows how to work with the cache in Bridge when you conduct a visual inspection of image integrity.|
If you are going to rely on a visual inspection to determine file integrity, you'll also want to know how any thumbnail or preview is being generated. Catalog programs that work with raw files may be showing you embedded previews instead of parsing the raw files when they make previews. A preview may be uncorrupted while the raw data in the file is corrupted. You'll want to make sure you understand you know how the preview is created.
Additionally, many programs have a preference to skip rendering for very large files in the interest of speeding things up. A large TIFF that is shown by such a program may once again show you only an embedded preview, rather than an image parsed from the underlying data.
|Figure 7 Some programs will allow you to set a limit on what size images they will parse. This can speed up display of a thumbnail of the image, but also means the program is not showing you the image from the file, only the embedded thumbnail.
Is it showing everything?
Finally, if you are going to submit a large number of images to a program for pre-parsing, you'll want to know that it is showing you everything, and is not simply skipping files it can't parse. Lightroom, for instance, will show you a warning if the files you try to import are not parsable. If you really want to know that everything is being shown, the best way is to use a catalog program that gives easy image counts that you can compare against a master record of what is supposed to be in the catalog. If the numbers don't match, you'll need to investigate why.
Let's take a look at a tool we also show on the validation page for optical media, Image Verifier. To get the full protection out of ImageVerifier, you'll need to use it with write-once media, but it can also be of help with files that are in-progress. The program has some internal code that can check the validity of the file structure of JPEG files, in particular, as well as TIFF and PSD files to a lesser extent. ImageVerifier does not run perfectly on these file types, but can help if you have a large archive you want to look through. Here's a video showing the program in action.
|Figure 8 This video shows ImageVerifier in action, checking an archive of raw, DNG, JPEG, TIFF and PSD files.|