Archive

We use the term Archive for any files that have been put in their “permanent” homes and safely backed up in a secure, long-term way. By creating storage and workflow around the use of a dedicated Archive, you can increase security and reduce costs. This page outlines the features of an Archive and how to create a good one.

Introduction
Why the quotes on “permanent”?
The elements of an Archive

Catalog
Primary storage

Backup considerations
Directory structure
Should originals and derivatives be archived together?
Data validation
Migration
What about updating my Archive files?
Format obsolescence

feedback icon

Introduction

In creating a workflow, it’s very important to understand where the work is actually flowing to. If you don’t have a “permanent” home for the images at the end of your normal workflow, there is no good way to separate your current Working files from ones that are ready to be put away in a secure long-term way. And as we saw in the previous section, protecting Working files is both complicated and expensive (at least compared to protecting read-only Archive files).

The concept of an image file Archive does not mean that the image has been shuffled off to obscurity. Rather, it is about the nature of file storage for the large amount of data that is essentially read-only data. Whether they are JPEG, proprietary raw, DNG or motion imaging files, your camera original files are much safer and less costly to store when they are treated this way.

If your archive is properly designed, it can provide extremely good access to the images at the same time as it protects them really well.

Cheaper and easier to store

Because Archive files are unchanging, it’s a lot cheaper and easier to store them and back them up. The primary storage of your archive can be slower, less-expensive hard drives than the storage for Working files. And it’s easier to back them up by copying to an additional drive and moving it off-site. You can also make use of write-once media like DVD or Blu-ray disc for additional backup protection.

With Working files, we saw that we needed to have four copies of the files to get a high level of protection.

The primary copy

An on-line automatic backup of the files

An off-line manual backup of the files

A swapper of the off-line backup, to add off-site backup

And not only do we need the extra copy, it’s also necessary to do the swapping. That is not so bad if everything fits on one 2.5 inch pocket drive. But it is a lot harder to carry a five-drive RAID array to and from the studio each week.

Easier to validate

Archived files are also significantly easier to validate - to know for sure that the files have not changed in any way since they were put away. The simple and longstanding practice of checksumming archived files can tell whether anything has changed. This can let you easily confirm that nothing unwanted has happened to the files.

With Working files (except for DNG), it’s much harder to check on the integrity of the file. Working files, by their very nature, are in flux. So it’s very difficult for automatic data validation tools to know the difference between a desired change and an undesired one. Gathering all the unchanging files into an Archive lets you automate these data validation processes.

Why the quotes on “permanent”?

When you use the term permanent with respect to digital storage, there is always a caveat. Permanent means “permanent until the next migration”. That’s because there really is no such thing as permanent digital storage. Hard drives will wear out or become obsolete. Optical discs will also eventually fail or become obsolete. In fact migration of storage devices should probably happen every few years, and almost certainly no longer than five years.

The file naming, and folder structure of your storage is a different story. This is called the logical structure of the archive, and if it’s designed well, it should be able to last a long time before any migration.

You may want to migrate the file format, as new capabilities are added to file formats made specifically for media archiving.

The elements of an Archive


At its most basic level, the Archive is simply the “permanent” storage device for your files - the place the images flow to. Let’s outline the components of an Archive, and how they work together.

Catalog - We strongly suggest you have a comprehensive catalog of all the media files in your archive.

Primary storage - This is the storage device that holds the primary copy of the archived files.

Backup storage - Backups are the redundant copies of the primary data that are used to restore the archive in the event of loss of the primary copy of the data.

Directory structure - This is the folder and file structure of the Archive.

Data validation - As part of the preservation of your Archive, we suggest that you perform periodic validation to check for the completeness and file integrity of the Archive.

Migration - No digital storage solution should be considered permanent, so eventually you’ll want to migrate the storage medium, and possibly the file format or directory structure of the Archive.

Catalog

In order to know what’s in the archive, you should use some sort of catalog software to track the contents. For small collections, this might be a single Lightroom or Aperture catalog that includes both Working files and Archive files. As your collection grows, you might end up with a catalog for the Archive and a separate catalog for Working files. You might also want to use a dedicated catalog software application for the Archive.

Make the most of your media

Having a catalog allows you to find, organize and make the most of your media files with the least amount of effort. Over time, the number of files you’ll archive will continue to grow, and it can be increasingly hard to find them by looking in folders. Media that is organized with a catalog is not only easy to find, but it’s easy to group together in new and valuable ways. This can include finding all portrait images for use on a portfolio, or pulling images together from a multi-year project into a book or website.

Essential for restoration

A catalog is an essential tool for the restoration of your Archive, following some sort of failure of the primary storage. Without a comprehensive catalog, it’s impossible to know if the archive has been completely restored, or if there is important work left to be done.

Primary storage


The primary storage for your Archive should be a unified and well-organized home for the images, with appropriate speed, connectivity and capacity. It should also be no more complex than necessary. Ideally, your media archive should be “live and local” meaning that the drives are connected to your computer either directly or through some sort of network.

If all your images can fit onto one hard drive, then a simple single drive may be the best storage device for your files. As you add more images to your Archive, you might need to move to multiple drives, but it’s also possible that you may simply migrate to a new and larger single drive. You might also want to consider a drive-spanning storage solution like a Drobo or RAID.

The speed of the primary storage for Archived files is generally less important than the speed of Working file storage, since you’ll generally be accessing the Archive files less frequently, and in a generally smaller number.

Backup considerations

We advocate a 3-2-1 backup strategy for the Archive. There should be a minimum of three copies of each file, stored on two different media types, and one copy should be stored off-site. Optical disc or digital tape could be used for your second medium. For some photographers, a second medium is too difficult to implement, usually due to very large shooting volume. For these photographers, we suggest the third copy be a hard drive copy that is treated like write-once media. We’ve outlined a strategy for this in the Backup section.

The directory structure of both the primary and backup copies of the Archive should be easy to correlate. This makes it easy to know that all files have been backed up properly, and it also makes restoration of the Archive much easier to accomplish.

Synchronized backup v. additive backup

There are a couple of different strategies for copying the files from the primary storage to the backup storage. The simplest method is to use a synchronizing program that copies everything from the primary to the backup. Any time new files are added to the Archive, they can be backed up. While this is an easy way to create a backup, it has a very important vulnerability.

Synchronized backup

The reason you want a backup is because it’s possible that something bad will happen to the primary copy of the data. If the bad thing is a theft, or the total malfunction of the drive, then the synchronized backup does a good job of protection. But some of the hazards you are protecting against can destroy a synchronized backup. Virus infection, accidental deletion, volume errors or invisible file corruption are all very real hazards to your Archive. And a simple synchronized backup will often transfer these problems to your backup - overwriting a good backup file with an unwanted change as shown in figure x. 

Figure x With a synchronized backup, any changes that are made to the primary copy of the file - good or bad - are passed along to the backup. This makes your backup much less safe.

With Working files, synchronized backups are generally unavoidable. By definition, the files are in a state of change and you want to preserve the changes. But with files that have been put into the Archive, a synchronized backup may not be necessary, depending on how you work.

Additive backup

An additive backup describes a backup scenario where files are added to the backup storage whenever they are added to the primary storage, but they are not synchronized thereafter. This arrangement can do a better job of protecting against problems such as virus or file corruption that might invisibly affect your primary storage. If you use a directory structure as outlined below, then using an additive backup system is pretty easy to implement. 

Figure x With an additive backup, files are placed in the archive and then copied to the backup, but they are generally not synchronized with any updates that are made to the primary copy of the files.

Directory structure

The directory or folder structure of your Archive should be simple and easy to grow. It should be easy to tell which folders have been backed up, and it should be easy to reconstruct the archive in the event the primary storage device fails.

We suggest some sort of folder structure that creates a natural sequence as the images are sent to the Archive. This allows for the easy implementation of an additive backup system, where files can be copied from the Primary to the Backup in groups as new files are added. It’s also helpful for those using synchronized backup, since it creates a natural chronological structure. Here are some examples of a sequential archive directory structure.

This might be a date-based system, where the folders follow a year/month/project structure.

Your directory could also make use of a simple sequence number in the folder naming that can provide a chronological sequencing.

If you are using Optical Disc as a part of your backup strategy, you might want to consider size-limited “buckets” for storage. In this scenario, sequentially numbered folders are no larger than the capacity of the optical disc format you are using for backup.

Figure x Here are three examples of good directory structures for an archive. The first shows a date-based structure. The second one organizes by project, and puts a sequence number at the start of the folder names, and the third example shows size-limited “buckets” for storage.

Should originals and derivatives be archived together?

One of the most fundamental decisions about Archive structure surrounds the handling of derivative files. Should they be stored in a folder with the original raw files, or should they be separated? The answer to this question will depend on the general structure of your Archive, and whether you place more emphasis on security or convenience.

The problem

The biggest issue with storing originals and derivatives together, for most people, is that the derivatives are often created long after the originals have been captured, and often long after the originals have been archived. If the derivatives are stored with the originals, then new and old work gets mixed together, which can make the proper backup of these files more difficult.

Synchronized backup for co-mingled derivatives

If you use a simple synchronized backup to protect your Archive files, then there can be little problem with the storage of derivatives and originals together. When a new derivative file is placed in the archive, the synchronizing software will copy it to the backup. This makes the backup easy, but increases the risk that some unwanted change made to the primary copy of the archive will be propagated to the backup.

Figure x If your backups are managed by a simple synchronization, then there’s no real problem keeping originals and derivatives together, as this movie shows. But this movie also outlines how that synchronized backup can lead to other problems.

Finalize derivatives before archiving entire projects

Some photographers may work in a way that allows originals and derivatives to be archived together without the risk outlined above. A wedding photographer, for instance, may want to deliver a fully finished project to the client before archiving the files, and use an additive backup rather than a synchronized one.

This Project workflow has a few structural problems. The first is that it can take a long time between the shoot and the delivery of final finished images. During this time, the original files are stored as Working files, and that may make for a very large set of Working files. And keep in mind that Working files carry some added exposure to loss than Archive files do.

Figure x This movie outlines an archive workflow where entire projects are archived as a unit.

Separate originals and derivatives

The cleanest way to archive your derivative files may be to simply split them apart from the camera originals.

This method presents several advantages.

  • It allows you to archive original files as soon as they have had the file-based changes made, regardless of how long it takes to create fully finalized derivative files. This can significantly reduce your Working storage needs.
  • It allows the use of an additive backup system, which can better protect your data.

Figure x This movie outlines an archive structure where camera originals and derivatives are separated.

Data validation

Once images have been placed in the Archive, you should undertake some level of data validation on a periodic basis. This can confirm that the storage is intact, and provide early warning to any problems with storage.

Media integrity

You’ll want to make sure that all the media that you are using is functioning properly. If all your files are stored on a small number of hard drives, this can be a simple process for both the primary and backup copies of the data. If you use a large number of drives, or a large number of optical discs or digital tape, this can be very daunting.

The media integrity starts with a verification that the media mounts and appears to function properly. A full verification would include a surface scan of a drive to make sure all the bits are readable. Of course, this only tells you that the bits can be read, but does not prove that they are being read correctly.

The use of checksums as outlined below can accomplish both the media scan and a validation that the data has been read correctly.

Checksums

A checksum is a verification key for a file. Once a checksum has been created, you can use it to confirm that not even one bit has been altered in a file. These are quite valuable when creating long-term storage, since they can tell you the status of a particular file, taking all factors into account. Media failure, virus, transfer error, human error - if any of these has adversely affected a file, the file will no longer produce a matching checksum.

DNG files are created with an embedded checksum, so the verification is really easy. Other file types can be checksummed with standalone applications that create a checksum for each file in a folder.

Read more about Checksums in the Data Verification section

Read more about DNG in the Verification Section

Completeness

In order to know that your archive is in good health, you’ll want to periodically check that nothing has gone missing. The only way for most people to accomplish this with any certainty is by the use of a catalog program to remember what is supposed to be present. Most catalog software has the capability to check the original media and verify that all cataloged items are still in the place that the catalog thinks they should be.

Migration

The medium-term and long-term preservation of a digital archive will require some periodic migration. Most often, this will be a storage migration, transferring from one medium or device to another. It’s also possible that you’ll want to consider migrating file formats, as existing ones become obsolete, or as new ones present significant advantages.

Storage migration

All digital storage media will eventually fail. In order to preserve your digital stuff, you’ll need to transfer it to new media periodically. In many cases, this can be very simple. Simply copy the data from one place to another using a verified transfer, and the new copy can replace the old copy. If you follow our recommendations about lifecycle-based workflow, this can be pretty easy.

A smooth migration

A typical storage migration could look like this:

  1. Gather all files to be migrated.
  2. If you have a comprehensive catalog, you should check for completeness of the archive.
  3. If you are using some kind of checksumming, this is an excellent time to verify the checksums.
  4. Evaluate the storage capacity you’ll need for the new system.
  5. Purchase and configure the new storage. Make sure that low-level format and SMART data checking are part of the set-up process.
  6. Perform a validated transfer of the existing data to the new device.
  7. Mark the old storage device as having been transferred to the new storage, in order to avoid confusion.
  8. You may wish to check for completeness and checksums on the new device.
  9. Put the new device in service, and make sure it is performing as expected before repurposing the old storage devices.

Migration complications

Not all storage migrations will be so clean, however.

Migration can become complicated for a number of reasons:

  • If the Archive is not well-organized, migration can be a complex process, as you try to track what has been copied and to sort out duplication.
  • If your primary storage medium is optical disc or digital tape, the process of copying the files can take a long time.
  • If you wait to migrate until the media is failing, the process can become much more complicated, particularly if you don’t have a good hard-drive-based backup to copy from.
  • If you have not performed any periodic data validation processes, you may find some kind of media failure during the migration.

What about updating my Archive files?

Depending on the software you use, and the structure you have chosen for your Archive, you’ll have some choices to make about how to treat your archive files with respect to file updating. And this choice will include not only the primary copy of your Archive, but the backup copies as well. Here is an outline of those issues.

PIEware makes updates optional for camera-original files

If the work you do to optimize and organize your images is done with a parametric image editor like Lightroom or Aperture, you have the option of treating your files as “read-only” files. All the work you do with these applications (as well as NLE video software) is about reinterpretation of the files, rather than changing them. This allows you to build a media collection where you can archive, checksum, backup and verify the media, with a very high level of security. This is one of the principle advantages to this class of software.

If security of the media is your main concern, then it’s probably best to simply archive the files, and refrain from updating them. You can make all your changes with the PIEware, and make sure that you protect the catalog, since it contains all the organization and optimization work.

Some software requires you to update the files or sidecar files

Some software, however, can’t keep all the instructions in a catalog.

  • If you make an adjustment to a file with Photoshop, you will always need to alter the file or make an additional image file that has the changes.
  • Adobe Camera Raw, Capture One, Bibble and others all require that the instructions are saved in some kind of sidecar file, which must live inside the folder that contains the images.
  • Even Lightroom and Aperture allow you to update your image files with metadata changes made in the program (depending on the file type of the original).

Update the Primary and Backup copies on different schedules

If you use one of these pieces of software that requires archives to be “touched” in order to save your work, or if you choose to push your Lightroom or Aperture metadata back to the original files, we suggest that you think carefully about how and when you update your backup files.

Your choice here is between maximizing protection for recent work, versus maximizing protection for disaster recovery.

Photoshop users

If you need to rework an Archived file with Photoshop, we suggest that you consider simply making a new file, rather than updating one that has already been archived. This is the heart of the whole concept of Archive. In general, we suggest that files are treated as Working files until you are reasonably certain you won’t want to rework the image. If you do need to rework after archiving, make a new file.

The exception to this would be people who are using a simple synchronized backup. For them, a reworked file could be updated in-place, and the sync would move those changes over to the backup copy.

Camera Raw, Capture One, Bibble users

If you use one of these pieces of software that must place the instructions inside the folder with the images, you’ll probably want to set up one copy of your backup to update when changes are made to the primary copy. Otherwise, it would be easy to lose your work by a simple failure of the Primary storage device. Ideally, you would also have an additional backup copy of the files that is either not updated at all, or is updated on a very different schedule.

Lightroom and Aperture Users

In general, we suggest that you spend your backup time preserving the catalog itself, rather than updating any copies of the files. If you do want to update the primary copy of the files, we suggest that it will generally be best to leave the backup copies alone, so that you have maximum disaster recovery protection.

So what are the considerations for updating files in the archive?

  • The most important of these is that any update of a file introduces the possibility of doing some kind of damage to the file, through computer error, human error or malware.
  • Updating any files removes the verification provided by a checksum, with the exception of DNG files.
  • Therefore, we suggest that archive files are only updated rarely.

Format obsolescence

Sometimes it will be desirable to convert your digital files to a new format. The most important reason to perform a format migration is the gradual obsolescence of a particular format. It may become difficult or impossible to open a particular file format in the future due to intellectual property issues or general lack of support in current software.

Kodak PhotoCD

Kodak PhotoCD is a prime example of intellectual property issues rendering a format obsolete. The color encoding method on PhotoCD is proprietary, and therefore commercial software generally opens the files using plug-ins provided by Kodak. As the fortunes of the company have declined, the plug-ins have not been updated, creating real difficulty for people to open the software with current hardware and software. While legacy systems and open-source software can open the files, this has gone from a common capability to a more specialized one.

Many individuals and institutions used Kodak PhotoCD for its promise of long media life. Few people considered that the reason PhotoCD would become unreadable was because the company itself would not be able to support its own formats.

If you have a collection of Kodak PhotoCD, it’s time to convert all the files to TIFF format, and migrate to new media as well.

Proprietary camera raw and DNG

Some of the early digital camera raw formats, particularly those created in the 90s, have already become difficult to read, mostly because of the intellectual property reasons outlined above. This should serve as a cautionary tale, but it’s important to keep it in perspective.

For proprietary camera raw formats created after the turn of the century, total format obsolescence is pretty unlikely within the next decade or so. The vast majority of digital cameras create files that are pretty similar. The bayer-pattern data created by the camera sensor is saved in some variation of a TIFF/EP file.

While total obsolescence is unlikely, it’s possible that files from older cameras become unsupported in current software. In order to open the files from any particular camera, the software needs to know the color response of the sensor. Test files from these cameras must be loaded into the software to tweak the color conversion from raw to rendered, in order to obtain pleasing results. At some point, it won’t make economic sense to spend all those engineering resources on camera files that are no longer being produced.

The openly documented DNG file format can help to maintain access to the raw data from obsolete camera files. The format includes color profiles that can be used to decode the data, so that new software may be able to open older files without having to test and characterize each camera model. DNG also standardizes a method for storing all the different parts of a digital image file, so that it can be easier for an application to find the bits it needs.

Other DNG benefits

In addition to forestalling obsolescence, DNG provides some very important advantages for digital image archiving. These include methods to store the instructions used to process images, as well as the checksum verification tool. Here are some examples of the data that may be stored in the file:

  • A color profile for decoding the file
  • A nearly unlimited amount of user-generated metadata
  • Processing instructions for multiple renderings in multiple programs
  • Adjusted version(s) of the image
  • Camera-specific color profiles
  • The original source image file itself

The capability of storing all this information in a single file, along with the image data, makes DNG an excellent choice for long-term storage of digital images.

Read more about DNG in the File Formats Section

feedback icon
 
Last Updated September 22, 2015