Key takeaways
When it comes to ensuring your systems are resilient, a solid database backup and recovery plan is essential. Database backup involves creating secure copies of critical data, while recovery ensures those backups can restore systems quickly after disasters. Testing your backup and recovery plan regularly is crucial to identify potential risks and bottlenecks. This article explains how to efficiently test your strategy, the key differences between backup types, and best practices to safeguard your infrastructure.
What is database backup and recovery?
In computer terminology, a “backup” is what it sounds like: a separate instance of a given file, system, or portion of infrastructure kept in reserve in case the primary instance meets an untimely end.
Recovery is the process of replacing the primary file or file system with a backup to get things up and running again. Thus, in the most reductive of terms, a backup and recovery plan is the collection of predetermined procedures that help IT/I&O teams prepare for and respond to disasters affecting computer systems.
There are a few names that recovery plans go by, such as disaster recovery, backup and disaster recovery, and failover and recovery systems. The core ideas remain the same.
Backup vs recovery: Distinguishing differences
It’s worth noting the distinction between the two core operations involved in such recovery plans. As mentioned above, “backup” refers only to the process of creating or preparing copies of the things you need. While it’s not unheard of for “backup” to be applied to hardware, i.e. spare and extra parts (this is, of course, the original sense of the word, after all), that’s not normally how it’s used in the context of computers.
Instead, we almost always discuss backups in this context when talking about the non-physical portions of the infrastructure. There’s an important reason for this: while having backup hard drives, replacement servers, and redundant data centers does require some preparation, backing up files constitutes something quite different. Backing up important data is an ongoing process with its own considerations, complications, and challenges.
In many ways, diligence in these aspects directly determines the speed and effectiveness of the rest of the recovery plan.
If backup is the preparation step, where copies and redundancies are created and maintained, recovery is when they’re called up out of the reserves. There’s often more to it than just replacing lost or damaged files with their most recent backups, especially when done at scale, so it’s treated as distinct from backup.
On the scale of a single device, the line between backup and recovery might not mean a whole lot. But when you’re dealing with, say, an entire server farm, both backup and recovery demand quite a bit in time, labor, and resources. An individual desktop may take up around a terabyte of storage, which would have to be copied to an equivalent amount of storage and restored to the device (or its replacement), both of which require some time.
Multiply that across 2,000 endpoints, though, or across hundreds of server racks, and suddenly, the entire endeavor becomes a logistics puzzle as much as it is a maintenance concern.
Types of backups, recoveries, and disasters
Computers are complex doohickeys, so it comes as no surprise that their upkeep is similarly nuanced. We don’t have the room here for a full rundown of all the different details and how they work together. But we’ll cover some critical points in case you aren’t familiar with them already.
Backup types: approach, location, medium, etc.
If backup types are what you’re asking about, odds are the question is one of “how are we backing things up?” In other words, we’re talking about “backup” as a verb, rather than a noun (we’ll get back to the noun in a second).
Making copies of digital assets, especially at scale, is a resource-intensive and time-consuming process. The more often and more exact the backup, the faster the recovery process, but the more demanding the backups become. Here’s how they break down:
Full backup
This is a complete copy of the file, directory, or whole system. It’s a video game checkpoint, but for your data, and you can roll back to it in the event of a disaster. Most backup procedures at least start with a full backup. It’s the most demanding kind of backup, but it makes for the fastest and easiest recovery.
Incremental backup
This method takes occasional snapshots, appending them to the original full backup image. None of the individual images (the original full backup or any of the following incrementals) provide the most accurate reflection of the data to be recovered. That comes from taking the whole stack of them in the aggregate, kind of like how a series of still images add up to a video when strung together.
Incremental is much quicker to run and demands less storage space, as it only takes an image of anything that’s changed (newly added, edited, or removed items). Everything else is ignored since it’s already contained in the full backup. The trade-off is in recovery, where restoring files and data can be grueling.
Where restoring from a full backup is essentially just cloning the drive and booting it up, incremental doesn’t store the image as a discrete and self-contained package. This means that both a full restore and more targeted recoveries are more complex and labor-intensive—the former because you have to cobble everything back together, the latter because you’re sifting through tons of backup images to find what you need.
Differential backup
Differential backup is a compromise between full and incremental. Full backups take significant time and disk space, with demand growing exponentially the more times you do it. Incremental takes up fewer resources, but you’re basically stitching things back together in recovery, and you’ll take longer to bounce back from disaster.
Differential tries to find a middle ground between the frontloaded demand of full backups and the sluggish recovery process of incremental ones. It starts with a full backup as a baseline, the same as the other two. That’s followed up by frequent partial backups that only record what’s been changed, just like incremental.
But instead of keeping a lengthy history of each incremental snapshot, differential only keeps two backup images: the original and the “differential” (hence the name). Basically, differential backups keep a running tally of everything that has changed and consolidates all of those changes into a single backup image. Then, during the recovery process, the two images are reconciled before restoring to the destination device.
It’s not as fast of a backup process as incremental, and the recovery isn’t as quick, but it levels out the disparity between the two quite well.
Mirror/clone backup
Most discussions regarding backup and recovery, especially at scale, deal with the three backup modes listed above. There is at least one other method, though, and it deserves a quick mention.
With standard backups, regardless of implementation type, there will be some amount of versioning, there’s often compression to minimize storage demands, and files removed from the source don’t usually result in their removal from the backup images (though they’ll obviously be omitted from the images moving forward).
A mirror backup instead creates a copy of existing files just like they are on the source drive. This may be the entire drive, applications, OS, and all, or it may be specific files and/or directories. But the mirror only keeps the most recent version removing anything the source has removed. The mirror isn’t compressed into a single backup image, so it takes up more space but can be immediately accessed when needed, making it the fastest recovery method.
Mirrors and clones often get conflated, and to be fair, there is some overlap. For example, a lot of definitions of clone backups boil down to “mirror the entire drive onto another drive.”
Mirrors/clones demand the most in storage and processing power, so they’re expensive at scale. The key advantage of this kind of backup is minimizing the time between disaster/disruption, and getting back to work. In other words, this is an approach best suited for devices and data that you need at all times, even if the rest of the system takes a bit to get back up and running.
Where’s my backup?
Ok, we’ve covered the most relevant types of backups and how they’re implemented. Now let’s talk about the “when” and “where.”
Backups can be local, or they can be remote. Local can mean on the same device, in the same room, or on the same premises, depending on circumstance. Remote means the opposite (again, varying by circumstance), with the source device and the backup storage connected via the internet.
Backups can also be done while the source is “online” or “offline.” In this context, we’re not necessarily talking about internet connection. Instead, this refers to whether or not the devices and/or data are accessible and in use while the backup occurs. With online backups, you leave everything up and running. This limits interruptions to operations, but it’s taxing on the system resources and increases the risk of errors in the backup due to ongoing modifications.
Running backups while the device or system is offline avoids issues tied to people editing the data during the process allowing full system resources to be devoted to the backup. But it also means that it’s offline and unavailable until it’s finished. This can be a problem if you need uninterrupted operation and access.
Online and offline backups are sometimes called “hot” and “cold,” respectively. Sometimes, a “warm” backup can be performed, where the system remains online, but no modifications are accepted during the process to avoid issues.
Recovery types
Ok, that last section took up a big chunk, which will make this one look tiny in comparison. That’s due in large part to the fact that the recovery type is usually dependent on the backup type. So if that’s all you’re wondering, please refer to the section above.
That being said, what data has been lost, where the data was, how much was lost, and when the last backup was can all impact the recovery process.
So far, we’ve used “recovery” to refer to the process of restoring data from backups. But recovery can also mean data recovery from the source itself when something goes wrong. This can be reclaiming lost or corrupted data on a device due to software errors or hard drive failure, for example. These tend to be more isolated cases.
Additionally, recovery can be restoring individual files, entire directories or databases, or the whole nine yards (i.e., a full system restore).
Disaster types
Murphy was right; there’s basically no limit to the number of ways a system can fail or go bust. And once you think you’ve seen them all, they invent new avenues for reaching failure states.
Broadly speaking, though, disasters break down like this. The failure itself is either hardware, software, or something in-between (so, basically any data-bearing or data-manipulating portion of the IT asset). The source can be internal, external, or human in origin. And for human-caused disasters, they can be accidental or intentional. Below are some examples:
- Internal hardware failure—HDDs break down over time, motherboards short circuit, PSU malfunctions can fry components, etc.
- Internal software failure—glitches, corrupted memory, import/export and conversion errors, etc.
- External “disasters” (in the traditional sense)—fires, floods, storms, power surges/outages, that kind of thing.
- Human-centric causes—accidental data deletion, installation errors, software patches that brick apps/devices, cyber threats, etc.
Basically, anything that would be dangerous to human “wetware” could be equally hazardous to system infrastructure (especially since server racks are much less equipped to flee such impending doom through the fire escape).
Why do you need a database backup & recovery plan
Data disasters are subject to the kind of unpredictable inevitability that rivals quantum uncertainty. It’s a foregone conclusion that, at some point, each asset in the system will suffer some kind of catastrophe, provided it’s in service long enough. But there’s no way to know what the disaster will be, or when it will happen.
Ultimately, every I&O crew has two choices: prepare for very real outcomes that cannot be fully anticipated, or do nothing and hope disaster doesn’t befall the system under their watch.
Society already sees the value of preparing for equivalent disasters in the physical realm. We use smoke detectors, silent alarms, backup generators, storm shelters, and that kind of thing. Choosing not to prepare some form of business continuity and disaster recovery (BCDR) plan is the digital equivalent of deciding against restocking the toilet paper and hoping you’ll remember to do it before you wind up with food poisoning.
Steps for testing your database backup & recovery plan
Working in tech, most of us are pretty familiar with the practice of testing things before pushing them to production. This is no different. And, like with website updates and software patches, some fundamental best practices apply above all else:
- Test first in an isolated environment so you don’t inadvertently change, add, or remove things you didn’t intend to.
- Wherever possible, don’t test with live data or devices.
- Make use of redundancies to avoid losing critical data if the test fails.
- Better yet, don’t test with sensitive or important data at all. Use something of nominal value, like a blog article draft from Steve, the Marketing Intern.
- Avoid running tests, making changes, or performing other maintenance without a way to roll things back, or during times when such a rollback might cause major issues.
As for the actual testing, you’re verifying at least three aspects of the backup/recovery process:
- The software/platform performing the backups and/or recovery
- The hardware resources that have to host and process everything
- The automation setups (if any) that will be initiating and overlooking all of this
And these are a few of the key questions you’re looking to answer:
- Does the backup or recovery require human intervention at any point in the process?
- Are resources and capacity sufficient to support a full-scale backup schedule, or a worst-case recovery response?
- Is all important data intact and accounted for on the other side of the process?
- Are there any clear bottlenecks or potential risks that need addressing?
- Does the process meet the needs of the use case, e.g. does it hit appropriate targets for speed, resource demand, overhead, etc.?
- How long should we estimate for recovery time in the event of a disaster?
Finally, like testing a new cleaning product on your pristine furniture, you’ll want to start with individual, smaller tests, and work upward from there. If feasible, test each aspect separately first, on a smaller scale, to make sure the process itself is working as intended. Then, begin testing with larger groups of assets, and begin running process steps in sequence, until you can safely run full, end-to-end tests at full scale.
Oh, and just like you’re going to be backing up regularly, you’ll want to have these “fire drills” on a recurring basis, too. The last thing you need is for an actual disaster to happen, only for your metaphorical fire extinguisher to be out of date and non-operational.