Delta copying
Do less, faster

Delta copying is a (much) faster way of replicating files
when their past versions already exist at destination.

The problem

When a file changes and we want to update its backup copy, the simplest thing to do, obviously, is just to copy the file over. Read it in full, write it in full. It's simple and it works.

It is also not terribly smart.

Chances are that the file changed only here and there and the backup copy is still largely in sync with the source. And yet here we are discarding it and re-copying anew.

We can do better.

We can keep unmodified parts of the backup copy and selectively update parts that actually changed. That's delta copying.
Source file
Backup copy
And as with any simple idea there's more than one way to slice it.

Option A

Any form of delta copying comes down to being able to detect which parts of a file changed between its two versions.

The most obvious option is to simply go through both files and compare them block by block.
Copy A
Copy B
The plus is that all we need is just two copies of a file. Feed two files in, get the differences out.

The minus is that both files need to be read and roughly at the same time. If files reside on the same local device, this may cause some performance issues. In fact, on non-NVMe devices it will make the whole process run at least 2x slower.
Copy A
Copy B
Secondly, if copying over the network, this will cause remote file to be transferred over, in full, just to decide if there are any parts worth updating. Even with a very fast network in place, this will still be the bottleneck of the whole process.

Option A*

The over-the-network copying is a very common need, so it's not surprising that a good solution exists.

It is, of course, the venerable rsync, an integral part of many IT toolboxes since its first release back in the mid-90s.

Rsync works by spawning a copy of itself on the remote machine, making it read remote file, compute block checksums and report them back.
rsync helper
block hashes
block hashes
block update
block update
Source machine
This significantly reduces the amount of data being sent over the network. If there's a checksum mismatch, the block is re-checked in a more expensive way and copied over if needed.

This works well and solves the issue of reading remote files in full but it comes with an obvious catch.

For rsync to work, it must be present on the remote machine. For many setups this is not an issue, but there is also a plenty of cases where this is a blocking requirement.

Option B

Bvckup 2 uses a different approach.

It remembers what a file was like on the last run and then detects changes by comparing file's current version against that.

This is done by splitting file into large-ish blocks, computing their hashes and then checking if these hashes changed since the last time we looked at the file. If they did, we copy respective blocks and save their new hashes:
read block
hash block
load hash from the last time
if (new-hash ≠ old-hash) then
    write block
    save new-hash

What is good

This delta copying method works for all remote storage that is accessible for reading/writing. Running a second copy of the software on the receiving end is not required.

The backup copy is not read at all.

We are keeping a per-file state, so we can use it to capture some additional details and enable things like resumable copying and abortive error recovery.

We are also storing detailed file hashes, which naturally enables file integrity verification and detection of bit rot.

The caveats

There are, of course, some drawbacks as well.

The source file is always read in full, so the delta copying benefits will be most pronounced if the program is running on the source machine.

Second is that destination files cannot change between the runs, because this will invalidate their saved state. Fortunately, this is largely a non-issue in our case, because backups don't generally change between the runs.

Finally, we need to store per-file state and this takes space. The exact numbers are given below, but on average the state takes up around 0.05% of the file size. That's 1/20th of a percent.

What good is a theory if not for its application?

Here's delta copying in action processing a 10 GB virtual disk of a freshly rebooted Windows virtual machine: Copied 19 MB or just 0.2% and completed in less than 4 minutes vs 19 minutes needed for a complete copy.

This is from a machine with two slower HDDs, which is why it's in minutes rather than seconds.
Here is another example. This time updating a 256 GB encrypted file system container with a day's worth of changes: Copied 788 MB or just 0.3% of the total and completed in about 11 minutes vs. 44 minutes that a full copy would've required: Note how the overall speed was around 390 MB/s even though the backup went over a gigabit link with a maximum throughput of 100 MB/s.

Both the source and destination here are the SSD drives, so the copying effectively ran at a reading speed of the source drive. Neither the network nor the backup device were a bottleneck.

Quick details

Block size

Files are read from the disk in large chunks as dictated by the IO module, and then split into 64KB blocks for delta processing.

Changes are detected at a block level, meaning that changing a single byte in a block marks the whole block as modified.

Block hashes

For each block 3 separate hashes are calculated and saved.

Blake2b is the principal hash with 16 bytes of output.

This is a newer generation cryptographic hash algorithm that was one of SHA-3 candidates. It is widely used and it is quite fast.

xxHash and Spooky are two weaker, non-cryptographic hashes that are however extremely fast, running almost at the speed of RAM itself. Each produces 8 bytes of the output.

Combined, this yields a total of 32 bytes of hashes per block or about 0.5% of the block size.
For additional details and performance numbers see this post.

Option C

"But wait... there is more."

There exists yet another way for detecting file changes and that's to rely on someone else to track changes for us.
Nearly all modern file systems employ a form journaling to keep a log of all major changes to the folders and files.

These records are kept to allow for automatic self-consistency checks and repair after computer crashes, sudden power-offs and other unfortunate events.

Windows NTFS is of this kind.
The journaling can also be configured to record changes to the file contents. This doesn't record what has been written, but it does record where the change was and when it occurred. And this happens to be exactly what we need for delta copying purposes. The main drawback is that this makes delta copying dependent on an external component.

Another trade-off is that tracking writes causes the journal to be updated much more frequently, grow in size faster and have a shorter time coverage. This has system-wide consequences, and it also increases chances of not having complete change log for a file that was last copied long time ago.

Excessive journal use with write tracking is also likely the reason why on Windows the tracking is off by default.

Last, but not least is that this is a technically complex feature and it's just plain hard to implement correctly with all i's dotted and t's crossed.
Nonetheless, when deployed with due thought and care, journal-based delta copying can be exceptionally effective.

The prototype is in the works and the initial revision should be making its way into Bvckup 2 builds shortly.
Give Bvckup 2 a try and see the difference delta copying can make for your setup.
Made by Pipemetrics in Switzerland

Blog / RSS
Follow Twitter
Miscellanea Press kit
Company Imprint

Legal Terms