Just some commands and things I I seem to use often do when backing up. It's a rather random collection of things, so don't expect a coherent blog post in this one, more of a scratchy notebook.

Analyse file and dir sizes to see what is taking up the most space

du -ch  --max-depth=1  | sort -h

Here we use the disk usage command, du, and set the -c flag to tell it to give us a grand total, the -h flag to produce human readable output, and the --max-depth flag so we only look at dirs directly below our present working directory. You could also exclude files with globs, such as--exclude="**/venv/**" --exclude="**/static/**" if you wanted to know the sizes without certain dirs considered. The output of du is then piped to the sort command again with a -h flag, so that it compares the human-readable numbers that we'll be feeding it.

This is quite handy to check what should be excluded when making a backup of your home dir for example, e.g.

cd ~; sudo du -ch  --max-depth=1 --exclude=./Desktop --exclude=./Documents --exclude=./Pictures --exclude=./Downloads

shows that the ~/.cache dir is huge and should obviously be excluded. I have vagrant installed and I can see that I should really exclude ~/.vagrant.d. The .thunderbird dir was also huge, which prompted me to turn off local storage, except for mail from last 30 days in Thunderbird settings.

If your backup becomes strangely large after a while, you might want to rerun this command and just double check there's nothing huge that should be excluded in your rules now.

Manual backup and restoration

If you want to make a manual backup in a way that can easily be restored from a liveCD in the event of a major crash, the following command does the trick

cd /; tar -cvpjf /home/lee/sysbackup.tar.bz2 --exclude=/home/lee/sysbackup.tar.bz2 --exclude=/sysbackup.log exclude='/var/cache/*' --exclude='/root/.cache/*'  --exclude=/home --one-file-system / 1> sysbackup.log

This will create a tar.bz2 with permissions intact from everything under root, except the home dir, and dirs not on the same filesystem (this excludes /proc, /sys and other temp filesystems, and also things mounted, like USBs, under /mnt). It also excludes some caches that we don't need for a system restore.

You can then restore the sysem (minus the home folder) quite quickly by

  1. Booting from the live USB
  2. Using gParted to sort out the partitions if need be, and to format the linux one.
  3. Mount the partition that you will restore to.
  4. Restore the image to it sudo tar -xvpzf /path/to/sysbackup.tar.gz -C /media/whatever --numeric-owner (the -C is just short for change directory, so we make sure we restore to correct place)
  5. If your backup excluded /proc and other temp fs (tip: in duplicity exclude as /proc/* instead of /proc, so that duplicity keeps the empty parent director placeholder) then you'll need to recreate these mkdir /proc /sys /mnt /media and so on.
  6. Don't recreate lost+found in the normal mkdir way; instead do mklost+found.
  7. Recreate empty /boot/efi if needed (it's just a mount point for the boot partition to mount into. The rest of /boot/ should have been backed up for a full restore.

If instead you have a duplicity backup you can instead replace step 4. with

duplicity --numeric-owner file:///path/to/dupBackup /media/whatever

If you're using duplicity with s3 make sure you export the gpg passphrase and aws keys as environment variables first.

Splitting tar files to make them fit on a USB stick

You may want to split a large tar archive into chunks less than 4gb each, so you can fit on them on to a USB (they often only accept max single files of 4gb). This can be done with

split -d -b 3900m /initial/path/to/backup.tar.gz /path/to/of/backup.tar.gz.
  • -d: This option means that the archive suffix will be numerical instead of alphabetical, each split will be sequential starting with 01 and increasing with each new split file.
  • -b: This option designates the size to split at, in this example I've made it 3900mB to fit into a FAT32 partition.
  • /path/to/backupchunk.tar.gz: The path to where the split files will be output, and the prefix for each of them. In our example, the first split archive will be in the directory /name/to/ and be named backup.tar.gz.01, backup.tar.gz.02, ....

A note on encrypted home dirs

If you have encrypted your home dir, you will see hidden files like /home/lee/.encryptfs, /home/lee/.Private. These are just symlinks, the encrypted files are at /home/.encryptfs/lee. You don't want to back those up. Rather back up the unencrypted files when you are logged in, then re-encrypt in the future if you ever have to restore.

In theory though (if you were an anarchist), you could opt to back up the encrypted files instead:

 tar -cvpjf /media/"USB DISK"/homePriv.tar.bz2 /home/.encryptfs

makes a tar of it to a USB stick (although I read this tar ball may not work, as tar has issues with the chaotic, unpatterned nature of encrypted files). In addition to the technical problems with doing this, there is no way to deselect parts of home dir that I don't want to back up here, such as .cache, Dropbox and so on.

If you did do this, then the way to recover a backup of this encrypted home is with sudo ecryptfs-recover-private from a live USB, and it is detailed here

Restoring GRUB

For the system to boot, you will may need to restore grub. To do this from a live USB/CD, you will need to reconfigure it in a chroot (after you've restored the rest of the system):

First, bind the directories that grub needs access to to detect other operating systems, like so:

sudo -s for f in dev dev/pts proc ; do mount --bind /$f /media/whatever/$f ; done

This script just runs over all the needed live USB dirs and binds them to the corresponding ones on the mount point restoration you made, so for e.g: one command in the series may be mount --bind /dev /media/whatever/dev.

With those bindings in place, we can use chroot, which effectively allows us to pretend that the root directory is not /, but rather /media/whatever

chroot /media/whatever

Now if you have a file /media/whatever/testfile.txt, it could be accessed with just cat /testfile.text. With that, the usual grub config commands should go through on our restored system

dpkg-reconfigure grub-pc

You will get a menu asking you what drive(s) grub should be installed on. Choose whatever drive(s) the computer will be booting from, and configure grub as you normally would.

A note on exclusions for sys (non home dirs) backup

  • /proc: this a virtual filesystem, specifically a "procfs". It doesn't contain real files but runtime system information (e.g. system memory, devices mounted, hardware configuration, etc). You should notice that all the files have size zero. It's used as an info centre for the kernel, and commands like lsmod are really just synonymous with cat /proc/modules. We really don't want to be attempting to back this up!
  • /dev: Everything in linux a file, from cdroms to printers, and this dir is where they reside. Don't back this up!
  • /sys: A virtual ram-based filesytem. Similar to /proc and if I understand things in many ways its replacement (at least in some aspects). Don't backup.
  • /media, /mnt: CDROMs, USBs, external HD. Definitely don't backup
  • /run: Another tempfs, that replaces var/run (now a symlink to the former)
  • /lost+found: when files don't shut down properly they go linux purgatory (see here). Don't backup.
  • /tmp: as the name suggests, temporary files, destroyed on each reboot. Pointless to backup. -- /boot/efi/: this is a mount point for the separate efi parition (i.e. it's an empty folder where the partition gets mounted, so contents are on other partition). I don't backup (although I guess you could...)

In order to save space, you will also want to exclude caches like:

  • /root/.cache/*
  • /var/cache/*

Note /var/run is a symlink to /run, and /var/lock is a symlink to /run/lock. Duplicity does not follow symlinks, so leaving them in your backup won't lead to duplicity accidentally backing up /run, and thus there's not much harm in doing so.

If you don't care about space/bandwidth, the above exclusions should be all you need to get things going again very easily. It will also allow you a very easy restore. For me it costs too much space and bandwidth to also backup dirs like

- /var/lib/*
- /bin/*
- /lib/*
- /lib32/*
- /lib64/*
- /opt/*
- /sbin/*
- /usr/*

so I exclude those too, and just take a snapshot list of currently installed packages, which I backup. In the event I need to restore my system, it won't be as easy as just directly restoring the backup with duplicity during a live USB session before rebooting, but it saves a huge amount of space and bandwidth. I'd instead need to use this packages list to reinstall everything from the repos, then restore config directories like /etc from my duplicity backup.

Comparing files between backup sets

If you take a given backup set and list all the files contained in it with duplicity, then output to some file, e.g.

duplicity list-current-files --time "2014-01-21"s3://s3.amazonaws.com/<YOUR_BUCKET>/ >> list_of_files_20140121.txt

then the output may look like

Sat Mar 30 16:37:34 2015 bin/bash
Fri Aug  3 18:30:13 2014 bin/bunzip2

you could reissue the command with a different datetime, say 2014-01-05, at which point you have two file lists of the above format: "list_of_files_20140121.txt" and "list_of_files_20140105.txt". You can now use some diff magic to see new additions/deletions (this doesn't cover modifications of existing files, just brand new files or deletions):

First collapse all white spaces into single white spaces:

sed 's/\s\s*/ /g' list_of_files_20140121.txt > list_of_files_20140121.out1

Then use cut to use a white space as a delimiter and keep only column six or more to kill the date columns:

cut -d ' ' -f 6-  list_of_files_20140121.out1 >  list_of_files_20140121.out2

With these changes the two files look like


and we can use diff to compare lines:

diff -y --suppress-common-lines  list_of_files_20140121.out2 list_of_files_20140105.out2 >> differences.txt

Not sure if this is the most elegant way of doing it, but it should show you new/deleted files in your backup between two given datetimes, should you need to know this information.

Duplicity host mismatch error

Duplicity protects its backups by only letting the same hostname do incremental backup to existing backup chains (unless the allow source mismatch flag is given).

You can see the hostname that a backup set is associated with from the manifest file. You may, thus, run into issues if you change the hostname thus. Duplicity wants to use the fully qualified domain name (fqdn), see hostname -f. Modify your host in /etc/hosts and /etc/hostname to resolve this if you run into this error after reviewing what hostname is in the duplicity manifest.


Keeping an archive on a server synced to a local dir:

rsync -avzi --progress -e='ssh -i /path/to/.ssh/id_rsa_key -p 22'  user@somehost.com:/path/to/src/backup/dir /path/to/dest/backup/dir

The flags here are -a for achive mode, which ensures permissions are kept, -v for verbose output, -z for compressions, -i to show changes. We use the -e flag to specify the ssh command, in which I specify a key and port.

Taking snapshots of a postgres db

In order to backup postgres dbs, I create a new cronjob under the "postgres" user's crontab. First switch to the postgres user, sudo su postgres, then edit the crontab, crontab -e, and add

# m h  dom mon dow   command
15 3 * * * /usr/bin/pg_dump pgdb_name -f /path/to/db.backup.sql

which will dump the content of the postgres db called "pgdb_name" to the "/path/to/db.backup.sql" SQL file at 03:15 each day. So long as the sql output file is on your backup path you will now have database backup.

Currently unrated

About Lee

I am a Theoretical Physics PhD graduate now working in the technology sector. I have strong mathematical skills and originally started in heavy-duty scientific computing, but now I work mostly with Python and the Django framework. I am available for hire now, so check out my resume and get in touch.