...making Linux just a little more fun!
The thing about backups is that they can just be a pain. Everyone knows just how important they are, but very few people actually take the time to perform proper backups. Even after they have felt the pain of losing all those important files.
In this article I am going to show you how to quickly set up your computer for simple, hassle-free, and transparent backups using only rsync and cron (or Anacron). The premise is simple: every night your computer will make an automatic mirror of all the files you wish to backup, and at chosen intervals these mirrors will be archived and kept for a specified period of time.
Before you get our hands dirty on actual implementation you need to design your own backup policy. In section 3 I discuss what a backup policy should and should not be. I will then introduce the necessary background information on rsync and cron separately. Finally, I will put it all together leaving you with a simple but effective backup regime.
This article and the presented backup procedure is intended for anyone wishing to keep an effective backup of their important data. It is definitely not intended for large organisations or businesses with mission critical data. I would imagine the ideal candidates would include: home users, home office/small office users, students/postgraduates, and researchers.
A common misconception among many people is that a good backup policy is as simple as making a regular copy of your data ("mirroring your data"), always overwriting the previous copy. This, although more effort than most might make, is almost as bad as doing nothing.
Consider, for example, if one of your files becomes corrupt over time. It takes you a week or two before you use it again. In that time, you have made two "backups". You open your file to find your data destroyed. "But", you think to yourself, "that's alright, I'll just turn to my backup". You open your backup to find the exact same corrupted file. You realise only too late how useless your backup policy was.
Most of us have hundreds, if not thousands, of important files in our home directories; address books, e-mails, letters, work related data, programs we have been working on, etc. Some of these files we might use every week, while others might not be looked at for months or even years.
A good backup policy is one which takes "snapshots" of your data and keeps them for a specified period of time. It is up to each individual to decide just how many snapshots to keep and at what intervals. Often this will be decided for you by storage limitations. Where possible, data that changes regularly will benefit from snapshots of smaller intervals, while data that rarely changes requires fewer intervals. The following table demonstrates my own backup procedure:
Data | Change Freq. | Size | Daily Mirror | Weekly Snapshots | Monthly Snapshots | Space Required | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 3 | 1 | 2 | 3 | 4 | 6 | 12 | |||||
E-mails | Daily | 100Mb | Y | Y | Y | Y | Y | N | Y | N | Y | Y | 500Mb |
MySQL Data | Daily | 30Mb | Y | Y | N | N | Y | N | Y | N | Y | N | 70Mb |
Website | Monthly | 900Mb | Y | Y | N | N | Y | N | N | Y | N | N | 3,200Mb |
/etc |
2-3 Weeks | 28Mb | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y | 200Mb |
Thesis | Daily | 25Mb | Y | Y | Y | Y | Y | Y | Y | Y | Y | N | 190Mb |
Research Code | Rarely | 60Mb | Y | Y | N | N | Y | N | Y | N | Y | Y | 200Mb |
Total space required: | 4,360Mb |
Each of the snapshots is compressed to reduce space. The largest data in my
policy is the website. This rarely changes so I keep only a few snapshots. My
systems /etc
directory also changes rarely, but as it is only 28Mb
I have chosen to keep all possible snapshots. You should now make a similar
table and decide which data you want to backup and how often.
The next consideration is where to store your backups. Again, this is your own choice. Some locations simply won't be available to you while others might just seem like overkill. The following list gives the various options listed from the best to the worst:
If you do not have access to a remote computer yourself, then consider joining up with a friend; each of you backing up to the others computer. Security concerns can be addressed by encrypting the data before/during transfer and only placing the encrypted versions on the remote computer.
rsync is a very fast and flexible file transfer utility. It uses its own "remote update" protocol to transfer just the differences between two sets of files. It can operate locally or across a network link using rcp, ssh or its own daemon. rsync is included with most standard Linux distributions by default, or it can be downloaded from its website (http://rsync.samba.org).
We are going to use rsync to mirror our files every night. rsync is the ideal choice as it will only transfer new files, the differences between existing files that have changed, and remove old files, minimising the bandwidth usage for dial-up/broadband customers.
The mirrors are easiest to implement when we take entire directories and its sub-directories. Let's take the case where you are mirroring all your e-mail files from your home computer to your office computer. We would use rsync as follows:
rsync -r -e ssh --delete /home/username/mail username@mycomputer.mycompany.com:/backups/mail
where:
-r
- Instructs rsync to copy directories recursively.
-e ssh
- Tells rsync to use the ssh remote shell. More about this below.
--delete
- Instructs rsync to delete files on the receiving side which do not exist on the sending side.
/home/username/mail
- The directory we are mirroring.
username@mycomputer.mycompany.com:/backups/mail
- Log in as user
username
onmycomputer.mycompany.com
and create/update the mirror in/backups/mail
This will create a mirror of /home/username/mail
on
mycomputer.mycompany.com
under the directory
/backups/mail/mail
. This is what we want. If you wanted the
reverse (backing-up from mycomputer.mycompany.com
to your home
computer) you would simply switch the source and destination:
rsync -r -e ssh --delete username@mycomputer.mycompany.com:/home/username/mail /backups/mail
I recommend that you use the ssh protocol to ensure the secrecy of your data while it is being transferred. If you are performing this backup on a closed network, feel free to use the older rsh protocol or rsync's own daemon. Using networked backups creates one more problem: we want this to be automatic, with no user interaction, but using rsh or ssh generally requires a password to be entered. We will overcome this by using public/private keys without pass-phrases to achieve this.
This article is not intended as a tutorial on ssh so I will only provide a brief instruction on setting up private/public key authentication using ssh. Please refer to the ssh documentation for a more thorough discussion.
The following two commands will set up password-less authentication from your
computer to mycomputer.mycompany.com
:
$ ssh-keygen -b 1024 -t rsa -f /home/username/.ssh/id_rsa
(do not enter a pass-phrase - leave it blank)
$ scp /home/username/.ssh/id_rsa.pub username@mycomputer.mycompany.com:/home/username/.ssh/authorized_keys
Usually any problems encountered are down to the permissions of the various
key files. Use ssh in verbose mode (ssh -v
) and check the ssh
daemon logs on both machines (usually /var/log/secure
).
In using this method it is important for you to be aware of the security
concerns that arise. The ssh-keygen
command produced two files:
/home/username/.ssh/id_rsa
: the private key
/home/username/.ssh/id_rsa.pub
: the public key
-rw-------
(i.e., only readable by the owner). This file is the
equivalent of having a text file containing your login password to your account
at mycomputer.mycompany.com
; anyone who gets their hands
on this file will be able to log into that account without knowing your
password. However, any potential hacker must first gain access to your
home computer in order to get at this file.
If you use this method you should also consider the following security measures:
cron is an integral part of most Linux distributions. It is used to execute commands at specific times according to a schedule you set. We will use it to set-up a nightly mirror of all the files we wish to backup, and to create the snapshots at the intervals we determined in section 3.
Each user on a Linux system has their own cron table ("crontab") which
contains the schedule of commands. This can be listed using
'crontab -l
', removed with 'crontab -r
' and edited
with 'crontab -e
'. Let's add the daily mirror command so that
it occurs at 2am every day by placing the following in our crontab:
00 02 * * * rsync -r -e ssh --delete /home/username/mail username@mycomputer.mycompany.com:/backups/mail
where the five fields (0 2 * * *
) are (respectively):
( *0 or 7 is Sunday)
Field Allowed Values minute
hour
day of month
month
day of week0-59
0-23
1-31
1-12
0-7*
So, in our case, we will mirror the contents of
/home/username/mail
at 02:00 on every day of every month. We can
place similar entries for all other directories you wish to mirror.
Alternatively, we could create a script containing all the entries and use cron
to execute that script.
There are two useful environment variables you can also set when editing the
crontab to override the defaults:
SHELL=/bin/sh
MAILTO=username
The MAILTO
is important as all error messages will only be sent
by e-mail and so you will notified if your backups are failing. Refer to the
crontab man page for more information and examples.
Now that we have the basics of rsync and cron, all we have left to do is to put them all together to create our backup policy. Let's continue with the example where your home computer is sending its daily mirror to your office computer. You office computer will now be responsible for the remainder of the backup policy: the snapshots at the predefined intervals. We will use another crontab on the office machine to accomplish this and I will demonstrate using the schedule for my thesis from section 3.
The method is quite simple. For example, every Sunday we will move the 3 week old snapshot to 1 month old snapshot, the 2 week old to the 3 week old, the 1 week old to the 2 week old and archive the mirror to the 1 week old. So, depending on the time of the week, the 3 week old snapshot could be as young as 2 weeks or as old as 3 weeks.
My schedule requires snapshots that are 1, 2, and 3 weeks old and 1, 2, 3, 4, and 6 months old. We will work from the oldest down (as otherwise we would only be propagating the new snapshot):
# Back up mail files with snapshots of 6, 4, 3, 2, 1 months and 3, 2, 1 weeks # Order 4m->6m, 3m->4m, 2m->3m, 1m->2m, 3w->1m, 2w->3w, 1w->2w, mirror->1w # At 3am on the 1st of Jan,Mar,May,Jul,Sep,Nov copy the 4m to the 6m 00 03 1 1,3,5,7,9,11 * cp -f /backups/thesis/backup/4month.tar.gz /backups/thesis/backup/6month.tar.gz # At 3.02am on the 1st of every month move the 3m to the 4m (and continue for other months) 02 03 1 * * cp -f /backups/thesis/backup/3month.tar.gz /backups/thesis/backup/4month.tar.gz 04 03 1 * * cp -f /backups/thesis/backup/2month.tar.gz /backups/thesis/backup/3month.tar.gz 06 03 1 * * cp -f /backups/thesis/backup/1month.tar.gz /backups/thesis/backup/2month.tar.gz 08 03 1 * * cp -f /backups/thesis/backup/3week.tar.gz /backups/thesis/backup/1month.tar.gz # And then every Sunday take care of the weekly snapshots and the archiving of the mirror 10 03 * * 0 cp -f /backups/thesis/backup/2week.tar.gz /backups/thesis/backup/3week.tar.gz 12 03 * * 0 cp -f /backups/thesis/backup/1week.tar.gz /backups/thesis/backup/2week.tar.gz 14 03 * * 0 rm -f /backups/thesis/backup/1week.tar.gz 16 03 * * 0 tar zcf /backups/thesis/backup/1week.tar.gz /backups/thesis/thesis/*
And that my friends is your automatic, hassle-free, and effective backup system.
A few points on the above:
1week.tar.gz
to 2week.tar.gz
,
3week.tar.gz
, etc) to prevent unnecessary error messages
Anacron is a periodic command scheduler similar to some uses of cron, but it does not assume that the system is running continuously. It can therefore be used for our backup policy on systems that don't run 24 hours a day. Just like rsync and cron, Anacron is now part of most standard Linux distributions.
Every time Anacron is run, it reads a configuration file that specifies the
jobs Anacron controls, and their periods in days. If a job wasn't executed in
the last n days, where n is the period of that job, Anacron executes it. The
configuration file is usually /etc/anacrontab
.
For the daily mirroring we could add a line to this configuration file such as:
1 20 mirror rsync -r -e ssh --delete /home/username/thesis username@mycomputer.mycompany.com:/backups/thesis
where the fields mean:
1
20
mirror
rsync...
And similarly on the backup machine we would place the following in the Anacron configuration file:
# Back up mail files with snapshots of 6,4,3,2,1 months and 3,2,1 weeks # Order 4m->6m, 3m->4m, 2m->3m, 1m->2m, 3w->1m, 2w->3w, 1w->2w, mirror->1w # Every 60 days (2 months) 60 20 bup1 cp -f /backups/thesis/backup/4month.tar.gz /backups/thesis/backup/6month.tar.gz # every 30 days (1 month) 30 22 bup2 cp -f /backups/thesis/backup/3month.tar.gz /backups/thesis/backup/4month.tar.gz 30 24 bup3 cp -f /backups/thesis/backup/2month.tar.gz /backups/thesis/backup/3month.tar.gz 30 26 bup4 cp -f /backups/thesis/backup/1month.tar.gz /backups/thesis/backup/2month.tar.gz 30 28 bup5 cp -f /backups/thesis/backup/3week.tar.gz /backups/thesis/backup/1month.tar.gz # And every 7 days 7 30 bup5 cp -f /backups/thesis/backup/2week.tar.gz /backups/thesis/backup/3week.tar.gz 7 32 bup7 cp -f /backups/thesis/backup/1week.tar.gz /backups/thesis/backup/2week.tar.gz7 7 34 bup8 rm -f /backups/thesis/backup/1week.tar.gz 7 36 bup9 tar zcf /backups/thesis/backup/1week.tar.gz /backups/thesis/thesis/*A few notes on this:
For a more professional backup solution:
Get advance notification before your hard disk fails:
Barry O'Donovan graduated from the National University of Ireland, Galway
with a B.Sc. (Hons) in computer science and mathematics. He is currently
completing a Ph.D. in computer science with the Information Hiding Laboratory, University
College Dublin, Ireland in the area of audio watermarking.
Barry has been using Linux since 1997 and his current flavor of choice
is Fedora Core. He is a member of the Irish
Linux Users Group. Whenever he's not doing his Ph.D. he can usually be
found supporting his finances by doing some work for Open Hosting, in the pub with friends or running in the local
park.