Tar, rsync and other backup tools

From docwiki
Jump to: navigation, search


Motivation

To keep a copy of your data in a different place you do not always need a complicated backup software. Often the small tools that come with any Linux distribution are easy and simple and do the job. Here you will see a few examples on how to use them.

The Problem of Backup

When you have not dealt with backup before then your fist thought will be: I need a copy of my data. But it is not that simple. Besides a copy of your data you will also want the metadata: When the file was created, some special files (like symbolic links) and you want the permissions. Who is allowed to access what. This is especially true if you want your backup of e.g. the file server of your company. Just imagine that you loos your data and need to restore from backup, only to find out that now all the 500 people in your company can access all files from everyone else. For sure some of them will not be happy about this.

Or, if you want a backup of your system and want to restore it, it will not work if all file permissions are wrong and all special files like device files and symbolic links are missing, etc..

Further more often want more then one version of your files. Anna from the Accounting department messed up her file and overwritten it with some garbage last month but only found out today. So it does not help here much if you have a backup of the files from only last week.

You also want your backup to be in a different place then your normal computer. E.g. if someone steals your computer then they will also steal the external drive that is next to it. The same with fire, etc..

tar

tar (short for Tape ARchive) was used to write data to magnetic tapes. While it can be still used for that, today it is more like a file archive similar to ZIP. Tar by itself bundles many files and directories into one file without compression. You can also compress this file with a tool that can only compress but not bundle into an archive like gzip or bzip2. Modern tar version have support for this compression and so you can create a file that is both TARed and compressed. Usually that would be an e.g. .tar.gz or .tar.bz2 file. Often .tgz is used instead of .tar.gz. In fact the files can have any extension. Using the one mentioned here is just a convention.

Here are a few examples of using tar:

$ tar cfvz dip.tgz Diplomarbeit/ 
$ tar cfvz bilder.tgz  *.png *.jpg 

This creates an archive file named dip.tgz with all the content of the Diplomarbeit/ folder (and sub folders) and an archive bilder.tgz with all .png and .jpg files from the current directory.

The option c says: create. The f means: put it in a file instead of tape. The v is for verbose - tell us what you are doing and the z means: compress it with gzip.

$ tar tfvz bilder.tgz

This would test the file. It lists the content of the archieve.

$ mkdir test
$ cd test
$ tar xfvz dip.tgz

The above would extract the files into the current directory. When you extract files it is always a good idea to do this in its own directory so that you do not overwrite other things. Some archives (like the dip.tgz) are built so that they extract into their own subdirectory but others (e.g. the bilder.tgz) will mess up your current directory with a lot of files.

Incremental Backups

When you have a lot of files to backup you do not have enough space to make a full copy every day and it would also take a long time. So you might want to choose to make incremental backups during the week and only a full backup on the weekend. You can tell tar to only take files newer then the date of some other file or some date. GNU tar also offers a way to have a file that keeps track of the state of files to be better able to handle incremental backups. Beware: E.g. you have your a directory where you e.g. restored an old folder of pictures - with the restore the files will have old timestamps and thus may not end up an an incremental backup of your directory.

You can also tell tar to take the list of files that should be in the backup from an other file. You then would first create the list by some other tools and then tell tar: take this list and backup everything there.

find

A list of files containing filenames can easily be created with the find tool. This is useful for a lot of things, not just creating lists for tar. Her are a few examples of find:

$ find .   # lists everything from your current directory
$ find /etc  # lists everything in etc (where you have acces)
$ find Doc/ -size +5M  # list all files bigger then 5MB
$ find Pictures/ -mtime -3 # list all files in pictures which are newer then 3 days
$ find Pictures/ -mtime +365  # list all files in pictures older then a year
$ find Pictures/ -name \*.jpg   # only search for jpg files in Pictures

You can even run commands on those programs or pipe the output to xargs that gives you a lot of control over which commands to run on those files.

rsync

rsync is an extremely useful tool for backup. rsync (Remote Synchornization) allows you to keep a copy of your file store on a remote place and incrementally update the remote site. Of course, rsync can also do this on the local computer in a different directory. E.g.:

$ rsync -Hxa --delete . /mnt/backup

This would keep a copy of the current directory and below (notice the dot) in the directory /mnt/backup. (where e.g. you have an external USB drive connected). There are many options as how to do this. With the --delete option you tell rsync to delete files from the remote if they do not exist anymore in the source.

$ rsync -a Documents/ anna@192.168.19.21:/var/www/Documents-anna/

The above example would use ssh to keep a remote directory synchronized with your local directory.

symlinks, hardlinks, bind mounts, sparse files and snapshots

If you are divining a bit deeper into backups then you will come across terms like symlink, hardlinks, etc.. and how they are handled by backup software. Here is a short overview of what they are and why you would care:

Symlinks

A symlink (short for symbolic link) is a special file that points to another file. This file can be anywhere. It can even be that the target does not even exist anymore and the symlink points to a non-existing file. (A so called dangling symlink). Symlinks are useful to have kind of bookmarks to other paces in the filesystem.

$ cd Desktop/
$ ln -s /var/www/mywebsite/files/ webfiles
$ ls -l webfiles
lrwxrwxrwx 1 anna staff 8 Apr  3 11:46 webfiles -> /var/www/mywebsite/files

So the above would create a link in your Desktop directory that points to files that you might have on your webserver. The ln -s creates that link.

Good Backup software as options to specify how to deal with symlinks. Wether you want to follow the and backup the data behind them or you want your backup to also only contain the links.

Hardlinks

Hardlinks are also links to other files but they are different: They are additional directory entries to the same file. A file on unix is defined by the filesystem (where it is mounted) and the internal inode number. You can view the inode number with the -i option of the ls command.

With the ln command you can create an additional directory entry for the same file (inode). The permissions are stored in the inode and the inode knows where the data of that file is. Hardlinks can only point to files on the same filesystem. Each file keeps track of how many links point to it. Only when the last link is deleted, then the data will also be deleted. The second (or third, ..) directory entry is not distinguishable from the first one. E.g:

$ touch bla.txt
$ ln bla.txt bli.txt
$ ls -li *.txt
1966640 -rw-r--r-- 2 mond mond 0 Apr  3 11:53 bla.txt
1966640 -rw-r--r-- 2 mond mond 0 Apr  3 11:53 bli.txt
$ rm bla.txt 
$ ls -li *.txt
1966640 -rw-r--r-- 1 mond mond 0 Apr  3 11:53 bli.txt

The above creates an empty file with the touch program (0 bytes long) with the name bla.txt. It then creates the hardlink with the name bli.txt. An ls with the -i option shows us that both have the same inode number 1966640 and thus are one and the same file. It also shows the number of 2 as the count of links. So you know that this file has an additional hardlink to it. If we delete the first file bla.txt the bli.txt is still there but the link count is reduced to 1.

Good backup software has options that tell it how to deal with hardlinks because when you have 3 links to your 100GB files and you want to backup it, you do not want to use up 300GB after restore.

sparse files

In some cases you can have file that would take up a lot of data because they are huge, but actually they contain mostly empty blocks (all zeros). This could e.g. be database files. When you backup them they might take a lot of space (or not if the backup is compressed). But if you restore the the 0 blocks would be actually written and could take up much more space then originally. Good backup software can deal with sparse files.

Snapshots

The newer Linux filesystem called btrfs (pronounced "butter F S" or "b tree F S") offers filesystem snapshots. Also the zfs system which has been ported from OpenSolaris supports them. This offers the possibility to keep older versions of the files frozen within the filesystem. This is useful if you accidentally delete or overwrite a file. Of course you still need a full backup in case of disaster.

Exercises

  • play around with the options of tar and create, test and extract an archive.
  • play aournd with find to create listings of files.
  • use ln -s to create some useful symlinks.