$ whoami

Ben RobertsI currently work as a systems administrator for GSA Capital, managing Unix, Linux and Windows systems and providing IT support. I rely heavily on automation and open source technologies in my daily activities.

I previously worked as a Network Projects engineer for Atos in the Major Projects division. My role included design and implementation of LAN and WAN systems in Data Centre environments. My last project involved working on a major internal infrastructure overhaul spanning five sites for which I produced all the physical layer design and implementation.

Before then, I worked for Netcraft in Bath, while taking a year out from my degree studies. My roles included developing and running the SSL Server Survey, reviewing Automated Vulnerability Scan results, and performing occasional penetration tests against web applications for financial institutions.

I am a former graudate of ECS (University of Southampton), earning a First in Computer Science with Distributed Systems and Networks to the Masters level.

In my spare time, I am a member of the Sabayon Linux developers group, where I help maintain the Sabayon Community Repositories, help look after the project infrastructure, maintain the puppet-sabayon module, and dabble with package maintenance.

You can contact me via me@benroberts.net.

Recent Posts

Managing volume usage in Bacula

Overview

I’ve been a user of Bacula for several years now, managing a large deployment for work. In this case, by large I mean multiple petabytes of data tracked by the catalog across tens of thousands of volume files. The past few years have seen several incremental improvements that now mean for the most part the setup hums along, requiring very little maintenance.

The three typical tasks that require my attention are:

  • Regular test restores (a backup is only as good as your restore process and ability to execute it)
  • Resizing pool limits as backup sizes change over time
  • Bringing additional storage arrays online when growth outstrips hardware capacity every couple of years or so.

Scaling a system up to these levels requires a little bit more thought on the design up front. I’ve been meaning to write up some of the things I’ve learned along the way and this will probably turn into a series of posts. This first one is about volume and pool management.

Executive summary

  • Don’t try to micromanage Bacula’s volume usage unless you have a compelling reason to.
    • Don’t try and force volumes to have particular names based on the contents. Use the catalog to tell you which volumes your data resides in. It saves you time/energy, and maximises storage efficiency. Let the computer deal do the grunt work.
  • Create pools based on Volume retention only for maximum storage efficiency. Subdivide only based on storage medium and location. E.g. Disk-30Day-ServerA
  • Set Maximum Volume Bytes to set a fixed upper limit on the size of volumes
  • Avoid using any other volume size limits, such as Use Volume Once, Maximum Volume Jobs or Maximum Use Duration which will reduce your overall capacity
  • Set Maximum Pool Volumes on each pool to stop any one pool from consuming all volumes
  • The sum total of Maximum Pool Volumes should not exceed the total number of volumes available.
  • Enable Recycling (always) and Auto Prune (where possible)
  • Use the Scratch Pool and Recycle Pool directives to free yourself from manual rebalancing tasks
  • Consider pre-creating your volumes, and monitoring free capacity based on the number of volumes remaining in the Scratch Pool

Volumes

How many and how big?

This one is a balancing act with a few factors to consider. Two important behaviours to note are that:

  • Bacula treats volumes as append-only until they are full. Once the volumes are full, the retention period begins counting down, and space is only ever reclaimed by recycling an entire volume once the retention period has been reached.
  • Bacula tries to hold off deleting your data for as long as possible. It will always prefer to create a new volume rather than recycle and overwrite an existing one. You’ll need to to impose limits to stop Bacula growing forever and running out of disk space. It will then consume all disk space and sit there, only ever freeing up space right before it’s about to be overwritten.

Capacity planning is easiest when disk usage is fairly stable so you can see trends and predict when you’re getting close to the limits. If you run low on disk space, having volumes that are too large will make things hard to manage. Data will be kept around longer than you want, and space will be freed up later and in big chunks which will cause peaks and troughs in your usage graphs.

You should be aiming to fill volumes fairly quickly after they’re first started so there isn’t a long delay before the retention interval starts counting down. Keeping the volumes small enough that they’re recycled frequently and on a regular basis will help keep the disk usage more consistent.

This would suggest using volumes of a fixed size, no bigger than your daily backup load, and expiring them on a regular basis. This has the side effect of meaning your capacity planning is based around the number of available volumes rather than the amount of free disk space in the filesystem.

On the flip side, having volumes that are too small can cause catalog performance issues. When the current volume fills, and Bacula needs to choose the next one, having lots of volumes means there are more to consider. All the time there are completely unused volumes this is a cheap operation, but once all volumes have been used finding the best one to recycle becomes a lot more expensive because Bacula has to consider the age of all jobs on each volumes and work out if there are any volumes that have expired and can be reused. With large numbers of volumes (a couple of thousand), this can become a blocking operation, at least under MySQL, and cause the backup system to slow down unnecessarily.

Because of the second behaviour listed above, Bacula will fill up all volumes very quickly and therefore will run in the constant state that nearly all volumes are used all the time.

I suggest you want to keep the total number of volumes in the range 200-1000 to start off with. That way you can grow 2-10x in size before having to worry about catalog performance becoming a real issue.

I think it’s fairly well understood that storage requirements and hence backup requirements only ever increase over time. My users at least have an uncany ability to fill all disk space given to them and are very reluctant to spend any time and energy cleaning up after themselves. As such, while planning our backups systems we need to make sure to leave plenty of room for growth over time. I find it best to plan for growth in two ways over the lifetime of the system.

  • Spec out the initial requirements so that day 1 load is never more than 50% of the total capacity. Demand for additional storage can happen very suddenly, but purchasing new hardware can take a while to action. By having plenty of spare capacity up front, this means you can double in size without having to spend any time, effort or money on the problem.This means, aim to be using only 50% of your total volumes from the beginning.
  • It’s not that efficient to vastly overspec right from the beginning though; hardware costs always go down over time so it’ll be cheaper to buy the extra capacity you don’t need right now later on. Therefore the second/third expansions should be possible just by throwing more hardware at the problem without having to re-architect the entire solution. This is typically by adding another batch of hard drives to the storage daemon.This means, aim to start with no more than 500 volumes, so you can grow 2-4x by adding more hardware without hitting performance issues.

In terms of how many volumes you will have, this is a function of total storage capacity and volume size. Certain operations can slow down when the filesystem gets close to being full, so it’s also worth leaving a margin of free space so that even when Bacula is at maximum usage there’s still a little bit of free space. Thus, the generalised formula is:

 Total Volumes = (Capacity - Margin) / Volume Size
 

Worked example

Taking a worked example using some nice round numbers, lets say:

  • you have 10TB of capacity available, and we’re going to leave a generous 10% margin.
  • the full backups are 1.5TB in total, and the rate of churn is 60GB/day.
  • you want to keep backups on disk for 30 days.

Let’s start with the assumption that since we use 60GB per day, the volume size could be 20GB, which gives you 450 volumes in total, nicely in the middle of the desired range. You can grow about 4x more again without having to change the design.

You’ll use 3 volumes per day, but every day 3 volumes will expire and be recycled ready for use the next day. Over the course of 30 days, 90 volumes will be used to hold the daily backups.

You won’t want to delete the only copy of the full backups without having taken another one, so you’ll have to budget for having two copies of the full backup on disk at any one time which is 3TB, 150 volumes.

The peak volume usage will be 150+90=240. This is just over half of the total initial volumes so you’ll have room to grow within the original design/hardware spec.

All good!

Pools

Group volumes by retention period and media type/location

Some people like to group their volumes into lots of different pools so they “know where their data is”. In practice this increases manual administration effort and reduces storage efficiency (because with more pools you have more part-filled volumes, some of which might sit idle for some time before being filled). Unless you have a compelling reason to divide the backups into different pools, the best thing is to have as few pools as possible and let Bacula put the data in the next available volume. The catalog keeps track of which backups are in which volumes and the query bconsole command can be used to tell you which volumes a job was written to, or which jobs are on a particular volume.

The admin overhead comes into play again when you run low on disk space. The more pools you have, and the more fragmented the volumes are across pools the more work it is to rebalance the volumes and limits to get more capacity in the pool you need it.

There are some restrictions though, and you can’t just use one pool for everything. If you have multiple Storage Daemons, you’ll need a set of pools for each, likewise you’ll need separate pools for each media type. If you have a single server and single media type, you can get away with just one pool here.

The second reason to divide pools is by retention time. You don’t want short-lived backup jobs being kept on disk longer than necessary because they shared a volume with some long-lived backups, as this will just waste disk space.

My recommendation is to create pools based on these three factors, using names of the format ${Media}-${RetentionPeriod}-${Server}. Some examples:

  • Disk-30Day-ServerA
  • Disk-30Day-ServerB
  • Disk-60Day-ServerB
  • Tape-7Year-LibraryA

Scratch Pool, pre-created volumes

The scratch pool is another excellent labour-saving feature. Rather than pre-assign all your volumes to the different backup pools, you can place all your volumes into the scratch pool, and configure your backup pools to take a new volume from the scratch pool when needed, and return expired volumes back to the scratch pool when recycled. This means your backup pools only ever contain full (or a single part-used) volumes. This is again useful when running low on disk space, since volumes no longer need to be manually moved and relabelled to make space elsewhere.

Rather than let Bacula create volumes on demand, I prefer to pre-create all my volumes when first bringing a storage array online. Since I’ve already worked out what the maximum number of volumes will be, this is easily done with a for loop and the label command. See below for an example.

Monitoring free space

This makes capacity monitoring very easy. The number of volumes in your scratch pool is the amount of free space in your backup system which you can grow into. When this gets low, you know it’s time to reduce your backup load (by reducing the amount of data being backed up, or reducing the retention periods), or increase your hardware capacity.

A simple query.sql bconsole query command can show you the breakdown per media type, and it’s trivial to hook this into nagios or similar monitoring system to alert you when you need to take action.

# 28
:Get the type and count of free volumes in the Scratch pool
SELECT Media.MediaType, COUNT(*) as Count
 FROM bacula.Media LEFT JOIN bacula.Pool ON Media.PoolID=Pool.PoolId
 WHERE Pool.Name="Scratch"
 GROUP BY Media.MediaType;

This will show output like:

+--------------+-------+
| MediaType    | Count |
+--------------+-------+
| File-ServerA | 194   |
| File-ServerB | 324   |
| File-ServerC | 719   |
| LTO6         | 37    |
+--------------+-------+

Putting it all together

That was all very wordy, but the good news is it all boils down to a very small amount of configuration.

Bacula-dir.conf

For example, creating the scratch pool, and one backup pool using the details from earlier:

Pool {
 # Magic name, volumes are taken from this pool
 # when another pool fills up
 Name = Scratch
 Pool Type = Backup
 # The following attributes are applied to the volume when first
 # labelled, and don't update automatically as the volume
 # moves between pools.

 # Allow this volume to be recycled
 Recycle = yes

 # When the volume is recycled, return it automatically to
 # this pool
 Recycle Pool = Scratch
}

Pool {
 Name = Disk-30Day-ServerA
 Pool Type = Backup
 Storage = ServerA-sd 

 # Allow volumes to be automatically reused once expired
 Recycle = yes
 # Take new volumes from the Scratch pool
 Recycle Pool = Scratch
 # Automatically expire old volumes
 Auto Prune = yes
 # Keep data for 30 days
 Volume Retention = 30 days
 # Limit the size and number of volumes
 Maximum Volume Bytes = 20G
 Maximum Volumes = 90

 # Disable Labelling media if pre-creating volumes
 Label Media = no

 # Catch any volumes automatically labelled into this pool
 Recycle = yes
 Recycle Pool = Scratch
}

Pre-creating your volumes

To add the 450 volumes to the ServerA-sd storage daemon, with names like ServerA-file-0001, run something like the following:

for i in $(seq -f "%04g" 1 450); do 
  echo "label pool=Scratch storage=ServerA-sd volume=ServerA-file-${i}" | /opt/bacula/bin/bconsole ;
done

Scaling further

As mentioned earlier, having thousands of volumes can cause performance issues when space runs low and Bacula has to hunt for volumes. It will look for new volumes independently for each blocked job once every 5 minutes. If several jobs block at the same time, not only does the director waste a lot of time scanning the catalog for volumes that probably aren’t going to become free any time soon, but the console also locks up for minutes at a time making it very hard to resolve.

One way to deal with this (which is also described in Bacula’s Best Practices for Disk Backup whitepaper is to disable auto prune and manually prune the volumes once a day instead.

Pool {
# ...
Auto Prune = no
}

Schedule {
 Name = PruningSchedule
 Run = daily at 06:30
}

job {
 Name = "PruneExpiredVolumes"
 Type = Admin
 Messages = Standard

 # Allow this job to run at the same time as any other jobs
 # which will be necessary to unblock them if the system is out
 # of volumes at the time this job is scheduled to run to avoid
 # deadlock
 Allow Mixed Priority = yes
 Priority = 3

 RunScript {
  # Run this console command when the job runs
  Console = "prune expired volume yes"
  RunsOnClient = no
  RunsWhen = Before
 }
 
 # Dummy values, required by the config parser and must exist,
 # but are not used
 Pool = CatalogPool
 Client = DefaultCatalog
 FileSet = CatalogFileSet
 Schedule = PruningSchedule
}
  1. Puppet custom type validation woes Leave a reply
  2. puppet-sabayon Leave a reply
  3. Going Paperless 1 Reply
  4. Removing stale facts from PuppetDB 4 Replies
  5. Setting up hiera-eyaml-gpg 1 Reply
  6. ZFS on Sabayon Leave a reply
  7. Using puppet on Sabayon Linux Leave a reply
  8. Puppetenvsh Mcollective Agent 2 Replies