Managing volume usage in Bacula

Overview

I’ve been a user of Bacula for several years now, managing a large deployment for work. In this case, by large I mean multiple petabytes of data tracked by the catalog across tens of thousands of volume files. The past few years have seen several incremental improvements that now mean for the most part the setup hums along, requiring very little maintenance.

The three typical tasks that require my attention are:

  • Regular test restores (a backup is only as good as your restore process and ability to execute it)
  • Resizing pool limits as backup sizes change over time
  • Bringing additional storage arrays online when growth outstrips hardware capacity every couple of years or so.

Scaling a system up to these levels requires a little bit more thought on the design up front. I’ve been meaning to write up some of the things I’ve learned along the way and this will probably turn into a series of posts. This first one is about volume and pool management.

Executive summary

  • Don’t try to micromanage Bacula’s volume usage unless you have a compelling reason to.
    • Don’t try and force volumes to have particular names based on the contents. Use the catalog to tell you which volumes your data resides in. It saves you time/energy, and maximises storage efficiency. Let the computer deal do the grunt work.
  • Create pools based on Volume retention only for maximum storage efficiency. Subdivide only based on storage medium and location. E.g. Disk-30Day-ServerA
  • Set Maximum Volume Bytes to set a fixed upper limit on the size of volumes
  • Avoid using any other volume size limits, such as Use Volume Once, Maximum Volume Jobs or Maximum Use Duration which will reduce your overall capacity
  • Set Maximum Pool Volumes on each pool to stop any one pool from consuming all volumes
  • The sum total of Maximum Pool Volumes should not exceed the total number of volumes available.
  • Enable Recycling (always) and Auto Prune (where possible)
  • Use the Scratch Pool and Recycle Pool directives to free yourself from manual rebalancing tasks
  • Consider pre-creating your volumes, and monitoring free capacity based on the number of volumes remaining in the Scratch Pool

Volumes

How many and how big?

This one is a balancing act with a few factors to consider. Two important behaviours to note are that:

  • Bacula treats volumes as append-only until they are full. Once the volumes are full, the retention period begins counting down, and space is only ever reclaimed by recycling an entire volume once the retention period has been reached.
  • Bacula tries to hold off deleting your data for as long as possible. It will always prefer to create a new volume rather than recycle and overwrite an existing one. You’ll need to to impose limits to stop Bacula growing forever and running out of disk space. It will then consume all disk space and sit there, only ever freeing up space right before it’s about to be overwritten.

Capacity planning is easiest when disk usage is fairly stable so you can see trends and predict when you’re getting close to the limits. If you run low on disk space, having volumes that are too large will make things hard to manage. Data will be kept around longer than you want, and space will be freed up later and in big chunks which will cause peaks and troughs in your usage graphs.

You should be aiming to fill volumes fairly quickly after they’re first started so there isn’t a long delay before the retention interval starts counting down. Keeping the volumes small enough that they’re recycled frequently and on a regular basis will help keep the disk usage more consistent.

This would suggest using volumes of a fixed size, no bigger than your daily backup load, and expiring them on a regular basis. This has the side effect of meaning your capacity planning is based around the number of available volumes rather than the amount of free disk space in the filesystem.

On the flip side, having volumes that are too small can cause catalog performance issues. When the current volume fills, and Bacula needs to choose the next one, having lots of volumes means there are more to consider. All the time there are completely unused volumes this is a cheap operation, but once all volumes have been used finding the best one to recycle becomes a lot more expensive because Bacula has to consider the age of all jobs on each volumes and work out if there are any volumes that have expired and can be reused. With large numbers of volumes (a couple of thousand), this can become a blocking operation, at least under MySQL, and cause the backup system to slow down unnecessarily.

Because of the second behaviour listed above, Bacula will fill up all volumes very quickly and therefore will run in the constant state that nearly all volumes are used all the time.

I suggest you want to keep the total number of volumes in the range 200-1000 to start off with. That way you can grow 2-10x in size before having to worry about catalog performance becoming a real issue.

I think it’s fairly well understood that storage requirements and hence backup requirements only ever increase over time. My users at least have an uncany ability to fill all disk space given to them and are very reluctant to spend any time and energy cleaning up after themselves. As such, while planning our backups systems we need to make sure to leave plenty of room for growth over time. I find it best to plan for growth in two ways over the lifetime of the system.

  • Spec out the initial requirements so that day 1 load is never more than 50% of the total capacity. Demand for additional storage can happen very suddenly, but purchasing new hardware can take a while to action. By having plenty of spare capacity up front, this means you can double in size without having to spend any time, effort or money on the problem.This means, aim to be using only 50% of your total volumes from the beginning.
  • It’s not that efficient to vastly overspec right from the beginning though; hardware costs always go down over time so it’ll be cheaper to buy the extra capacity you don’t need right now later on. Therefore the second/third expansions should be possible just by throwing more hardware at the problem without having to re-architect the entire solution. This is typically by adding another batch of hard drives to the storage daemon.This means, aim to start with no more than 500 volumes, so you can grow 2-4x by adding more hardware without hitting performance issues.

In terms of how many volumes you will have, this is a function of total storage capacity and volume size. Certain operations can slow down when the filesystem gets close to being full, so it’s also worth leaving a margin of free space so that even when Bacula is at maximum usage there’s still a little bit of free space. Thus, the generalised formula is:

 Total Volumes = (Capacity - Margin) / Volume Size
 

Worked example

Taking a worked example using some nice round numbers, lets say:

  • you have 10TB of capacity available, and we’re going to leave a generous 10% margin.
  • the full backups are 1.5TB in total, and the rate of churn is 60GB/day.
  • you want to keep backups on disk for 30 days.

Let’s start with the assumption that since we use 60GB per day, the volume size could be 20GB, which gives you 450 volumes in total, nicely in the middle of the desired range. You can grow about 4x more again without having to change the design.

You’ll use 3 volumes per day, but every day 3 volumes will expire and be recycled ready for use the next day. Over the course of 30 days, 90 volumes will be used to hold the daily backups.

You won’t want to delete the only copy of the full backups without having taken another one, so you’ll have to budget for having two copies of the full backup on disk at any one time which is 3TB, 150 volumes.

The peak volume usage will be 150+90=240. This is just over half of the total initial volumes so you’ll have room to grow within the original design/hardware spec.

All good!

Pools

Group volumes by retention period and media type/location

Some people like to group their volumes into lots of different pools so they “know where their data is”. In practice this increases manual administration effort and reduces storage efficiency (because with more pools you have more part-filled volumes, some of which might sit idle for some time before being filled). Unless you have a compelling reason to divide the backups into different pools, the best thing is to have as few pools as possible and let Bacula put the data in the next available volume. The catalog keeps track of which backups are in which volumes and the query bconsole command can be used to tell you which volumes a job was written to, or which jobs are on a particular volume.

The admin overhead comes into play again when you run low on disk space. The more pools you have, and the more fragmented the volumes are across pools the more work it is to rebalance the volumes and limits to get more capacity in the pool you need it.

There are some restrictions though, and you can’t just use one pool for everything. If you have multiple Storage Daemons, you’ll need a set of pools for each, likewise you’ll need separate pools for each media type. If you have a single server and single media type, you can get away with just one pool here.

The second reason to divide pools is by retention time. You don’t want short-lived backup jobs being kept on disk longer than necessary because they shared a volume with some long-lived backups, as this will just waste disk space.

My recommendation is to create pools based on these three factors, using names of the format ${Media}-${RetentionPeriod}-${Server}. Some examples:

  • Disk-30Day-ServerA
  • Disk-30Day-ServerB
  • Disk-60Day-ServerB
  • Tape-7Year-LibraryA

Scratch Pool, pre-created volumes

The scratch pool is another excellent labour-saving feature. Rather than pre-assign all your volumes to the different backup pools, you can place all your volumes into the scratch pool, and configure your backup pools to take a new volume from the scratch pool when needed, and return expired volumes back to the scratch pool when recycled. This means your backup pools only ever contain full (or a single part-used) volumes. This is again useful when running low on disk space, since volumes no longer need to be manually moved and relabelled to make space elsewhere.

Rather than let Bacula create volumes on demand, I prefer to pre-create all my volumes when first bringing a storage array online. Since I’ve already worked out what the maximum number of volumes will be, this is easily done with a for loop and the label command. See below for an example.

Monitoring free space

This makes capacity monitoring very easy. The number of volumes in your scratch pool is the amount of free space in your backup system which you can grow into. When this gets low, you know it’s time to reduce your backup load (by reducing the amount of data being backed up, or reducing the retention periods), or increase your hardware capacity.

A simple query.sql bconsole query command can show you the breakdown per media type, and it’s trivial to hook this into nagios or similar monitoring system to alert you when you need to take action.

# 28
:Get the type and count of free volumes in the Scratch pool
SELECT Media.MediaType, COUNT(*) as Count
 FROM bacula.Media LEFT JOIN bacula.Pool ON Media.PoolID=Pool.PoolId
 WHERE Pool.Name="Scratch"
 GROUP BY Media.MediaType;

This will show output like:

+--------------+-------+
| MediaType    | Count |
+--------------+-------+
| File-ServerA | 194   |
| File-ServerB | 324   |
| File-ServerC | 719   |
| LTO6         | 37    |
+--------------+-------+

Putting it all together

That was all very wordy, but the good news is it all boils down to a very small amount of configuration.

Bacula-dir.conf

For example, creating the scratch pool, and one backup pool using the details from earlier:

Pool {
 # Magic name, volumes are taken from this pool
 # when another pool fills up
 Name = Scratch
 Pool Type = Backup
 # The following attributes are applied to the volume when first
 # labelled, and don't update automatically as the volume
 # moves between pools.

 # Allow this volume to be recycled
 Recycle = yes

 # When the volume is recycled, return it automatically to
 # this pool
 Recycle Pool = Scratch
}

Pool {
 Name = Disk-30Day-ServerA
 Pool Type = Backup
 Storage = ServerA-sd 

 # Allow volumes to be automatically reused once expired
 Recycle = yes
 # Take new volumes from the Scratch pool
 Recycle Pool = Scratch
 # Automatically expire old volumes
 Auto Prune = yes
 # Keep data for 30 days
 Volume Retention = 30 days
 # Limit the size and number of volumes
 Maximum Volume Bytes = 20G
 Maximum Volumes = 90

 # Disable Labelling media if pre-creating volumes
 Label Media = no

 # Catch any volumes automatically labelled into this pool
 Recycle = yes
 Recycle Pool = Scratch
}

Pre-creating your volumes

To add the 450 volumes to the ServerA-sd storage daemon, with names like ServerA-file-0001, run something like the following:

for i in $(seq -f "%04g" 1 450); do 
  echo "label pool=Scratch storage=ServerA-sd volume=ServerA-file-${i}" | /opt/bacula/bin/bconsole ;
done

Scaling further

As mentioned earlier, having thousands of volumes can cause performance issues when space runs low and Bacula has to hunt for volumes. It will look for new volumes independently for each blocked job once every 5 minutes. If several jobs block at the same time, not only does the director waste a lot of time scanning the catalog for volumes that probably aren’t going to become free any time soon, but the console also locks up for minutes at a time making it very hard to resolve.

One way to deal with this (which is also described in Bacula’s Best Practices for Disk Backup whitepaper is to disable auto prune and manually prune the volumes once a day instead.

Pool {
# ...
Auto Prune = no
}

Schedule {
 Name = PruningSchedule
 Run = daily at 06:30
}

job {
 Name = "PruneExpiredVolumes"
 Type = Admin
 Messages = Standard

 # Allow this job to run at the same time as any other jobs
 # which will be necessary to unblock them if the system is out
 # of volumes at the time this job is scheduled to run to avoid
 # deadlock
 Allow Mixed Priority = yes
 Priority = 3

 RunScript {
  # Run this console command when the job runs
  Console = "prune expired volume yes"
  RunsOnClient = no
  RunsWhen = Before
 }
 
 # Dummy values, required by the config parser and must exist,
 # but are not used
 Pool = CatalogPool
 Client = DefaultCatalog
 FileSet = CatalogFileSet
 Schedule = PruningSchedule
}

Puppet custom type validation woes

Since I’ve just lost a full day to troubleshooting this issue, I’m documenting it in case it hits anyone else. In at least puppet versions 4.7.0 and earlier, global type validation cannot be used to ensure the presence of paramters without breaking puppet resource.

Simplified example type:

Puppet::Type.newtype(:entropy_mask) do
  @desc = "Mask packages in Entropy"

  ensurable

  newparam(:name) do
    desc "Unique name for this mask"
  end

  newproperty(:package) do
    desc "Name of the package being masked"
  end

  validate do
    # This will break for `puppet resource`
    raise(ArgumentError, "Package is required") if self[:package].nil?
  end
end

This works fine to validate that in a puppet manifest the `package` parameter is provided, but not when puppet resource interrogates the state of the existing system due to the way the object is constructed.

Puppet calls the `provider.instances` to obtain a list of the resources on the system managed by that provider. In the example above, the provider was a child of ParsedFile and took care of parsing the contents of /etc/entropy/packages/package.mask, splitting the lines into the various properties including package

Puppet then tries to create an instance of the Type/resource for each object retrieved by the provider. It does so by instantiating an instance of the type, but passing in the namevar and provider only. It then attempts to iterate through all the provider properties and set them on the resource one by one. The problem is that validation happens on the call to new() and so the required properties have not yet been set.

Here’s the code from lib/puppet/type.rb from puppet 4.7.0, with irrelevant bits stripped out and comments added by me:

def self.instances
  # Iterate through all providers for this type
  providers_by_source.collect do |provider|
    # Iterate through all instances managed by this provider
    provider.instances.collect do |instance|
      # Instantiate the resource using just the namevar and provider
      result = new(:name => instance.name, :provider => instance)
      # Oops, type.validate() got called here, but not all properties
      # have been set yet</code>

      # Now iterate through all properties on the provider and set
      # them on the resource
      properties.each { |name| result.newattr(name) }

      # And add this to the list of resources to return
      result
    end
  end.flatten.compact
end

Of course, once the problem is understood, finding out that someone else already discovered this 2 years ago becomes much easier. Here’s the upstream bug report: https://tickets.puppetlabs.com/browse/PUP-3732

puppet-sabayon

I’ve just uploaded my first puppet module to the forge, optiz0r-sabayon, which improves support for the Sabayon Linux distribution in puppet.

This does the following things:

  • Overrides the operatingsystem fact for Sabayon hosts
  • Adds a provider for entropy package manager, and sets this as the default for Sabayon
  • Adds a shim service provider, that marks systemd as the default for Sabayon hosts
  • Adds an enman_repo type which manages installation of Sabayon Community Repositories onto a Sabayon host
  • Adds entropy_mask and entropy_unmask types, which manage package masks and unmasks

I’ll add more features as and when I need them. In the meantime, pull requests welcome!

Going Paperless

Too much paper!

My house is full of paperwork. Bank statements, invoices, letters about services. There’s far too much of it, and I’ve never been good at throwing it away in case I need it later on. But physically filing lots of paper requires lots of boxes to be organised, which takes up lots of space and time. Neither of which I have a lot of. So the end result is I have a stack of “loosely-chronologically filed” paperwork on my desk which has been mounting up for a couple of years. And then on the occasion I do need something, I can’t find it, because it’s somewhere in the pile along with everything else. Something needs to change.

Piles of paper

Piles of paper, credit to Shehan Peruma

Ideally, what I want is a simple system where I only have to touch a piece of paper once, and then it magically ends up in the cloud, sorted into folders based on the type or source, searchable by keywords. Then the physical paper can go straight through the shredder and into the bin. Maybe my house will be a bit tidier as a result!

This is not a simple as it sounds though. Firstly I need a scanner that’s quick and easy to use. I don’t want to have to faff around with software, or manually uploading the results anywhere; the more effort required, the less likely I am to stick to the process. Next up, the result of a scan is usually a JPEG file (or sometimes a PDF) containing just the image. In order to make it really useful, there needs to be some OCR action going on. Then the document needs to be stored somewhere useful so I can manually sort them into different folders (ideally in as simple a way as possible, so bulk drag-n-drop while being able to see an image preview).

I do have an always-on local Linux server, and am happy to run whatever software on there is needed to make this work.

The Scanner

I picked up an ION Docuscan, which is a small, standalone desktop scanner which doesn’t need a PC to operate and writes images directly to an SD card. It’s fairly primitive and doesn’t seem to have much control over the scan type such as DPI or colour depth but it’s simple to use and produces acceptable results. Using it is as simple as feeding the sheet into the front and pressing a button. The motor pulls the paper through and shows the end result on a tiny LCD display. If anything goes wrong, e.g. the drum doesn’t grab the paper, or it moves during the scan I can delete the result using the buttons on the scanner before anything else happens. It doesn’t like anything bigger than A4, but that’s OK for me.

The EyeFi SD Card

It may be a few years old, but I love that this thing exists. It’s a 4GB SD card that contains a small processor with wifi connectivity, and it can be told to upload the files written to it to somewhere else as soon as they are saved. Now we don’t have to spend hundreds or thousands of pounds on a network-attached scanner, or have a giant clunky multifunction printer to get scans stored on the network. Awesome.

The card needs some initial setup, which annoyingly can only be done using Windows or Mac OSX. Ordinarily I don’t run either of these so I needed to setup a Windows VM to get started. I used the IE testing VM provided by Microsoft inside Virtualbox. I passed the SD card through to the windows VM (while my laptop has an SD card slot, virtualbox couldn’t pass that through so I needed to use the USB SD card reader that shipped with the EyeFi card) and installed the software included on the SD card. It needs to be set to upload pictures to “this computer” (which is an oversimplification in the GUI, it just instructs the card to upload it to something on the local network, it doesn’t have to be that particular computer. It seems to do some kind of service discovery or network probe to find the actual machine to upload to).

eyefi-setup

EyeFi setup to upload pictures to the local network

We also need to grab the UploadKey from a config file that’s written to the Windows machine, which is what’s used for access control later on, so only this card is allowed to upload images. This is saved to C:\Users\IEUser\AppData\Roaming\Eye-Fi\Settings.xml and is fairly obviously near the top of that file.

Software

This is where things get a little interesting, there’s no one-stop fits all solution here, and it took me quite a few hours to get something working. I’m documenting the process for two reasons, firstly it may be useful to someone later on, and secondly that person is likely to be me!

Eyefiserver2

We need to run a copy of eyefiserver2 on the local Linux machine for the card to upload to. This listens on a fixed high-numbered port, and using some kind of service discovery magic, the EyeFi card will find it and send the scanned images to it. The eyefiserver will store the images on a local path and run a script for us after each image is saved.

I installed the copy from silmano’s overlay which works fine but only ships with an openrc initscript. I added the following files for systemd support:

# cat /etc/systemd/system/eyefiserver.service
[Unit]
Description=EyeFi Server

[Service]
Type=simple
User=ben
Group=users
PIDFile=/var/run/eyefiserver/eyefiserver.pid
ExecStart=/usr/bin/eyefiserver --conf=/etc/eyefiserver.conf --log=/var/log/eyefiserver/eyefiserver.log

[Install]
WantedBy=multi-user.target

# cat /etc/tmpfiles.d/eyefiserver.conf
D /var/run/eyefiserver 0755 ben users
D /var/log/eyefiserver 0755 ben users

We also need to configure eyefiserver with our EyeFi card’s upload key, where to save the files, and the script to run:

# cat /etc/eyefiserver.conf
[EyeFiServer]
upload_key=YOUR UPLOAD_KEY_HERE
loglevel=INFO
upload_dir=/media/pictures/eyefi/unprocessed/
use_date_from_file=no
# Commented out for now, we'll come back to this later
#complete_execute=/media/pictures/eyefi/ocr

I’m choosing to save the scanned images to an unprocessed directory and the script will put the post-processed files somewhere more useful later on. At this point we can start the service up:

systemd-tmpfiles --create
systemctl enable --now eyefiserver
systemctl status eyefiserver

And then test it’s working properly by scanning an image and tailing the /var/log/eyefiserver/eyefiserver.log logfile. You should see some messages in the log and a JPG file appear in the unprocessed directory.

Google Drive client setup

Google is my feudal lord and so Drive is my preferred cloud storage provider. There seem to be a great many Linux clients with varying support, however none of them I could find would do what I needed and were packaged up for Sabayon. drive seems to be one of the better clients, and after 30mins trying and failing to write an ebuild for it, I took the lazy route and installed it to /opt

# export GOPATH=/opt/drive
# go get -u github.com/odeke-em/drive/cmd/drive
$ /opt/drive/bin/drive version
drive version: 0.3.4

This needs a directory that will act as a mirror of your Google Drive to work, and needs to be setup with an OAuth token for access. Running the following command will give you a URL to paste into a browser. Accept the authorisation prompt and copy the key back into the drive prompt to get started. All actions then need to be done from within the ~/gdrive directory. So I symlinked the eyefi directory into the gdrive directory and plan to be selective about what I push.

$ /opt/drive/bin/drive init ~/gdrive

$ cd ~/gdrive
$/opt/drive/bin/drive ls
$ ln -s /media/pictures/eyefi ./

The OCR and upload script

So far everything has been about using ready-made products, but this is the bit where we have to get our hands dirty and do some of our own scripting. I’ve put together the following to glue the other components together. It does the following:

  • Creates (if not already existing) a temporary working directory and a dated directory to store the processed results in
  • Rewrites the JPG image from the unprocessed directory to the output directory to work around the scanner producing malformed JPG files
  • Rewrites the EXIF dates to the current date (since the scanner doesn’t have an accurate clock and starts up at 2010-01-01 00:00:00 every time it’s powered on)
  • Converts the JPG into a PDF document
  • Uses pdfsandwich to do the OCR and produce the final PDF we’re after
    •  Passing calls to gs through a wrapper script which noops some of the operations, for the reasons described below
  • Moves the PDF into the right place and cleans up all the temporary files
  • Pushes the OCR’d PDF to Google Drive

pdfsandwich post-processes the results by passing it through ghostscript to do things like resize the page back to A4. Due to an incompatibility, this broke the ability to search the document since it was inserting additional whitespace between every character of text (so instead of searching for “hello”, you’d have to search for “h e l l o”, which is no good to me). I found the intermediate files to be perfectly usable so wrote a horrifically quick and dirty wrapper script around ghostscript which noops some of the operations. Unfortunately pdfsandwich doesn’t expose options to disable some of the post-processing, but does let you specify the gs binary, which meant I could point it at a shell script without mucking around with the locations of system binaries or custom PATHs. I hope this will be fixed upstream at some point so I can remove the hack.

The OCR script:

#!/bin/bash

DATE=$(date +"%Y-%m-%d")
EXIFDATE=$(date +"%Y:%m:%d-%H:%M:%S")

IMAGE=${1##*/}
UNPROCESSED_PATH=${1%/*}
IMAGE_PATH=/media/pictures/eyefi/${DATE}

WORK_PATH=/media/pictures/eyefi/.tmp
PDF_PATH="${IMAGE_PATH}/OCR"
PDF=${IMAGE%.*}.pdf
OCR_PDF=${PDF%.pdf}_ocr.pdf

# Create the output dirs if not already
mkdir -p "${PDF_PATH}" "${WORK_PATH}"

# Scanner malforms the JPG files
/usr/bin/mogrify -write "${IMAGE_PATH}/${IMAGE}" -set comment 'Extraneous bytes removed' "${UNPROCESSED_PATH}/${IMAGE}" 2>/dev/null

# Update the exif timestamp to today's date, since the scanner isn't accurate
jhead -ts"${EXIFDATE}" "${IMAGE_PATH}/${IMAGE}"

# Convert to PDF
/usr/bin/convert "${IMAGE_PATH}/${IMAGE}" "${WORK_PATH}/${PDF}"

# OCR it
/usr/bin/pdfsandwich "${WORK_PATH}/${PDF}" -o "${WORK_PATH}/${OCR_PDF}" -rgb -gs /media/pictures/eyefi/fake-gs

# Push the resulting OCR'd document up to Google Drive
mv "${WORK_PATH}/${OCR_PDF}" "${UNSORTED_PATH}/${PDF}"
pushd ~/gdrive
/opt/drive/bin/drive push -no-prompt eyefi/unsorted/${PDF}
popd

# Move the unsorted PDF to the final locla location
mv "${UNSORTED_PATH}/${PDF}" "${PDF_PATH}/${PDF}"

# And tidy up
rm "${UNPROCESSED_PATH}/${IMAGE}" "${WORK_PATH}/${PDF}"

And the ghostview wrapper script (don’t look too closely, it’s nasty but does the job for now):

#!/bin/bash
if [ "$7" == "-dPDFFitPage" ]; then
 echo "Faking call to gs to preserve working search"
 cp -- "${@:(-1):1}" "${@:(-2):1}"
elif [ "${5%=*}" == "-sOutputFile" ]; then
 OUT=${5#-sOutputFile=}
 echo "Faking call to gs to preserve working search, copying $6 to $OUT"
 cp $6 $OUT
else
 exec /usr/bin/gs $*
fi

Dependencies

The script requires the following tools:

  • imagemagick (provides the convert and mogrify tool)
  • jhead (does the exif maipulation)
  • exact-image (provides the hocr2pdf tool)
  • pdfsandwhich (produces the OCR’d PDF we want)

Managing the files in Google Drive

Now within about 2 minutes, all the freshly OCR’d documents from the scanner appear in my Drive at eyefi/unsorted, and I can drag and drop them in bulk to the folders on the left.

Document in Drive waiting to be sorted (and no, it didn't recognise my handwriting!)

Document in Drive waiting to be sorted (and no, it didn’t recognise my handwriting!)

I can also search Drive for any of the text in the PDFs and it will show the search results. The accuracy of the OCR seems to be pretty good here, as the results contain what I expect to see.

Conclusion

My objective has been met; I can run a batch of paper through the scanner every couple of weeks, and quickly file away the resulting electronic documents safe in the knowledge that I could probably find them again. No longer do I need to worry about box files and shelf space. Although I do now have a new problem…

Credit to Paul Silcock

Credit to Paul Silcock

Removing stale facts from PuppetDB

PuppetBoard and PuppetExplorer are both excellent tools but can be slowed down significantly if there are a very large number of facts in PuppetDB. I recently had an issue with some legacy facts tracking stats about mounted filesystems causing a significant amount of bloat, and this is how I cleaned them up.

The problem

A long time ago, someone decided it would be useful to have some extra fact data recording which filesystems were mounted, the types and how much space was being used on each. These were recorded such as:

fstarget_/home=/dev/mapper/system-home
fstype_/home=ext4
fsused_/home=4096

It turned out that none of these ever got used for anything useful, but not before we amassed 1900 unique filesystems being tracked across the estate and with three facts each that accounted for almost 6000 useless facts.

Too many facts!

The PuppetDB visualisation tools both have a page that lists all the unique facts, retrieved from the PuppetDB API using the /fact-names endpoint. Having several thousand records to retrieve and render caused each tool to delay page loads by around 30 seconds, and typing into the realtime filter box could take minutes to update, one character appearing at a time.

Removing the facts

Modifying the code to stop the fact being present on the machine is the easy part. Since the /fact-names reports the unique fact names across all nodes, in order to make them disappear completely we must make sure all nodes check in with the updated fact list that omits the removed facts.

How you do this depends on your setup. Perhaps you have the puppet agent running on a regular schedule; maybe you have mcollective or another orchestration tool running on all your nodes; failing any of those a mass-SSH run.

So we update all the nodes and refresh PuppetExplorer… and it’s still slow. Damn, missed something.

Don’t forget the deactivated nodes!

If we take a closer look at the documentation for the /fact-names documentation we see the line:

This will return an alphabetical list of all known fact names, including those which are known only for deactivated nodes.

Ah ha! The facts are still present in PuppetDB for all the deactivated nodes, but since they’re not active we didn’t/cannot do a puppet run on them to update the list of facts. We’re going to have to remove them from the database entirely.

Purging old nodes from PuppetDB

By default, PuppetDB doesn’t ever remove deactivated nodes, which means the facts hang around forever. You can tweak this by enabling node-purge-ttl in PuppetDB’s database.ini. As a once-off tidy up, I set node-purge-ttl = 1d and restarted PuppetDB. Tailing the logs I see PuppetDB runs a garbage collection on startup and all of my deactivated nodes were purged immediately.

Success!

Now.. to deal with the thousand entries from the built-in network facts…

Setting up hiera-eyaml-gpg

It’s inevitable at some point while writing puppet manifests that you’ll need to manage some sensitive configuration; be that a database password, an SSH deploy key, etc. One way to deal with this is to lock down your puppet code so that only trusted developers can see the contents. Another approach is to encrypt the secrets within the puppet code, which is where hiera-eyaml comes in. Hiera-eyaml provides a pluggable backend for your hiera data that can contain secrets encrypted through different means. By default hiera-eyaml uses a symmetric passphrase to protect the secrets, but hiera-eyaml-gpg adds a GPG backend allowing secrets to be protected using asymmetric keys.

The puppetmaster of course will need access to the secrets in order to provide them to end machines (whether using passwords or GPG keys), so hiera-eyaml does nothing to help secure the master itself. You should work on the basis that the secrets are effectively plaintext on the puppetmaster and protect it appropriately. If someone does compromise your puppetmaster, you’ll have bigger problems than someone being able to read the hiera secrets. However hiera-eyaml does protect the secrets while they reside outside of the puppetmaster, for example on workstations and in version control systems.

I prefer the GPG backend because it means developers can have passphrase-protected keys on workstations and use the gpg-agent to securely access the key. It means workstation machines don’t need the same rigorous protection as the puppetmasters to keep the secrets secure.

Installing the hiera backends

On Sabayon, use my community repo and install using entropy:

equo install dev-ruby/hiera-eyaml dev-ruby/hiera-eyaml-gpg -a

Or install using portage from my overlay:

emerge dev-ruby/hiera-eyaml dev-ruby/hiera-eyaml-gpg -av

On RedHat type systems, use fpm to build RPMs from the ruby gems and install natively. Note we’re building with sudo so that fpm picks up the right GEMPATH and doesn’t build packages that install to the builder’s home directory.

cd ~/rpmbuild
sudo -E fpm -s gem -t rpm -n hiera-eyaml -a noarch --version 2.0.2 --iteration 1 -p RPMS/noarch/hiera-eyaml-VERSION-ITERATION.ARCH.rpm hiera-eyaml
sudo -E fpm -s gem -t rpm -n hiera-eyaml-gpg -a noarch --version 0.4 --iteration 1 -p RPMS/noarch/hiera-eyaml-gpg-VERSION-ITERATION.ARCH.rpm -d hiera-eyaml hiera-eyaml-gpg
sudo rpm -Uvh RPMS/noarch/hiera-eyaml{,-gpg}*.rpm

I’m assuming here that the ~/rpmbuild environment is already setup for rpmbuild to use. If not you will need to do this first.

Creating your puppetmaster keys

On each of your puppetmasters, you’ll need to create a PGP keypair that the master can use to decrypt the secrets.

First up, create a directory to contain the keyrings:

sudo mkdir /etc/puppet/keyrings

Now generate the keypair and export the public part (be sure not to set a passphrase here):

sudo gpg --homedir /etc/puppet/keyrings --gen-key
sudo gpg --homedir /etc/puppet/keyrings/ --export -o /tmp/puppetmaster.pub

Copy the puppetmaster.pub to your local workstation using something like scp so you can encrypt data using it later.

You can reuse the same key on all puppetmasters by copying the keyrings directory around, but it would be better to repeat this process to generate a unique key on each of your puppetmasters. This means you can later revoke a single master’s key without having to re-key every machine.

Creating your personal keys

If you don’t already have a personal GPG keypair, create one for yourself now:

gpg --gen-key

Next up you need to import all of the puppet master keys into your keyring (be sure to set a strong passphrase here):

gpg --import puppetmaster.pub

We need to list and sign each of the puppetmaster keys with your personal key:

gpg --list-keys
/home/ben/.gnupg/pubring.gpg
----------------------------
pub 4096R/CDADE567 2013-05-04
uid [ultimate] Ben Roberts <ben@example.com>
sub 2048R/FDF62278 2013-05-04

pub 4096R/CBF58456 2013-05-04
uid [ full ] master1.example.com <hostmaster@example.com>
sub 2048R/234E54BF 2013-05-04

pub 4096R/427659C4 2014-11-22
uid [ full ] master2.example.com <hostmaster@example.com>
sub 4096R/C645C3FB 2014-11-22

gpg --sign-key CBF58456
gpg --sign-key 427659C4

Setting up hiera

Now we need to configure hiera to use the hiera-eyaml and hiera-eyaml-gpg backends. Configure your hiera.conf contain the following to have it query

 --- 
 :backends:
 - eyaml
 
:hierarchy:
- common
 
 :eyaml:
 :datadir: /etc/puppet/environments/%{::environment}/data
 :extension: 'yaml'
 :encrypt_method: gpg
 :gpg_gnupghome: /etc/puppet/keyrings

Next up we need to tell eyaml which keys to encrypt new data with. This is a simple text file that goes inside your hiera data directory and contains the names of the PGP keys we generated earlier. You’ll need to include the names of all the puppetmasters so the catalogs can be generated plus all the developers who need to be able to make changes to the secrets.

master1.example.com
master2.example.com
me@benroberts.net

Make sure the updated hiera.yaml and hiera-eyaml-gpg.recipients files are available on all puppet masters and reload/restart the master to pick up the changes.

Your first secret

Now we’re all set, it’s time for the fun bit. We’ll use the eyaml tool to edit the common.yaml to add a new secret value.

eyaml edit data/common.csv

Add a new value into the file, wrapping the plaintext secret in DEC::GPG[ and ]!. This will tell eyaml that the secret is in decrypted form, and should be encrypted with the GPG backend.

---
host::root_password: "DEC::GPG[superseekrit]!"

Now save and exit the file, then re-open it, and you’ll see something like this instead:

 --- 
 host::root_password: "ENC[GPG,hQEMA4LkLtcnPlS+AQf/SabmYb9US3HTv8B1Bxx3CN9Tw29Lt3WcC4OeOnq1a5xzlhP5dolMcSV/qPqo4j3hq+ z2D1e+POZSd+ 3cH4lD6wRr3IWjJkyHyGmibVlIUPv2Y7CNMOXPcGJaAEFCKTpTEKlS87zDied19b9jS6yoCDVtGgLlUF32Et66P6pimVelWSb4REnv3rRVR7goCLmlaFk30/ UqeJfwmwNxPPsO+Ne8SreA0dfukkkyZ3JnSTmbtXlGJfMPLA7bjW8+Jexb/0c6WJiEDXCxuncvzkBeMz6+ cuKjZ6SHLIxiQtZUDrxAvkpiId6cWM49nYpbxdZVvzfoyiQkDtK7uw/hF92wxYUCDAOVcE7kxjXDjgEP/ 2XcJQnRSuagdOUPMZMW4RkC3pNXRV8IcoLWQVDP08YuICCdL5iVaNbU66fU034UyJmHRyZREU+ NiTUvxj92gkuNSG4jqMiDEdehNTnkCmij9qSjiZGaHHcIx6OwfYanLsWm5b0R+HBRCg1EXqwjmeUqi3sFCu6qlRPaDLc77xRCxJdvGRHZ04JUnyjYS/ leRxdVo2FEzJVHAW/Psm2wa+wkTcuW6g2Uv65WzANxaNBcP+vWAlErMHbxmkFiRvYHPBxbS6L/w5+Umh+5LLrx6M/op3iQWAialqNd8NKFYKkVqb/ Y7Tmfaj6W0XV+JiEkwoYY0SMD6wTtQwH6OPk99VfDUPiU7uQ+i8Q8doK8J8OH7sQTj/ye1Rq0e6dF7xGhvhm7YOa3UMSx/V33eZAr4EQ/n+ bMVZxDfZ6Qmi5wVw9oZ9KO826zkUy1K/ 4QrxjQZfz0YZTzDIrc8lGcHXuroIbiUemPbgkX6GEiXInha5tt7chTiiyjFgfCtSOcekeQ4VAMcBb66LUp2M8D3k4Aqp3j+ wK7KesDTaoTF1gN4FyVsXuest6YB6v67Zv+Wox30z+AG97RIzHZlWqioPxtB98QAbg5pT2a5brnRuD2/6rllO4dCRE1lMO1Sh8v5ZiV824rxVMo4z+ NzybSB2kDN4DoeubDUCzExeJXM9MRqpz7hQEMA5Fhm/f79iLoAQf9EZ8XH2jNgHY8K4oJ/TKapivcEqZm5a/ 35eWzFigBHKaBwag05q2M5imtFbI4Ez7ugFrwSdeUFeQHW16Mt9Jka7KfAmo9CuxYuOcc5/3T6qjzwf1nQtRiX/ 9LMxAQWz5vQRYXbIPhPzMif6JfUxGfT5fg4oNBsDc2mIo6K7gxUg1EDhqznVpnclVuv4LrTieZgq2FPue95IM1SGsFFHak5y3f+sbQUl8xvVQohq+ hyXhsxGmMASkt6ZPIQE1v3u35FUA8ovKQg5cIOdt5sYp1EV7tDL6kPieaVF0Ba20v01MY0dsHFxuGmeAIHWJxukxXDB8bPOQBoW45TRTZ0u0GOdJHAaGi6dM JwzVz2Dt/IQZlhjG3Yh0VPkUgQ78bsHKYuL7k0CDpDr3vb4mT0PljNEot7wDb4pBUL/3KtumvmDRxJ20TA4sFaNI=]"

Success!

Encrypt all the things

Running eyaml edit again will decrypt the file and open it in your text editor. Saving and quitting the editor will re-encrypt the file again. While editing the file, the wrapper string will change to DEC::GPG(N) where N is a unique number within the file. This is used for eyaml to keep track of which values have been edited so you don’t see all the encrypted blobs change every time the file is edited, and only the actually changed values will show up in version control diffs. Neat.

Adding a new developer/puppetmaster

If in future you need to deploy a new puppetmaster or a new developer joins the team there are three things you need to do.

  1. Import the new user’s key into your keyring and sign it
  2. Update data/hiera-eyaml-gpg.recipients with the name of the new keys
  3. Run eyaml recrypt <filename> for each file that contains secrets so they are re-encrypted with the new keys.

Preventing information leaks

Secrets are now protected on disk and in version control, which is good, but that’s not the only place secrets can be leaked. Recall when you do a [startCodepuppet agent –test –noop run to see what things puppet would change? That shows diffs of files secrets and all. Oops. And logged to syslog by default. Oops. And if you’re running with stored configs/puppetdb the reports including the diffs will be stored there as well. Oops.

To prevent this, make sure that when managing any file resource in puppet that could potentially include secrets, that you make use of the show_diff option to hide the diffs. This has the slight downside that doing a –noop run no longer shows you what would change in the file, but that’s probably better than having a password available in plaintext somewhere.

Also remember that just because the current version of the heira data is protected doesn’t mean that any history is. If you start using hiera-eyaml to protect existing secrets, remember that previous revisions may contain the plaintext versions. Bringing in heira-eyaml is a good excuse to change passwords too.

Problems?

If you see a message like this, then your gpg-agent or pinentry programs may not be working properly:

[gpg] !!! Warning: General exception decrypting GPG file
[hiera-eyaml-core] !!! Bad passphrase

Make sure you have a pinentry application installed (e.g. pinentry-curses, and that your gpg-agent is running:

$eval(gpg-agent --daemon)

Then try again.

Using puppet on Sabayon Linux

I like puppet and I like Sabayon but out of the box they don’t play nicely together. Sabayon is a Gentoo derivative and looks to puppet like a Gentoo system which causes it to use the Gentoo providers for package and service resources. Unlike a stock gentoo install, Sabayon hosts use systemd and a binary package manager (entropy). While entropy is compatible with portage I don’t want to compile things from source on my Sabayon boxes.

I’ve put together a a puppet module (https://github.com/optiz0r/puppet-sabayon/) which adds support for Sabayon by doing the following things:

  • Overriding the operatingsystem fact to distinguish between Gentoo and Sabayon (osfamily still describes it as Gentoo)
  • Adding an init fact which reports whether systemd is running
  • Subclassing the Systemd service provider to make it the default when the init fact reports systemd is in use
  • Adding a has_entropy fact that checks whether equo is available
  • Adding an Entropy package provider that’s the default for systems where has_entropy is true (and when operatingsystem is Sabayon, but that’s mostly a hack to override the portage default)

To use it, just add the module to your environment, and pluignsync will do all the hard work.

Known issue: The entropy provider doesn’t behave well when it tries to install a package which has a new license not already included in [startCodet]license.accept. Equo will present a prompt to accept the license in this case, which fails since there’s no stdin to read the answer from. This causes the prompt to be reprinted in a tight loop while puppet captures the output and fills up /tmp. I’ve got a bug report to change the equo behaviour. Once this issue is resolved I’ll package up the module properly and submit it to the forge, but in the meantime it’s around if people might find it useful.

Puppetenvsh Mcollective Agent

There is no shortage of different ways to setup Puppet and to manage how code is deployed. Like many people, I’m using git to store my puppet code. Perhaps a little less normally, I have multiple puppetmasters. For me these solve two problems; resilience in case one master needs to be taken offline, and geographic diversity which means I can target puppet runs at the nearest master and save a bit of time during puppet runs. This does however raise a different problem: how to keep both masters in sync so that each serves the same content. My answer to this is puppetenvsh, a MCollective agent which is triggered via a git pre-receive hook and updates all the puppet environments on all masters concurrently.

Continue reading