TL;DR Version: We’ve screwed up for not really understanding how the packaging system we used worked internally.

I currently work on a project where we have to go full stack: from planning, coding and testing to packaging, shipping and doing some legal paper work. One of the areas that gives the most headaches (besides the legal hassle, obviously) is the packaging (we’re only dealing with RPM for now). Here I’ll describe an RPM default behavior which has bitten us recently and how we’ve used it for our own good.

RPM packages support scripting at some points during installation: before and after installation and removal - namely:

  • %pre: kicks in before installing the package - this stage is commonly used to prepare the environment before the install, such as creating directories and setting permissions.
  • %post: runs after the package has been installed, typically used to run configuration scripts, like creating the administrator credentials for a database or setting up the service to start at boot.
  • %preun: happens before a package removal (or upgrade), used to stop dependant services and kill the application being uninstalled.
  • %postun: the last step of a removal (or, again, an upgrade), sometimes used for some leftover cleanup like purging empty directories.

I won’t go too deep on the workings of RPM as this is not the scope of this post, but if you’re interested check out Maximum RPM: it’s the most complete RPM guide you’ll ever get. As a bare minimum, take a quick look at here and here.

So, what would happen behind the scenes during a package upgrade? It’s natural (at least for me and my team) to think that the old version package will be removed and then the new version would be installed -meaning that the sections described previously would be called in the following order: %preun, %postun, %pre, %post. Well, turns out this is not what happens and this is why we were bitten. Twice.

Bite #1: Our Services Were All Stopped After An Upgrade

Following the “natural” chain of thought, we’d included in our package’s %preun section instructions to stop the services we setup so that it wouldn’t be running during the upgrade and risk breaking everything.

This definitely didn’t break anything and since we provide a front-end for upgrading our packages which triggers a restart after everything, we didn’t spot the issue soon. Only when testing a command line upgrade it was noticed that our services were completely stopped.

The fix for it was simple: we removed the commands to stop our services in the %preun section during an upgrade and everything went better than expected.

Bite #2: Files Generated Dynamically Were Being Removed

Due to some internal reasons one of the packages we ship switched from including some files statically in the RPM package (defined in the %files section) to automatically generating them during the install process.

With this change, after an upgrade from our old and more declarative package to this new dynamic one the files that were being generated ended up being removed after the upgrade (we had confirmed they were definitely being created during the setup).

The rather-ugly-but-simple-fix was to create an intermediary package with static files with names different from the ones being auto generated.

How An Upgrade REALLY Works

So, how come those fixes worked and - more importantly - why the heck were they happening in the first place?

Quoting from this developerWorks article, what happens during a package upgrade is the following:

  1. Run the %pre section of the RPM being installed.
  2. Install the files that the RPM provides.
  3. Run the %post section of the RPM.
  4. Run the %preun of the old package.
  5. Delete any old files not overwritten by the newer version. (This step deletes files that the new package does not require.)
  6. Run the %postun hook of the old package.

What this means is that the old package removal happens after the installation of the new one. If you think about it, it kind of makes sense since we don’t want to touch the configuration files the user has changed (or rather, we don’t want to force the user to reconfigure her service after simple updates). If it were up to me I wouldn’t do it like this, but anyway.

Let’s understand each issue and why the fix worked:

  1. The issue with services being stopped (Bite #1) is fairly simple to understand once you read about the upgrade order above: you must not stop services on the %preun and %postun sections during an update/upgrade.
  2. Issue #2 is the interesting one: since the files we were creating were described in the %files section of the old package but not in the new one, the step 5 in the upgrade flow removes them. This, in my opinion, is a bug: the mtime of the candidate files for removal should be checked to make sure no scriptlet touched it.

The Tasty Bite

Enough of issues, let’s take a look at a situation where this upgrade behavior actually helped us solve a production bug:

In our first release some of our code depended on some filesystem data to work and we didn’t cache such data for further use throughout the service lifetime; This caused a bug during an upgrade because the newer package version would replace some files which the old service depended on - so, after an upgrade through our UI, the application would crash since the filesystem data is different from what it expected (yeah, I’m talking about templates - shame on us).

Such an issue could be solved by simply releasing a new minor version of the product, however the bug would still show up if the user tried to upgrade from the original version, without going through the minor update, so this wasn’t perfectly safe.

By the time we caught this problem we were already aware of the real RPM upgrade flow, so we used that to work around the issue for any version: in the %pre section of the new package we backed-up the old files which triggered the crash when changed and kept them at their original location after the upgrade finished - when the application restarts, it detects the presence of the old files and finally replaces them with the new ones: now the service is running with updated code aware of the filesystem changes and nothing really breaks.

This is what one would call a really ugly hack in the wild and I’d have to agree. However, the code was already in production and even worse than doing dirty hacks is letting the application crash in the face of the user.