Build time optimization with pip and virtualenv

Using pip and virtualenv to manage your Python dependencies is a no brainer for development. The ability to ensure all developers use consistent versions, simplfy development environment bootstrapping, and quickly test new dependencies versions in an isolated environment are all huge wins for development. What about in a production environment or with continuous integration?

Pip presents certain difficulties when managing changing dependencies with the CI model of development. Different branches may have different dependencies, building a new virtualenv is not a trivial affair in that context and can easily take 10 or more minutes to download sources and compile distributions with binary dependencies. Adding an extra 10 minutes to the cycle time between creating a pull request and having the tests run against the branch was not acceptable for Rover.

A production environment poses additional challenges. How do you quickly roll out new environment changes to multiple instances? What about failed environment builds on particular machines? How to avoid rebuilding an environment if no requirements have changed?

We wanted the benefits of pip and virtualenv in production at Rover, and as a result we've solved a number of these problems.

Speeding up source downloads

Pip's default configuration is to download a new copy of the source evertime it needs to be built. Excluding the compilation of binary packages (such as PIL) this is the most time consuming portion of building a new virtualenv.

So how to avoid it? Thankfully pip provides a solution. You can specifiy a directory to be used to store a cache of downloaded sources.

Create a file at:

~/.pip/pip.conf

The contents should be something like:

[global]
download-cache=/path/to/your/download-cache

You can use any location for your cache. I personally like nesting it under my
.pip directory:

download-cache=~/.pip/download-cache

Speeding up binary compilation

So now you're only downloading the source distributions for each package version once and reusing it each time you need it.

That's much better. Still everytime you build a virtualenv you will need to spend time rebuilding packages such as PIL, psycopg2, and mysqldb that have binary components, and don't change very often.

Pip doesn't provide a great built in solution to this problem. However, pip-accel does a great job of managing this issue. Simply put it provides a thin wrapper around pip commands that manages the caching of the binary output of your builds resulting in massive speedups to your deployment process

Build one virtualenv for each requirements iteration

At this point, we had already improved our build time with pip and virtualenv by 80% or so. However, that still wasn't good enough for us. We we're still building a new virtualenv for every test suite run on CI and every new deployment build, whether or not the requirements have changed!

If the requirements haven't changed, then neither has the virtualenv. So how to reuse virtualenv's, but ensure new ones are made whenever the requirements change? Hashing was made for this. Simply hash the contents of the requirements file and look for existing environment matching that requirements version. The exact process will vary based on your system, but the gist of what we did:

Hash the contents of the requirements file
Look in a known location for a directory matching that hash
If it exists use it, if it doesn't then create it with virtualenv
If you created it, install the requirements being sure to check that the installation succeeded

We use shell scripts to manage our deployment steps. Our virtualenv creation for CI looks roughly like this:

#!/bin/bash
set -e

function handle_dependency_errors() {
  venv_path=$1
  pip_error_code="$2"
  echo "Pip failed to build new virtual environment"
  rm -rf $venv_path
  exit $pip_error_code
}

venv_path="/var/cache/venvs/$(md5sum requirements.txt | awk '{ print $1 }')"

if [ ! -d "$venv_path" ]; then
  virtualenv $venv_path
  trap 'handle_dependency_errors ${venv_path} ${$?}' EXIT
  $venv_path/bin/pip install pip-accel
  $venv_path/bin/pip-accel install -r requirements.txt
  trap "" EXIT

fi

Shipping the virtualenv

If you're running a limited number of servers it can be feasible to simple build the virtualenv on the boxes. However as your system grows in complexity it can be advantageous to have a single deploy machine that handles the entire build process, then simply copies the virtualenv to the production machines.

Doing this saved us significantly on build time, but it's not without consequences. For deployment, we had to build our virtualenv with --relocatable. This causes several issues, the most obvious of which is the loss of activate.

More subtly, you'll need to be on point with your configuration management. Any package that depends on system libraries (such as openssl, libjpeg, etc) requires that those libraries be installed on each machine, and at the same absolute path as the build machine. Depending on your needs it may be desirable to eat the build time hit and build it on each machine to avoid these potential complications.

Further Improvements

Hopefully this has given you some good idea for how to deal with building python dependencies quickly and effeciently. These changes made a massive improvement our build time, but there is still more we can do. While rolling out these changes we discovered one of our dependencies had been suddenly deleted from PyPi. This wouldn't be an issue if we where running a local PyPi mirror. Doing so would improve our resiliency to changes outside our control as well as allow us to host internal packages.