In the last couple of years I’ve become very interested in the interactions and collaboration between development and operations teams, the ‘last mile’ of delivering working software into production, and keeping that software healthy and sustainable in production. I’ve had some satisfying experiences working in teams that have bridged part of the divide between development and ops.
Conveniently in the last few months the ‘DevOps’ movement has arrived and a lot of very smart and interesting people have been sharing their ideas. DevOps resonates incredibly loudly with me – bringing focus to both the people and incentive problems that can hinder collaboration between development and ops, with some interesting technical problems around faster delivery and the necessary investment in automation and configuration management.
I find the DevOps landscape very complex to visualise – many of the pieces are interdependant. To get some sense of the breadth, I drew the mind map below. It’s a big mix of different levels of abstraction, and later I’ll try to draw out some themes.
(click through for a full-size image)
I’m sure I’ve missed some major areas of concern, so if you can be bothered looking at the image and it prompts a thought – please do make a comment.
I met a team in the recent past who were adopting automated testing. Developers would write some automated unit tests for their application code, and run them in the IDE before marking their work as complete. Testers would then write down the testing scripts for the completed software, and then manually execute those tests, recording results. An automation tester followed behind, writing automated functional tests from a growing backlog of completed manual tests.
Regularly during the day an automated build would run on the CI server – the server would dutifully report the number of unit tests that had failed. Rarely would all the unit tests pass – the team would mark a ‘stable’ build if a separate smoke suite of automated functional tests would pass. If the team was lucky enough to have a stable build at the end of the day it was deployed to a test environment, and the automated functional tests would execute overnight. Any failed functional tests would be raised as ‘bug cards’ on the card wall and reprioritised.
Lots of test automation, and build servers, but was this team practicing continuous integration?
The impact of this cycle meant that developers had no confidence to make any significant changes to the codebase to improve quality. They were discouraged from working beyond the minimum required to complete their task. As the team approached a release deadline, there were fewer ‘new feature’ cards on the wall, but a growing number of ‘bug cards’. The pressure started to mount to fix the bugs as quickly as possible. All sense of sustainable pace is now gone. After the release the team schedule requires it to start on the next batch of functionality, but there’s still quite a large number of bugs hanging over from last release. Clearly ‘agile’ doesn’t work.
Stop the line
Zero tolerance is required. The automated build must be kept ‘green’ – if it’s failed, then the highest priority for the team is to make it pass again. Never report ‘% passing tests’ – only green or red.
Ask the team to agree to never commit new changes to source control on top of a broken build – and stay disciplined despite pressure. If the build cannot be fixed quickly, then team members should know to back out a change quickly and fix it locally before committing again.
Make sure there is a well-known process for running a local ‘pre-commit’ build – it should always be a script that is in source control alongside the source code. This way everyone shares the same script, and if you need to add steps to the script (e.g. duplication checks) then everyone shares the same file. Using the IDE to perform a build (e.g. Visual Studio) is not acceptable.
Ensure everyone can see the build status – set up a build status ‘radiator’ (e.g. greenscreen or bigvisiblewall) on a spare PC and monitor, or acquire a build light, or some other form of highly visible status. Put it in a prominent position, so when a senior executive asks what it is you can explain – they love process controls.
A few years ago Sam Newman wrote a great article about the ‘build fix flag‘ pattern. At every customer since then I’ve introduced the ‘build manifesto’ – printed on a poster on the wall for everyone to understand.
Try not to break the build
Run a pre-commit build locally before checking in
Commit regularly NO COMMIT on a broken build (red light)
IF the build breaks
See who is fixing it (look for the <broken build token>)
If no-one is fixing it, look who triggered the build, tell them
If they aren’t around, start fixing it yourself
Take the <build token> to show you are fixing the build DON’T commit and walk out the door
I like ‘manifesto’ – it sounds suitably radical. Some teams who’ve been working in chaos for a long time will eye the new ‘agile guy’ with suspicion for being a nutter – I’m happy to reinforce that impression. When we start delivering software more regularly and predictably with higher quality than ever before, it doesn’t matter what you thought when I first arrived.
It’s hard to get people to care about the build. It’s especially hard in large teams, where you can’t raise your voice and be heard. Clearly a good solution is NOT to have a large team, but sometimes my hands are tied. A few years ago on a large team we found that people were either ignoring the build failure (most common), or sending emails to update the team on the build status. It was time-consuming, often no-one was sure who was fixing the build if it broke – and regularly everybody just assumed someone else was doing it. We introduced an IRC server and wrote some ruby scripts (‘buildbot’) to post build status to a channel. The team all installed an IRC client and joined the channel, and when the build was broken it was easy to see who was working on a fix at a glance.
We also had an ‘svnbot’ which posted source control commits to the room. This had a really nice side effect – folk were cajoled into writing meaningful commit messages! It provided another focus for the team in understanding the continuous integration of work on a single code line.
Since then I’ve been involved in replicating this approach at multiple sites – it’s always quite useful, at least to start with. If you are using Hudson then there is an excellent Jabber plugin that supports multi-user chat, this has worked well with the ejabberd or openfire jabber servers.
Ultimately however we still have a people problem – if the team has agreed to adopt continuous integration and the build manifesto then you may have to spend some time being ‘build cop’ until the team takes care of it themselves.
Production Ready Increments
This much should surely be obvious in the year 2010: your source control system is not a place to backup your files. If you are concerned about losing work in progress due to an act of god, consider that perhaps you should be checking in more often! If you have less than two hours of work in progress, then you don’t have a lot to lose.
Only commit to source control working code that could be shipped to production at any moment. All production features should work. All the time.
Sounds impossible – regular commits and no breaking changes? This conundrum focusses the team on breaking large tasks up into a series of small changes, each of which can be committed separately. Separate refactoring from adding new features – commit each separately. Use feature toggles to allow new partially-complete features to be disabled.
Doesn’t this take longer? In my experience I do not believe so – it focusses the team on making more careful changes and increasing overall quality. It enables a regular release cycle without having to rush to complete work in progress. It enables teams to work together on a single code line – which has enormous benefits.
I’m very excited about companies that are adopting continuous deployment – where the path to production is significantly automated and deployments can be pushed several times a day. The discipline required by teams to achieve this must be enormous – you can be certain they do not report ‘percentage failed tests’.