The majority of organisations I’ve worked with deliver new system functionality as development projects. These are funded with capex, and have a start and an end. Even projects that are ‘agile’ are still expected to finish at some date in the future, then once the system has been delivered it will undergo ‘handover’ to ‘BAU’. The project team usually moves on to new projects, developing remarkable cases of mass-amnesia along the way.
Projects deliver exactly what they promise. Project teams have little incentive to invest in the long term operation and maintenance of the systems that they create. I’m not saying that the team doesn’t care or are intentionally acting irresponsibly, but when delivery pressure is applied the first things to be dropped from the project schedule will be the cross-functional concerns that make the system reliable, monitorable, deployable, and maintainable ongoing.
The project effect:
the project team do not have to live with the long term results of their own architectural and design decisions.
BAU support/maintenance teams are generally under-resourced, have extremely limited opportunity for handover from project teams, and have to support many different systems. This usually leads to less than ideal development practices and deteriorating quality over time.
the project team never have to be involved in problem analysis for production outages. They’re never forced to put the right kind of monitoring and logging in place to find root causes.
the project team only do a limited number of releases to production, so have little incentive to invest in reliable automation or production-like test environments.
Therefore – I believe that many projects are the source of ‘instant legacy’, and a major cause of the development and operations divide.
What’s the alternative? Form long-lived teams around applications/products, or sets of features. A team works from a prioritised backlog of work that contains a mix of larger initiatives, minor enhancements, or BAU-style bug fixes and maintenance. Second-level support should be handled by people in the product team. Everyone in the team should work with common process and a clear understanding of technical design and business vision.
This approach is not easy – it introduces new challenges particularly around balancing priorities and budgeting. I’ve observed that the benefits in terms of long term system health definitely outweigh the drawbacks. Like everything – hire good people who care, and give them the right incentives, good things will happen.
In the last couple of years I’ve become very interested in the interactions and collaboration between development and operations teams, the ‘last mile’ of delivering working software into production, and keeping that software healthy and sustainable in production. I’ve had some satisfying experiences working in teams that have bridged part of the divide between development and ops.
Conveniently in the last few months the ‘DevOps’ movement has arrived and a lot of very smart and interesting people have been sharing their ideas. DevOps resonates incredibly loudly with me – bringing focus to both the people and incentive problems that can hinder collaboration between development and ops, with some interesting technical problems around faster delivery and the necessary investment in automation and configuration management.
I find the DevOps landscape very complex to visualise – many of the pieces are interdependant. To get some sense of the breadth, I drew the mind map below. It’s a big mix of different levels of abstraction, and later I’ll try to draw out some themes.
(click through for a full-size image)
I’m sure I’ve missed some major areas of concern, so if you can be bothered looking at the image and it prompts a thought – please do make a comment.
John Carney wrote this short post about complexity in the architecture at his workplace.
@johncarneyau: You know your architecture is too complex when your arrows cross over
A little later someone else chimed in:
@tvars: @johncarneyau surely you need an ESB?!?
This was a cruel joke (despite the lack of emoticon) but it did get me thinking – this is a fundamental problem in the ongoing fight against the inappropriate adoption of ESBs. The level of complexity shown on the board in John’s photo can be daunting. When ESB advocates (or vendors) sell bus integration, they can make that diagram look so nicely clean and ordered – with nice square lines that never intersect. This appeals to the obsessive compulsive pointy haired boss types. The Enterprise Service Bus will guarantee to make your whiteboard diagram 42.4% less complex.
The reality is that the architecture on the whiteboard is relatively simple and consistent compared to most ESB architectures, and an order of magnitude more productive.
I met a team in the recent past who were adopting automated testing. Developers would write some automated unit tests for their application code, and run them in the IDE before marking their work as complete. Testers would then write down the testing scripts for the completed software, and then manually execute those tests, recording results. An automation tester followed behind, writing automated functional tests from a growing backlog of completed manual tests.
Regularly during the day an automated build would run on the CI server – the server would dutifully report the number of unit tests that had failed. Rarely would all the unit tests pass – the team would mark a ‘stable’ build if a separate smoke suite of automated functional tests would pass. If the team was lucky enough to have a stable build at the end of the day it was deployed to a test environment, and the automated functional tests would execute overnight. Any failed functional tests would be raised as ‘bug cards’ on the card wall and reprioritised.
Lots of test automation, and build servers, but was this team practicing continuous integration?
The impact of this cycle meant that developers had no confidence to make any significant changes to the codebase to improve quality. They were discouraged from working beyond the minimum required to complete their task. As the team approached a release deadline, there were fewer ‘new feature’ cards on the wall, but a growing number of ‘bug cards’. The pressure started to mount to fix the bugs as quickly as possible. All sense of sustainable pace is now gone. After the release the team schedule requires it to start on the next batch of functionality, but there’s still quite a large number of bugs hanging over from last release. Clearly ‘agile’ doesn’t work.
Stop the line
Zero tolerance is required. The automated build must be kept ‘green’ – if it’s failed, then the highest priority for the team is to make it pass again. Never report ‘% passing tests’ – only green or red.
Ask the team to agree to never commit new changes to source control on top of a broken build – and stay disciplined despite pressure. If the build cannot be fixed quickly, then team members should know to back out a change quickly and fix it locally before committing again.
Make sure there is a well-known process for running a local ‘pre-commit’ build – it should always be a script that is in source control alongside the source code. This way everyone shares the same script, and if you need to add steps to the script (e.g. duplication checks) then everyone shares the same file. Using the IDE to perform a build (e.g. Visual Studio) is not acceptable.
Ensure everyone can see the build status – set up a build status ‘radiator’ (e.g. greenscreen or bigvisiblewall) on a spare PC and monitor, or acquire a build light, or some other form of highly visible status. Put it in a prominent position, so when a senior executive asks what it is you can explain – they love process controls.
A few years ago Sam Newman wrote a great article about the ‘build fix flag‘ pattern. At every customer since then I’ve introduced the ‘build manifesto’ – printed on a poster on the wall for everyone to understand.
Try not to break the build
Run a pre-commit build locally before checking in
Commit regularly NO COMMIT on a broken build (red light)
IF the build breaks
See who is fixing it (look for the <broken build token>)
If no-one is fixing it, look who triggered the build, tell them
If they aren’t around, start fixing it yourself
Take the <build token> to show you are fixing the build DON’T commit and walk out the door
I like ‘manifesto’ – it sounds suitably radical. Some teams who’ve been working in chaos for a long time will eye the new ‘agile guy’ with suspicion for being a nutter – I’m happy to reinforce that impression. When we start delivering software more regularly and predictably with higher quality than ever before, it doesn’t matter what you thought when I first arrived.
It’s hard to get people to care about the build. It’s especially hard in large teams, where you can’t raise your voice and be heard. Clearly a good solution is NOT to have a large team, but sometimes my hands are tied. A few years ago on a large team we found that people were either ignoring the build failure (most common), or sending emails to update the team on the build status. It was time-consuming, often no-one was sure who was fixing the build if it broke – and regularly everybody just assumed someone else was doing it. We introduced an IRC server and wrote some ruby scripts (‘buildbot’) to post build status to a channel. The team all installed an IRC client and joined the channel, and when the build was broken it was easy to see who was working on a fix at a glance.
We also had an ‘svnbot’ which posted source control commits to the room. This had a really nice side effect – folk were cajoled into writing meaningful commit messages! It provided another focus for the team in understanding the continuous integration of work on a single code line.
Since then I’ve been involved in replicating this approach at multiple sites – it’s always quite useful, at least to start with. If you are using Hudson then there is an excellent Jabber plugin that supports multi-user chat, this has worked well with the ejabberd or openfire jabber servers.
Ultimately however we still have a people problem – if the team has agreed to adopt continuous integration and the build manifesto then you may have to spend some time being ‘build cop’ until the team takes care of it themselves.
Production Ready Increments
This much should surely be obvious in the year 2010: your source control system is not a place to backup your files. If you are concerned about losing work in progress due to an act of god, consider that perhaps you should be checking in more often! If you have less than two hours of work in progress, then you don’t have a lot to lose.
Only commit to source control working code that could be shipped to production at any moment. All production features should work. All the time.
Sounds impossible – regular commits and no breaking changes? This conundrum focusses the team on breaking large tasks up into a series of small changes, each of which can be committed separately. Separate refactoring from adding new features – commit each separately. Use feature toggles to allow new partially-complete features to be disabled.
Doesn’t this take longer? In my experience I do not believe so – it focusses the team on making more careful changes and increasing overall quality. It enables a regular release cycle without having to rush to complete work in progress. It enables teams to work together on a single code line – which has enormous benefits.
I’m very excited about companies that are adopting continuous deployment – where the path to production is significantly automated and deployments can be pushed several times a day. The discipline required by teams to achieve this must be enormous – you can be certain they do not report ‘percentage failed tests’.
I once worked with an architect who was responsible for the technical direction of a major web project. When I arrived on the project it was clear that the intended architecture was not universally understood. There was some architectural documentation and wiki pages but they didn’t convey the information well and were already out of date. Team members were spending a bit too much time working in isolation, and not enough time sharing information. We really just needed to concentrate on telling and re-telling the ‘tale’ of the architecture around a whiteboard.
My favourite moment would come after the architect and I had spent a long time discussing the architecture and various options, sharing experiences. The board would be an unrecognisable scrawl of squiggles, smudges, and illegible text. To my astonishment the architect would grab other members of the team, drag them to the incomprehensible whiteboard and shout “SEE???!!!!”. I’d roll on the ground laughing.
It is of course obvious, but in this usage the whiteboard (or sketch paper and pen) is just a tool of communication within the conversation. A prop. The diagram left behind is completely meaningless unless you were part of that conversation.
I do a lot of ‘project-onboarding’ for new team members. I don’t tend to do this by handing them a ream of documentation. I find that handing someone a complete picture is extremely confusing. Instead we have a conversation and build up that sketch on the whiteboard. Building up that sketch incrementally – even though the resulting picture is incomprehensible – is much more effective at conveying the story of the project.
I think ‘Ruthless’ presents the right intent when automating repetitive tasks – it’s more than just aggressive or compulsive. My previous post talked about deployment repeatability primarily, but our goal is to improve the consistency of all processes that are repeated in the process of creating and maintaining system.
A sequence of actions performed manually multiple times is a surefire recipe for disaster. My tip:
perform a task manually once – you’ll usually be exploring ‘how’ to do the task.
perform the task a second time – write down in a text file the steps you took. If working at a command line (why aren’t you?) then go back through shell history and capture the exact commands and parameters in the text file. Check the text file into source control if you like.
IF you perform the task a third time, then stop and use the text file to write a small script, and check it into source control. Delete the text file.
I recommend against upfront automation – always do something manually a couple of times, so you understand the failure points and consequences. Attempts at upfront automation seem always to lead to frustration.
Mark taught me an important technique – start by building a script that asks for input e.g. release numbers, branch names. Have the script default to an ‘interactive mode’ by confirming each step that it’s going to perform with a prompt “[execute]/skip/quit”. ‘skip’ is important – it allows me to skip steps which is really useful during development where you need to make quick changes and recover the process from a known point.
Encourage the team to use the script in this interactive mode, and fix anything that goes wrong. Eventually you can switch the default to the fully automated mode, but leave the interactive mode in place so that the script can be debugged.
This approach is great because it allows you to incrementally introduce automation, and to carefully watch the steps involved and introduce checks and verifications for anything that goes wrong.
A couple of times I’ve observed a team that has captured the steps to perform something in a wiki page – for example creating a new release support branch in source control, including updating a file signifying the major and minor version to be built. The wiki page will contain detailed steps, and include the lines to be copy/pasted into the terminal, with “<insert release number here>” annotations.
This just kills me – and I’ve seen these wiki pages over a page long. Put those commands in a script and check it into your source control!
It shouldn’t have to be said – check the automation scripts you write into source control. Share the love – make sure all of your team members use the same process to perform routine tasks, and that they can contribute fixes and improvements.
Pick a language that is good for this type of automation – usually an interpreted scripting language of some sort. Use something that might appeal to your operations group – you want them to share ownership of at least some of the automation, and be able to debug and submit patches. I personally encourage the use of Ruby as it has great library support (e.g. highline for interactive mode) and has some great specific build/deploy scripting tools like rake and capistrano.
There’s some simple rules to follow to reduce the unexpected – particularly in build and deployment as part of a Continuous Integration process. If something works, I expect it to work again next time, and will put in place something to make sure it happens exactly the same next time. If something fails instead of just fixing it, I want to put something in place to make sure it never happens again. Simple application of these rules can bring calm and order.
There should be no manual steps required to deploy an application to a target environment (test or production). You should not for example have to unpack a zip file, change the contents of files x and y, and restart service X. If deployment instructions include the word ‘click’ then something is wrong. Every manual step introduces a chance for variation, and removes an opportunity to add an automated check.
Some customers claim to have an automated deployment process – when we dig deeper we find that the instructions to run the automated deployment process run to dozens of steps. Deployments are done into different environments by different people – each of which interpret the manual steps differently, and use different workarounds and additional steps where the process is not well defined or fails regularly.
What do we need to implement true repeatability of deployment?
don’t fix problems ‘in situ’. When a deployment to a test environment fails, do not fix it in place. Investigate the problem, then add something to the deployment process to ensure it cannot happen. This might be a ‘pre flight check’ that makes assertions about the target environment, or a post-deployment verification test that will provide fast feedback that something went wrong. Sometimes this means changing the behaviour of other groups like IT operations or release management to remove this ‘quick just patch it’ approach.
externalise environment-specific configuration. Deploy the *exact* same artefacts in your test and production environments. Anything that is specific to “system test” should be sourced outside of the artefact – from config files, environment vars etc. I have a lot to say here which I’ll save for a dedicated post.
make test environments as close to production as possible. The closer test environments are to production, the less likely there is to have a ‘whoops’ on the production date. Audit this regularly – OS version, service packs, app server versions, database names, directory locations, load balancer configs. This will minimise the number of items you need to place in environment-specific configuration.
automate the deployment of *everything*. Including e.g. apache configs, load balancer config, firewall settings, database upgrade scripts. Everything should come from a known configuration coming out of source control. I’m very keen to learn how to use tools like puppet and chef to assist here.
use exactly the same deployment process from dev to production. Too many times we develop deployment automation that is only used in the test environments, and the production deployment is done by humans following an invisible set of instructions.
share responsibility for building, maintaining, and testing deployment scripts between development and operations. Ensure that changes to scripts are checked back in to source control (easiest way is to embed them in the deployment artefact built by your CI server). Give your ops team commit access to source control.
release everything every time. Don’t cherry pick a set of components to deploy. In every release try deploying all components together – including components that haven’t been changed. Two benefits I’ve realised – eliminate the risk of forgetting a dependant change, and confidence that a rarely-changed component CAN be deployed. If you feel it is risky to deploy a component unnecessarily, then you really need to address those risks. Don’t cop out with the ‘let sleeping dogs lie’ approach. That dog will bite you badly when you come to build and deploy it in a year’s time.
These are just a few of the things I’ve learned help to make deployments boring. Boring *should* be the goal, although you get a heck of a lot less champagne.
Sudden thought – perhaps there is a hidden incentive here that’s driving behaviour? – app deployments that happen like clockwork every two weeks without raising a sweat are boring for some folk – and there is no opportunity to be a hero. I feel a little queasy at this thought…
I thought by 2010 that this would be a standard doctrine, but it’s not (at least with the customer teams I coach). Commit regularly – minimum once per hour. Every minute past one hour should make you very uncomfortable. The hair on the back of your neck should start to stand up at 1.5 hours. A facial tic should begin at 2 hours. At 3 hours a reflex action should kick in to revert local changes and start over in a more incremental way.
Effective continuous integration relies on continuous commits from developers – I commit often, others update (get latest) often, we remain in a perpetual state of integration. Thanks to collective code ownership and a high shared coding standard, I’ll start building on top of (or refactoring) code that you’re committing – while you’re still working on a feature. This is incredibly healthy, and helps us deliver code that is expressive and free from duplication. If we’re accidentally working in the same area, we’ll find out in an hour instead of in two days when the train wreck is unavoidable.
Work in small hops. Red – Green – Refactor – can I commit? If I can’t commit, why not? Make your next priority to get the code back to a state where you can commit.
Deferring commits is like playing ‘chicken’ with the rest of your team.
A common practice in SCM is to create multiple branches (code lines) from a stable baseline, allow teams to work in isolation on these feature branches until they meet some quality gate. The feature branch can then be merged into the baseline to form a release. I find this approach abhorrent in almost all cases. My three main objections are:
1. Multiple active code lines force a conservative approach to design improvement (refactoring)
While there is more than one active code line most teams will defer any widespread design improvement, as any widespread change will be difficult to merge. This means that emergent design and refactoring do not occur, and the software will build further inconsistency and duplication. This effect must not be underestimated – effectively it’s another source of fear, preventing the teams from moving forward.
2. Deferring integration of code lines usually leads to high risk late in delivery
The longer an isolated code line lives, the more pain and risk incurred when merging. This risk can be largely mitigated if the teams are disciplined in regularly merging changes into the feature branches from baseline. However most teams I’ve observed aren’t very disciplined in this regard, and this risk becomes a real issue.
3. Multiple active code lines works against collective code ownership
Teams working in isolation on a separate code line share their work with other teams as late as possible. This leads to code ownership problems, and inconsistency. The code introduced by an isolated team is often quite clearly different to the rest of the codebase, and is disowned by other developers working on other branches.
Other issues with multiple code lines:
complexity can cause significant errors that may not be caught by automated or manual testing, risking production stability.
it is very difficult to consistently spread good technical practices (automated testing, coding standard)
it works against the CI principle of production-ready increments – isolated branches are often used as excuses to leave the software in a broken state for some period of time, instead of working out how to implement a major change incrementally.
But what if I’m working on a feature that isn’t going to be ready in time for the next release? Firstly, are there any smaller increments that we can release to production and get benefit earlier? If not, then we need to release partial work into production, without it changing the current behaviour of the production system until the feature is complete and can be activated. This involves the introduction of ‘feature toggles’ – configuration that disables the new feature implementation in production until it is ready.
This doesn’t have to be runtime configuration – simple switches introduced to environment-specific config files will usually be enough. There is a cost in introducing this conditional behaviour, but in my opinion this is far outweighed by the enablement of single code line and regular metronomic releases.
The approach is also more challenging when altering the behaviour of an existing feature – sometimes requiring significant refactoring to introduce the switch. Sometimes we need to introduce a whole abstraction to be able to switch implementations – this is an enabler for significant ‘architectural refactorings’. This is referred to by Paul Hammant as Branch by Abstraction – and is a very powerful technique.
The prevailing attitude in software development still seems to be that if something is difficult or expensive (or even just not much fun), we try to do it as few times as possible. This usually means deferring it until as late as possible.
merging and integrating the work of multiple people
merging and integrating the work of multiple teams
execution of tests
testing the integration of components
deploying into a production like environment
deploying into production
Most of these things are difficult and expensive, and the temptation is to make more rapid progress early in a project or release by deferring these things until late. Unfortunately these things are also very difficult to predict – in complexity and effort. This means that we often find that we have a significant amount of complexity late in a project or release, just as the pressure on the team starts to rise to deliver. This inevitably leads to delays being announced very late in the delivery of a project or release, or the team to abandon quality.
I don’t want one day to be significantly harder or anxiety-inducing than any other. I don’t like the ‘deployment day’ being a time which people dread, or merging the work of multiple teams to be an unloved task which is risky and error-prone.
My goal with CI is to do these ‘hard’ tasks as often as possible, to invent ways to make these things easier, and to keep this up until the ‘hard’ tasks are painless and risk-free.