DevOps – Ten tips for developers

Last weekend I attended the ThoughtWorks Australia ‘Team Hug’ – a weekend away for all Australian TWers to get together and share ideas and have some fun. On the Saturday we run a conference program. This year we had so much interest (and quite a few international visitors) so sessions were kept short and punchy at 25 minutes.

My presentation was titled “DevOps – Ten tips for loveless developers”. It’s a collection of ideas for ensuring developers do their best to build a DevOps culture of collaboration. The ideas seem very obvious at a glance, but I find that while we all do some of these things some of the time, we don’t do all of them all of the time.

The slides don’t make any sense on their own, here’s a rough outline. I apologise for any agile jargon.

1. Understand incentives

Understand the organisation context that your operations team is operating within, and the incentives that organisation is providing. Understand that acting to prevent change is a perfectly sane response to incentives that reward production stability. Understanding these incentives can remove some frustration and help you focus on overcoming the barriers.

2. Engage ops early and often

Ensure that the operations is involved early in new development initiatives, making sure ops have representation early in project workshops. Ensure ops are given the full business vision for a programme of change – and that they have input into solution and direction equally with the development team members. Provide plenty of opportunity for ops feedback into the agile planning process. Involve ops in regular design sessions and technical showcases.

3. One team

Be relentless in the message that ops and dev are one logical team. Ensure that team has a chance to work together and celebrate together – lunch, drinks etc. Ensure ops have invites to project standups, showcases, retrospectives. Go out of your way to help your team members – for example finding quick ways in development to reduce manual workload for sysadmins.

4. Favour face to face communication

Obvious really. Never lose your dignity by resorting to email carbon-copy escalation wars. Many ops and support teams are monitored and measured using ticketing systems – most of the time we just have to accept this. Ask if you can speak face to face or on the phone, and offer to raise a ticket reflecting the conversation.

5. Ops is an end user (as well as a team member)

Imagine that the operator who gets up at 2am to fix your faulty system is a homocidal maniac who knows where you live. What can you do to ensure that the system is easy to manage, monitor and troubleshoot in production? I’ve written before about cross-functional stories, ensure they are in your project scope and champion those stories to ensure they are appropriately prioritised.

6. Share responsibility

Developers should wear pagers and suffer the consequences of their own crap software (there I’ve said it). I’m not necessarily suggesting handing the keys over to production and letting the developers fix things, however they should be given the same information and feedback that the operator gets when the system falls over. I’ve seen this drive improvements to logging and visibility for ops.

There’s lots more to shared responsibility – many development teams I’ve worked with see nothing of the applications they build beyond QA. Good development teams (especially product teams not project teams) will watch production metrics regularly to inform their technical direction.

7. Don’t place orders

Share problems, issues, and proposed designs with your operations team, and ask for help in solving the problems – even if you *think* you already know the answers. Don’t just place an order for servers, networks, and software. Work together to use all of your combined talent and experience.

8. Meet commitments

You are not the first developer to make promises about how much better things will be. Organisations are littered with broken promises, as operational concerns are the first items of scope cut. Don’t over-promise, instead build trust by working together continuously and delivering small incremental improvements.

9. Don’t abuse your friendship

When things are going well and you’ve built a good team relationship with ops, you might find yourself given extra privileges. You might be able to make a phone call and get a problem resolved, instead of entering a ticket queue. You might even be able to have a login for monitoring access to production systems. Be very careful about abusing this privilege, as you might end up losing your new friendship much more quickly than it took to build.

10. Educate yourself

Take some time to learn some skills, for example a working knowledge of unix is essential for most developers. There’s no excuse at this stage not to have a deeper understanding of your target platform.

Finally some basic politeness still lost on some of my developer friends… say please, thankyou, and when you’ve screwed up say sorry!

DevOps and the Iteration Showcase

Look down. Look up again. You’re on the agile team your team could be like.

It’s the end of the iteration, and there’s a showcase this afternoon (sprint demo if you prefer) demonstrating all the new functionality the team has built in the last two weeks. In the room are members of the project team, the product owner, and various stakeholders and interested parties from the marketing and customer service teams who use the product every day. Everyone’s very excited about the new features, and provide some great feedback on the spot.

This sounds great! But something is missing. Where are the ‘ops’ features?

Very few agile projects I’ve been on will demonstrate the ‘cross-functional’* or ops features they’ve completed in the same showcase, but they SHOULD. Features like monitoring, failover testing, deployment automation, performance improvements – these are all very important to our business. If you’ve truly got a DevOps culture, then they should be showcased and celebrated alongside the new whizzy UI features.

How do we make these achievements relevant to a wider audience? Start by describing the work in a different way – talk about the work that’s being done in terms of its benefit to our business.

A technical story would look like:

Enable monitoring of JVM heap allocation.

To make it more understandable to the business, highlight the business benefit in this way:

In order to reduce the risk of an outage as site traffic grows
The operations team need to
Monitor the JVM heap memory allocation

By putting the business benefit up front (and always present) this should help make the story more interesting to showcase.

The regular showcase presentation is also an opportunity to report to the stakeholder group on the current state of the system in production. This can take the form of presenting some selected metrics plotted over time. For a website you might include metrics on site traffic, response times, performance and stability over time. The presentation should support the prioritisation of appropriate cross-functional work to improve those metrics over time.

Getting to the point where cross-functional work is celebrated by a wider stakeholder group requires some creativity and effort. When it works I’ve observed it makes the conversations around proper prioritisation and collaboration on DevOps work so much easier.

* I’ve taken to using the term ‘cross-functional requirements’ (thanks to Sarah) to describe requirements that are cross-cutting and not-directly-functional – for example performance, availablity, volume, maintainability. I think the term NFR has become a weasel-word, treated as ‘someone else’s problem’ rather than an important priority. It might just be a word game, but I think it’s useful.

Projects are evil and must be destroyed

The majority of organisations I’ve worked with deliver new system functionality as development projects. These are funded with capex, and have a start and an end. Even projects that are ‘agile’ are still expected to finish at some date in the future, then once the system has been delivered it will undergo ‘handover’ to ‘BAU’. The project team usually moves on to new projects, developing remarkable cases of mass-amnesia along the way.

Projects deliver exactly what they promise. Project teams have little incentive to invest in the long term operation and maintenance of the systems that they create. I’m not saying that the team doesn’t care or are intentionally acting irresponsibly, but when delivery pressure is applied the first things to be dropped from the project schedule will be the cross-functional concerns that make the system reliable, monitorable, deployable, and maintainable ongoing.

The project effect:

  • the project team do not have to live with the long term results of their own architectural and design decisions.
  • BAU support/maintenance teams are generally under-resourced, have extremely limited opportunity for handover from project teams, and have to support many different systems. This usually leads to less than ideal development practices and deteriorating quality over time.
  • the project team never have to be involved in problem analysis for production outages. They’re never forced to put the right kind of monitoring and logging in place to find root causes.
  • the project team only do a limited number of releases to production, so have little incentive to invest in reliable automation or production-like test environments.

Therefore – I believe that many projects are the source of ‘instant legacy’, and a major cause of the development and operations divide.

What’s the alternative? Form long-lived teams around applications/products, or sets of features. A team works from a prioritised backlog of work that contains a mix of larger initiatives, minor enhancements, or BAU-style bug fixes and maintenance. Second-level support should be handled by people in the product team. Everyone in the team should work with common process and a clear understanding of technical design and business vision.

This approach is not easy – it introduces new challenges particularly around balancing priorities and budgeting. I’ve observed that the benefits in terms of long term system health definitely outweigh the drawbacks. Like everything – hire good people who care, and give them the right incentives, good things will happen.

DevOps Mind Map

In the last couple of years I’ve become very interested in the interactions and collaboration between development and operations teams, the ‘last mile’ of delivering working software into production, and keeping that software healthy and sustainable in production. I’ve had some satisfying experiences working in teams that have bridged part of the divide between development and ops.

Conveniently in the last few months the ‘DevOps’ movement has arrived and a lot of very smart and interesting people have been sharing their ideas. DevOps resonates incredibly loudly with me – bringing focus to both the people and incentive problems that can hinder collaboration between development and ops, with some interesting technical problems around faster delivery and the necessary investment in automation and configuration management.

I find the DevOps landscape very complex to visualise – many of the pieces are interdependant. To get some sense of the breadth, I drew the mind map below. It’s a big mix of different levels of abstraction, and later I’ll try to draw out some themes.

(click through for a full-size image)

I’m sure I’ve missed some major areas of concern, so if you can be bothered looking at the image and it prompts a thought – please do make a comment.

Enterprise Service Bust

John Carney wrote this short post about complexity in the architecture at his workplace.

@johncarneyau: You know your architecture is too complex when your arrows cross over

A little later someone else chimed in:

@tvars: @johncarneyau surely you need an ESB?!?

This was a cruel joke (despite the lack of emoticon) but it did get me thinking – this is a fundamental problem in the ongoing fight against the inappropriate adoption of ESBs. The level of complexity shown on the board in John’s photo can be daunting. When ESB advocates (or vendors) sell bus integration, they can make that diagram look so nicely clean and ordered – with nice square lines that never intersect.  This appeals to the obsessive compulsive pointy haired boss types.  The Enterprise Service Bus will guarantee to make your whiteboard diagram 42.4% less complex.

The reality is that the architecture on the whiteboard is relatively simple and consistent compared to most ESB architectures, and an order of magnitude more productive.

Build Manifesto

Image by WELS.net (Creative Commons)

I met a team in the recent past who were adopting automated testing. Developers would write some automated unit tests for their application code, and run them in the IDE before marking their work as complete. Testers would then write down the testing scripts for the completed software, and then manually execute those tests, recording results. An automation tester followed behind, writing automated functional tests from a growing backlog of completed manual tests.

Regularly during the day an automated build would run on the CI server – the server would dutifully report the number of unit tests that had failed. Rarely would all the unit tests pass – the team would mark a ‘stable’ build if a separate smoke suite of automated functional tests would pass. If the team was lucky enough to have a stable build at the end of the day it was deployed to a test environment, and the automated functional tests would execute overnight. Any failed functional tests would be raised as ‘bug cards’ on the card wall and reprioritised.

Lots of test automation, and build servers, but was this team practicing continuous integration?

The impact of this cycle meant that developers had no confidence to make any significant changes to the codebase to improve quality. They were discouraged from working beyond the minimum required to complete their task. As the team approached a release deadline, there were fewer ‘new feature’ cards on the wall, but a growing number of ‘bug cards’. The pressure started to mount to fix the bugs as quickly as possible. All sense of sustainable pace is now gone. After the release the team schedule requires it to start on the next batch of functionality, but there’s still quite a large number of bugs hanging over from last release. Clearly ‘agile’ doesn’t work.

Stop the line

Zero tolerance is required. The automated build must be kept ‘green’ – if it’s failed, then the highest priority for the team is to make it pass again. Never report ‘% passing tests’ – only green or red.

Ask the team to agree to never commit new changes to source control on top of a broken build – and stay disciplined despite pressure. If the build cannot be fixed quickly, then team members should know to back out a change quickly and fix it locally before committing again.

Make sure there is a well-known process for running a local ‘pre-commit’ build – it should always be a script that is in source control alongside the source code. This way everyone shares the same script, and if you need to add steps to the script (e.g. duplication checks) then everyone shares the same file. Using the IDE to perform a build (e.g. Visual Studio) is not acceptable.

Ensure everyone can see the build status – set up a build status ‘radiator’ (e.g. greenscreen or bigvisiblewall) on a spare PC and monitor, or acquire a build light, or some other form of highly visible status. Put it in a prominent position, so when a senior executive asks what it is you can explain – they love process controls.

A few years ago Sam Newman wrote a great article about the ‘build fix flag‘ pattern. At every customer since then I’ve introduced the ‘build manifesto’ – printed on a poster on the wall for everyone to understand.

Try not to break the build
Run a pre-commit build locally before checking in
Update regularly
Commit regularly
NO COMMIT on a broken build (red light)‏

IF the build breaks
See who is fixing it (look for the <broken build token>)
If no-one is fixing it, look who triggered the build, tell them
If they aren’t around, start fixing it yourself
Take the <build token> to show you are fixing the build
DON’T commit and walk out the door

I like ‘manifesto’ – it sounds suitably radical.  Some teams who’ve been working in chaos for a long time will eye the new ‘agile guy’ with suspicion for being a nutter – I’m happy to reinforce that impression.  When we start delivering software more regularly and predictably with higher quality than ever before, it doesn’t matter what you thought when I first arrived.

BuildBot

It’s hard to get people to care about the build. It’s especially hard in large teams, where you can’t raise your voice and be heard.  Clearly a good solution is NOT to have a large team, but sometimes my hands are tied. A few years ago on a large team we found that people were either ignoring the build failure (most common), or sending emails to update the team on the build status.  It was time-consuming, often no-one was sure who was fixing the build if it broke – and regularly everybody just assumed someone else was doing it.  We introduced an IRC server and wrote some ruby scripts (‘buildbot’) to post build status to a channel. The team all installed an IRC client and joined the channel, and when the build was broken it was easy to see who was working on a fix at a glance.

We also had an ‘svnbot’ which posted source control commits to the room. This had a really nice side effect – folk were cajoled into writing meaningful commit messages! It provided another focus for the team in understanding the continuous integration of work on a single code line.

Since then I’ve been involved in replicating this approach at multiple sites – it’s always quite useful, at least to start with. If you are using Hudson then there is an excellent Jabber plugin that supports multi-user chat, this has worked well with the ejabberd or openfire jabber servers.

Ultimately however we still have a people problem – if the team has agreed to adopt continuous integration and the build manifesto then you may have to spend some time being ‘build cop’ until the team takes care of it themselves.

Production Ready Increments

This much should surely be obvious in the year 2010: your source control system is not a place to backup your files. If you are concerned about losing work in progress due to an act of god, consider that perhaps you should be checking in more often! If you have less than two hours of work in progress, then you don’t have a lot to lose.

Only commit to source control working code that could be shipped to production at any moment. All production features should work. All the time.

Sounds impossible – regular commits and no breaking changes? This conundrum focusses the team on breaking large tasks up into a series of small changes, each of which can be committed separately. Separate refactoring from adding new features – commit each separately. Use feature toggles to allow new partially-complete features to be disabled.

Doesn’t this take longer? In my experience I do not believe so – it focusses the team on making more careful changes and increasing overall quality. It enables a regular release cycle without having to rush to complete work in progress. It enables teams to work together on a single code line – which has enormous benefits.

I’m very excited about companies that are adopting continuous deployment – where the path to production is significantly automated and deployments can be pushed several times a day. The discipline required by teams to achieve this must be enormous – you can be certain they do not report ‘percentage failed tests’.

Whiteboard architecture – SEE???!!!

I once worked with an architect who was responsible for the technical direction of a major web project. When I arrived on the project it was clear that the intended architecture was not universally understood. There was some architectural documentation and wiki pages but they didn’t convey the information well and were already out of date. Team members were spending a bit too much time working in isolation, and not enough time sharing information. We really just needed to concentrate on telling and re-telling the ‘tale’ of the architecture around a whiteboard.

Image by Jeff Youngstrom (Creative Commons)

My favourite moment would come after the architect and I had spent a long time discussing the architecture and various options, sharing experiences. The board would be an unrecognisable scrawl of squiggles, smudges, and illegible text. To my astonishment the architect would grab other members of the team, drag them to the incomprehensible whiteboard and shout “SEE???!!!!”. I’d roll on the ground laughing.

It is of course obvious, but in this usage the whiteboard (or sketch paper and pen) is just a tool of communication within the conversation. A prop. The diagram left behind is completely meaningless unless you were part of that conversation.

I do a lot of ‘project-onboarding’ for new team members. I don’t tend to do this by handing them a ream of documentation. I find that handing someone a complete picture is extremely confusing. Instead we have a conversation and build up that sketch on the whiteboard. Building up that sketch incrementally – even though the resulting picture is incomprehensible – is much more effective at conveying the story of the project.

Continuous Integration – Ruthless Automation

I think ‘Ruthless’ presents the right intent when automating repetitive tasks – it’s more than just aggressive or compulsive. My previous post talked about deployment repeatability primarily, but our goal is to improve the consistency of all processes that are repeated in the process of creating and maintaining system.

A sequence of actions performed manually multiple times is a surefire recipe for disaster. My tip:

  • perform a task manually once – you’ll usually be exploring ‘how’ to do the task.
  • perform the task a second time – write down in a text file the steps you took. If working at a command line (why aren’t you?) then go back through shell history and capture the exact commands and parameters in the text file. Check the text file into source control if you like.
  • IF you perform the task a third time, then stop and use the text file to write a small script, and check it into source control. Delete the text file.

I recommend against upfront automation – always do something manually a couple of times, so you understand the failure points and consequences. Attempts at upfront automation seem always to lead to frustration.

Interactive Mode

Mark taught me an important technique – start by building a script that asks for input e.g. release numbers, branch names. Have the script default to an ‘interactive mode’ by confirming each step that it’s going to perform with a prompt “[execute]/skip/quit”. ‘skip’ is important – it allows me to skip steps which is really useful during development where you need to make quick changes and recover the process from a known point.

Encourage the team to use the script in this interactive mode, and fix anything that goes wrong. Eventually you can switch the default to the fully automated mode, but leave the interactive mode in place so that the script can be debugged.

This approach is great because it allows you to incrementally introduce automation, and to carefully watch the steps involved and introduce checks and verifications for anything that goes wrong.

Wiki Scripting

A couple of times I’ve observed a team that has captured the steps to perform something in a wiki page – for example creating a new release support branch in source control, including updating a file signifying the major and minor version to be built. The wiki page will contain detailed steps, and include the lines to be copy/pasted into the terminal, with “<insert release number here>” annotations.

This just kills me – and I’ve seen these wiki pages over a page long. Put those commands in a script and check it into your source control!

Source Control

It shouldn’t have to be said – check the automation scripts you write into source control. Share the love – make sure all of your team members use the same process to perform routine tasks, and that they can contribute fixes and improvements.

Languages

Pick a language that is good for this type of automation – usually an interpreted scripting language of some sort. Use something that might appeal to your operations group – you want them to share ownership of at least some of the automation, and be able to debug and submit patches. I personally encourage the use of Ruby as it has great library support (e.g. highline for interactive mode) and has some great specific build/deploy scripting tools like rake and capistrano.

Continuous Integration – Repeatability

There’s some simple rules to follow to reduce the unexpected – particularly in build and deployment as part of a Continuous Integration process. If something works, I expect it to work again next time, and will put in place something to make sure it happens exactly the same next time. If something fails instead of just fixing it, I want to put something in place to make sure it never happens again. Simple application of these rules can bring calm and order.

There should be no manual steps required to deploy an application to a target environment (test or production). You should not for example have to unpack a zip file, change the contents of files x and y, and restart service X. If deployment instructions include the word ‘click’ then something is wrong. Every manual step introduces a chance for variation, and removes an opportunity to add an automated check.

Some customers claim to have an automated deployment process – when we dig deeper we find that the instructions to run the automated deployment process run to dozens of steps. Deployments are done into different environments by different people – each of which interpret the manual steps differently, and use different workarounds and additional steps where the process is not well defined or fails regularly.

What do we need to implement true repeatability of deployment?

  • don’t fix problems ‘in situ’. When a deployment to a test environment fails, do not fix it in place. Investigate the problem, then add something to the deployment process to ensure it cannot happen. This might be a ‘pre flight check’ that makes assertions about the target environment, or a post-deployment verification test that will provide fast feedback that something went wrong.  Sometimes this means changing the behaviour of other groups like IT operations or release management to remove this  ’quick just patch it’ approach.
  • externalise environment-specific configuration. Deploy the *exact* same artefacts in your test and production environments. Anything that is specific to “system test” should be sourced outside of the artefact – from config files, environment vars etc. I have a lot to say here which I’ll save for a dedicated post.
  • make test environments as close to production as possible. The closer test environments are to production, the less likely there is to have a ‘whoops’ on the production date. Audit this regularly – OS version, service packs, app server versions, database names, directory locations, load balancer configs. This will minimise the number of items you need to place in environment-specific configuration.
  • automate the deployment of *everything*. Including e.g. apache configs, load balancer config, firewall settings, database upgrade scripts. Everything should come from a known configuration coming out of source control. I’m very keen to learn how to use tools like puppet and chef to assist here.
  • use exactly the same deployment process from dev to production. Too many times we develop deployment automation that is only used in the test environments, and the production deployment is done by humans following an invisible set of instructions.
  • share responsibility for building, maintaining, and testing deployment scripts between development and operations. Ensure that changes to scripts are checked back in to source control (easiest way is to embed them in the deployment artefact built by your CI server). Give your ops team commit access to source control.
  • release everything every time. Don’t cherry pick a set of components to deploy. In every release try deploying all components together – including components that haven’t been changed. Two benefits I’ve realised – eliminate the risk of forgetting a dependant change, and confidence that a rarely-changed component CAN be deployed. If you feel it is risky to deploy a component unnecessarily, then you really need to address those risks. Don’t cop out with the ‘let sleeping dogs lie’ approach. That dog will bite you badly when you come to build and deploy it in a year’s time.

These are just a few of the things I’ve learned help to make deployments boring. Boring *should* be the goal, although you get a heck of a lot less champagne.

Sudden thought – perhaps there is a hidden incentive here that’s driving behaviour? – app deployments that happen like clockwork every two weeks without raising a sweat are boring for some folk – and there is no opportunity to be a hero.  I feel a little queasy at this thought…

Continuous Integration – Commit Frequently

I thought by 2010 that this would be a standard doctrine, but it’s not (at least with the customer teams I coach). Commit regularly – minimum once per hour. Every minute past one hour should make you very uncomfortable. The hair on the back of your neck should start to stand up at 1.5 hours. A facial tic should begin at 2 hours. At 3 hours a reflex action should kick in to revert local changes and start over in a more incremental way.

Effective continuous integration relies on continuous commits from developers – I commit often, others update (get latest) often, we remain in a perpetual state of integration. Thanks to collective code ownership and a high shared coding standard, I’ll start building on top of (or refactoring) code that you’re committing – while you’re still working on a feature. This is incredibly healthy, and helps us deliver code that is expressive and free from duplication. If we’re accidentally working in the same area, we’ll find out in an hour instead of in two days when the train wreck is unavoidable.

Work in small hops. Red – Green – Refactor – can I commit? If I can’t commit, why not? Make your next priority to get the code back to a state where you can commit.

Deferring commits is like playing ‘chicken’ with the rest of your team.

// TODO: think of a witty and intelligent tagline