Tuesday, July 5, 2016

Developers need to care about resiliency

Disclaimer: I'm the technical lead for Lightbend ConductR - a tool that focuses on managing distributed systems with a key goal of resiliency.

I've been doing a reasonable amount of travelling over the past few years. Overall I enjoy it; I don't think that I'm away that much that it has become painful - may be 3-6 international flights per year.

One of the things you hear about when travelling is missing an international connection. I've been fortunate in that this has happened just once; a couple of weeks ago in fact.

The airline was British Airways (BA), and they did a really good job of trying to make up time given that flights out of Heathrow were causing delays across Europe. Thus my flight from Berlin TXL to London LHR was about two hours late. I missed my Sydney SYD flight from LHR. The BA staff did a great job of putting me up in a hotel overnight and getting me on to the next available flight. Honestly, from a staff perspective, BA were fantastic in fact.

What was frustrating though was that it took about two hours to arrange the accomodation and flight booking. This was also considering that I didn't have to queue for long and that a staff member attended to me in a reasonable time frame. The problem was the BA computer system.

Apparently BA have some new system. There were IT staff walking around helping the front-of-house staff get into the system and deal with its incapacity to handle any load. The IT staff were frustrated. The front-of-house staff were frustrated. I was frustrated (although not as frustrated as a Business Class passenger in front of me who felt that his ticket meant that BA should treat him like royalty!).

BA's computer system had an amazing effect on all concerned, except most likely the people that wrote it. In my opinion the original developers should be there supporting the front-of-house staff. They should feel the pain that they have inflicted.

I'm sure that you have similar stories to share, where computer systems have failed you miserably. Computer systems will of course fail, that's natural, but it is the fact that their degree of failure is considered acceptable that is the problem. Computer systems should not fail to the extent that they do. Your airline reservation system, online banking site, or whatever it is, it should be more reliable that it probably has been.

The problem is that developers do not understand that building-in resilience to their software is more important than most other things. As my colleague, Jonas Bonér has stated many times, "without resiliency nothing else matters". He's so right. Why is it then that developers just don't get this?

My answer to that is that many developers just see what they do as a job, and they don't really care about what they do. Putting that aside though, creating and then indeed managing distributed systems, a key requirement for resiliency, is harder than not; not hard, but harder and developers are lazy (btw: in case you don't realise this, I'm a developer!).

We need systems that are resilient. We therefore need developers to care about resiliency. The more that developers care about resiliency, the more tools and technologies we'll see appearing in support of it. I strongly feel that it all starts with the developer though.

I imagine a world where, given the inevitability of missing flight connections, I can wait in a queue for no longer than 10 minutes, be handled within another 10 minutes and then sleep off the tiredness and inconvenience of waiting another 24 hours for my next flight. The developer just needs to start caring in order for this to happen. Make he or she responsible for managing their system in production and they'll start caring, I guarantee it.

Here's a language/tool agnostic starting point for you if you are a developer that cares enough to have read this far: The Reactive Manifesto.

Thanks!

No comments: