Paris - USI 2011 Session

en

Building Resilience in Web Development and Operations

Pour visualiser ce contenu interactif vous avez besoin de Macromedia Flash Player 8 Installation du flash player 8
Please wait...

Mark: 5.0/5 (3 vote(s))
2298 visits

Resilience engineering can be defined as the "ability of a system to adjust its functioning before, during, or after changes and disturbances, so that it can sustain operations even after a major mishap or in the presence of continuous stress." (Erik Hollnagel) 

 

Building a resilient system means spending time (and money) to anticipate, monitor, respond, and learn from failure. Learning from failures can be surprisingly tricky. For one, failures only provide a tiny view into what makes systems and organizations resilient. You have to not just look closely at how things went wrong, but also how things go right during normal circumstances. Human error can only be used as a learning tool when given the right perspective and context. Post-mortem meetings gone wrong don’t only prevent learning, but they can severely harm an organization’s ability to learn and improve in the future.

 

I'm going to be talking about how a growing web application (Etsy.com) can aim to be resilient in this context, which applies to not only the software and infrastructure, but also the development and operations staff themselves who build and maintain this growth and change. 

 

I'll talk about anticipating, monitoring, responding, and learning from failures and outages in this context. We’ll touch on the perspective shift between spending effort to prevent failure versus developing your ability to respond to the inevitable failures that will occur.

 

Throughout the talk, I’ll give plenty of technical details to illustrate these approaches.

Be the first to write a comment

Back to program

About speaker

Session tags

> View all tags

Suggestions