Fault tolerance: why and how we test it in production

  • 40 min

This talk explores why we decided to test how our system behaves under component failures not in a test environment, but directly in production – and how we implement this approach.

  • Can the system continue operating despite failures?
  • Will our backup mechanisms actually work as expected?
  • Will we lose any data?
  • How will the system recover?
  • Will human intervention be required, and if so, is the team prepared for it?

We will discuss how we organise this process, how it evolves, what goals we set, and how we analyse the results of such resilience drills.

Comments ({{Comments.length}} )
  • {{comment.AuthorFullName}}
    {{comment.AuthorInfo}}
    {{ comment.DateCreated | date: 'dd.MM.yyyy' }}

To leave a feedback you need to

or
Chat with us, we are online!