Break me if you can! How to Test the Ability of Large-Scale, Distributed Software Systems to Cope with Failures?
-
-
40 min
We conduct functional testing, check performance, write unit tests. However, all these activities may not be enough when it comes to large-scale, heavily loaded distributed systems with a high error cost.
What will happen to your distributed system in case of network segmentation caused by network problems?
Will your system respond correctly to the failure of cluster nodes?
Are you sure that your database does not lose data?
Have you ever thought about the reliability and security of your system?
In this talk, I will share our story on how adopting the experience of Amazon, Netflix and Twitter we created our own framework to test the ability of the system to cope with failures. On the example of testing a new microservice architecture of Sberbank, we will analyse various scenarios for testing the system's response to failures. We'll talk about the technologies that we use.