Programmatic Chaos with AWS Fault Injection Simulator – Continued

This post will be a continuation of our post introducing the AWS Fault Injection Simulator.

The idea was to run an experiment and remediate our findings but as it turned out, the post was already too long with a simple setup so I split it in two parts.

I’d recommend you to check the first part to better understand the context of this entry, but the “tl;dr” is that we set up an experiment with FIS that would target for termination all EC2 instances of an application managed by Elastic Beanstalk. The beanstalk configuration has an autoscaling group with a minimum of 1 instance, which meant terminating it incurred on an outage.

The Remediation

On the application side, we need to make sure our environment runs on a minimum of 2 instances.

       

This is a good reminded that even though you make use of managed services, you’re still in charge of the behavior of it. Managed services (regardless of being compute, databases, containers, etc) will do all the heavy lifting but it will only operate in whichever way you tell it to. Our first FIS experiment showed that the application setup wasn’t resilient enough to failure. Whilst, beanstalk made sure to spin up a new instance to replace the terminated one, there was still a minute or two of downtime.

The New Experiment

Now that we’re running more instances, I’m also going to update the experiment template. On its current form, it would still target all instances because it was just based on tags which are shared by all ec2 resources managed by beanstalk.

The action can remain as is, that is a terminate-ec2 type. The change will be at the target level. Here, we need to update it in such a way that it targets a subset of the instances and FIS provides you with two options to do so.

  1. Count: Fixed number of resources that will be targeted by the matching criteria
  2. Percentage: Percentage of affected instances. NB: FIS will round down the resulting number of targets in cases where you have an odd number of resources.

I want to test how my application behaves if I lose half my fleet, so I’ll set it up with a Percent mode at 50%. In this particular case, this is the equivalent to choosing Count with a value of 1.

After running this new experiment, we can test our application and see that there are no perceived changes to it. However, upon closer inspection to our resources, we’ll learn a few things

  1. Our EC2 fleet downside to 1 (which means our action ran as intended)
  2. Beanstalk is showing a Degraded state because 1 of the instances stopped sending data. If you remember, our application state was Unknown when the entire fleet disappeared.

   

We now have a new configuration to withstand certain types of failure and an experiment we can run on a regular basis to make sure our application configuration is up to it.

There are many more types of actions you can perform with FIS that we can explore in future entries.

Leave a Reply

Your email address will not be published. Required fields are marked *