Programmatic Chaos with AWS Fault Injection Simulator

Chaos Engineering has been around for a while, after being popularized by Netflix during their migration to the cloud. However, despite their best efforts to open source their tooling, a proper secure and reliable set up was complicated enough most people.

Fast forward to the AWS announcement of a limited preview new managed chaos engineering service called AWS Fault Injection Simulator at re:Invent 2020. After a couple of months of limited access, the service is now GA (us-east-1 only at the time of this post) and today’s post is about getting started with it.

The Setup

There are a number of actions the service can perform (stop/terminate instances, throttle APIs, etc) against a number of different targets (EC2, ECS, RDS with more to come). For this entry, we’ll keep it simple and just focus on terminating a production EC2 instance experiment. In this particular case, I’ll be using the sample NodeJS application managed by Elastic Beanstalk.

The Application

As mentioned before, I’m just using the sample NodeJS application that Elastic Beanstalk offers you to quickly get started. However, I wanted to some of the configuration choices that I made to my environment.

The first bit of configuration (and one to pay attention to) is around the high availability for my environment. You’ll notice that while it is load balanced and scale up to 4 instances, the minimum has actually been set at 1.

You can also see the resources the service created for us, which in this case is one EC2 instance to which I’ve applied a resource tag at the application level. This tag is of the form chaos:ready, which it is descriptive enough for me to understand what instances I want FIS to target during its experiments. You could choose whatever value of the key value pair tag or just not have one altogether.

Finally, here’s what the sample application looks like and it also serves as a one to see how our environment is running.

Experiment Time

From the FIS homepage, you’ll see your option is to create a new experiment template so go ahead and hit that button.

Disclaimer: FIS will execute whatever actions you define against your resources. The service doesn’t produce fake metrics or wizardy to simulate how a potential disruption affects your system. The service will indeed, terminate your instances if that’s the action you have chosen. You will be provided with a number of warning signs along the way but it’s better to be safe than sorry.

Think of the template as the definition for your experiments, the place in which you can specify actions, targets and alarms on top of the usual name, role (the role requires a trust relationship on ‘fis.amazonaws.com’) and tags that we’re used to from other AWS services.  As previously mentioned, today’s experiment will only perform a terminate instance action.

When creating our action, we’re asked to provide a name for it as well as an action type from a predefined list. Once you’ve selected your action type, the Target dropdown will appear with an already prepopulated value created for you. The last option is something called “Start after“, what this means is in cases were a template has multiple actions, you might choose to run them in parallel or in sequence. Right now, it can be ignored given we’re going for the one action.

Now, let’s edit the target FIS created for us. I’ll start by updating the name for something a bit more descriptive, the Resource Type can stay as is because we’re indeed targeting EC2 instances. Now comes the fun part and arguably the area in which you need to focus the most which is how are we going to target these resources.

We see the selected method by default is using a resource ID. For our particular example, it might look like it’s enough and it indeed could be for a one off execution. It is true we’re only running one EC2 instance but we need to save the template with a fixed ID, so that means we’re not really in a position to reuse the template given that if we succeed and actually terminate the instance that particular ID will be lost.

So let’s use tags and filters and as soon as we select that method, a couple of “resource” options will appear. The first one is tags, and as you can imagine it will only run against resources with the specified tags. This will be the place in which I’ll use that chaos:ready tag from before.

The second option is called filters and I highly recommend you to follow the documentation link as this is the area where targets become truly powerful. For the sake of simplicity (this post is already too long) but not to leave you hanging, I’ll create one that targets only EC2 instances that are in a running state.

The Stop Condition section will provide you with the necessary safe guard to stop the experiment if a certain criteria is met. It is an optional value and I won’t be using it now but I’d suggest to always have one for serious experiments.

Go ahead finish the creation of the template. The service will make sure you’re sure about it with with a nice warning sign.

I’m now ready to start the template, which will in return create an experiment instance. The start process comes with the same warning as the creation one and it should run successfully.

Now that the experiment has finished, let’s have a look at the chaos it caused.

My beanstalk URL now returns an error, which means the underlying EC2 instance has been successfully terminated.

We can confirm our suspicions by looking at the health of our environment as well at the specific time in which it happened by looking at the metrics.

Beanstalk will automatically spin up a new instance and your environment will be back to healthy in a minute or two but it is a good reminded that even if you’re using a managed service, the service can only do what you tell it to do. In our case, because our minimum configuration was one instance, terminating it meant a complete disruption of our application.

In our follow up post with a way of mitigating that but still being able to run chaos experiments on our environments. Check it out here : https://tutorialsforcloud.com/2021/03/25/programmatic-chaos-with-aws-fault-injection-simulator-continued/

Leave a Reply

Your email address will not be published. Required fields are marked *