Security Chaos Engineering: Products and Tools

COUV_CHAOS-1400x788.jpg

As we discussed in a previous post, chaos engineering is a relatively new idea in the security domain. But there are products and tools that security teams can use to implement it if they so choose.

As the first to apply chaos engineering to software development, Netflix built many of its own tools. But given the results organizations such as Netflix have had with the practice, general-purpose tools have started to emerge, allowing more organizations to add chaos engineering to their cloud development arsenal. Such tools typically provide a general-purpose framework for developing experiments and tools for deploying them and reporting the results. Libraries of specific, focused experiments work within these general-purpose frameworks. 

Currently, there aren’t many security-specific experiments in the libraries included in these products, but we expect that to change. There is a growing community of security professionals who are both advocating security chaos engineering and developing tests through open source and other community initiatives. And as general-purpose chaos engineering tools mature, their experiment libraries will include more security-specific experiments. Today, however, security managers should be prepared to design and build their own experiments using the tools these general-purpose platforms provide.

Here’s a basic overview of some of the tools available:

  • ChAP: Netflix developed the Chaos Automation Platform, or ChAP, to automate experiments due to the rapid change inherent to its production systems, taking the notion of continuous experimentation quite literally. ChAP interrogates the deployment pipeline for a specific service. It then launches both experiment and control clusters of that service and routes a small amount of traffic to each. A specified scenario is applied to the experimental group, and the results of the experiment are reported to the service owner. ChAP will automatically end an automated experiment if it exceeds a predefined error budget. Netflix integrated ChAP with Spinnaker, its CI/CD system, allowing the engineers to run experiments often and continuously. Netflix says it has identified and prevented resiliency-threat regressions since deploying ChAP in this fashion.

  • The Chaos Toolkit: The Chaos Toolkit is an open-source project that provides an extensible toolkit for experiments that developers can adapt to specific use cases. The Chaos Toolkit allows developers to create their own “probes” (for observing system state as part of an experiment) and “actions” (for affecting the system while conducting an experiment). Developers can write and package a Python function in a module that can be called from the Chaos Toolkit, execute an arbitrary executable, or invoke an HTTP endpoint.

  • Gremlin: Gremlin provides a general-purpose, commercial product that the company positions as “failure as a service.” Gremlin supports chaos engineering experimentation and reporting on its library of failure testing modes, including resource exhaustion (CPU, Memory, IO, or Disk bottlenecks), bad behavior (dying processes, time drifts, instance reboots), and unreliable networks. The Chaos Toolkit team recently announced that the toolkit supports Gremlin.

  • Verica: A recent startup, Verica was founded by Aaron Rinehart and Casey Rosenthal, who ran the chaos engineering team at Netflix. Rosenthal also co-authored the O'Reilly book on Chaos Engineering and the manifesto at PrinciplesofChaos.org. Verica provides an enterprise platform for what it calls “Continuous Verification,” which is based on chaos engineering. Verica based its platform on Netflix's ChAP and, according to Rinehart, it includes security-specific chaos experiments. Verica’s platform sits on-prem or within the customers' cloud.

  • Chaoslingr: The first security-specific tool to appear was a relatively simple but interesting open source project, Chaoslingr. (The project is now archived, but can provide some value in showing how to create and run experiments in a low-cost way.) Created by a team Rinehart led, Chaoslinger is a security experiment and reporting framework. Anyone can write their own experiments. Written in Python, the framework consists of four AWS Lambda functions, as follows:

    • Generatr, which identifies the object to inject the failure on and calls Slingr

    • Slingr, which injects the failure

    • Trackr, which logs details about the experiment as it occurs

    • Experiment description, which provides documentation on the experiment along with applicable input and output parameters for Lambda functions

Beyond these general-purpose chaos engineering tools, system-specific utilities for performing fault experiments are starting to emerge. For example, Istio, the open-source service mesh technology, includes the ability to perform chaos experiments in a microservices environment, without the need to change the applications. More specifically, Istio allows testers to: 

  • Throw a 503 Service Unavailable error: System managers can easily configure Istio to return 503 errors to service requests, testing how robust distributed applications are in the face of unavailable services. 

  • Inject service response delays: Istio allows managers to inject variable-length network delays at different points in the system without changing any code. Random response delays can be a difficult problem to deal with in a complex microservices environment. Using this feature with Jaeger tracing, managers can spot problems proactively and increase system resiliency. 

  • Retry services a random number of times: When applications retry service requests after an unsuccessful attempt, they typically follow a predetermined pattern. With Istio, managers can dynamically change the number of retry attempts, another useful tool in distributed debugging and tracing.

As chaos engineering becomes more standard practice in DevOps environments, other general-purpose tools are likely to emerge or become part of CI/CD platforms. Security teams can take advantage of of these as they see fit.


Disclosure: Rain Capital has an investment in Tetrate, which provides products and services based on Istio and Envoy. As of this writing, Rain capital has no investments in the other companies mentioned in this post.

Jamie Lewis