One of the key challenges of building an enterprise-class robust scalable storage system is to validate the system under duress and failing system components. This includes, but is not limited to: failed networks, failed or failing disks, arbitrary delays in the network or IO path, network partitions, and unresponsive systems.
Apache Ozone fault injection framework is designed to validate Ozone under heavy stress and failed or failing system components. Specifically, this would enable injecting different types of failures in the Ozone cluster and validating system behavior in the presence of such failures. The framework is generic and extensible enough to allow injecting new classes of failures over time and writing a suite of automated test cases to validate system behavior against the newly defined failure class.
Although we have designed this fault injection framework for Ozone, it is generic enough to be used for validating any other distributed and scalable system.
This framework is designed to simulate failure of a variety of system components, specifically:
Randomly injecting a failure and hoping to catch race conditions and possible data corruption may not always be fruitful. Analyzing failures with random failure injection also requires some manual analysis to rule out false alarms. We considered the Namazu framework but decided not to use it for very similar reasons. We also noticed that timing of the failure injection plays a critical role in the outcome. While existing fault injection frameworks offer random error injection into the system, they lack the ability to control the timing and placement of error injection relative to the test case execution. Given the shortcoming of the existing frameworks, we developed an Ozone fault injection framework that allows for random error injection as well as precise and targeted error injection. This allows us to create targeted test cases where we can inject and control failures within well outlined time windows as Ozone is serving a given request. Such a targeted test case should also have a well-defined outcome that the test can validate without manual analysis. This framework does not require any code changes to the system-under-test that is being validated. This framework simulates failures directly at the file system layer or network layer.
Fault Injection service runs on every node where we are running one or more system components under test. This service provides REST APIs to inject/reset various types of failures. This service has one or more plugin extensions to inject different types of failures.
One key part of the fault injection service is a very lightweight passthrough fuse file system that is used by Ozone for storing all its persistent data and metadata. Service provides APIs to control how and when this file system behaves in a certain way, including injecting delays as well as failures on the read/write access path. The APIs are generic enough that we could target both Ozone data and metadata for failure/corruption/delays.
Another key part of the service is the ability to filter network packets and return failures or introduce delays in the network. This filter can also be used to create network partitions. This is done with a custom netfilter module that can use libnetfilter_queue.
Initially we plan to use this framework for injecting failures in system components e.g. file-system or network. Over time we can do more intrusive whitebox testing by enabling and disabling various join points and delay-points within the Ozone code. We could then provide APIs to enable or disable a crash or delay behavior with a specific action.
The figure below depicts the overall setup required to test Ozone.
HDDS-3064 | Get Key is hung when READ delay is injected in chunk file path |
HDDS-3136 | Retry timeout is large while writing key |
HDDS-3163 | Write Key is hung when write delay is injected in datanode dir |
HDDS-3214 | Unhealthy datanodes repeatedly participate in pipeline creation |
Kraken is a unified fault injection framework that is developed by Cloudera for resiliency testing. Kraken provides a programming language agnostic, cloud-agnostic, deployment-kind agnostic framework for wrapping existing fault injection implementations and provides a simple & unified interface for users to consume. It is a hosted fault injection framework, that reduces the setup/installation complexities to next-to-zero efforts.
Ozone fault injection is now integrated with the Kraken framework to inject errors at the system level and enhance its capability to validate system robustness.
Kraken users do not need to perform any complicated setup or installations. A single command execution or an HTTP post request can set up the fault agent in the machines where the system under test is running. After this step, users can immediately start resiliency testing using the GUI or test automation using simple APIs of Kraken client SDK. With Kraken integration, the Ozone fault injection framework faults could be consumed using simple APIs of Kraken client SDK. And Kraken also provides many fault implementations inbuilt covering targeted resources – CPU, Memory, Disk, Network, Process.
With the BYOF (Bring Your Own Fault) principle of Kraken, we could integrate any other fault injection implementation with Kraken easily. And with Kraken’s unified interface (using auto-generated SDKs from swagger JSON of fault services), all of these faults could be used in resiliency tests with simple and uniform code.
The system/application under test nodes need not open SSH port for communication from test automation or any fault injection triggers. The Kraken fault agent installed on SUT nodes registers itself to the Kraken’s nodes service and listens for incoming messages on a RabbitMQ queue.
The client layer of Kraken has auto-generated SDKs, Kraken-client SDK (wrapper library on auto-generated SDKs), and GUI as of now. Kraken’s roadmap has plans to build Random Disruptor (similar to ChaosMonkey, but with Kraken’s advantages) in this layer using its builtin fault implementations. The Random Disruptor would be a client with fault injection randomness algorithm and policies configured consuming Kraken’s fault injection services.
Kraken supported faults as of now, including Apache Ozone fault injection framework’s disk failures are:
This may have been caused by one of the following: