Unverified Commit e9b38175 authored by Yue Yang's avatar Yue Yang Committed by GitHub
Browse files

chore: remove legacy website (#1497)

Signed-off-by: default avatarYue Yang <g1enyy0ung@gmail.com>
No related merge requests found
Showing with 0 additions and 1733 deletions
+0 -1733
root = true
[*]
indent_style = space
indent_size = 2
end_of_line = lf
charset = utf-8
trim_trailing_whitespace = true
insert_final_newline = true
# Dependencies
/node_modules
# Production
/build
# Generated files
.docusaurus
.cache-loader
# Misc
.DS_Store
.env.local
.env.development.local
.env.test.local
.env.production.local
npm-debug.log*
yarn-debug.log*
yarn-error.log*
{
"semi": false,
"singleQuote": true,
"printWidth": 120
}
<!-- markdownlint-disable-file MD033 -->
<!-- markdownlint-disable-file MD041 -->
<p align="center">
<img src="../static/logo.svg" width="256" alt="Chaos Mesh Logo" />
</p>
<h1 align="center">Website</h1>
<p align="center">
Built using <a href="https://v2.docusaurus.io/" target="_blank">Docusaurus 2</a>, a modern static website generator.
</p>
## How to develop
```sh
yarn # install deps
yarn start
```
This command starts a local development server and opens up a browser window. Most changes are reflected live without having to restart the server.
## Build
```sh
yarn build
```
This command generates static content into the `build` directory and can be served using any static contents hosting service.
## New version
```sh
yarn docusaurus docs:version x.x.x
```
The versions of the all docs split into two parts, one is the **latest (in docs/)** and the others are **versioned (in versioned_docs/)**. When a version has been released, the current latest docs will be copied into versiond_docs (by running the command above).
## How to contribute
Mostly you only need to modify the content in the `docs/`, but if you want some versions to take effect as the latest, please also update the same files in the `versioned_docs/` dir.
## License
Same as Chaos Mesh
module.exports = {
presets: [require.resolve('@docusaurus/core/lib/babel/preset')],
};
---
slug: /chaos_mesh_your_chaos_engineering_solution
title: Chaos Mesh - Your Chaos Engineering Solution for System Resiliency on Kubernetes
author: Cwen Yin
author_title: Maintainer of Chaos Mesh
author_url: https://github.com/cwen0
author_image_url: https://avatars1.githubusercontent.com/u/22956341?v=4
image: /img/chaos-engineering.png
tags: [Chaos Mesh, Chaos Engineering, Kubernetes]
---
![Chaos Engineering](/img/chaos-engineering.png)
## Why Chaos Mesh?
In the world of distributed computing, faults can happen to your clusters unpredictably any time, anywhere. Traditionally we have unit tests and integration tests that guarantee a system is production ready, but these cover just the tip of the iceberg as clusters scale, complexities amount, and data volumes increase by PB levels. To better identify system vulnerabilities and improve resilience, Netflix invented [Chaos Monkey](https://netflix.github.io/chaosmonkey/) and injects various types of faults into the infrastructure and business systems. This is how Chaos Engineering was originated.
<!--truncate-->
At [PingCAP](https://chaos-mesh.org/), we are facing the same problem while building [TiDB](https://github.com/pingcap/tidb), an open source distributed NewSQL database. To be fault tolerant, or resilient holds especially true to us, because the most important asset for any database users, the data itself, is at stake. To ensure resilience, we started [practicing Chaos Engineering](https://pingcap.com/blog/chaos-practice-in-tidb/) internally in our testing framework from a very early stage. However, as TiDB grew, so did the testing requirements. We realized that we needed a universal chaos testing platform, not just for TiDB, but also for other distributed systems.
Therefore, we present to you Chaos Mesh, a cloud-native Chaos Engineering platform that orchestrates chaos experiments on Kubernetes environments. It's an open source project available at [https://github.com/chaos-mesh/chaos-mesh](https://github.com/chaos-mesh/chaos-mesh).
In the following sections, I will share with you what Chaos Mesh is, how we design and implement it, and finally I will show you how you can use it in your environment.
## What can Chaos Mesh do?
Chaos Mesh is a versatile Chaos Engineering platform that features all-around fault injection methods for complex systems on Kubernetes, covering faults in Pod, network, file system, and even the kernel.
Here is an example of how we use Chaos Mesh to locate a TiDB system bug. In this example, we simulate Pod downtime with our distributed storage engine ([TiKV](https://pingcap.com/docs/stable/architecture/#tikv-server)) and observe changes in queries per second (QPS). Regularly, if one TiKV node is down, the QPS may experience a transient jitter before it returns to the level before the failure. This is how we guarantee high availability.
![Chaos Mesh discovers downtime recovery exceptions in TiKV](/img/chaos-mesh-discovers-downtime-recovery-exceptions-in-tikv.png)
<div class="caption-center"> Chaos Mesh discovers downtime recovery exceptions in TiKV </div>
As you can see from the dashboard:
* During the first two downtimes, the QPS returns to normal after about 1 minute.
* After the third downtime, however, the QPS takes much longer to recover—about 9 minutes. Such a long downtime is unexpected, and it would definitely impact online services.
After some diagnosis, we found the TiDB cluster version under test (V3.0.1) had some tricky issues when handling TiKV downtimes. We resolved these issues in later versions.
But Chaos Mesh can do a lot more than just simulate downtime. It also includes these fault injection methods:
- **pod-kill:** Simulates Kubernetes Pods being killed
- **pod-failure:** Simulates Kubernetes Pods being continuously unavailable
- **network-delay:** Simulates network delay
- **network-loss:** Simulates network packet loss
- **network-duplication:** Simulates network packet duplication
- **network-corrupt:** Simulates network packet corruption
- **network-partition:** Simulates network partition
- **I/O delay:** Simulates file system I/O delay
- **I/O errno:** Simulates file system I/O errors
## Design principles
We designed Chaos Mesh to be easy to use, scalable, and designed for Kubernetes.
### Easy to use
To be easy to use, Chaos Mesh must:
* Require no special dependencies, so that it can be deployed directly on Kubernetes clusters, including [Minikube](https://github.com/kubernetes/minikube).
* Require no modification to the deployment logic of the system under test (SUT), so that chaos experiments can be performed in a production environment.
* Easily orchestrate fault injection behaviors in chaos experiments, and easily view experiment status and results. You should also be able to quickly rollback injected failures.
* Hide underlying implementation details so that users can focus on orchestrating the chaos experiments.
### Scalable
Chaos Mesh should be scalable, so that we can "plug" new requirements into it conveniently without reinventing the wheel. Specifically, Chaos Mesh must:
* Leverage existing implementations so that fault injection methods can be easily scaled.
* Easily integrate with other testing frameworks.
### Designed for Kubernetes
In the container world, Kubernetes is the absolute leader. Its growth rate of adoption is far beyond everybody's expectations, and it has won the war of containerized orchestration. In essence, Kubernetes is an operating system for the cloud.
TiDB is a cloud-native distributed database. Our internal automated testing platform was built on Kubernetes from the beginning. We had hundreds of TiDB clusters running on Kubernetes every day for various experiments, including extensive chaos testing to simulate all kinds of failures or issues in a production environment. To support these chaos experiments, the combination of chaos and Kubernetes became a natural choice and principle for our implementation.
## CustomResourceDefinitions design
Chaos Mesh uses [CustomResourceDefinitions](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/) (CRD) to define chaos objects. In the Kubernetes realm, CRD is a mature solution for implementing custom resources, with abundant implementation cases and toolsets available. Using CRD makes Chaos Mesh naturally integrate with the Kubernetes ecosystem.
Instead of defining all types of fault injections in a unified CRD object, we allow flexible and separate CRD objects for different types of fault injection. If we add a fault injection method that conforms to an existing CRD object, we scale directly based on this object; if it is a completely new method, we create a new CRD object for it. With this design, chaos object definitions and the logic implementation are extracted from the top level, which makes the code structure clearer. This approach also reduces the degree of coupling and the probability of errors. In addition, Kubernetes' [controller-runtime](https://github.com/kubernetes-sigs/controller-runtime) is a great wrapper for implementing controllers. This saves us a lot of time because we don't have to repeatedly implement the same set of controllers for each CRD project.
Chaos Mesh implements the PodChaos, NetworkChaos, and IOChaos objects. The names clearly identify the corresponding fault injection types.
For example, Pod crashing is a very common problem in a Kubernetes environment. Many native resource objects automatically handle such errors with typical actions such as creating a new Pod. But can our application really deal with such errors? What if the Pod won't start?
With well-defined actions such as `pod-kill`, PodChaos can help us pinpoint these kinds of issues more effectively. The PodChaos object uses the following code:
```yml
spec:
action: pod-kill
mode: one
selector:
namespaces:
- tidb-cluster-demo
labelSelectors:
"app.kubernetes.io/component": "tikv"
scheduler:
cron: "@every 2m"
```
This code does the following:
* The `action` attribute defines the specific error type to be injected. In this case, `pod-kill` kills Pods randomly.
* The `selector` attribute limits the scope of chaos experiment to a specific scope. In this case, the scope is TiKV Pods for the TiDB cluster with the `tidb-cluster-demo` namespace.
* The `scheduler` attribute defines the interval for each chaos fault action.
For more details on CRD objects such as NetworkChaos and IOChaos, see the [Chaos-mesh documentation](https://github.com/chaos-mesh/chaos-mesh).
## How does Chaos Mesh work?
With the CRD design settled, let's look at the big picture on how Chaos Mesh works. The following major components are involved:
- **controller-manager**
Acts as the platform's "brain." It manages the life cycle of CRD objects and schedules chaos experiments. It has object controllers for scheduling CRD object instances, and the [admission-webhooks](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/) controller dynamically injects sidecar containers into Pods.
- **chaos-daemon**
Runs as a privileged daemonset that can operate network devices on the node and Cgroup.
- **sidecar**
Runs as a special type of container that is dynamically injected into the target Pod by the admission-webhooks. For example, the `chaosfs` sidecar container runs a fuse-daemon to hijack the I/O operation of the application container.
![Chaos Mesh workflow](/img/chaos-mesh-workflow.png)
<div class="caption-center"> Chaos Mesh workflow </div>
Here is how these components streamline a chaos experiment:
1. Using a YAML file or Kubernetes client, the user creates or updates chaos objects to the Kubernetes API server.
2. Chaos Mesh uses the API server to watch the chaos objects and manages the lifecycle of chaos experiments through creating, updating, or deleting events. In this process, controller-manager, chaos-daemon, and sidecar containers work together to inject errors.
3. When admission-webhooks receives a Pod creation request, the Pod object to be created is dynamically updated; for example, it is injected into the sidecar container and the Pod.
## Running chaos
The above sections introduce how we design Chaos Mesh and how it works. Now let's get down to business and show you how to use Chaos Mesh. Note that the chaos testing time may vary depending on the complexity of the application to be tested and the test scheduling rules defined in the CRD.
### Preparing the environment
Chaos Mesh runs on Kubernetes v1.12 or later. Helm, a Kubernetes package management tool, deploys and manages Chaos Mesh. Before you run Chaos Mesh, make sure that Helm is properly installed in the Kubernetes cluster. To set up the environment, do the following:
1. Make sure you have a Kubernetes cluster. If you do, skip to step 2; otherwise, start one locally using the script provided by Chaos Mesh:
```bash
// install kind
curl -Lo ./kind https://github.com/kubernetes-sigs/kind/releases/download/v0.6.1/kind-$(uname)-amd64
chmod +x ./kind
mv ./kind /some-dir-in-your-PATH/kind
// get script
git clone https://github.com/chaos-mesh/chaos-mesh
cd chaos-mesh
// start cluster
hack/kind-cluster-build.sh
```
**Note:** Starting Kubernetes clusters locally affects network-related fault injections.
2. If the Kubernetes cluster is ready, use [Helm](https://helm.sh/) and [Kubectl](https://kubernetes.io/docs/reference/kubectl/overview/) to deploy Chaos Mesh:
```bash
git clone https://github.com/chaos-mesh/chaos-mesh.git
cd chaos-mesh
// create CRD resource
kubectl apply -f manifests/
// install chaos-mesh
helm install helm/chaos-mesh --name=chaos-mesh --namespace=chaos-testing
```
Wait until all components are installed, and check the installation status using:
``` bash
// check chaos-mesh status
kubectl get pods --namespace chaos-testing -l app.kubernetes.io/instance=chaos-mesh
```
If the installation is successful, you can see all pods up and running. Now, time to play.
You can run Chaos Mesh using a YAML definition or a Kubernetes API.
### Running chaos using a YAML file
You can define your own chaos experiments through the YAML file method, which provides a fast, convenient way to conduct chaos experiments after you deploy the application. To run chaos using a YAML file, follow the steps below:
**Note:** For illustration purposes, we use TiDB as our system under test. You can use a target system of your choice, and modify the YAML file accordingly.
1. Deploy a TiDB cluster named `chaos-demo-1`. You can use [TiDB Operator](https://github.com/pingcap/tidb-operator) to deploy TiDB.
2. Create the YAML file named `kill-tikv.yaml` and add the following content:
```yml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-kill-chaos-demo
namespace: chaos-testing
spec:
action: pod-kill
mode: one
selector:
namespaces:
- chaos-demo-1
labelSelectors:
"app.kubernetes.io/component": "tikv"
scheduler:
cron: "@every 1m"
```
3. Save the file.
4. To start chaos, `kubectl apply -f kill-tikv.yaml`.
The following chaos experiment simulates the TiKV Pods being frequently killed in the `chaos-demo-1` cluster:
![Chaos experiment running](/img/chaos-experiment-running.gif)
<div class="caption-center"> Chaos experiment running </div>
We use a sysbench program to monitor the real-time QPS changes in the TiDB cluster. When errors are injected into the cluster, the QPS show a drastic jitter, which means a specific TiKV Pod has been deleted, and Kubernetes then re-creates a new TiKV Pod.
For more YAML file examples, see <https://github.com/chaos-mesh/chaos-mesh/tree/master/examples>.
### Running chaos using the Kubernetes API
Chaos Mesh uses CRD to define chaos objects, so you can manipulate CRD objects directly through the Kubernetes API. This way, it is very convenient to apply Chaos Mesh to your own applications with customized test scenarios and automated chaos experiments.
In the [test-infra](https://github.com/pingcap/tipocket/tree/35206e8483b66f9728b7b14823a10b3e4114e0e3/test-infra) project, we simulate potential errors in [etcd](https://github.com/pingcap/tipocket/blob/35206e8483b66f9728b7b14823a10b3e4114e0e3/test-infra/tests/etcd/nemesis_test.go) clusters on Kubernetes, including nodes restarting, network failure, and file system failure.
The following is a Chaos Mesh sample script using the Kubernetes API:
```
import (
"context"
"github.com/chaos-mesh/chaos-mesh/api/v1alpha1"
"sigs.k8s.io/controller-runtime/pkg/client"
)
func main() {
...
delay := &chaosv1alpha1.NetworkChaos{
Spec: chaosv1alpha1.NetworkChaosSpec{...},
}
k8sClient := client.New(conf, client.Options{ Scheme: scheme.Scheme })
k8sClient.Create(context.TODO(), delay)
k8sClient.Delete(context.TODO(), delay)
}
```
## What does the future hold?
In this article, we introduced you to Chaos Mesh, our open source cloud-native Chaos Engineering platform. There are still many pieces in progress, with more details to unveil regarding the design, use cases, and development. Stay tuned.
Open sourcing is just a starting point. In addition to the infrastructure-level chaos experiments introduced in previous sections, we are in the process of supporting a wider range of fault types of finer granularity, such as:
* Injecting errors at the system call and kernel levels with the assistance of eBPF and other tools
* Injecting specific error types into the application function and statement levels by integrating [failpoint](https://github.com/pingcap/failpoint), which will cover scenarios that are otherwise impossible with conventional injection methods
Moving forward, we will continuously improve the Chaos Mesh Dashboard, so that users can easily see if and how their online businesses are impacted by fault injections. In addition, our roadmap includes an easy-to-use fault orchestration interface. We're planning other cool features, such as Chaos Mesh Verifier and Chaos Mesh Cloud.
If any of these sound interesting to you, join us in building a world class Chaos Engineering platform. May our applications dance in chaos on Kubernetes!
If you find a bug or think something is missing, feel free to file an [issue](https://github.com/chaos-mesh/chaos-mesh/issues), open a PR, or join us on the #sig-chaos-mesh channel in the [TiDB Community](https://chaos-mesh.org/tidbslack) slack workspace.
GitHub: [https://github.com/chaos-mesh/chaos-mesh](https://github.com/chaos-mesh/chaos-mesh)
---
slug: /run_your_first_chaos_experiment
title: Run Your First Chaos Experiment in 10 Minutes
author: Cwen Yin
author_title: Maintainer of Chaos Mesh
author_url: https://github.com/cwen0
author_image_url: https://avatars1.githubusercontent.com/u/22956341?v=4
image: /img/run-first-chaos-experiment-in-ten-minutes.jpg
tags: [Chaos Mesh, Chaos Engineering, Kubernetes]
---
![Run your first chaos experiment in 10 minutes](/img/run-first-chaos-experiment-in-ten-minutes.jpg)
Chaos Engineering is a way to test a production software system's robustness by simulating unusual or disruptive conditions. For many people, however, the transition from learning Chaos Engineering to practicing it on their own systems is daunting. It sounds like one of those big ideas that require a fully-equipped team to plan ahead. Well, it doesn't have to be. To get started with chaos experimenting, you may be just one suitable platform away.
<!--truncate-->
[Chaos Mesh](https://github.com/chaos-mesh/chaos-mesh) is an **easy-to-use**, open-source, cloud-native Chaos Engineering platform that orchestrates chaos in Kubernetes environments. This 10-minute tutorial will help you quickly get started with Chaos Engineering and run your first chaos experiment with Chaos Mesh.
For more information about Chaos Mesh, refer to our [previous article](https://pingcap.com/blog/chaos-mesh-your-chaos-engineering-solution-for-system-resiliency-on-kubernetes/) or the [chaos-mesh project](https://github.com/chaos-mesh/chaos-mesh) on GitHub.
## A preview of our little experiment
Chaos experiments are similar to experiments we do in a science class. It's perfectly fine to stimulate turbulent situations in a controlled environment. In our case here, we will be simulating network chaos on a small web application called [web-show](https://github.com/chaos-mesh/web-show). To visualize the chaos effect, web-show records the latency from its pod to the kube-controller pod (under the namespace of `kube-system`) every 10 seconds.
The following clip shows the process of installing Chaos Mesh, deploying web-show, and creating the chaos experiment within a few commands:
![The whole process of the chaos experiment](/img/whole-process-of-chaos-experiment.gif)
<div class="caption-center"> The whole process of the chaos experiment </div>
Now it's your turn! It's time to get your hands dirty.
## Let's get started!
For our simple experiment, we use Kubernetes in the Docker ([Kind](https://kind.sigs.k8s.io/)) for Kubernetes development. You can feel free to use [Minikube](https://minikube.sigs.k8s.io/) or any existing Kubernetes clusters to follow along.
### Prepare the environment
Before moving forward, make sure you have [Git](https://git-scm.com/) and [Docker](https://www.docker.com/) installed on your local computer, with Docker up and running. For macOS, it's recommended to allocate at least 6 CPU cores to Docker. For details, see [Docker configuration for Mac](https://docs.docker.com/docker-for-mac/#advanced).
1. Get Chaos Mesh:
```bash
git clone https://github.com/chaos-mesh/chaos-mesh.git
cd chaos-mesh/
```
2. Install Chaos Mesh with the `install.sh` script:
```bash
./install.sh --local kind
```
`install.sh` is an automated shell script that checks your environment, installs Kind, launches Kubernetes clusters locally, and deploys Chaos Mesh. To see the detailed description of `install.sh`, you can include the `--help` option.
> **Note:**
>
> If your local computer cannot pull images from `docker.io` or `gcr.io`, use the local gcr.io mirror and execute `./install.sh --local kind --docker-mirror` instead.
3. Set the system environment variable:
```bash
source ~/.bash_profile
```
> **Note:**
>
> * Depending on your network, these steps might take a few minutes.
> * If you see an error message like this:
>
> ```bash
> ERROR: failed to create cluster: failed to generate kubeadm config content: failed to get kubernetes version from node: failed to get file: command "docker exec --privileged kind-control-plane cat /kind/version" failed with error: exit status 1
> ```
>
> increase the available resources for Docker on your local computer and execute the following command:
>
> ```bash
> ./install.sh --local kind --force-local-kube
> ```
When the process completes you will see a message indicating Chaos Mesh is successfully installed.
### Deploy the application
The next step is to deploy the application for testing. In our case here, we choose web-show because it allows us to directly observe the effect of network chaos. You can also deploy your own application for testing.
1. Deploy web-show with the `deploy.sh` script:
```bash
# Make sure you are in the Chaos Mesh directory
cd examples/web-show &&
./deploy.sh
```
> **Note:**
>
> If your local computer cannot pull images from `docker.io`, use the `local gcr.io` mirror and execute `./deploy.sh --docker-mirror` instead.
2. Access the web-show application. From your web browser, go to `http://localhost:8081`.
### Create the chaos experiment
Now that everything is ready, it's time to run your chaos experiment!
Chaos Mesh uses [CustomResourceDefinitions](https://kubernetes.io/docs/tasks/access-kubernetes-api/custom-resources/custom-resource-definitions/) (CRD) to define chaos experiments. CRD objects are designed separately based on different experiment scenarios, which greatly simplifies the definition of CRD objects. Currently, CRD objects that have been implemented in Chaos Mesh include PodChaos, NetworkChaos, IOChaos, TimeChaos, and KernelChaos. Later, we'll support more fault injection types.
In this experiment, we are using [NetworkChaos](https://github.com/chaos-mesh/chaos-mesh/blob/master/examples/web-show/network-delay.yaml) for the chaos experiment. The NetworkChaos configuration file, written in YAML, is shown below:
```
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-delay-example
spec:
action: delay
mode: one
selector:
namespaces:
- default
labelSelectors:
"app": "web-show"
delay:
latency: "10ms"
correlation: "100"
jitter: "0ms"
duration: "30s"
scheduler:
cron: "@every 60s"
```
For detailed descriptions of NetworkChaos actions, see [Chaos Mesh wiki](https://github.com/chaos-mesh/chaos-mesh/wiki/Network-Chaos). Here, we just rephrase the configuration as:
* target: `web-show`
* mission: inject a `10ms` network delay every `60s`
* attack duration: `30s` each time
To start NetworkChaos, do the following:
1. Run `network-delay.yaml`:
```bash
# Make sure you are in the chaos-mesh/examples/web-show directory
kubectl apply -f network-delay.yaml
```
2. Access the web-show application. In your web browser, go to `http://localhost:8081`.
From the line graph, you can tell that there is a 10 ms network delay every 60 seconds.
![Using Chaos Mesh to insert delays in web-show](/img/using-chaos-mesh-to-insert-delays-in-web-show.png)
<div class="caption-center"> Using Chaos Mesh to insert delays in web-show </div>
Congratulations! You just stirred up a little bit of chaos. If you are intrigued and want to try out more chaos experiments with Chaos Mesh, check out [examples/web-show](https://github.com/chaos-mesh/chaos-mesh/tree/master/examples/web-show).
### Delete the chaos experiment
Once you're finished testing, terminate the chaos experiment.
1. Delete `network-delay.yaml`:
```bash
# Make sure you are in the chaos-mesh/examples/web-show directory
kubectl delete -f network-delay.yaml
```
2. Access the web-show application. From your web browser, go to `http://localhost:8081`.
From the line graph, you can see the network latency level is back to normal.
![Network latency level is back to normal](/img/network-latency-level-is-back-to-normal.png)
<div class="caption-center"> Network latency level is back to normal </div>
### Delete Kubernetes clusters
After you're done with the chaos experiment, execute the following command to delete the Kubernetes clusters:
```bash
kind delete cluster --name=kind
```
> **Note:**
>
> If you encounter the `kind: command not found` error, execute `source ~/.bash_profile` command first and then delete the Kubernetes clusters.
## Cool! What's next?
Congratulations on your first successful journey into Chaos Engineering. How does it feel? Chaos Engineering is easy, right? But perhaps Chaos Mesh is not that easy-to-use. Command-line operation is inconvenient, writing YAML files manually is a bit tedious, or checking the experiment results is somewhat clumsy? Don't worry, Chaos Dashboard is on its way! Running chaos experiments on the web sure does sound exciting! If you'd like to help us build testing standards for cloud platforms or make Chaos Mesh better, we'd love to hear from you!
If you find a bug or think something is missing, feel free to file an issue, open a pull request (PR), or join us on the #project-chaos-mesh channel in the [CNCF slack workspace](https://join.slack.com/t/cloud-native/shared_invite/zt-fyy3b8up-qHeDNVqbz1j8HDY6g1cY4w).
GitHub: [https://github.com/chaos-mesh/chaos-mesh](https://github.com/chaos-mesh/chaos-mesh)
---
slug: /simulating-clock-skew-in-k8s-without-affecting-other-containers-on-node
title: Simulating Clock Skew in K8s Without Affecting Other Containers on the Node
author: Cwen Yin
author_title: Maintainer of Chaos Mesh
author_url: https://github.com/cwen0
author_image_url: https://avatars1.githubusercontent.com/u/22956341?v=4
image: /img/clock-sync-chaos-engineering-k8s.jpg
tags: [Chaos Mesh, Chaos Engineering, Kubernetes, Distributed System]
---
![Clock synchronization in distributed system](/img/clock-sync-chaos-engineering-k8s.jpg)
[Chaos Mesh™](https://github.com/chaos-mesh/chaos-mesh), an easy-to-use, open-source, cloud-native chaos engineering platform for Kubernetes (K8s), has a new feature, TimeChaos, which simulates the [clock skew](https://en.wikipedia.org/wiki/Clock_skew#On_a_network) phenomenon. Usually, when we modify clocks in a container, we want a [minimized blast radius](https://learning.oreilly.com/library/view/chaos-engineering/9781491988459/ch07.html), and we don't want the change to affect the other containers on the node. In reality, however, implementing this can be harder than you think. How does Chaos Mesh solve this problem?
<!--truncate-->
In this post, I'll describe how we hacked through different approaches of clock skew and how TimeChaos in Chaos Mesh enables time to swing freely in containers.
## Simulating clock skew without affecting other containers on the node
Clock skew refers to the time difference between clocks on nodes within a network. It might cause reliability problems in a distributed system, and it's a concern for designers and developers of complex distributed systems. For example, in a distributed SQL database, it's vital to maintain a synchronized local clock across nodes to achieve a consistent global snapshot and ensure the ACID properties for transactions.
Currently, there are well-recognized [solutions to synchronize clocks](https://pingcap.com/blog/Time-in-Distributed-Systems/), but without proper testing, you can never be sure that your implementation is solid.
Then how can we test global snapshot consistency in a distributed system? The answer is obvious: we can simulate clock skew to test whether distributed systems can keep a consistent global snapshot under abnormal clock conditions. Some testing tools support simulating clock skew in containers, but they have an impact on physical nodes.
[TimeChaos](https://github.com/chaos-mesh/chaos-mesh/wiki/Time-Chaos) is a tool that **simulates clock skew in containers to test how it impacts your application without affecting the whole node**. This way, we can precisely identify the potential consequences of clock skew and take measures accordingly.
## Various approaches for simulating clock skew we've explored
Reviewing the existing choices, we know clearly that they cannot be applied to Chaos Mesh, which runs on Kubernetes. Two common ways of simulating clock skew--changing the node clock directly and using the Jepsen framework--change the time for all processes on the node. These are not acceptable solutions for us. In a Kubernetes container, if we inject a clock skew error that affects the entire node, other containers on the same node will be disturbed. Such a clumsy approach is not tolerable.
Then how are we supposed to tackle this problem? Well, the first thing that comes into our mind is finding solutions in the kernel using [Berkeley Packet Filter](https://en.wikipedia.org/wiki/Berkeley_Packet_Filter) (BPF).
### `LD_PRELOAD`
`LD_PRELOAD` is a Linux environment variable that lets you define which dynamic link library is loaded before the program execution.
This variable has two advantages:
* We can call our own functions without being aware of the source code.
* We can inject code into other programs to achieve specific purposes.
For some languages that use applications to call the time function in glibc, such as Rust and C, using `LD_PRELOAD` is enough to simulate clock skew. But things are trickier for Golang. Because languages such as Golang directly parse virtual Dynamic Shared Object ([vDSO](http://man7.org/linux/man-pages/man7/vdso.7.html)), a mechanism to speed up system calls. To obtain the time function address, we can't simply use `LD_PRELOAD` to intercept the glic interface. Therefore, `LD_PRELOAD` is not our solution.
### Use BPF to modify the return value of `clock_gettime` system call
We also tried to filter the task [process identification number](http://www.linfo.org/pid.html ) (PID) with BPF. This way, we could simulate clock skew on a specified process and modify the return value of the `clock_gettime` system call.
This seemed like a good idea, but we also encountered a problem: in most cases, vDSO speeds up `clock_gettime`, but `clock_gettime` doesn't make a system call. This selection didn't work, either. Oops.
Thankfully, we determined that if the system kernel version is 4.18 or later, and if we use the [HPET](https://www.kernel.org/doc/html/latest/timers/hpet.html) clock, `clock_gettime()` gets time by making normal system calls instead of vDSO. We implemented [a version of clock skew](https://github.com/chaos-mesh/bpfki) using this approach, and it works fine for Rust and C. As for Golang, the program can get the time right, but if we perform `sleep` during the clock skew injection, the sleep operation is very likely to be blocked. Even after the injection is canceled, the system cannot recover. Thus, we have to give up this approach, too.
## TimeChaos, our final hack
From the previous section, we know that programs usually get the system time by calling `clock_gettime`. In our case, `clock_gettime` uses vDSO to speed up the calling process, so we cannot use `LD_PRELOAD` to hack the `clock_gettime` system calls.
We figured out the cause; then what's the solution? Start from vDSO. If we can redirect the address that stores the `clock_gettime` return value in vDSO to an address we define, we can solve the problem.
Easier said than done. To achieve this goal, we must tackle the following problems:
* Know the user-mode address used by vDSO
* Know vDSO's kernel-mode address, if we want to modify the `clock_gettime` function in vDSO by any address in the kernel mode
* Know how to modify vDSO data
First, we need to peek inside vDSO. We can see the vDSO memory address in `/proc/pid/maps`.
```
$ cat /proc/pid/maps
...
7ffe53143000-7ffe53145000 r-xp 00000000 00:00 0 [vdso]
```
The last line is vDSO information. The privilege of this memory space is `r-xp`: readable and executable, but not writable. That means the user mode cannot modify this memory. We can use [ptrace](http://man7.org/linux/man-pages/man2/ptrace.2.html) to avoid this restriction.
Next, we use `gdb dump memory` to export the vDSO and use `objdump` to see what's inside. Here is what we get:
```
(gdb) dump memory vdso.so 0x00007ffe53143000 0x00007ffe53145000
$ objdump -T vdso.so
vdso.so: file format elf64-x86-64
DYNAMIC SYMBOL TABLE:
ffffffffff700600 w DF .text 0000000000000545 LINUX_2.6 clock_gettime
```
We can see that the whole vDSO is like a `.so` file, so we can use an executable and linkable format (ELF) file to format it. With this information, a basic workflow for implementing TimeChaos starts to take shape:
![TimeChaos workflow](/img/timechaos-workflow.jpg)
<div class="caption-center"> TimeChaos workflow </div>
The chart above is the process of **TimeChaos**, an implementation of clock skew in Chaos Mesh.
1. Use ptrace to attach the specified PID process to stop the current process.
2. Use ptrace to create a new mapping in the virtual address space of the calling process and use [`process_vm_writev`](https://linux.die.net/man/2/process_vm_writev) to write the `fake_clock_gettime` function we defined into the memory space.
3. Use `process_vm_writev` to write the specified parameters into `fake_clock_gettime`. These parameters are the time we would like to inject, such as two hours backward or two days forward.
4. Use ptrace to modify the `clock_gettime` function in vDSO and redirect to the `fake_clock_gettime` function.
5. Use ptrace to detach the PID process.
If you are interested in the details, see the [Chaos Mesh GitHub repository](https://github.com/chaos-mesh/chaos-mesh/blob/master/pkg/time/time_linux.go).
## Simulating clock skew on a distributed SQL database
Statistics speak volumes. Here we're going to try TimeChaos on [TiDB](https://pingcap.com/docs/stable/overview/), an open source, [NewSQL](https://en.wikipedia.org/wiki/NewSQL), distributed SQL database that supports [Hybrid Transactional/Analytical Processing](https://en.wikipedia.org/wiki/Hybrid_transactional/analytical_processing) (HTAP) workloads, to see if the chaos testing can really work.
TiDB uses a centralized service Timestamp Oracle (TSO) to obtain the globally consistent version number, and to ensure that the transaction version number increases monotonically. The TSO service is managed by the Placement Driver (PD) component. Therefore, we choose a random PD node and inject TimeChaos regularly, each with a 10-millisecond-backward clock skew. Let's see if TiDB can meet the challenge.
To better perform the testing, we use [bank](https://github.com/cwen0/bank) as the workload, which simulates the financial transfers in a banking system. It's often used to verify the correctness of database transactions.
This is our test configuration:
```
apiVersion: chaos-mesh.org/v1alpha1
kind: TimeChaos
metadata:
name: time-skew-example
namespace: tidb-demo
spec:
mode: one
selector:
labelSelectors:
"app.kubernetes.io/component": "pd"
timeOffset:
sec: -600
clockIds:
- CLOCK_REALTIME
duration: "10s"
scheduler:
cron: "@every 1m"
```
During this test, Chaos Mesh injects TimeChaos into a chosen PD Pod every 1 millisecond for 10 seconds. Within the duration, the time acquired by PD will have a 600 second offset from the actual time. For further details, see [Chaos Mesh Wiki](https://github.com/chaos-mesh/chaos-mesh/wiki/Time-Chaos).
Let's create a TimeChaos experiment using the `kubectl apply` command:
```
kubectl apply -f pd-time.yaml
```
Now, we can retrieve the PD log by the following command:
```
kubectl logs -n tidb-demo tidb-app-pd-0 | grep "system time jump backward"
```
Here's the log:
```
[2020/03/24 09:06:23.164 +00:00] [ERROR] [systime_mon.go:32] ["system time jump backward"] [last=1585041383060109693]
[2020/03/24 09:16:32.260 +00:00] [ERROR] [systime_mon.go:32] ["system time jump backward"] [last=1585041992160476622]
[2020/03/24 09:20:32.059 +00:00] [ERROR] [systime_mon.go:32] ["system time jump backward"] [last=1585042231960027622]
[2020/03/24 09:23:32.059 +00:00] [ERROR] [systime_mon.go:32] ["system time jump backward"] [last=1585042411960079655]
[2020/03/24 09:25:32.059 +00:00] [ERROR] [systime_mon.go:32] ["system time jump backward"] [last=1585042531963640321]
[2020/03/24 09:28:32.060 +00:00] [ERROR] [systime_mon.go:32] ["system time jump backward"] [last=1585042711960148191]
[2020/03/24 09:33:32.063 +00:00] [ERROR] [systime_mon.go:32] ["system time jump backward"] [last=1585043011960517655]
[2020/03/24 09:34:32.060 +00:00] [ERROR] [systime_mon.go:32] ["system time jump backward"] [last=1585043071959942937]
[2020/03/24 09:35:32.059 +00:00] [ERROR] [systime_mon.go:32] ["system time jump backward"] [last=1585043131978582964]
[2020/03/24 09:36:32.059 +00:00] [ERROR] [systime_mon.go:32] ["system time jump backward"] [last=1585043191960687755]
[2020/03/24 09:38:32.060 +00:00] [ERROR] [systime_mon.go:32] ["system time jump backward"] [last=1585043311959970737]
[2020/03/24 09:41:32.060 +00:00] [ERROR] [systime_mon.go:32] ["system time jump backward"] [last=1585043491959970502]
[2020/03/24 09:45:32.061 +00:00] [ERROR] [systime_mon.go:32] ["system time jump backward"] [last=1585043731961304629]
...
```
From the log above, we see that every now and then, PD detects that the system time rolls back. This means:
* TimeChaos successfully simulates clock skew.
* PD can deal with the clock skew situation.
That's encouraging. But does TimeChaos affect services other than PD? We can check it out in the Chaos Dashboard:
![Chaos Dashboard](/img/chaos-dashboard.jpg)
<div class="caption-center"> Chaos Dashboard </div>
It's clear that in the monitor, TimeChaos was injected every 1 millisecond and the whole duration lasted 10 seconds. What's more, TiDB was not affected by that injection. The bank program ran normally, and performance was not affected.
## Try out Chaos Mesh
As a cloud-native chaos engineering platform, Chaos Mesh features all-around [fault injection methods for complex systems on Kubernetes](https://pingcap.com/blog/chaos-mesh-your-chaos-engineering-solution-for-system-resiliency-on-kubernetes/), covering faults in Pods, the network, the file system, and even the kernel.
Wanna have some hands-on experience in chaos engineering? Welcome to [Chaos Mesh](https://github.com/chaos-mesh/chaos-mesh). This [10-minute tutorial](https://pingcap.com/blog/run-first-chaos-experiment-in-ten-minutes/) will help you quickly get started with chaos engineering and run your first chaos experiment with Chaos Mesh.
---
slug: /chaos-mesh-join-cncf-sandbox-project
title: Chaos Mesh® Joins CNCF as a Sandbox Project
author: Chaos Mesh Authors
author_title: Maintainer of Chaos Mesh
author_url: https://github.com/chaos-mesh
author_image_url: https://avatars1.githubusercontent.com/u/59082378?v=4
image: /img/chaos-mesh-cncf.png
tags: [Chaos Mesh, Chaos Engineering, Kubernetes, CNCF, Cloud Native]
---
![Chaos Mesh Join CNCF as Sandbox Project](/img/chaos-mesh-cncf.png)
We’re thrilled to announce that [Chaos Mesh®](https://github.com/chaos-mesh/chaos-mesh) is now officially accepted as a CNCF Sandbox project. As maintainers of Chaos Mesh, we’d like to thank all the contributors and adopters. This would not be possible without your trust, support, and contributions.
<!--truncate-->
Chaos Mesh is a powerful Chaos Engineering platform that orchestrates chaos experiments on Kubernetes environments. By covering comprehensive fault injection methods in Pod, network, file system, and even the kernel, we aim at providing a neutral, universal Chaos Engineering platform that enables cloud-native applications to be as resilient as they should be.
![Architecture](/img/chaos-mesh.svg)
Within only 7 months since it was open-sourced on December 31st, 2019, Chaos Mesh has already received 2000 GitHub stars, with 44 contributors from multiple organizations. As a young project, the adoption in production has been the key recognition and motivation that pushes us forward constantly. Here is a list of our adopters so far:
* [PingCAP](http://www.pingcap.com)
* [Xpeng Motor](https://en.xiaopeng.com/)
* [NetEase Fuxi Lab](https://www.crunchbase.com/organization/netease-fuxi-lab)
* [JuiceFS](http://juicefs.com/?hl=en)
* [Dailymotion](https://www.dailymotion.com/)
* [Meituan-Dianping](https://about.meituan.com/en)
* [Celo](https://celo.org/)
Being a CNCF Sandbox project marks a major step forward for the project. It means that Chaos Mesh has become part of the great vendor-neutral cloud-native community. With the guidance and help from CNCF, Chaos Mesh will strive to develop a community with transparent, meritocracy-based governance for open communication and open collaboration, while driving the project forward, towards our ultimate goal of establishing the Chaos Engineering standards on Cloud.
Currently, Chaos Mesh is in active development for 1.0 GA. Going forward, we will be focusing on the following aspects:
* Lowering the bar of chaos engineering by improving Chaos Dashboard.
* Extending chaos injection to application layers
* Completing the full chaos engineering loop with status checking, reporting, and scenario defining, etc.
If you are interested in the project, check out our [website](https://chaos-mesh.org/), join our [Slack](https://cloud-native.slack.com/archives/C018JJ686BS) discussions, or attend our [monthly meeting](https://docs.google.com/document/d/1H8IfmhIJiJ1ltg-XLjqR_P_RaMHUGrl1CzvHnKM_9Sc/edit) to know more. Or better yet, become part of us.
---
slug: /building_automated_testing_framework
title: Building an Automated Testing Framework Based on Chaos Mesh® and Argo
author: Ben Ye, Chengwen Yin
author_title: Maintainer of Chaos Mesh
author_url: https://github.com/chaos-mesh/chaos-mesh/blob/master/MAINTAINERS.md
author_image_url: https://avatars1.githubusercontent.com/u/59082378?v=4
image: /img/automated_testing_framework.png
tags: [Chaos Mesh, Chaos Engineering, Test Automation]
---
![TiPocket - Automated Testing Framework](/img/automated_testing_framework.png)
[Chaos Mesh](https://github.com/chaos-mesh/chaos-mesh)® is an open-source chaos engineering platform for Kubernetes. Although it provides rich capabilities to simulate abnormal system conditions, it still only solves a fraction of the Chaos Engineering puzzle. Besides fault injection, a full chaos engineering application consists of hypothesizing around defined steady states, running experiments in production, validating the system via test cases, and automating the testing.
This article describes how we use [TiPocket](https://github.com/pingcap/tipocket), an automated testing framework to build a full Chaos Engineering testing loop for TiDB, our distributed database.
<!--truncate-->
## Why do we need TiPocket?
Before we can put a distributed system like [TiDB](https://github.com/pingcap/tidb) into production, we have to ensure that it is robust enough for day-to-day use. For this reason, several years ago we introduced Chaos Engineering into our testing framework. In our testing framework, we:
1. Observe the normal metrics and develop our testing hypothesis.
2. Inject a list of failures into TiDB.
3. Run various test cases to verify TiDB in fault scenarios.
4. Monitor and collect test results for analysis and diagnosis.
This sounds like a solid process, and we’ve used it for years. However, as TiDB evolves, the testing scale multiplies. We have multiple fault scenarios, against which dozens of test cases run in the Kubernetes testing cluster. Even with Chaos Mesh helping to inject failures, the remaining work can still be demanding—not to mention the challenge of automating the pipeline to make the testing scalable and efficient.
This is why we built TiPocket, a fully-automated testing framework based on Kubernetes and Chaos Mesh. Currently, we mainly use it to test TiDB clusters. However, because of TiPocket’s Kubernetes-friendly design and extensible interface, you can use Kubernetes’ create and delete logic to easily support other applications.
## How does it work
Based on the above requirements, we need an automatic workflow that:
- [Injects chaos](#injecting-chaos---chaos-mesh)
- [Verifies the impact of that chaos](#verifying-chaos-impacts-test-cases)
- [Automates the chaos pipeline](#automating-the-chaos-pipeline---argo)
- [Visualizes the results](#visualizing-the-results-loki)
### Injecting chaos - Chaos Mesh
Fault injection is the core chaos testing. In a distributed database, faults can happen anytime, anywhere—from node crashes, network partitions, and file system failures, to kernel panics. This is where Chaos Mesh comes in.
Currently, TiPocket supports the following types of fault injection:
- **Network**: Simulates network partitions, random packet loss, disorder, duplication, or delay of links.
- **Time skew**: Simulates clock skew of the container to be tested.
- **Kill**: Kills the specified pod, either randomly in a cluster or within a component (TiDB, TiKV, or Placement Driver (PD)).
- **I/O**: Injects I/O delays in TiDB’s storage engine, TiKV, to identify I/O related issues.
With fault injection handled, we need to think about verification. How do we make sure TiDB can survive these faults?
## Verifying chaos impacts: test cases
To validate how TiDB withstands chaos, we implemented dozens of test cases in TiPocket, combined with a variety of inspection tools. To give you an overview of how TiPocket verifies TiDB in the event of failures, consider the following test cases. These cases focus on SQL execution, transaction consistency, and transaction isolation.
### Fuzz testing: SQLsmith
[SQLsmith](https://github.com/pingcap/tipocket/tree/master/pkg/go-sqlsmith) is a tool that generates random SQL queries. TiPocket creates a TiDB cluster and a MySQL instance.. The random SQL generated by SQLsmith is executed on TiDB and MySQL, and various faults are injected into the TiDB cluster to test. In the end, execution results are compared. If we detect inconsistencies, there are potential issues with our system.
### Transaction consistency testing: Bank and Porcupine
[Bank](https://github.com/pingcap/tipocket/tree/master/cmd/bank) is a classical test case that simulates the transfer process in a banking system. Under snapshot isolation, all transfers must ensure that the total amount of all accounts must be consistent at every moment, even in the face of system failures. If there are inconsistencies in the total amount, there are potential issues with our system.
[Porcupine](https://github.com/anishathalye/porcupine) is a linearizability checker in Go built to test the correctness of distributed systems. It takes a sequential specification as executable Go code, along with a concurrent history, and it determines whether the history is linearizable with respect to the sequential specification. In TiPocket, we use the [Porcupine](https://github.com/pingcap/tipocket/tree/master/pkg/check/porcupine) checker in multiple test cases to check whether TiDB meets the linearizability constraint.
### Transaction Isolation testing: Elle
[Elle](https://github.com/jepsen-io/elle) is an inspection tool that verifies a database’s transaction isolation level. TiPocket integrates [go-elle](https://github.com/pingcap/tipocket/tree/master/pkg/elle), the Go implementation of the Elle inspection tool, to verify TiDB’s isolation level.
These are just a few of the test cases TiPocket uses to verify TiDB’s accuracy and stability. For more test cases and verification methods, see our [source code](https://github.com/pingcap/tipocket).
## Automating the chaos pipeline - Argo
Now that we have Chaos Mesh to inject faults, a TiDB cluster to test, and ways to validate TiDB, how can we automate the chaos testing pipeline? Two options come to mind: we could implement the scheduling functionality in TiPocket, or hand over the job to existing open-source tools. To make TiPocket more dedicated to the testing part of our workflow, we chose the open-source tools approach. This, plus our all-in-K8s design, lead us directly to [Argo](https://github.com/argoproj/argo).
Argo is a workflow engine designed for Kubernetes. It has been an open source product for a long time, and has received widespread attention and application.
Argo has abstracted several custom resource definitions (CRDs) for workflows. The most important ones include Workflow Template, Workflow, and Cron Workflow. Here is how Argo fits in TiPocket:
- **Workflow Template** is a template defined in advance for each test task. Parameters can be passed in when the test is running.
- **Workflow** schedules multiple workflow templates in different orders, which form the tasks to be executed. Argo also lets you add conditions, loops, and directed acyclic graphs (DAGs) in the pipeline.
- **Cron Workflow** lets you schedule a workflow like a cron job. It is perfectly suitable for scenarios where you want to run test tasks for a long time.
The sample workflow for our predefined bank test is shown below:
```yml
spec:
entrypoint: call-tipocket-bank
arguments:
parameters:
- name: ns
value: tipocket-bank
- name: nemesis
value: random_kill,kill_pd_leader_5min,partition_one,subcritical_skews,big_skews,shuffle-leader-scheduler,shuffle-region-scheduler,random-merge-scheduler
templates:
- name: call-tipocket-bank
steps:
- - name: call-wait-cluster
templateRef:
name: wait-cluster
template: wait-cluster
- - name: call-tipocket-bank
templateRef:
name: tipocket-bank
template: tipocket-bank
```
In this example, we use the workflow template and nemesis parameters to define the specific failure to inject. You can reuse the template to define multiple workflows that suit different test cases. This allows you to add more customized failure injections in the flow.
Besides [TiPocket’s](https://github.com/pingcap/tipocket/tree/master/argo/workflow) sample workflows and templates, the design also allows you to add your own failure injection flows. Handling complicated logics using codable workflows makes Argo developer-friendly and an ideal choice for our scenarios.
Now, our chaos experiment is running automatically. But if our results do not meet our expectations? How do we locate the problem? TiDB saves a variety of monitoring information, which makes log collecting essential for enabling observability in TiPocket.
## Visualizing the results: Loki
In cloud-native systems, observability is very important. Generally speaking, you can achieve observability through **metrics**, **logging**, and **tracing**. TiPocket’s main test cases evaluate TiDB clusters, so metrics and logs are our default sources for locating issues.
On Kubernetes, Prometheus is the de-facto standard for metrics. However, there is no common way for log collection. Solutions such as [Elasticsearch](https://en.wikipedia.org/wiki/Elasticsearch), [Fluent Bit](https://fluentbit.io/), and [Kibana](https://www.elastic.co/kibana) perform well, but they may cause system resource contention and high maintenance costs. We decided to use [Loki](https://github.com/grafana/loki), the Prometheus-like log aggregation system from [Grafana](https://grafana.com/).
Prometheus processes TiDB’s monitoring information. Prometheus and Loki have a similar labeling system, so we can easily combine Prometheus' monitoring indicators with the corresponding pod logs and use a similar query language. Grafana also supports the Loki dashboard, which means we can use Grafana to display monitoring indicators and logs at the same time. Grafana is the built-in monitoring component in TiDB, which Loki can reuse.
## Putting them all together - TiPocket
Now, everything is ready. Here is a simplified diagram of TiPocket:
![TiPocket Architecture](/img/tipocket-architecture.png)
As you can see, the Argo workflow manages all chaos experiments and test cases. Generally, a complete test cycle involves the following steps:
1. Argo creates a Cron Workflow, which defines the cluster to be tested, the faults to inject, the test case, and the duration of the task. If necessary, the Cron Workflow also lets you view case logs in real-time.
![Argo Workflow](/img/argo-workflow.png)
1. At a specified time, a separate TiPocket thread is started in the workflow, and the Cron Workflow is triggered. TiPocket sends TiDB-Operator the definition of the cluster to test. In turn, TiDB-Operator creates a target TiDB cluster. Meanwhile, Loki collects the related logs.
2. Chaos Mesh injects faults in the cluster.
3. Using the test cases mentioned above, the user validates the health of the system. Any test case failure leads to workflow failure in Argo, which triggers Alertmanager to send the result to the specified Slack channel. If the test cases complete normally, the cluster is cleared, and Argo stands by until the next test.
![Alert in Slack](/img/alert_message.png)
This is the complete TiPocket workflow. .
## Join us
[Chaos Mesh](https://github.com/pingcap/chaos-mesh) and [TiPocket](https://github.com/pingcap/tipocket) are both in active iterations. We have donated Chaos Mesh donated to [CNCF](https://github.com/cncf/toc/pull/367), and we look forward to more community members joining us in building a complete Chaos Engineering ecosystem. If this sounds interesting to you, check out our [website](https://chaos-mesh.org/), or join #chaos-mesh in [Slack](https://cloud-native.slack.com/archives/C018JJ686BS).
---
id: develop_a_new_chaos
title: Develop a New Chaos
sidebar_label: Develop a New Chaos
---
After [preparing the development environment](setup_env.md), let's develop a new type of chaos, HelloWorldChaos, which only prints a "Hello World!" message to the log. Generally, to add a new chaos type for Chaos Mesh, you need to take the following steps:
1. [Add the chaos object in controller](#add-the-chaos-object-in-controller)
2. [Register the CRD](#register-the-crd)
3. [Implement the schema type](#implement-the-schema-type)
4. [Make the Docker image](#make-the-docker-image)
5. [Run chaos](#run-chaos)
## Add the chaos object in controller
In Chaos Mesh, all chaos types are managed by the controller manager. To add a new chaos type, you need to start from adding the corresponding reconciler type in the controller, as instructed in the following steps:
1. Add the HelloWorldChaos object in the controller manager [main.go](https://github.com/chaos-mesh/chaos-mesh/blob/master/cmd/controller-manager/main.go#L104).
You will notice existing chaos types such as PodChaos, NetworkChaos and IOChaos. Add the new type below them:
```go
if err = (&controllers.HelloWorldChaosReconciler{
Client: mgr.GetClient(),
Log: ctrl.Log.WithName("controllers").WithName("HelloWorldChaos"),
}).SetupWithManager(mgr); err != nil {
setupLog.Error(err, "unable to create controller", "controller", "HelloWorldChaos")
os.Exit(1)
}
```
2. Under [controllers](https://github.com/chaos-mesh/chaos-mesh/tree/master/controllers), create a `helloworldchaos_controller.go` file and edit it as below:
```go
package controllers
import (
"github.com/go-logr/logr"
chaosmeshv1alpha1 "github.com/chaos-mesh/chaos-mesh/api/v1alpha1"
ctrl "sigs.k8s.io/controller-runtime"
"sigs.k8s.io/controller-runtime/pkg/client"
)
// HelloWorldChaosReconciler reconciles a HelloWorldChaos object
type HelloWorldChaosReconciler struct {
client.Client
Log logr.Logger
}
// +kubebuilder:rbac:groups=chaos-mesh.org,resources=helloworldchaos,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=chaos-mesh.org,resources=helloworldchaos/status,verbs=get;update;patch
func (r *HelloWorldChaosReconciler) Reconcile(req ctrl.Request) (ctrl.Result, error) {
logger := r.Log.WithValues("reconciler", "helloworldchaos")
// the main logic of `HelloWorldChaos`, it prints a log `Hello World!` and returns nothing.
logger.Info("Hello World!")
return ctrl.Result{}, nil
}
func (r *HelloWorldChaosReconciler) SetupWithManager(mgr ctrl.Manager) error {
// exports `HelloWorldChaos` object, which represents the yaml schema content the user applies.
return ctrl.NewControllerManagedBy(mgr).
For(&chaosmeshv1alpha1.HelloWorldChaos{}).
Complete(r)
}
```
> **Note:**
>
> The comment `// +kubebuilder:rbac:groups=chaos-mesh.org...` is an authority control mechanism that decides which account can access this reconciler. To make it accessible by the dashboard and chaos-controller-manager, you need to modify [controller-manager-rbac.yaml](https://github.com/chaos-mesh/chaos-mesh/blob/master/helm/chaos-mesh/templates/controller-manager-rbac.yaml) accordingly:
```yaml
- apiGroups: ["chaos-mesh.org"]
resources:
- podchaos
- networkchaos
- iochaos
- helloworldchaos # Add this line in all chaos-mesh.org group
verbs: ["*"]
```
## Register the CRD
The HelloWorldChaos object is a custom resource object in Kubernetes. This means you need to register the corresponding CRD in the Kubernetes API. To do this, modify [kustomization.yaml](https://github.com/chaos-mesh/chaos-mesh/blob/master/config/crd/kustomization.yaml) by adding the corresponding line as shown below:
```yaml
resources:
- bases/chaos-mesh.org_podchaos.yaml
- bases/chaos-mesh.org_networkchaos.yaml
- bases/chaos-mesh.org_iochaos.yaml
- bases/chaos-mesh.org_helloworldchaos.yaml # this is the new line
```
## Implement the schema type
To implement the schema type for the new chaos object, add `helloworldchaos_types.go` in the [api directory](https://github.com/chaos-mesh/chaos-mesh/tree/master/api/v1alpha1) and modify it as below:
```go
package v1alpha1
import (
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)
// +kubebuilder:object:root=true
// HelloWorldChaos is the Schema for the helloworldchaos API
type HelloWorldChaos struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
}
// +kubebuilder:object:root=true
// HelloWorldChaosList contains a list of HelloWorldChaos
type HelloWorldChaosList struct {
metav1.TypeMeta `json:",inline"`
metav1.ListMeta `json:"metadata,omitempty"`
Items []HelloWorldChaos `json:"items"`
}
func init() {
SchemeBuilder.Register(&HelloWorldChaos{}, &HelloWorldChaosList{})
}
```
With this file added, the HelloWorldChaos schema type is defined and can be called by the following YAML lines:
```yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: HelloWorldChaos
metadata:
name: <name-of-this-resource>
namespace: <ns-of-this-resource>
```
## Make the Docker image
Having the object successfully added, you can make a Docker image and push it to your registry:
```bash
make
make docker-push
```
> **Note:**
>
> The default `DOCKER_REGISTRY` is `localhost:5000`, which is preset in `hack/kind-cluster-build.sh`. You can overwrite it to any registry to which you have access permission.
## Run chaos
You are almost there. In this step, you will pull the image and apply it for testing.
Before you pull any image for Chaos Mesh (using `helm install` or `helm upgrade`), modify [values.yaml](https://github.com/chaos-mesh/chaos-mesh/blob/master/helm/chaos-mesh/values.yaml) of helm template to replace the default image with what you just pushed to your local registry.
In this case, the template uses `pingcap/chaos-mesh:latest` as the default target registry, so you need to replace it with `localhost:5000`, as shown below:
```yaml
clusterScoped: true
# Also see clusterScoped and controllerManager.serviceAccount
rbac:
create: true
controllerManager:
serviceAccount: chaos-controller-manager
...
image: localhost:5000/pingcap/chaos-mesh:latest
...
chaosDaemon:
image: localhost:5000/pingcap/chaos-daemon:latest
...
dashboard:
image: localhost:5000/pingcap/chaos-dashboard:latest
...
```
Now take the following steps to run chaos:
1. Get the related custom resource type for Chaos Mesh:
```bash
kubectl apply -f manifests/
kubectl get crd podchaos.chaos-mesh.org
```
2. Install Chaos Mesh:
```bash
helm install helm/chaos-mesh --name=chaos-mesh --namespace=chaos-testing --set chaosDaemon.runtime=containerd --set chaosDaemon.socketPath=/run/containerd/containerd.sock
kubectl get pods --namespace chaos-testing -l app.kubernetes.io/instance=chaos-mesh
```
The arguments `--set chaosDaemon.runtime=containerd --set chaosDaemon.socketPath=/run/containerd/containerd.sock` is used to to support network chaos on kind.
3. Create `chaos.yaml` in any location with the lines below:
```yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: HelloWorldChaos
metadata:
name: hello-world
namespace: chaos-testing
```
4. Apply the chaos:
```bash
kubectl apply -f /path/to/chaos.yaml
kubectl get HelloWorldChaos -n chaos-testing
```
Now you should be able to check the `Hello World!` result in the log:
```bash
kubectl logs chaos-controller-manager-{pod-post-fix} -n chaos-testing
```
> **Note:**
>
> `{pod-post-fix}` is a random string generated by Kubernetes, you can check it by executing `kubectl get po -n chaos-testing`.
## Next steps
Congratulations! You have just added a chaos type for Chaos Mesh successfully. Let us know if you run into any issues during the process. If you feel like doing other types of contributions, refer to Add facilities to chaos daemon (WIP).
---
id: development_overview
title: Development Guide
sidebar_label: Development Overview
---
This guide prepares you for the development of Chaos Mesh from scratch. Before you get started, it is recommended to get familiar with the project through the following materials:
- [README](https://github.com/chaos-mesh/chaos-mesh/blob/master/README.md)
The development flow starts from [Set up your development environment](setup_env.md). After this, you can choose any of the following procedures to contribute:
- [Develop a New Chaos Type](dev_hello_world.md)
- Add facilities to chaos daemon
---
id: set_up_the_development_environment
title: Set up the development environment
sidebar_label: Set up the development environment
---
This document walks you through the environment setup process for Chaos Mesh development.
## Prerequisites
- [golang](https://golang.org/dl/) version >= v1.13
- [docker](https://www.docker.com/)
- [gcc](https://gcc.gnu.org/)
- [helm](https://helm.sh/) version >= v2.8.2
- [kind](https://github.com/kubernetes-sigs/kind)
- [yarn](https://yarnpkg.com/lang/en/) and [nodejs](https://nodejs.org/en/) (for Chaos Dashboard)
## Prepare the toolchain
Make sure you have the above prerequisites met. Now follow the steps below to prepare the toolchain for compiling Chaos Mesh:
1. Clone the Chaos Mesh repo to your local machine.
```bash
git clone https://github.com/chaos-mesh/chaos-mesh.git
cd chaos-mesh
```
2. Install the Kubernetes API development framework - [kubebuilder](https://github.com/kubernetes-sigs/kubebuilder) and [kustomize](https://github.com/kubernetes-sigs/kustomize).
```bash
make ensure-all
```
3. Make sure [Docker](https://docs.docker.com/install/) is installed and running on your local machine.
4. Make sure `${GOPATH}/bin` is in your `PATH`.
```bash
echo 'export PATH=$(go env GOPATH)/bin:${PATH}' >> ~/.bash_profile
```
```bash
source ~/. bash_profile
```
> **Note:**
>
> If your yarn is newly installed, you might need to restart the terminal to make it available.
Now you can test the toolchain by running:
```bash
make
```
If there is no error in the output, the compiling toolchain is successfully configured.
## Prepare the deployment environment
With the toolchain ready, you still need a local Kubernetes cluster as the deployment environment. Because kind is already installed, you can now set up the Kubernetes cluster directly:
```bash
hack/kind-cluster-build.sh
```
The above script will create a Kubernetes cluster by kind. When you don't need this cluster, you can run the following command to delete it:
```bash
kind delete cluster --name=kind
```
## Next step
Congratulations! You are now all set up for Chaos Mesh development. Try the following tasks:
- [Develop a New Chaos Type](dev_hello_world.md)
- Add facilities to chaos daemon
---
id: faqs
title: FAQs
sidebar_label: FAQs
---
## Question
### Q: If I do not have Kubernetes clusters deployed, can I use Chaos Mesh to create chaos experiments?
No, you can not use Chaos Mesh in this case. But still you can run chaos experiments using command line. Refer to [Command Line Usages of Chaos](https://github.com/pingcap/tipocket/blob/master/doc/command_line_chaos.md) for details.
### Q: I have deployed Chaos Mesh and created PodChaos experiments successfully, but I still failed in creating NetworkChaos/TimeChaos Experiment. The log is shown below:
```
2020-06-18T01:05:26.207Z ERROR controllers.TimeChaos failed to apply chaos on all pods {"reconciler": "timechaos", "error": "rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp xx.xx.xx.xx:xxxxx: connect: connection refused\""}
```
You can try using the parameter: `hostNetwork`, as shown below:
```
# vim helm/chaos-mesh/values.yaml, change hostNetwork from false to true
hostNetwork: true
```
### Q: I just saw `ERROR: failed to get cluster internal kubeconfig: command "docker exec --privileged kind-control-plane cat /etc/kubernetes/admin.conf" failed with error: exit status 1` when installing Chaos Mesh with kind. How to fix it?
You can try the following command to fix it:
```
kind delete cluster
```
then deploy again.
## Debug
### Q: Experiment not working after chaos is applied
You can debug as described below:
Execute `kubectl describe` to check the specified chaos experiment resource.
- If there are `NextStart` and `NextRecover` fields under `spec`, then the chaos will be triggered after `NextStart` is executed.
- If there are no `NextStart` and `NextRecover`fields in `spec`, run the following command to get controller-manager's log and see whether there are errors in it.
```bash
kubectl logs -n chaos-testing chaos-controller-manager-xxxxx (replace this with the name of the controller-manager) | grep "ERROR"
```
For error message `no pod is selected`, run the following command to show the labels and check if the selector is desired.
```bash
kubectl get pods -n yourNamespace --show-labels
```
If the above steps cannot solve the problem or you encounter other related errors in controller's log, [file an issue](https://github.com/chaos-mesh/chaos-mesh/issues) or message us in #sig-chaos-mesh channel in the [TiDB Community](https://chaos-mesh.org/tidbslack) slack workspace.
## IOChaos
### Q: Running chaosfs sidecar container failed, and log shows `pid file found, ensure docker is not running or delete /tmp/fuse/pid`
The chaosfs sidecar container is continuously restarting, and you might see the following logs at the current sidecar container:
```
2020-01-19T06:30:56.629Z INFO chaos-daemon Init hookfs
2020-01-19T06:30:56.630Z ERROR chaos-daemon failed to create pid file {"error": "pid file found, ensure docker is not running or delete /tmp/fuse/pid"}
github.com/go-logr/zapr.(*zapLogger).Error
```
* **Cause**: Chaos Mesh uses Fuse to inject I/O failures. It fails if you specify an existing directory as the source path for chaos. This often happens when you try to reuse a persistent volume (PV) with the `Retain` reclaim policy to request a PersistentVolumeClaims (PVC) resource.
* **Solution**: In this case, use the following command to change the reclaim policy to `Delete`:
```bash
kubectl patch pv <your-pv-name> -p '{"spec":{"persistentVolumeReclaimPolicy":"Delete"}}'
```
---
id: get_started_on_kind
title: Get started on kind
---
import PickVersion from '@site/src/components/PickVersion'
This document describes how to deploy Chaos Mesh in Kubernetes on your laptop (Linux or macOS) using kind.
## Prerequisites
Before deployment, make sure [Docker](https://docs.docker.com/install/) is installed and running on your local machine.
## Install Chaos Mesh
<PickVersion className="language-bash">
curl -sSL https://mirrors.chaos-mesh.org/latest/install.sh | bash -s -- --local kind
</PickVersion>
`install.sh` is an automation shell script that helps you install dependencies such as `kubectl`, `helm`, `kind`, and `kubernetes`, and deploy Chaos Mesh itself.
After executing the above command, you need to verify if the Chaos Mesh is installed correctly.
You also can use [helm](https://helm.sh/) to [install Chaos Mesh manually](installation.md#install-by-helm).
### Verify your installation
Verify if the chaos mesh is running
```bash
kubectl get pod -n chaos-testing
```
Expected output:
```bash
NAME READY STATUS RESTARTS AGE
chaos-controller-manager-6d6d95cd94-kl8gs 1/1 Running 0 3m40s
chaos-daemon-5shkv 1/1 Running 0 3m40s
chaos-daemon-jpqhd 1/1 Running 0 3m40s
chaos-daemon-n6mfq 1/1 Running 0 3m40s
chaos-dashboard-d998856f6-vgrjs 1/1 Running 0 3m40s
```
## Uninstallation
<PickVersion className="language-bash">
curl -sSL https://mirrors.chaos-mesh.org/latest/install.sh | bash -s -- --template | kubectl delete -f -
</PickVersion>
In addition, you also can uninstall Chaos Mesh by deleting the namespace directly.
```bash
kubectl delete ns chaos-testing
```
## Clean kind cluster
```bash
kind delete cluster --name=kind
```
---
id: get_started_on_minikube
title: Get started on Minikube
---
import PickVersion from '@site/src/components/PickVersion'
This document describes how to deploy Chaos Mesh in Kubernetes on your laptop (Linux or macOS) using Minikube.
## Prerequisites
Before deployment, make sure [Minikube](https://kubernetes.io/docs/tasks/tools/install-minikube/) is installed on your local machine.
## Step 1: Set up the Kubernetes environment
Perform the following steps to set up the local Kubernetes environment:
1. Start a Kubernetes cluster:
```bash
minikube start --kubernetes-version v1.15.0 --cpus 4 --memory "8192mb"
```
> **Note:**
>
> It is recommended to allocate enough RAM (more than 8192 MiB) to the Virtual Machine (VM) using the `--cpus` and `--memory` flag.
2. Install helm:
```bash
curl https://raw.githubusercontent.com/helm/helm/master/scripts/get | bash
helm init
```
3. Check whether the helm tiller pod is running:
```bash
kubectl -n kube-system get pods -l app=helm
```
## Step 2: Install Chaos Mesh
<PickVersion className="language-bash">
curl -sSL https://mirrors.chaos-mesh.org/latest/install.sh | bash
</PickVersion>
The above command installs all the CRDs, required service account configuration, and all components.
Before you start running a chaos experiment, verify if Chaos Mesh is installed correctly.
You also can use [helm](https://helm.sh/) to [install Chaos Mesh manually](installation.md#install-by-helm).
### Verify your installation
Verify if the chaos mesh is running
```bash
kubectl get pod -n chaos-testing
```
Expected output:
```bash
NAME READY STATUS RESTARTS AGE
chaos-controller-manager-6d6d95cd94-kl8gs 1/1 Running 0 3m40s
chaos-daemon-5shkv 1/1 Running 0 3m40s
chaos-daemon-jpqhd 1/1 Running 0 3m40s
chaos-daemon-n6mfq 1/1 Running 0 3m40s
chaos-dashboard-d998856f6-vgrjs 1/1 Running 0 3m40s
```
## Uninstallation
You can uninstall Chaos Mesh by deleting the namespace.
<PickVersion className="language-bash">
curl -sSL https://mirrors.chaos-mesh.org/latest/install.sh | bash -s -- --template | kubectl delete -f -
</PickVersion>
## Limitations
There are some known restrictions for Chaos Operator deployed in the Minikube cluster:
- `netem chaos` is only supported for Minikube clusters >= version 1.6.
In Minikube, the default virtual machine driver's image does not contain the `sch_netem` kernel module in earlier versions. You can use `none` driver (if your host is Linux with the `sch_netem` kernel module loaded) to try these chaos actions using Minikube or [build an image with sch_netem by yourself](https://minikube.sigs.k8s.io/docs/contrib/building/iso/).
---
id: installation
title: Installation
---
import PickVersion from '@site/src/components/PickVersion'
This document describes how to install Chaos Mesh to perform chaos experiments against your application in Kubernetes.
If you want to try Chaos Mesh on your your laptop (Linux or macOS), you can refer the following two documents:
- [Get started on kind](get_started_on_kind.md)
- [Get started on minikube](get_started_on_minikube.md)
## Prerequisites
Before deploying Chaos Mesh, make sure the following items have been installed:
- Kubernetes version >= 1.12
- [RBAC](https://kubernetes.io/docs/admin/authorization/rbac) enabled (optional)
## Install Chaos Mesh
<PickVersion className="language-bash">
curl -sSL https://mirrors.chaos-mesh.org/latest/install.sh | bash
</PickVersion>
The above command installs all the CRDs, required service account configuration, and all components.
Before you start running a chaos experiment, verify if Chaos Mesh is installed correctly.
If you are using k3s or k3d, please also specify `--k3s` flag.
<PickVersion className="language-bash">
curl -sSL https://mirrors.chaos-mesh.org/latest/install.sh --k3s | bash
</PickVersion>
### Verify your installation
Verify if the chaos mesh is running (For the use of *kubectl*, you can refer to the [documentation](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands).)
```bash
kubectl get pod -n chaos-testing
```
Expected output:
```bash
NAME READY STATUS RESTARTS AGE
chaos-controller-manager-6d6d95cd94-kl8gs 1/1 Running 0 3m40s
chaos-daemon-5shkv 1/1 Running 0 3m40s
chaos-daemon-jpqhd 1/1 Running 0 3m40s
chaos-daemon-n6mfq 1/1 Running 0 3m40s
chaos-dashboard-d998856f6-vgrjs 1/1 Running 0 3m40s
```
## Uninstallation
You can uninstall Chaos Mesh by deleting the namespace.
<PickVersion className="language-bash">
curl -sSL https://mirrors.chaos-mesh.org/latest/install.sh | bash -s -- --template | kubectl delete -f -
</PickVersion>
## Install by helm
You also can install Chaos Mesh by [helm](https://helm.sh).
Before you start installing, make sure that helm v2 or helm v3 is installed correctly.
### Step 1: Add Chaos Mesh repository to Helm repos
```bash
helm repo add chaos-mesh https://charts.chaos-mesh.org
```
After adding the repository successfully, you can search available version by the following command:
```bash
helm search repo chaos-mesh
```
### Step 2: Create custom resource type
To use Chaos Mesh, you must create the related custom resource type first.
<PickVersion className="language-bash">
curl -sSL https://mirrors.chaos-mesh.org/latest/crd.yaml | kubectl apply -f -
</PickVersion>
### Step 3: Install Chaos Mesh
> **Note:**
>
> Currently, Chaos Dashboard is not installed by default. If you want to try it out, add `--set dashboard.create=true` in the helm commands above. Refer to [Configuration](https://github.com/chaos-mesh/chaos-mesh/tree/master/helm/chaos-mesh#configuration) for more information.
Depending on your environment, there are two methods of installing Chaos Mesh:
- Install in Docker environment
1. Create namespace `chaos-testing`:
```bash
kubectl create ns chaos-testing
```
2. Install Chaos Mesh using helm:
- For helm 2.X
```bash
helm install chaos-mesh/chaos-mesh --name=chaos-mesh --namespace=chaos-testing
```
- For helm 3.X
```bash
helm install chaos-mesh chaos-mesh/chaos-mesh --namespace=chaos-testing
```
3. Check whether Chaos Mesh pods are installed:
```bash
kubectl get pods --namespace chaos-testing -l app.kubernetes.io/instance=chaos-mesh
```
Expected output:
```bash
NAME READY STATUS RESTARTS AGE
chaos-controller-manager-6d6d95cd94-kl8gs 1/1 Running 0 3m40s
chaos-daemon-5shkv 1/1 Running 0 3m40s
chaos-daemon-jpqhd 1/1 Running 0 3m40s
chaos-daemon-n6mfq 1/1 Running 0 3m40s
chaos-dashboard-d998856f6-vgrjs 1/1 Running 0 3m40s
```
- Install in containerd environment (kind)
1. Create namespace `chaos-testing`:
```bash
kubectl create ns chaos-testing
```
2. Install Chaos Mesh using helm:
- for helm 2.X
```bash
helm install helm/chaos-mesh --name=chaos-mesh --namespace=chaos-testing --set chaosDaemon.runtime=containerd --set chaosDaemon.socketPath=/run/containerd/containerd.sock
```
- for helm 3.X
```bash
helm install chaos-mesh helm/chaos-mesh --namespace=chaos-testing --set chaosDaemon.runtime=containerd --set chaosDaemon.socketPath=/run/containerd/containerd.sock
```
3. Check whether Chaos Mesh pods are installed:
```bash
kubectl get pods --namespace chaos-testing -l app.kubernetes.io/instance=chaos-mesh
```
Expected output:
```bash
NAME READY STATUS RESTARTS AGE
chaos-controller-manager-6d6d95cd94-kl8gs 1/1 Running 0 3m40s
chaos-daemon-5shkv 1/1 Running 0 3m40s
chaos-daemon-jpqhd 1/1 Running 0 3m40s
chaos-daemon-n6mfq 1/1 Running 0 3m40s
chaos-dashboard-d998856f6-vgrjs 1/1 Running 0 3m40s
```
- Install in containerd environment (k3s)
1. Create namespace `chaos-testing`:
```bash
kubectl create ns chaos-testing
```
2. Install Chaos Mesh using helm:
- for helm 2.X
```bash
helm install chaos-mesh/chaos-mesh --name=chaos-mesh --namespace=chaos-testing --set chaosDaemon.runtime=containerd --set chaosDaemon.socketPath=/run/k3s/containerd/containerd.sock
```
- for helm 3.X
```bash
helm install chaos-mesh chaos-mesh/chaos-mesh --namespace=chaos-testing --set chaosDaemon.runtime=containerd --set chaosDaemon.socketPath=/run/k3s/containerd/containerd.sock
```
3. Check whether Chaos Mesh pods are installed:
```bash
kubectl get pods --namespace chaos-testing -l app.kubernetes.io/instance=chaos-mesh
```
Expected output:
```bash
NAME READY STATUS RESTARTS AGE
chaos-controller-manager-6d6d95cd94-kl8gs 1/1 Running 0 3m40s
chaos-daemon-5shkv 1/1 Running 0 3m40s
chaos-daemon-jpqhd 1/1 Running 0 3m40s
chaos-daemon-n6mfq 1/1 Running 0 3m40s
chaos-dashboard-d998856f6-vgrjs 1/1 Running 0 3m40s
```
After executing the above commands, you should be able to see the output indicating that all Chaos Mesh pods are up and running. Otherwise, check the current environment according to the prompt message or create an [issue](https://github.com/chaos-mesh/chaos-mesh/issues) for help.
---
slug: /
id: overview
title: Chaos Mesh
sidebar_label: Overview
---
Welcome to the Chaos Mesh documentation!
Chaos Mesh is a cloud-native Chaos Engineering platform that orchestrates chaos on Kubernetes environments. At the current stage, it has the following components:
- **Chaos Operator**: the core component for chaos orchestration. Fully open-sourced.
- **Chaos Dashboard**: a Web UI for managing, designing, monitoring Chaos Experiments; under development.
Chaos Mesh is a versatile chaos engineering solution that features all-around fault injection methods for complex systems on Kubernetes, covering faults in Pod, network, file system, and even the kernel.
## Architecture
![chaos-mesh](/img/chaos-mesh.svg)
---
id: v0.8.0
title: Chaos Mesh v0.8.0 Release Notes
sidebar_label: v0.8.0
---
Chaos Mesh v0.8.0 provides the ability to orchestrate chaos experiments in the Kubernetes environment, with support of comprehensive types of failure simulation, including Pod failures, container failures, network failures, file system failures, system time failures, and kernel failures. Helm installation method is also supported so that users can quickly deploy Chaos Mesh for chaos experiments. Chaos Mesh uses YAML to define chaos experiments, and provides a rich range of preset chaos test samples for users to quickly try Chaos Mesh.
## New Features and Enhancements
- Add `PodChaos` to simulate the failure on Pods and Containers, including Pods and Containers being killed, Pods being continuously unavailable
- Add `NetworkChaos` to simulate network failures, including delay, packet duplication, packet loss, partition, etc
- Add `TimeChaos` to simulate failures on the system clock, such as clock skew
- Add `IOChaos` to simulate failures on the file system, including file system I/O delay, and file system I/O errors
- Add `KernelChaos` to simulate kernel failures
- Add `StressChaos` to simulate CPU burn and Memory burn
- Support rich selectors to specify the scope of the chaos experiment
- Support rich schedulers, including using cron to schedule chaos experiments
- Support pausing a chaos experiment provisionally
- Support defining chaos experiments using YAML file
- Support ValidatingAdmissionWebhook for verifying the chaos object
- Support cert-manager for certificate management
- Support deploying Chaos Mesh using Helm to
- Support saving metrics using Prometheus
- Support recording information of chaos experiment in Kubernetes events
- Support the complete e2e testing framework
---
id: v0.9.0
title: Chaos Mesh v0.9.0 Release Notes
sidebar_label: v0.9.0
---
Chaos Mesh v0.9.0 mainly introduces the Chaos Dashboard component, which is the web UI for users to manage and monitor chaos experiments. In this version, NetworkChaos has been refactored to support setting multiple network attack rules on the same Pod at the same time, a one-click installation script is added to help users quickly get started with Chaos Mesh.
## New Features & Enhancements
- Introduce Chaos Dashboard component
- Support creating/updating/deleting/pausing PodChaos, NetworkChaos, StressChaos, TimeChaos, IoChaos, KernelChaos through the web interface [#481](https://github.com/pingcap/chaos-mesh/pull/481)
- Support directly uploading YAML files through the interface to create chaos experiments [#665](https://github.com/chaos-mesh/chaos-mesh/pull/665)
- Support obtaining specific fault injection event details through the interface [#628](https://github.com/pingcap/chaos-mesh/pull/628)
- Support direct reuse of archived chaos configurations [#783](https://github.com/pingcap/chaos-mesh/pull/783)
- Supports forcibly cleaning chaos experiment by setting chaos experiment object annotations to [#415](https://github.com/pingcap/chaos-mesh/pull/415) [#478](https://github.com/pingcap/chaos-mesh/pull/478)
- Support the use of `installsh` script to quickly install Chaos Mesh [#466](https://github.com/pingcap/chaos-mesh/pull/466) [#506](https://github.com/pingcap/chaos-mesh/pull/506) [#511](https://github.com/pingcap/chaos-mesh/pull/511)
- Add a new sidecar configuration template to simplify the IoChaos configuration file [#502](https://github.com/pingcap/chaos-mesh/pull/502)
- Support setting protected namespaces [#661](https://github.com/pingcap/chaos-mesh/pull/661)
- Support injecting StressChaos into a specified container in the Pod [#759](https://github.com/pingcap/chaos-mesh/pull/759) [#794](https://github.com/pingcap/chaos-mesh/pull/794)
- Refactor NetworkChaos to support setting multiple network attack rules on the same Pod [#788](https://github.com/pingcap/chaos-mesh/pull/788)
## Major Bug Fixes
- Fix burn-memory not taking effect [#776](https://github.com/pingcap/chaos-mesh/pull/776)
- Fix the failure to restore NetworkChaos [#788](https://github.com/pingcap/chaos-mesh/pull/788)
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment