Skip to content

Conversation

@maoling
Copy link
Member

@maoling maoling commented Nov 7, 2019

  • Byteman is a powerful tool to inject faults during runtime, especially for distributed system
  • In the future, we can also introduce it into our unit test infra, just as hadoop, cassandra did. PR-123 is a good starting.
  • more details in the ZOOKEEPER-3601

@anmolnar
Copy link
Contributor

anmolnar commented Dec 5, 2019

@maoling I like the idea of introducing fault injection fw for ZooKeeper, but is that it? I mean we only need to add documentation and good to go?
What's the long term plan? Do we need to pick up and merge #123 too?

@maoling
Copy link
Member Author

maoling commented Dec 26, 2019

@anmolnar

  • This PR is an outside one which gives users examples how to inject faults into zk servers with BM during runtime
  • PR-123 is a great patch which inject faults to verify the fix of ZOOKEEPER-2549. Without byteman, writing an unit case for that bug will be a headache. If @eribeiro isn't free these days, I or someone else can pick it up.
  • In further, we will implement our own chaos monkey to inject random faults(e.g network, disk) to verify the correctness of zk's functions. Look at this two great references: New Functional Testing in etcd, Testing distributed systems in Go

Copy link
Contributor

@anmolnar anmolnar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Alright.

@anmolnar
Copy link
Contributor

retest maven build

@functioner
Copy link

I was wondering if the fault injection (Byteman or Chaos Monkey or whatever) functionality is available now? This functionality may be used in ZOOKEEPER-4074.

@maoling
Copy link
Member Author

maoling commented Mar 9, 2021

  • @functioner the functionality is still working(Glad to see you used the Byteman to reproduce the issue in the ZOOKEEPER-4203)
  • @anmolnar It seems that this patch is very useful to help users and developers, do you have a plan to merge it?

Sharing some other chaos engineering approaches currently I used for ZooKeeper:

blockade:

I use it for injecting network latency and network partition(very useful, I love it). I write lots of specific scripts which help me create the custom network partition pattern I want.

chaos-mesh:

I use it for general chaos work and it has a good web UI and is integrated with k8s. But AFAIK, I cannot customize network partition

jepsen:

A framework for distributed systems verification, with fault injection(mainly used for consistency check). I used it for linearizable read consistency check in ZOOKEEPER-3600

Byteman:

It's what this patch wants to introduce. Byteman is a tool which makes it easy to trace, monitor and test the behaviour of Java application and JDK runtime code. It injects Java code into your application methods or into Java runtime methods without the need for you to recompile, repackage or even redeploy your application. Injection can be performed at JVM startup or after startup while the application is still running. Injected code can access any of your data and call any application methods, including where they are private. You can inject code almost anywhere you want and there is no need to prepare the original source code in advance. You can even remove injected code and reinstall different changes while the application continues to execute.

chaosblade:

ChaosBlade is a injection tool that follows the principles of chaos engineering and chaos experimental models to help enterprises improve the fault tolerance of distributed systems and ensure business continuity during the process of enterprises going to cloud or moving to cloud native systems. The scenarios include:
Basic resources: such as CPU, memory, network, disk, process and other experimental scenarios;
Java applications: such as databases, caches, messages, JVM itself, microservices, etc. You can also specify any class method to inject various complex experimental scenarios;

Stress for Linux:

A tool which imposes a configurable amount of load on your system where ZooKeeper locates

namazu-swarm:

Namazu Swarm is a part of Namazu, a programmable fuzzy scheduler for testing distributed systems. AFAIK, this project has not been maintained for a long time and it's not easy to install it.

Other resources from other projects:

[etcd]:

https://coreos.com/blog/new-functional-testing-in-etcd.html
https://coreos.com/blog/testing-distributed-systems-in-go.html

[Kafka]:

https://cwiki.apache.org/confluence/display/KAFKA/Fault+Injection

[Apache Ozone]:

https://blog.cloudera.com/apache-ozone-fault-injection-framework/

@maoling
Copy link
Member Author

maoling commented May 27, 2021

I will merge this PR at 05-29 if no other concerns within this week. Cc @anmolnar

@asfgit asfgit closed this in 5c10229 May 29, 2021
@maoling maoling deleted the ZOOKEEPER-3601 branch May 29, 2021 12:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants