-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve graceful shutdown of RegionSevers #508
Comments
Can we just use readiness probes to take the pod out of service? |
There need to be at lease two shutdown modes:
Findings (in progress):
|
During testing it was discovered that region servers already transfer regions when shutting down. This behavior is implemented in the 2.4 and 2.6 versions. To clarify:
Another idea : since this is the default behavior anyway, maybe in cases like rolling cluster restarts, the user would benefit more from actually disabling the region mover altogether during that period. |
This will be discussed next week |
I believe this is not making the 24.11 release anymore. We should then remove it from https://github.com/orgs/stackabletech/projects/42. If it does end up going in last minute, the following will need doing again: |
Relevant docs: https://hbase.apache.org/book.html#decommission
Relevant script: graceful_stop.sh
Relevant class: org.apache.hadoop.hbase.util.RegionMover, with relevant function
In #400 we implemented a graceful shutdown for all HBase components which is similar to
./bin/hbase-daemon.sh stop <service>
. While this works in general it has downsides, such regions being offline for some time, resulting in (short) outages.Instead we should try to call or mimic
graceful_stop.sh
. The graceful_stop.sh script will move the regions off the decommissioned RegionServer one at a time to minimize region churn. It will verify the region deployed in the new location before it will moves the next region and so on until the decommissioned server is carrying zero regions. At this point, the graceful_stop.sh tells the RegionServer stop. The master will at this point notice the RegionServer gone but all regions will have already been redeployed and because the RegionServer went down cleanly, there will be no WAL logs to split.Acceptance criteria
The text was updated successfully, but these errors were encountered: