Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve graceful shutdown of RegionSevers #508

Open
3 tasks
sbernauer opened this issue Jun 11, 2024 · 5 comments
Open
3 tasks

Improve graceful shutdown of RegionSevers #508

sbernauer opened this issue Jun 11, 2024 · 5 comments

Comments

@sbernauer
Copy link
Member

sbernauer commented Jun 11, 2024

Relevant docs: https://hbase.apache.org/book.html#decommission
Relevant script: graceful_stop.sh
Relevant class: org.apache.hadoop.hbase.util.RegionMover, with relevant function

In #400 we implemented a graceful shutdown for all HBase components which is similar to ./bin/hbase-daemon.sh stop <service>. While this works in general it has downsides, such regions being offline for some time, resulting in (short) outages.

Instead we should try to call or mimic graceful_stop.sh. The graceful_stop.sh script will move the regions off the decommissioned RegionServer one at a time to minimize region churn. It will verify the region deployed in the new location before it will moves the next region and so on until the decommissioned server is carrying zero regions. At this point, the graceful_stop.sh tells the RegionServer stop. The master will at this point notice the RegionServer gone but all regions will have already been redeployed and because the RegionServer went down cleanly, there will be no WAL logs to split.

Acceptance criteria

Preview Give feedback
@lfrancke lfrancke changed the title Imporove graceful shutdown of RegionSevers Improve graceful shutdown of RegionSevers Jun 14, 2024
@NickLarsenNZ
Copy link
Member

NickLarsenNZ commented Sep 10, 2024

Must: The docs say "Disable the Load Balancer before Decommissioning a node". We found a solution to this by either doing so or making sure we (or our customers) are not using LBs

Can we just use readiness probes to take the pod out of service?

@sbernauer sbernauer removed their assignment Sep 23, 2024
@razvan razvan self-assigned this Sep 23, 2024
@razvan razvan moved this from Next to Refinement: In Progress in Stackable Engineering Sep 23, 2024
@razvan
Copy link
Member

razvan commented Sep 23, 2024

There need to be at lease two shutdown modes:

  • one where regions are being moved around because the service is decommissioned forever. This one is slow and possibly generates a lot of traffic inside the cluster.
  • a fast one temporary decommissioned servers due to security, version updates and what not. The region balancer should probably be stopped during the entire time.

Findings (in progress):

  • hbase/bin/hbase-daemon.sh can start/stop/restart etc. and already handles termination signals better then our home grown solution.
    • uses jstack to do a thread dump in case shutdown takes longer than 20 mins. jstack is not in our images.
  • The graceful_stop.sh script requires the hostname or ssh commands which are not available in the Hbase images currently.
    • It always moves regions to a different server
    • It can turn off region balacing before shutdown and turn it on again when a server is stopped.
    • We can get rid of the ssh requirement by passing localhost as the name of the region server but the script needs hostname to find out the actual region server name.
    • Assumes the HBase servers have been started with hbase-daemon.sh which writes PID files for every process.
  • An additional way to decommission a region server is the decommission_regionserver shell command which also can move regions (async) but doesn't actually stop anything. A mechanism to wait for the regions to be moved is needed in this case.

@razvan
Copy link
Member

razvan commented Oct 29, 2024

During testing it was discovered that region servers already transfer regions when shutting down. This behavior is implemented in the 2.4 and 2.6 versions.

To clarify:

  • What is the benefit of invoking the region mover explicitly before shutdown?
  • Are regions in "transition" available for querying ?
  • How long can a region move take in the worst case and how does this impact HBase clients ?

Another idea : since this is the default behavior anyway, maybe in cases like rolling cluster restarts, the user would benefit more from actually disabling the region mover altogether during that period.

@NickLarsenNZ
Copy link
Member

This will be discussed next week

@NickLarsenNZ
Copy link
Member

I believe this is not making the 24.11 release anymore.

We should then remove it from https://github.com/orgs/stackabletech/projects/42.

If it does end up going in last minute, the following will need doing again:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Status: Development: In Review
Development

No branches or pull requests

4 participants