HDDS-6663. remove container reference from scm if its state is DELETED#3360
HDDS-6663. remove container reference from scm if its state is DELETED#3360JacksonYao287 wants to merge 13 commits intoapache:masterfrom
Conversation
|
@sodonnel @ChenSammi PTAL, thanks! |
|
@JacksonYao287 what is the motivation behind this change? Maintaining a record of all deleted containers in SCM has the following benefits:
I do not see problems with keeping deleted container entries. Even if we had to track 10 billion containers during the system's lifetime (which I suspect might be pushing the limits of what RocksDB can handle) that would correspond to about 5gb * 10billion = 50 exabytes of data deleted from the cluster during its lifetime, which is a massive amount of churn. If we are concerned about SCM memory usage since it is tracking all the containers in memory, a better approach might be to track deleted containers in RocksDB only. This is actually a somewhat major change in the way HDDS functions, so if we want to go through with this, I think we would need to answer some more questions:
|
|
thanks @errose28 for the review!
can you please explain this in more detail? do you mean if we have to track a massive amount of containers, no matter put or get the state of container , we just hit rocksdb directly, but not maintain a state map in RAM? |
|
I think it is reasonable to remove containers that are empty. There isn't a good reason to keep them beyond debugging and if a DN comes back up with a container on it, we should probably handle it as an unknown container and remove it. I know the default Ozone behaviour is to ignore unknown containers and not remove them. I understand why (we don't want to accidentality delete data), but I feel that over time the number of empty containers in a cluster will grow. Also if the cluster is out of safemode, it seems overly pessimistic to keep unknown containers. It may seem unrealistic that we will end up with a log of empty containers, but if we have Hive workloads for example. They can have many phases and stage the temp data from between jobs on the cluster. This will fill containers and then they will be purged. A job could stage 100's of GB potentially in a large multi-table join. Related to this, and a problem I foresee, is that a cluster running these Hive type workloads may end up filling containers partly with temp data and partly with real data. The containers get closed. The tmp data gets removed and we end up with lots of small containers that are never totally empty, but may have much less than 5GB of data in them. Then the SCM memory will get used with not just empty container references, but far more containers than we expected (small container problem). We ultimately need to solve that small container problem by merging containers, but there is at least one major problem. We cannot easily update OM with the new containerID. One suggestion was to create a mapping in SCM. Another idea may be to find the keys associated with a container in Recon (reverse index of containerID -> key) and then update OM. For this PR, perhaps removing the deleted container reference in SCM could be tied to the same config that lets us delete or not delete unknown containers, so it either logs a warning or deletes the SCM reference depending on the value? |
|
thanks @sodonnel for the comments, i agree with your opinion.
this is a very good point.
since recon may hold stale inform, creating a mapping in scm may be better.
yea, agree. |
|
Another question comes to my mind. suppose we have contianer c1 and three replicas r1, r2,r3 on datanode d1, d2,d3 ,respectively. if d3 is dead, r3 will lost, so RM will find this situation and copy r2 to datanode d4 as r4. after a while, some block deletion commands comes to d1, d2,d4 to delete some blocks in r1, r2, r4 and the delete transaction will be removed from the deleteBlockLog. if d3 resurrects , r3 comes back and has more blocks than r1, r2 ,r4, since d3 does not receive those deletion commands. when d3 sends a container report to scm , the following function in containerReportHandler will change the container stats to the same as r3, since r3 has a bigger KeyCounts and UsedBytes. since those zombie blocks will not be deleted from r3(no deletion command for these blocks will be received) , the KeyCounts and UsedBytes will not be zero, and thus , the container replica at datanodes and the reference at scm will never be removed. |
|
@sodonnel i have updated this patch, please take a look! |
@errose28 |
|
IMO I think this change looks good. Perhaps we should add one more test that confirms the reference is not deleted if the config flag is false, just to catch any future changes which may break this. I also believe that the best thing is to have this "delete empty container references" on by default as in the normal case, there is no real reason to want to keep an empty reference around. We should probably log it is being removed however, so there is some trace or indication about why it disappeared. @errose28 what do you think? You had some differing opinions on this before? |
| // of the container from scm. this happens immediately, thus DELETED | ||
| // is a transient state and hardly be captured. | ||
| actions.put(CLEANUP, info -> | ||
| removeContainer(info.containerID().getProtobuf())); |
There was a problem hiding this comment.
I am not sure if this is logged anywhere else, but could we ensure a lot message is written at INFO level saying container xxx has been deleted as it is empty.
|
Hi @JacksonYao287 @sodonnel I still have some concerns about the way this change is implemented. No clear problem statementIs the motive to reduce memory usage or RocksDB usage in SCM? If it is memory usage, then we can leave the deleted containers in RocksDB only and not load them into the Changing the config key fragments the container delete pathContainers deleted before this change is applied remain in SCM forever. Similarly if the config is turned off and on and containers are deleted, they will not be cleaned up when the config is restored to on. I guess this is more of a debugging inconvenience than an error, but will make examining existing clusters confusing none the less. No integration with unknown container handling
Confusion about container delete vs. block deleteThis is not really a comment on the code but a response to some of the discussion on this PR. The second half of this comment from Stephen and this comment from Jackson are definitely valid points, but they concern block deletion, not container deletion. Changes proposed here will not solve these problems. A change in this area will only affect SCM memory usage or rocksDB usage with potential debugging tradeoffs. IMO the best thing to do is:
|
|
1 No clear problem statement 2 Changing the config key fragments the container delete path 3 No integration with unknown container handling thanks @errose28 for the concerns 。 generally, i agree with @stephen. since this patch is not urgent ,we can keep discussing it until we make a consensus |
|
I can kind of see this both ways. If we take it to one extreme - if we want to ensure stuff is kept around for debugging, why are we letting the empty replicas get deleted? On a real cluster blocks will get deleted all the time, and we will end up with empty containers. Some bug (which I think we already had) could erroneously delete replicas it should not, but its also not scalable to keep all empty replicas forever. They will grow slowly but surely to a large number over months and years and cause problems in the reporting and report processing. So if we have deleted the replicas, what use is a ContainerInfo reference in SCM? Even if we have it, we cannot do anything with it - there is no replica for it. All it tells us is that a container once existed with that ID, and its empty. Its bad that it will use up memory slowly over time. Replication Manager will also need to check it on each pass. We can keep it in the Rocks DB table unloaded on SCM startup, but (I think) we will still need to read out the record and decide on whether or not to load it into memory. Even will millions of such references the overhead should be small - it will only slowdown the SCM startup by a tiny amount, but its still a small overhead. I'm in favor of just removing them, as I see little value in keeping them. We should certainly log they have been removed - of course the logs will roll off, but at least that is some evidence they have been removed for a reason. |
|
@errose28 does this make sense to you? or do you have any other points? |
|
Ok we can continue with the proposal to remove the deleted containers from RocksDB. Does this description line up with what is being proposed?
The catch here is that the a container will get removed from SCM if all replicas in the current replica set are empty, but SCM cannot know about all replicas that may have a copy of the container. As @JacksonYao287 pointed out in this comment, an old replica whose container has been deleted could resurface, and have blocks that already got deleted and cleared. If we are only deleting unknown empty containers, then this container would not get deleted. However, deleting unknown non-empty containers seems dangerous. Removal of these containers might have to wait until we have orphan block cleanup on Recon. |
|
my plan is to complete this work in two steps. second, i will create a new patch to solve this problem.. my idea here is after container is closed, only deleteblockLog can shrink the container size, not |
|
Hi @JacksonYao287 thanks for continuing the work on this. I do still have some concerns with this proposal though.
This doesn't address my original point that these two configs conflict for some values. I also do not think that config keys should be used to fragment core internal operations like deletion. Ozone only needs one delete path and we should be confident that the one we choose works. If we are not confident in this delete container proposal and are trying to hide it behind a config then that is a sign we should not go forward with the change.
This assumes the issue shows up in dev first, which may not be the case. There's so many variables in an Ozone deployment it is possible that the dev environments work fine, but the perfect storm of issues does not happen until some production deployment. Having debugged multiple dev and prod Ozone deployments I can confirm this does happen for corner case bugs. This is why I have been advocating for a more cautious approach.
We should use block count to determine if the container is empty, as used bytes is not exact like you mentioned. Block count is currently used by SCM to determine if a container is empty. There are some invariants that I believe make it impossible to solve this problem without using Recon:
Given these properties I believe we cannot know the definite size of the container once blocks start getting deleted. We can only know the current size of each replica. For example, say a block is deleted from all known container replicas and SCM's "definite" block count decreases by 1. SCM removes the entry from the block log. Now maybe OM missed the ack for that delete so it resends it. SCM would repeat the same steps, incorrectly decrementing the "definite" block count again. It has no permanent record of which blocks were already deleted. Since old datanodes can be resurrected, they are also not a source of truth for which block deletions have already been issued by SCM. In summary, I feel that my proposal has the following advantages:
The disadvantage of my proposal is that for the block deletion corner case discussed earlier, containers with those blocks will remain in the system. However this is true even today with the current container delete flow, since all replicas must have zero block count for deletion to happen. The orphan block problem is a block problem, not a container problem. IMO this means it should not be solved as an extension of container deletion by deleting non-empty containers. I think orphan block cleanup with Recon is the correct answer to this, even if it may not be implemented in the near future. |
|
/pending |
|
@JacksonYao287 Are you working on this PR? |
@swamirishi , sorry, i am now busy with some internal stuff and not have enough time to continue this work! I think it is better to close this patch for now, and reopen it later. also, anyone interesting with this issue can take this work after it is cloesd |
What changes were proposed in this pull request?
when the state of a container in scm is deleting , and the all the replica of this container is deleted from the corresponding datanode, a CLEANUP event will be fired in scm, and the state of the container in scm will be changed to DELETED. when the state of a container is DELETED, the reference should be removed from scm.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-6663
How was this patch tested?
UT