-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-4722. Creating RDBStore fails due to RDBMetrics instance race #1820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Change-Id: Ieca93325d5c001c033e1d2d18b273f2c694da037
adoroszlai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jojochuang Synchronizing unRegister may fix the race condition, but I'm not sure using a single shared instance is OK in the first place. Imagine the following steps:
RDBStoreis created with newRDBMetricsinstance- Another
RDBStoreis created, reusesRDBMetricsinstance - One of the
RDBStoresis closed, sharedRDBMetricsis unregistered - Other
RDBStoreis still used, but can it report events to the metrics system?
Shouldn't each RDBStore register its RDBMetrics instance with a unique name?
|
Great question.... IMO we want a single RDBMetric instance. With the current architecture we will see thousands of containers per DataNode. Having metrics for each and every container... that seems too excessive to me. Plus, the metrics are kept only during the lifetime of a container. I feel like we want a singleton RDBMetric for the duration of the process. @vivekratnavel thoughts? you wrote the original code. |
I also think this was the intention of introducing the shared instance. In this case I think it should be reference counted, and the last one should clean up. |
|
Thanks @jojochuang and @adoroszlai for the patch and review! @avijayanhwx will also have a good context on this issue. If I remember correctly, we discussed some of the potential issues while making the RDBMetrics a shared instance, but I think we overlooked the example scenario that @adoroszlai has stated in his comment above. Please correct me if I am wrong. @adoroszlai You are right about our intention for the shared Metrics instance. Before sharing the instance, we were getting the wrong metrics since each store was keeping its own count. I think it's time to think about all the scenarios thoroughly and redesign RDBMetrics if needed. |
|
Let's merge this to make synchronized access consistent. Then we can improve the shared instance with reference counting. |
…ing-upgrade * upstream/master: (29 commits) HDDS-4741. Modularize upgrade test (apache#1928) HDDS-4864. Add acceptance tests to certify Ozone with boto3 python client. (apache#1976) HDDS-4791. StateContext.getReports may return list with size larger t… (apache#1892) HDDS-4867. Ozone admin datanode list should report dead and stale nodes (apache#1966) HDDS-4858. Useless Maven cache cleanup (apache#1956) HDDS-4769. Simplify insert operation of ContainerAttribute (apache#1865) HDDS-4847. Fix typo in name of IdentityService (apache#1941) HDDS-4869. Bump jackson version number (apache#1963) HDDS-4871. Fix intellij runConfigurations for datanode (apache#1968) HDDS-4870. Bump jetty version (apache#1964) HDDS-4722. Creating RDBStore fails due to RDBMetrics instance race (apache#1820) HDDS-4138. Improve crc efficiency by using Java.util.zip.CRC when available (apache#1950) HDDS-4816. Add UsageInfoSubcommand to get Datanode usage information. (apache#1919) HDDS-4754. Make scm heartbeat rpc retry interval configurable (apache#1942) HDDS-4832. Show Datanode OperationalState in Recon (apache#1937) HDDS-4653. Support TDE for MPU Keys on Encrypted Buckets (apache#1766) HDDS-4853. libexec/entrypoint.sh might copy from wrong path (apache#1951) HDDS-4857. Format ReplicationType.java which indentation are confusion (apache#1952) HDDS-4850. Intermittent failure in ozonesecure due to unable to allocate block (apache#1948) HDDS-4808. Add Genesis benchmark for various CRC implementations (apache#1910) ... Conflicts: hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/client/ScmClient.java hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/protocol/StorageContainerLocationProtocol.java hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/OzoneConsts.java hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/scm/protocolPB/StorageContainerLocationProtocolClientSideTranslatorPB.java hadoop-hdds/interface-admin/src/main/proto/ScmAdminProtocol.proto hadoop-hdds/interface-client/src/main/proto/hdds.proto hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/protocol/StorageContainerLocationProtocolServerSideTranslatorPB.java hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/SCMClientProtocolServer.java hadoop-hdds/tools/src/main/java/org/apache/hadoop/hdds/scm/cli/ContainerOperationClient.java hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneManager.java
…ing-upgrade-merge-candidate * upstream/master: (29 commits) HDDS-4741. Modularize upgrade test (apache#1928) HDDS-4864. Add acceptance tests to certify Ozone with boto3 python client. (apache#1976) HDDS-4791. StateContext.getReports may return list with size larger t… (apache#1892) HDDS-4867. Ozone admin datanode list should report dead and stale nodes (apache#1966) HDDS-4858. Useless Maven cache cleanup (apache#1956) HDDS-4769. Simplify insert operation of ContainerAttribute (apache#1865) HDDS-4847. Fix typo in name of IdentityService (apache#1941) HDDS-4869. Bump jackson version number (apache#1963) HDDS-4871. Fix intellij runConfigurations for datanode (apache#1968) HDDS-4870. Bump jetty version (apache#1964) HDDS-4722. Creating RDBStore fails due to RDBMetrics instance race (apache#1820) HDDS-4138. Improve crc efficiency by using Java.util.zip.CRC when available (apache#1950) HDDS-4816. Add UsageInfoSubcommand to get Datanode usage information. (apache#1919) HDDS-4754. Make scm heartbeat rpc retry interval configurable (apache#1942) HDDS-4832. Show Datanode OperationalState in Recon (apache#1937) HDDS-4653. Support TDE for MPU Keys on Encrypted Buckets (apache#1766) HDDS-4853. libexec/entrypoint.sh might copy from wrong path (apache#1951) HDDS-4857. Format ReplicationType.java which indentation are confusion (apache#1952) HDDS-4850. Intermittent failure in ozonesecure due to unable to allocate block (apache#1948) HDDS-4808. Add Genesis benchmark for various CRC implementations (apache#1910) ... Conflicts: hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/client/ScmClient.java hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/protocol/StorageContainerLocationProtocol.java hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/OzoneConsts.java hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/scm/protocolPB/StorageContainerLocationProtocolClientSideTranslatorPB.java hadoop-hdds/interface-admin/src/main/proto/ScmAdminProtocol.proto hadoop-hdds/interface-client/src/main/proto/hdds.proto hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/protocol/StorageContainerLocationProtocolServerSideTranslatorPB.java hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/SCMClientProtocolServer.java hadoop-hdds/tools/src/main/java/org/apache/hadoop/hdds/scm/cli/ContainerOperationClient.java hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneManager.java hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/scm/ReconNodeManager.java
* HDDS-3698-nonrolling-upgrade: (29 commits) HDDS-4741. Modularize upgrade test (apache#1928) HDDS-4864. Add acceptance tests to certify Ozone with boto3 python client. (apache#1976) HDDS-4791. StateContext.getReports may return list with size larger t… (apache#1892) HDDS-4867. Ozone admin datanode list should report dead and stale nodes (apache#1966) HDDS-4858. Useless Maven cache cleanup (apache#1956) HDDS-4769. Simplify insert operation of ContainerAttribute (apache#1865) HDDS-4847. Fix typo in name of IdentityService (apache#1941) HDDS-4869. Bump jackson version number (apache#1963) HDDS-4871. Fix intellij runConfigurations for datanode (apache#1968) HDDS-4870. Bump jetty version (apache#1964) HDDS-4722. Creating RDBStore fails due to RDBMetrics instance race (apache#1820) HDDS-4138. Improve crc efficiency by using Java.util.zip.CRC when available (apache#1950) HDDS-4816. Add UsageInfoSubcommand to get Datanode usage information. (apache#1919) HDDS-4754. Make scm heartbeat rpc retry interval configurable (apache#1942) HDDS-4832. Show Datanode OperationalState in Recon (apache#1937) HDDS-4653. Support TDE for MPU Keys on Encrypted Buckets (apache#1766) HDDS-4853. libexec/entrypoint.sh might copy from wrong path (apache#1951) HDDS-4857. Format ReplicationType.java which indentation are confusion (apache#1952) HDDS-4850. Intermittent failure in ozonesecure due to unable to allocate block (apache#1948) HDDS-4808. Add Genesis benchmark for various CRC implementations (apache#1910) ...
…pache#1820) (cherry picked from commit a795ab0)
What changes were proposed in this pull request?
Make RDBMetrics#unRegister() synchronized. Accessing the instance object inside RDMBetrics#create() is synchronized, and therefore RDBMetrics#unRegister() should be made synchronized to protect the instance object.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-4722
How was this patch tested?
Manually tested.