-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-3878. Make OMHA serviceID optional if one (but only one) is defined in the config #1149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1149 +/- ##
============================================
+ Coverage 70.56% 73.30% +2.74%
- Complexity 9427 9988 +561
============================================
Files 965 972 +7
Lines 49063 49589 +526
Branches 4803 4872 +69
============================================
+ Hits 34620 36351 +1731
+ Misses 12137 10915 -1222
- Partials 2306 2323 +17
Continue to review full report at Codecov.
|
adoroszlai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @elek for working on this. The usability improvement is nice.
I think we need additional change in BaseFreonGenerator#createOmClient to make ombg and omkg work with the same setup (single service ID defined in config). Currently these fail with Retrying connect to server: 0.0.0.0/0.0.0.0:9862.
|
I think this is a deliberate choice not to pick from config, if it is there. As hive stores the table path in its metastore. So, having a complete path is better. Just an example scenario/
|
|
Yeah this was a conscious choice. I feel usage of unqualified paths in HDFS can be ambiguous and error-prone when multiple clusters/federation are involved. |
...
If the user has both remote and local (or multiple federated cluster) it means that we have two service ids. This shortcut works only if you have one and only one cluster. I think it's very important to provide better usability when we have one only one cluster. I agree, if we have two clusters, it should be required to choose. |
We want to support client-less config of HA eventually (using DNS for example). In that case we cannot look at the client side configs to know how many clusters/services there are. This should be an inconvenience only for manually types o3fs paths. For programmatic usage it shouldn't be such a big deal right? I feel it was a mistake in HDFS to not require fully qualified paths for HA. It made the adoption of viewfs/federation more difficult. |
Can you please share more information about this plan? Possible implementations? I need more information to understand the possible problem as with the current behavior it seems to be possible make the configuration optional.
As I understood with |
|
Yes one risk is when people code up their apps or scripts to use implicit paths. Then we add a second cluster and now their paths that used to work previously don't work anymore. Or suppose we want to change the defaultFS from o3fs to viewfs. Fallling back to defaultFS is an anti-pattern IMO. |
|
As I understood with defaultFs it's always possible to use implicit defaults. Do you see any risk to use default cluster if only one is defined? Are you referring here defaulFS set to o3fs://bucket.vol.serviceid1
Agree on this point, this will break all existing tables which are created before assuming with default serviceID which has only one value. Now, those tables cannot be accessed, as hive metastore has path without serviceID. |
|
So iiuc the main convenience argument is that we can type paths like With ofs, buckets and volumes now become first class directories i.e. |
|
Thanks to explain it. I understand why do you think implicit defaultFs can be dangerous. On the other hand (as many other dangerous feature) it provides additional flexibility as you can migrate to a new defaultFs without changing all your app. It cab be a helpful feature even if the same feature can provide some additional risks in some use cases. But we couldn't change the You mentioned two concerns:
Fortunately this patch is independent of these problems. This patch modifies the behavior if one (and only one) Can you please explain where do you see any risk w.r.t this patch? |
|
I agree with @elek that the scope of this change is very limited to where client only has one om service id configured. So accidentally access wrong om cluster when multiple om service ids are defined are not a major issue here. There are few case that specify om service id explicitly is helpful even only one om service id is configured. The client specifies different configurations to access two clusters (each with only one service id), e.g., DR(distcp). If we don't require explicit om service id in the uri, the only way to differentiate is the name configuration file locations. |
|
I agree with @arp7 that the URI stored in the applications like Hive should be the FullyQualified URI, else the applications assume that the rest of the information is provided by the core-site.xml etc. This might lead to issues when we switch from o3fs to ofs, or to viewfs or something similar. |
In case of distcp we have two different clusters. This patch makes the serviceId optional only if you have only one configured. serviceId is required if it's not clear which one is used (eg. you have two clusters, configured).
Thanks the answer @mukul1987 . I think it's a question which is more related to the Hadoop Compatible File System. As we can use unqualified path (Hadoop feature) I couldn't see any way to disable this. This patch modifies the behavior when somebody uses unqualified path. If you can convince users to use qualified path it won't change anything. |
No, It doesn't. It's exactly the same as before, but it doesn't apply for ofs/o3fs, only for Java client / cli. |
Exactly. But only if one (and only one) serviceId is defined. In case of having multiple serviceId, you should always be explicit (unless a default/interal serviceId is configured).
Sure. We should document all the cases (including this special case). I have a big patch where I created documentation for OM HA: #1269 I will improve that page, but only if we have an agreement here and it's merged. |
|
Are you OK with this approach? |
| # to you under the Apache License, Version 2.0 (the | ||
| # "License"); you may not use this file except in compliance | ||
| # with the License. You may obtain a copy of the License at | ||
| # |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These files are removed, I think this docker-compose is for testing for ratis log purge usecase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and it's named as test_disabled.sh because it didn't work. As I wrote in the description, I suggested creating a simplified (but working) test and fix the purge use-case and restart use-case in a separated jira, following the pattern introduced by Attila for the upgrade test.
ozone-ha acceptance tests are turned off long time ago (it included some life-cycle test to start/stop services. In this patch I simplified the ha cluster and added simple smoketest.
Later we can restore the lifecycle tests (start/stop) but I would prefer to use a generic approach for all the clusters.
But I am open to do anything with the tests (I can revert the changes, but in that case the change won't be covered), jut to commit this patch after 37 days and unblock the release.
Just let me know what do you suggest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am okay with it, if there is a cleaner approach to do and if it can be added in the new Jira.
cc @hanishakoneru, any comments on removing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the tests which include restarts / failures can be separated and handled similar to the upgrade test. I think we can keep a simple om-ha (or ozone-ha) test which tests the functionality and collect all the lifecycle related tests (restart, kill dn/om services) in a separated place.
We need more tests there not only for HA but for normal servers, IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@elek sorry for jumping into the discussion late. I am working on fixing the robot tests for HA functionality (the old tests covering the restarts and failovers).
Instead of removing those files altogether, can you create another dir for the new docker compose - ozone-ha-basic or something. Or rename the old compose dir to something else.
As for changing the life-cycle test to start/stop services, I am open to discuss a better approach. But I guess we can keep that for a different thread.
Thanks @bharatviswa504 for bringing it to my notice.
hadoop-hdds/test-utils/src/main/java/org/apache/hadoop/test/InMemoryConfiguration.java
Show resolved
Hide resolved
hadoop-hdds/test-utils/src/main/java/org/apache/hadoop/test/InMemoryConfiguration.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/shell/OzoneAddress.java
Outdated
Show resolved
Hide resolved
Using internal service id when multiple is not handled in this? Other than a few minor comments, Overall LGTM. |
|
Thanks the review @bharatviswa504, I updated the patch. See my question about the tests, I am open to do anything, just let me know what do you suggest. @arp7 Do you have any concept-level concerns? |
|
@elek I don't see any commits, but the comments are marked as resolved. |
|
bq. Using internal service id when multiple is not handled in this? My understanding of the internal service id is only used internally by OM and Recon. |
And also there is a Jira out for review, to use by S3G also. |
I think we reach consensus "convenience" should not be priority when multiple service ids are added even though they may have only one before. User must specify service id explicitly in that case. |
Based on my understanding we agreed to require explicit serviceId in API calls are used from other servers and CLI tools (both are related to admin use-cases and not related to user activities). The original argument against making If It's optional in ofs/o3fs url, the url might need modification (make explicit from implicit) when additional HA services are used. Which makes it harder for Hive users to add secondary Ozone HA cluster. It seemed to be a particular use case, but I accepted this argument. Overall it's a decision between to forces:
@arp7 and @bharatviswa504 argued that 1 is more important, and I accepted it. But as far as I understood we agreed that we make this improvement on API level ( |
bharatviswa504
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 LGTM.
Usage of internal.service.id will be fixed as part of HDDS-4096
hanishakoneru
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@elek sorry for jumping into the discussion late. I am working on fixing the robot tests for HA functionality (the old tests covering the restarts and failovers).
Instead of removing old ozone-ha files altogether, can you create another dir for the new docker compose - ozone-ha-basic or something. Or rename the old compose dir to something else.
As for changing the life-cycle test to start/stop services, I am open to discuss a better approach. But I guess we can keep that for a different thread.
Thanks @bharatviswa504 for bringing it to my notice.
P.S.: Adding the comment here so that its not missed.
|
@hanishakoneru Sure, I also thought about separating the test changes and the serviceId changes (I created a separated jira for the test changes and I considered posting them in a separated jira). But this change can be tested only with fixing the tests. Keeping the old directories and starting the new one -- as you suggested -- seems to be a good approach. I restored the existing om-ha tests but the new ones are added. |
|
Summary: got +1 from @bharatviswa504 and @arp7 also confirmed offline, that he is fine with it (if o3fs/ofs are not changed). Hanisha's suggestion also applied (and I pinged Her offline). I will commit this soon. Thanks the (very) long conversation and patient for everybody (@xiaoyuyao, @adoroszlai ...) It seems to be a long and hard issue, but I am happy with it. As remote work become the norm, I think we should move more and more formal and informal conversations to the pull request threads (or mailing list threads). In this case it took time, but I am glad that we found a consensus. |
Proposed change are done, and Hanisha was happy with it (pinged offline)
* master: (28 commits) HDDS-4037. Incorrect container numberOfKeys and usedBytes in SCM after key deletion (apache#1295) HDDS-3232. Include the byteman scripts in the distribution tar file (apache#1309) HDDS-4095. Byteman script to debug HCFS performance (apache#1311) HDDS-4057. Failed acceptance test missing from bundle (apache#1283) HDDS-4040. [OFS] BasicRootedOzoneFileSystem to support batchDelete (apache#1286) HDDS-4061. Pending delete blocks are not always included in #BLOCKCOUNT metadata (apache#1288) HDDS-4067. Implement toString for OMTransactionInfo (apache#1300) HDDS-3878. Make OMHA serviceID optional if one (but only one) is defined in the config (apache#1149) HDDS-3833. Use Pipeline choose policy to choose pipeline from exist pipeline list (apache#1096) HDDS-3979. Make bufferSize configurable for stream copy (apache#1212) HDDS-4048. Show more information while SCM version info mismatch (apache#1278) HDDS-4078. Use HDDS InterfaceAudience/Stability annotations (apache#1302) HDDS-4034. Add Unit Test for HadoopNestedDirGenerator. (apache#1266) HDDS-4076. Translate CSI.md into Chinese (apache#1299) HDDS-4046. Extensible subcommands for CLI applications (apache#1276) HDDS-4051. Remove whitelist/blacklist terminology from Ozone (apache#1306) HDDS-4055. Cleanup GitHub workflow (apache#1282) HDDS-4042. Update documentation for the GA release (apache#1269) HDDS-4066. Add core-site.xml to intellij configuration (apache#1292) HDDS-4073. Remove leftover robot.robot (apache#1297) ...
…ned in the config (apache#1149)
What changes were proposed in this pull request?
om.serviceId is required on case of OM.HA in all the client parameters even if there is only one om.serviceId and it can be chosen.
My goal is:
With using the om.serviceId from the config if
It also makes easier to run the same tests with/without HA
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-3878?filter=12349091
How was this patch tested?
ozone-ha acceptance tests are turned off long time ago (it included some life-cycle test to start/stop services. In this patch I simplified the
hacluster and added simple smoketest.Later we can restore the lifecycle tests (start/stop) but I would prefer to use a generic approach for all the clusters.