-
Notifications
You must be signed in to change notification settings - Fork 25.8k
HDFS plugin: allow webhdfs scheme #94299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Pinging @elastic/es-data-management (Team:Data Management) |
jbaiera
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hiya! Thanks for the PR submission. I've been giving it a cursory review as my schedule has permitted. I've left some surface level comments to start and will be digging a bit deeper soon.
Any time we add HDFS functionality that requires permission changes we need to be pretty careful. The Hadoop project has historically taken liberties on how it modifies the JVM it is running on. This review might be a bit slow, sorry in advance!
| } | ||
| uri = URI.create(uriSetting); | ||
| if ("hdfs".equalsIgnoreCase(uri.getScheme()) == false) { | ||
| if (uri.getScheme() == null || allowedSchemes.contains(uri.getScheme().toLowerCase(Locale.ROOT)) == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just check the URI scheme without lower casing. Is there a case where it makes sense to accept WEBHdfs as the scheme when most would put it as lowercase to begin with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed the toLowerCase call now, but I want to highlight equalsIgnoreCase was used for comparison before my change.
So, I guess that technically this could be a breaking change for someone, however unlikely. Normally, it will just throw an exception if you try to use uppercased HDFS as a scheme:
org.apache.hadoop.fs.UnsupportedFileSystemException: fs.AbstractFileSystem.HDFS.impl=null: No AbstractFileSystem configured for scheme: HDFS
However, technically, you can avoid that error by setting conf.fs.AbstractFileSystem.HDFS.impl to org.apache.hadoop.hdfs.DistributedFileSystem, or maybe even set that to a different, custom implementation class.
So, that's why I just proposed the toLowerCase call initially - I wanted to avoid breaking some edge-case setup. Would this need to be a deprecation first?
| ); | ||
| } | ||
| if (Strings.hasLength(uri.getPath()) && uri.getPath().equals("/") == false) { | ||
| if (uri.getScheme().equals("hdfs") && Strings.hasLength(uri.getPath()) && uri.getPath().equals("/") == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this test is specific to hdfs? If the uri is specified as webhdfs://hostname:9870/path/to/repo I think we would want to intercept and advise on that too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, that's a bit of wishful thinking on my part. You see, I have separate PR for Hadoop: apache/hadoop#5447 that would utilize base path in webhdfs-scheme URIs. I was actually testing this PR combined with my Hadoop PR at one point.
For some context: webhdfs is using regular HTTP(S) requests for interacting with filesystem. The requests are being made to base scheme URI + a hardcoded prefix, /webhdfs/v1. However, there are some webhdfs proxies, most notably Apache Knox, that have a different API path. It's also easy to set up a custom proxy, since it's just HTTP.
For Apache Knox, the default path is /gateway/<gateway-name>/webhdfs/v1. Somewhat ironically, Apache Hadoop's client does not support this custom path currently, unlike most third party clients. In that PR apache/hadoop#5447 I've linked to before I proposed to add support for it by utilizing the path in connection URI. So, while it's not the same as filesystem base path, there would be a use-case where user would want to provide a separate path in connection URI, for example webhdfs://hostname:9870/gateway/myname/webhdfs/v1/. It would be only path to API, different from a path option.
Anyway, I can make this validation apply to (s)webhdfs schemes for now if you'd prefer, please let me know. We could then perhaps revisit if Hadoop would accept the change.
| new AuthPermission("doAs") | ||
| }; | ||
| SWEBHDFS_PERMISSIONS = new Permission[] { | ||
| new AuthPermission("doAs"), | ||
| new RuntimePermission("setFactory"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be a bit in the weeds as a request, but the above registry of permissions details why they are needed. Can we add a quick blurb for each permission about why they are included?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I've added comments to that section. For some more details:
doAspermission was required (IIRC) due to the call here. This permissions was already required for HDFS plugin in Elasticsearch, but only for Kerberized setup. Now, it's also required for webhdfs, even without Kerberos.setFactoryis required for the call here. This is required only for theswebhdfsscheme, aswebhdfsdoesn't use SSL connection.
|
Thanks for taking a look @jbaiera! I've answered the initial feedback.
No worries, take your time! Thanks for being thorough. |
jbaiera
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry again for the long delay on getting this review up, I've left a couple of comments on here, mostly some small nits/questions. There is one comment about the permissions that I think we'll want to get some additional approval on regarding setFactory but otherwise I think things are looking good so far. I'll enable testing as well.
| systemProperty 'tests.rest.suite', 'hdfs_repository_' + hadoopVer + '/10_basic' | ||
| } | ||
| // HA fixture is unsupported. Don't run them. | ||
| tasks.named("yamlRestTestSecure" + hadoopVer).configure { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small Nit: Can we move this down a little to rectify the code comment above it? The HA tests are done by the javaRestTest* tasks, and while we may justifiably want to disable all the yamlRestTest* tasks here the comment placement is inaccurate.
This build file is really complicated. I normally wouldn't bring this kind of thing up otherwise
| tasks.named("yamlRestTestWeb" + hadoopVer).configure { | ||
| dependsOn "hdfs" + hadoopVer + "Fixture" | ||
| // The normal test runner only runs the standard hdfs rest tests | ||
| systemProperty 'tests.rest.suite', 'hdfs_repository_' + hadoopVer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be the webhdfs rest tests instead of the hdfs ones?
|
|
||
| // org.apache.hadoop.hdfs.web.SSLConnectionConfigurator | ||
| // This is used by swebhdfs connections | ||
| permission java.lang.RuntimePermission "setFactory"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So here's the deal with this permission: Frustratingly, Java uses this permission to guard against setting a SSL socket factory on both a connection instance as well as the global default for all https connections. In fact, it's used to guard against setting all sorts of weird global state in the running JVM.
I'll ask around with our more Security minded folks and either link them here for their blessing on this, or collect feedback and share it here. I imagine that using webhdfs without SSL support is not exactly an attractive prospect. There is some prior art in the code base that might support this usage, but I need to double check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While not ideal, the grant here seems reasonable (given the usage).
I'm less concerned with the grant of setFactory permission, than I would be with other already granted permissions. And we already "trust" several other plugins and their dependencies with this permission.
|
@elasticmachine ok to test |
|
@elasticmachine update branch |
| } | ||
|
|
||
| def getHttpPortForVersion(hadoopVersion) { | ||
| return 10004 - (2 * hadoopVersion) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried running the tests locally and the fixture does not start due to an already used port. I think these port assignment functions need to be updated to use an offset of 3 instead of 2 now.
|
Since the conversation has come to a bit of a halt I'm going to close this PR for now. If you would like to continue development on it, please let us know and we can happily reopen it! |
HDFS plugin for snapshots was restricted to
hdfs://scheme. However, the client library it uses also supportswebhdfs://andswebhdfs://schemes.This PR enables use of
webhdfsandswebdhfsschemes for HDFS snapshot repositories.It only required allowing
webhdfsandswebhdfsschemes in URI validation, and adding a couple of permissions:doAsis required for (s)webhdfs, and runtimesetFactoryis required for swebhdfs.All other changes are related to tests - I've added
plugins:repository-hdfs:yamlRestTestWeb2andplugins:repository-hdfs:yamlRestTestWeb3gradle tasks that use miniHDFS for integration tests of snapshots usingwebhdfsprotocol.I've also tested it manually against Hadoop proxy with SSL enabled (
swebhdfsprotocol).This is related to #24455, though I didn't test Kerberized webhdfs setup. I think that might be unsupported in Hadoop client in the first place: https://issues.apache.org/jira/browse/HDFS-15765
The original issue was closed due to low demand, but seeing that it only requires loosening the validation and couple permissions, this should be well worth it.
In our case, we prefer to use
webhdfsbecause it's much easier to secure - it's just an HTTP REST API.