-
Notifications
You must be signed in to change notification settings - Fork 859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why doesn't SparkFetcher use REST client to fetch eventlogs? #199
Comments
It seems like a good plan to me! We can always structure things so that the WebHDFS stuff lives in a legacy fetcher for those using older versions of Spark. I should have a Dr. Elephant instance to test on soon so I'll take deeper look in the next couple days. |
JFYI we've been using Sparklint (a tool for profiling resource usage of Spark jobs) which fetches history logs via REST API for about a month now, and it seems to do fine, so a big 👍 if Dr. E goes that way as well. Also, the potential benefit of such approach is that Dr. E would be able to fetch the logs of any user (without fiddling with HDFS ACL), whereas with the current version it is limited to the logs of the user which it is currently running under, c.f. Event Logging:
|
Thanks, @superbobry. We also favor REST API based fetcher because of the same permission and Kerberos issue. However, in general, getting data using REST API can be slower than reading it from HDFS. Therefore, we wanted to know from Dr Elephant team that whether or not they have any concerns about the REST API based eventlog fetcher. Since it's working fine for you, that gives us more confidence. Thank you. |
Dr. E is still defaulting to compile against 1.4, which as you observed, doesn't have an API to fetch event logs. So #162 was done with that minimum in mind. If @shankar37, @akshayrai, and team are fine raising the minimum Spark version to >= 1.5, then I'm fine with the fetcher going to REST entirely as the primary method of getting info. |
I assume that if the new version of the Spark fetcher were designed so that the old one could stick around as a legacy tool then it would be fine, right? There could be 2 Spark fetchers kind of like how there are 2 MapReduce fetchers: one that uses the REST API and one that uses HDFS. |
I agree with @paulbramsen, we can have both, REST client and LOG client, to fetch eventlogs. And maybe use REST client as default for spark versions 1.5 and above? |
@shkhrgpt as of right now the only way to make a fetcher the "default" fetcher is if it exists uncommented in |
@paulbramsen Yeah, the old one could co-exist with the new one. It would be a simple config block just as you described. I should point out that after consulting with the Dr. E team I ended up removing the old fetcher, and I failed to update the documentation in #162, which makes it sounds like both the old and new are usable (sorry!). Nonetheless, if you look at the commits in https://github.com/linkedin/dr-elephant/pull/162/commits you'll see that it's a fairly clean path before and after I removed the old fetcher. So backtracking to a3dde3a and then doing some renaming and re-applying the subsequent commits in some form, we can get both to co-exist. Of course I defer to @shankar37, @nntnag17, and @akshayrai, on what to do. |
@rayortigas I thought that @paulbramsen meant complete REST API based fetcher as the new one, and the current fetcher (REST + WebHDFS) as the old one. Am I right @paulbramsen ? In that case, I don't think we need to backtrack to a3dde3a. The believe that the new fetcher should add a REST client to fetch eventlogs. @paulbramsen Am I right here? |
@shkhrgpt yes you are right. That is what I meant. No need to backtrack. |
Since the existing and the proposed fetchers will be the same in most of their data fetching, both will use REST to fetch most of the data (ApplicationInfo, StageData, ExecutorData, and JobData). They will be only different in terms of fetching eventlogs. Does it make sense to call them two "different" fetchers? I was thinking that maybe just keep one fetcher (SparkFetcher), and add an additional parameter to specify whether or not use REST client to fetch eventlogs. So the config for the fetcher will look like the following:
The default value of this parameter should be false because the WebHDFS will work for all Spark versions but REST will for Spark 1.5 and above. @paulbramsen @rayortigas What you think? |
@shkhrgpt sounds good to me. |
@shkhrgpt this way the user could misconfigure Dr. E to fetch eventlogs via REST API even if running only on Spark 1.5. What do you think about an alternative strategy: the fetcher would inspect SPARK_VERSION and use REST API for >1.5 and WebHDFS for ≤1.5? Update: this would only work if Dr. E is compiled with the appropriate version of Spark. |
This commits allows Dr. Elephant to fetch Spark logs without universal read access to eventLog.dir on HDFS. SparkFetcher would atomatically use SparkRestClient instead of SparkLogClient for Spark >1.5. Closes linkedin#199
This commit allows Dr. Elephant to fetch Spark logs without universal read access to eventLog.dir on HDFS. SparkFetcher would use SparkRestClient instead of SparkLogClient if configured as <params> <use_rest_for_eventlogs>true</use_rest_for_eventlogs> </params> The default behaviour is to fetch the logs via SparkLogClient/WebHDFS. Closes linkedin#199
@superbobry I think that's an interesting approach too. However, at this stage, the problem with making a decision based on Spark version forces the fetcher to use either WebHDFS or REST. Maybe someone does not want to use REST even if Spark version >= 1.5. Especially in the beginning when the REST fetcher is not well tested compared to WebHDFS fetcher. Maybe down the road when we feel confident about the REST fetcher, we can think of using Spark version. What do you think? Btw, thanks for submitting the PR for this issue. |
This commit allows Dr. Elephant to fetch Spark logs without universal read access to eventLog.dir on HDFS. SparkFetcher would use SparkRestClient instead of SparkLogClient if configured as <params> <use_rest_for_eventlogs>true</use_rest_for_eventlogs> </params> The default behaviour is to fetch the logs via SparkLogClient/WebHDFS. Closes linkedin#199
@shkhrgpt I completely agree. The PR implements precisely your version. |
SparkFetcher is using REST client to fetch ApplicationInfo, StageData, ExecutorData, and JobData. However, it's using WebHDFS based LOG client to fetch eventlogs. Spark (1.5.0 and above) history server provides the following REST APIs to fetch eventlogs:
I was wondering why we did not choose to use REST API to fetch eventlogs? Was this option ever considered? If yes, then what are the drawbacks with this approach. If no, then what could be potential problems?
@rayortigas: Since you wrote this new Spark fetcher, can you please provide some information on this issue?
@paulbramsen , @shankar37 , @akshayrai @babak-altiscale: What do you guys think about it?
Thank you.
The text was updated successfully, but these errors were encountered: