Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why doesn't SparkFetcher use REST client to fetch eventlogs? #199

Closed
shkhrgpt opened this issue Jan 31, 2017 · 15 comments · Fixed by #225
Closed

Why doesn't SparkFetcher use REST client to fetch eventlogs? #199

shkhrgpt opened this issue Jan 31, 2017 · 15 comments · Fixed by #225

Comments

@shkhrgpt
Copy link
Contributor

SparkFetcher is using REST client to fetch ApplicationInfo, StageData, ExecutorData, and JobData. However, it's using WebHDFS based LOG client to fetch eventlogs. Spark (1.5.0 and above) history server provides the following REST APIs to fetch eventlogs:

monitoring and instrumentation spark 2 1 0 documentation

I was wondering why we did not choose to use REST API to fetch eventlogs? Was this option ever considered? If yes, then what are the drawbacks with this approach. If no, then what could be potential problems?

@rayortigas: Since you wrote this new Spark fetcher, can you please provide some information on this issue?
@paulbramsen , @shankar37 , @akshayrai @babak-altiscale: What do you guys think about it?

Thank you.

@paulbramsen
Copy link
Contributor

It seems like a good plan to me! We can always structure things so that the WebHDFS stuff lives in a legacy fetcher for those using older versions of Spark. I should have a Dr. Elephant instance to test on soon so I'll take deeper look in the next couple days.

@superbobry
Copy link
Contributor

JFYI we've been using Sparklint (a tool for profiling resource usage of Spark jobs) which fetches history logs via REST API for about a month now, and it seems to do fine, so a big 👍 if Dr. E goes that way as well. Also, the potential benefit of such approach is that Dr. E would be able to fetch the logs of any user (without fiddling with HDFS ACL), whereas with the current version it is limited to the logs of the user which it is currently running under, c.f. Event Logging:

The event log files will be created by Spark with permissions such that only the user and group have read and write access.

@shkhrgpt
Copy link
Contributor Author

Thanks, @superbobry. We also favor REST API based fetcher because of the same permission and Kerberos issue. However, in general, getting data using REST API can be slower than reading it from HDFS. Therefore, we wanted to know from Dr Elephant team that whether or not they have any concerns about the REST API based eventlog fetcher.

Since it's working fine for you, that gives us more confidence.

Thank you.

@rayortigas
Copy link
Contributor

Dr. E is still defaulting to compile against 1.4, which as you observed, doesn't have an API to fetch event logs. So #162 was done with that minimum in mind.

If @shankar37, @akshayrai, and team are fine raising the minimum Spark version to >= 1.5, then I'm fine with the fetcher going to REST entirely as the primary method of getting info.

@paulbramsen
Copy link
Contributor

I assume that if the new version of the Spark fetcher were designed so that the old one could stick around as a legacy tool then it would be fine, right? There could be 2 Spark fetchers kind of like how there are 2 MapReduce fetchers: one that uses the REST API and one that uses HDFS.

@shkhrgpt
Copy link
Contributor Author

shkhrgpt commented Feb 3, 2017

I agree with @paulbramsen, we can have both, REST client and LOG client, to fetch eventlogs. And maybe use REST client as default for spark versions 1.5 and above?

@paulbramsen
Copy link
Contributor

paulbramsen commented Feb 3, 2017

@shkhrgpt as of right now the only way to make a fetcher the "default" fetcher is if it exists uncommented in FetcherConf.xml. There's not a mechanism to alter default by version. But I think that having both fetchers in FetcherConf.xml with one commented out and a description of each of their purposes would be sufficient.

@rayortigas
Copy link
Contributor

@paulbramsen Yeah, the old one could co-exist with the new one. It would be a simple config block just as you described.

I should point out that after consulting with the Dr. E team I ended up removing the old fetcher, and I failed to update the documentation in #162, which makes it sounds like both the old and new are usable (sorry!). Nonetheless, if you look at the commits in https://github.com/linkedin/dr-elephant/pull/162/commits you'll see that it's a fairly clean path before and after I removed the old fetcher. So backtracking to a3dde3a and then doing some renaming and re-applying the subsequent commits in some form, we can get both to co-exist.

Of course I defer to @shankar37, @nntnag17, and @akshayrai, on what to do.

@shkhrgpt
Copy link
Contributor Author

shkhrgpt commented Feb 3, 2017

@rayortigas I thought that @paulbramsen meant complete REST API based fetcher as the new one, and the current fetcher (REST + WebHDFS) as the old one. Am I right @paulbramsen ? In that case, I don't think we need to backtrack to a3dde3a. The believe that the new fetcher should add a REST client to fetch eventlogs. @paulbramsen Am I right here?

@paulbramsen
Copy link
Contributor

@shkhrgpt yes you are right. That is what I meant. No need to backtrack.

@shkhrgpt
Copy link
Contributor Author

shkhrgpt commented Feb 3, 2017

Since the existing and the proposed fetchers will be the same in most of their data fetching, both will use REST to fetch most of the data (ApplicationInfo, StageData, ExecutorData, and JobData). They will be only different in terms of fetching eventlogs. Does it make sense to call them two "different" fetchers?

I was thinking that maybe just keep one fetcher (SparkFetcher), and add an additional parameter to specify whether or not use REST client to fetch eventlogs. So the config for the fetcher will look like the following:

<fetcher>
<applicationtype>spark</applicationtype>
<classname>com.linkedin.drelephant.spark.fetchers.SparkFetcher</classname>
<params>
<should_use_rest_to_fetch_eventlogs>false</should_use_rest_to_fetch_eventlogs>
</params>
</fetcher>

The default value of this parameter should be false because the WebHDFS will work for all Spark versions but REST will for Spark 1.5 and above.

@paulbramsen @rayortigas What you think?

@paulbramsen
Copy link
Contributor

@shkhrgpt sounds good to me.

@superbobry
Copy link
Contributor

superbobry commented Mar 21, 2017

@shkhrgpt this way the user could misconfigure Dr. E to fetch eventlogs via REST API even if running only on Spark 1.5. What do you think about an alternative strategy: the fetcher would inspect SPARK_VERSION and use REST API for >1.5 and WebHDFS for ≤1.5?

Update: this would only work if Dr. E is compiled with the appropriate version of Spark.

superbobry pushed a commit to criteo-forks/dr-elephant that referenced this issue Mar 21, 2017
This commits allows Dr. Elephant to fetch Spark logs without universal
read access to eventLog.dir on HDFS. SparkFetcher would atomatically
use SparkRestClient instead of SparkLogClient for Spark >1.5.

Closes linkedin#199
superbobry pushed a commit to criteo-forks/dr-elephant that referenced this issue Mar 21, 2017
This commit allows Dr. Elephant to fetch Spark logs without universal
read access to eventLog.dir on HDFS. SparkFetcher would use SparkRestClient
instead of SparkLogClient if configured as

   <params>
     <use_rest_for_eventlogs>true</use_rest_for_eventlogs>
   </params>

The default behaviour is to fetch the logs via SparkLogClient/WebHDFS.

Closes linkedin#199
@shkhrgpt
Copy link
Contributor Author

@superbobry I think that's an interesting approach too. However, at this stage, the problem with making a decision based on Spark version forces the fetcher to use either WebHDFS or REST. Maybe someone does not want to use REST even if Spark version >= 1.5. Especially in the beginning when the REST fetcher is not well tested compared to WebHDFS fetcher. Maybe down the road when we feel confident about the REST fetcher, we can think of using Spark version. What do you think?

Btw, thanks for submitting the PR for this issue.

superbobry pushed a commit to criteo-forks/dr-elephant that referenced this issue Mar 21, 2017
This commit allows Dr. Elephant to fetch Spark logs without universal
read access to eventLog.dir on HDFS. SparkFetcher would use SparkRestClient
instead of SparkLogClient if configured as

    <params>
      <use_rest_for_eventlogs>true</use_rest_for_eventlogs>
    </params>

The default behaviour is to fetch the logs via SparkLogClient/WebHDFS.

Closes linkedin#199
@superbobry
Copy link
Contributor

@shkhrgpt I completely agree. The PR implements precisely your version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants