-
Notifications
You must be signed in to change notification settings - Fork 859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use old Spark fetcher as a fallback. #224
Use old Spark fetcher as a fallback. #224
Conversation
I have an overall design question here. Why not to have a separate file system fetcher for spark, rather having as a fallback from REST fetcher? |
I had the same question. IIRC, from discussing with @shankar37 it sounded like it was desirable to fall back to filesystem fetcher if the REST fetcher encountered issues with the API endpoint. It's unclear whether restoring the filesystem fetcher should be permanent. In any case, this PR provides a bridge to whatever gets decided next. |
I think there should be a flag to determine whether or not to use fallback fetcher I addition, I think there should also be a configuration to just use the FS fetcher, if someone wants to use that. What do you think? |
YES! I absolutely request having both of these Fetchers with a configuration option. Let me explain why: I recently wanted to update our Dr. Elephant and realized we can not use the new SparkFetcher requiring Spark History Server. At Altiscale we decided to let the users run the Spark History Server rather than us running it on the whole cluster due to the increasing memory usage of the Spark History Server (I don't know if it still does not have a retention policy or they resolved it, but this was an issue with some of our clusters that used to run a lot of Spark jobs). Therefore, we can not use the new SparkFetcher until we decide to run the History Server for the whole cluster as a daemon. Having both of the fetchers is very useful for other reasons (like performance) too. If the previous code was not clean, we can certainly polish it and then put it back in place. |
Also, I just looked at the Code and want to ask the same question that @shkhrgpt asked earlier in this thread. Why not having a |
@babak-altiscale I haven't completely verified this works, but the intent here was also to make SparkFSFetcher usable via explicit configuration if that was desired. Note though that as things stand, SparkFSFetcher will only work with the old heuristics. As for @shkhrgpt requesting a config flag specifically for the SparkFetcher (REST), indicating whether it should fall back to the SparkFSFetcher (or not at all)... Assuming I've understood that correctly, I'm fine with that too. |
@rayortigas Is there a strong reason why the SparkFSFetcher can't be made to work with the new heuristics and new metricsaggregator ? My intention was to bring the FS Fetcher back as a standalone fetcher and also a combo fetcher so to speak that uses both REST and FS Fetchers in that order. But I dont want any of this to change which heuristics we have and how we calculate metrics aggregation. That will lead to a lot of confusion and also a dependency we are trying to avoid. We are proceeding in a direction where the fetchers and heuristics are completely decoupled from each other except through the HadoopApplicationData interface. |
Also, it's not clear to me where in code or config you make the switch between old and new heuristics. I see the old heuristics code back but I am not sure how those are invoked only when legacyfetcher is called. |
@shankar37 I only focused on getting the SparkFetcher falling back to the SparkFSFetcher, since that is what you and I had agreed upon; when that fallback from REST to FS happens, we still end up using the new heuristics and metrics aggregator, since the SparkFetcher calls a converter to wrap whatever it got from SparkFSFetcher. If additionally, you want the SparkFSFetcher to also be usable standalone and with the new heuristics and metrics aggregator, then that is extra work that I did not scope. I suspect that is not too much more work, but I've run out of cycles at the moment to make that happen. Feel free to evolve the PR though, I don't want to hold things up. |
re: legacy heuristics, those would be invoked if you configured SparkFSFetcher and old heuristics like before (just with the package names changed to legacydata/legacyheuristics). |
got it. I am not expecting that to happen in this PR but I misunderstood from your comment as that is done already. Now the change makes more sense. Why did you bring the old heuristics and old metrics aggregators file then ? It is not needed for this change, right ? |
They are not needed; sorry for the confusion. I brought in the old heuristics and metrics aggregator in case somebody wanted to use them again, or use them for debugging the SparkFSFetcher. We could always re-deprecate the old heuristics and metrics aggregator when we were more confident about the path going forward. |
ok, I am confident we dont need the old metrics aggregator. Regarding the heuristics, we can discuss it in the dr. elephant meetup on what the community feels. @akshayrai and @nntnag17 are in sunnyvale this week. Can you work with them to test this in the test machine ? |
OK, I just sent them an email. |
I agree with @shankar37, I don't think there is any need for legacy heuristics. If we decide to remove legacy heuristics, then we should also remove legacy SparkApplicationData class. We should just have one SparkApplicationData, @shankar37: I am still wondering what is the main motivation of using the fallback based design, rather than having two separate fetchers? |
@shkhrgpt The motivation is based on my assessment that neither fetcher is good enough on it's own. REST fetcher is fast but history server seems to have reliability issues. FS Fetcher is reliable but downloads too much data and is a strain on Dr. Elephant. Wanted to create a fetcher that tries REST when possible and FS when REST is not available. |
Hmm.... I agree that SparkFetcher is not very reliable. I am still not sure about the fallback approach. If the REST fetcher fails to fetch data for an application, I believe it may continue to fail to fetch data for all the following applications. Which means, once the fetcher goes into the fallback mode, it may always be in the fallback mode. Moreover, for every new application, the fetcher will first try to use REST, even though it has been failing for a while. So if Spark History Server was already in a bad state (OOM), this will make things worse. What do you think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe create a separate PR for this change? I don't think you need to combine it with this big PR.
@@ -143,6 +144,7 @@ object SparkRestClient { | |||
} | |||
|
|||
val objectMapper = new ObjectMapper() with ScalaObjectMapper | |||
objectMapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this configuration already exist in the latest SparkRestClient. Check this:
Maybe you just need to update your local repo, and only commit the unit test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching that @shkhrgpt! I'll remove the conflicting line when I do a final rebase and cleanup.
4989bfc
to
cf435dc
Compare
…tcher (#232) Remove backup for Rest Fetcher and make Legacy FSFetcher as top level fetcher. Change the default fetcher in the config.
I am closing this in favor of #232. Thanks a lot @rayortigas and all those involved. |
…alone fetcher (linkedin#232) Remove backup for Rest Fetcher and make Legacy FSFetcher as top level fetcher. Change the default fetcher in the config.
As requested by @shankar37, this PR restores legacy Spark fetcher for use as fallback for new Spark fetcher introduced in #162.
The rough order of operations to achieve this commit was:
This should also fix #206.
Haven't tested on our cluster yet, but will report back.