-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-27371][CORE] Support GPU-aware resources scheduling in Standalone #25047
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
76 commits
Select commit
Hold shift + click to select a range
8984026
support gpu-aware resource scheduling in standalone
Ngone51 a9160e4
sync resources info on worker node
Ngone51 4787498
fix unit tests
Ngone51 6bee4ac
fix indentation
Ngone51 30991b0
close lockFileChannel after release lock
Ngone51 73027b3
revert createAppDesc()
Ngone51 03c0921
release resources on worker interrupt
Ngone51 3a54e16
reuse parseResourceRequirements()
Ngone51 d47c8db
fix scalastyle
Ngone51 0bf0d7e
only release resources for driver submitted in client mode
Ngone51 09c13af
release lock on execption
Ngone51 863b220
avoid NPE on listernerBus in SparkContext.stop()
Ngone51 acfed50
log isolated resources for worker/driver
Ngone51 d06985d
set _resources to empty when acquireResources failed
Ngone51 ff35213
do not prepare resources file for driver/executor when its resources …
Ngone51 83222cf
user ResourceInformation instead of Seq[String]
Ngone51 169be97
make it configurable for spark-resources/
Ngone51 13e9f93
remove variable hostResourceToAssignedAddress
Ngone51 33ec65b
use ResourceRequirement across task/executor/driver
Ngone51 e89bfe2
track assignments with pid
Ngone51 0153024
document resource config for worker
Ngone51 68c2cc9
use kill -0 instead of jps
Ngone51 e19a9b6
resolve conflict
Ngone51 ffa3663
fix python tests
Ngone51 8bed3c0
conflict -> conflicting
Ngone51 50e6a6c
SPARK_RESOURCES_DIRECTORY -> SPARK_RESOURCES_COORDINATE_DIR
Ngone51 9d532b3
reword configuration doc
Ngone51 a2899cc
use SPARK_REGEX/LOCAL_CLUSTER_REGEX
Ngone51 fa28d7a
remove unneed comment in ApplicationDescription
Ngone51 7a75713
move _resources into ExecutorDesc's constructor
Ngone51 105abea
notify -> recover
Ngone51 c1ca57e
fix bug in resourcesCanBeReleased
Ngone51 dc82637
add description for acquireResources()
Ngone51 26571ff
make acquireResources() more readable
Ngone51 c2fa13e
use case classes in WorkerSchedulerStateResponse
Ngone51 5cb2cd4
fix scalastyle
Ngone51 95111b0
use Executor/DriverResponse in MasterSuite
Ngone51 aec8cd5
also try to find SPARK_HOME under testing
Ngone51 e69c973
fix error in grammar
Ngone51 8bb7f18
abstract resource info showing logic into a separate method
Ngone51 d7c058d
avoid changing _resources
Ngone51 a29bede
showResourceInfo -> logResourceInfo
Ngone51 fa14f88
move WorkerInfo.releaseResources() into removeDriver()&removeExecutor()
Ngone51 7300813
rename old canLaunchExecutor() to canLaunchExecutorForApp()
Ngone51 e9ec52e
remove driver/execToResourcesUsed
Ngone51 8408e73
add desc for prepareResourceFile()
Ngone51 8aab740
prepareResourceFile -> prepareResourcesFile
Ngone51 a26e112
make prepareResourcesFile() return Option[File]
Ngone51 83313f0
try finally with releaseLock()
Ngone51 159929d
more comments for acquireResources()
Ngone51 8af658a
update doc
Ngone51 51818db
add StandaloneResourceUtils
Ngone51 f7612b4
check the format of process name
Ngone51 9f15819
remove ReleaseResources && releaseResourcesIfPossible()
Ngone51 1208f38
update comment above driver acquireResources()
Ngone51 7016aa9
remove unused try-catch
Ngone51 84bebae
logs for releaseResources()
Ngone51 0b67150
logWarning -> logError
Ngone51 c2d7132
pass None resourcesFile for LocalSparkCluster
Ngone51 2055582
add config spark.resources.coordinate.enable
Ngone51 c2a9855
update doc
Ngone51 4946b01
make ResourceAllocation public and add @@Evolving
Ngone51 782eb0f
simplify releaseResources()
Ngone51 dca8f8b
update comments for acquireResources() & releaseResources()
Ngone51 756a818
(Executor/Driver)Response -> Worker(Executor/DriverState)Response
Ngone51 250d9a8
remove resourcesCanBeReleased()
Ngone51 a97c91f
remove pid in WorkerInfo & RegisterWorker
Ngone51 adc74ae
improve doc
Ngone51 f121f84
fix scalastyle
Ngone51 b46b243
merge master
Ngone51 aa2f63d
make coordinate default to be true
Ngone51 ade35fc
add test for coordinate is off
Ngone51 2bb50da
array of ResourceAllocation
Ngone51 16523bd
update comment for test
Ngone51 33fcc95
revert doc for ResourceAllocation
Ngone51 15a9897
log warn when driver/executor require more resource than any workers …
Ngone51 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems a big odd to me in local mode since all workers would get the same file. The intent was really to have the cluster admin pass them in different resource per worker but that goes with my general comment. If we want to keep this way then perhaps we just need to make sure to document it.
If the cluster admin does basically split the resources themselves between the Workers, then we have no need for the acquireResources and locking, so I definitely think we should put in a config to turn that off and we can document the different ways it can be setup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the same resources file for different workers (whether local cluster or real cluster) really doesn't make sense if resources file is intent to have the cluster admin config with different resources. We can just pass None and only use discovery script for LocalSparkCluster since it is only used for test purpose.
And for a real cluster, how about this:
When user configs a resources file(even a discovery script configured concurrently), we just acquire resources from it and do not go through acquireResources() any more with the assuming that user has already configured different resources across workers. If not, then, we use discovery script and calls acquireResources() to make sure we get different resources compare to others.
And we don't introduce a new configuration here, but just document more specific and rely on those file existence to decide which way we wanna go to.
WDYT ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could do that, but then again a cluster admin could write a discovery script to make sure different workers get different resources. Then they don't have to manually create the resourcesFile. I also think there are some weird cases like you mention where you have both resources file and discovery script that wouldn't be obvious to the user what happens. one resource they separated but then another the discovery script didn't. I realize these are corner cases but with a config it would be obvious exactly what is going to happen.
If we don't do the config then I think we should leave it as is and just document that the resourcesfile/discovery script for Workers/Driver in Standalone mode needs to have all the node resources or they need to configure a different resources Dir for each.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know whether it is easy for admin to do this within a script. But I agree that it would be better to have a separate config option if we do expect a admin would do this.
So, let me add it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of forgot about this, we should probably document the fact you can't use separate resource file or discovery script so only way to do this properly is with the coordination. I added an item to https://issues.apache.org/jira/browse/SPARK-27492 to make sure to document.