-
Notifications
You must be signed in to change notification settings - Fork 17
Make logging level and number of job submission threads configurable #32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Also, I made a commit that makes the number of threads in the thread pool used for creating the job submission futures: implicit val ec = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(extractionJobThreads))
val futures = for (job <- jobs) yield future
{
job.run()
}Here's the comment used for /**
* Number of threads to use in the ExecutionContext while calling DistExtractionJob.run() on multiple
* extraction jobs in parallel.
*
* Note that these threads on the driver node do not perform any heavy work except for executing
* DistExtractionJob.run() which submits the respective Spark job to the Spark master and waits
* for the job to finish.
*
* By default it is set to Integer.MAX_VALUE so that all extraction jobs are submitted to Spark master
* simultaneously, which uses the configured scheduling mechanism to execute the jobs on the cluster.
*/@jimkont I remember that long back you had some doubts as to what implications the number of threads have on parallel multiple language extraction. Hope the above comment clarifies it. |
|
So, if I understand correctly, on a multi-language extraction spark will |
|
@jimkont Yes, to be a bit pedantic, Spark will get a list of "jobs", each consisting of "stages" (eg. map, reduce etc.) which in turn contain N tasks (parallely distributed) For sake of simplicity, let's assume a job consists of a lot of tasks. So, depending on the scheduler used, Spark will schedule the jobs across the cluster according to the number of available worker slots. By default the FIFO scheduler is used. In this case a job is taken from a FIFO queue and executed. If task slots are still free while it is being executed, the next job's tasks are fetched and so on. Otherwise, we may use the FAIR scheduler which tries to give all the jobs equal priority. And as you said, yeah, it's best to benchmark this. It's probably a good idea to merge this if you're okay with the code, create an issue about the default, and we can just change the default after benchmarking? |
…ling threads The program was not stopping even after SparkContext was stopped and everything. The ExecutorService needs to be shut down too.
… the default Level.WARN Problem was: getValue() returns null directly instead of passing it through the closure.
…the maps have no such value. Example: Often in a stage the first TaskEnd event may tell us that the task has failed - in that case doing this.stageIdToTasksComplete(stageId) would cause a NoSuchElementException. Similarly for stageIdToTasksFailed.
Make logging level and number of job submission threads configurable
@jimkont @sangv please review these small changes and merge the PR to nildev - till now most of Spark's logging was disabled. I'm making it configurable, easier while setting up clusters/troubleshooting.