Make logging level and number of job submission threads configurable #32

nilesh-c · 2014-07-11T07:56:34Z

@jimkont @sangv please review these small changes and merge the PR to nildev - till now most of Spark's logging was disabled. I'm making it configurable, easier while setting up clusters/troubleshooting.

nilesh-c · 2014-07-11T08:52:30Z

Also, I made a commit that makes the number of threads in the thread pool used for creating the job submission futures:

implicit val ec = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(extractionJobThreads))
val futures = for (job <- jobs) yield future
                                          {
                                            job.run()
                                          }

Here's the comment used for DistConfig.extractionJobThreads that should provide you some insight:

/**
   * Number of threads to use in the ExecutionContext while calling DistExtractionJob.run() on multiple
   * extraction jobs in parallel.
   *
   * Note that these threads on the driver node do not perform any heavy work except for executing
   * DistExtractionJob.run() which submits the respective Spark job to the Spark master and waits
   * for the job to finish.
   *
   * By default it is set to Integer.MAX_VALUE so that all extraction jobs are submitted to Spark master
   * simultaneously, which uses the configured scheduling mechanism to execute the jobs on the cluster.
   */

@jimkont I remember that long back you had some doubts as to what implications the number of threads have on parallel multiple language extraction. Hope the above comment clarifies it.

jimkont · 2014-07-11T13:31:51Z

So, if I understand correctly, on a multi-language extraction spark will
get a huge list of tasks and the decide how to distribute them
automatically.
I'd like to benchmark this and see the actual gain, otherwise it might be
better to set this to 1 or 2 by default and get better logging and
completed jobs faster

nilesh-c · 2014-07-11T13:59:42Z

@jimkont Yes, to be a bit pedantic, Spark will get a list of "jobs", each consisting of "stages" (eg. map, reduce etc.) which in turn contain N tasks (parallely distributed) For sake of simplicity, let's assume a job consists of a lot of tasks.

So, depending on the scheduler used, Spark will schedule the jobs across the cluster according to the number of available worker slots.

By default the FIFO scheduler is used. In this case a job is taken from a FIFO queue and executed. If task slots are still free while it is being executed, the next job's tasks are fetched and so on. Otherwise, we may use the FAIR scheduler which tries to give all the jobs equal priority.

And as you said, yeah, it's best to benchmark this. It's probably a good idea to merge this if you're okay with the code, create an issue about the default, and we can just change the default after benchmarking?

…ling threads The program was not stopping even after SparkContext was stopped and everything. The ExecutorService needs to be shut down too.

… the default Level.WARN Problem was: getValue() returns null directly instead of passing it through the closure.

…the maps have no such value. Example: Often in a stage the first TaskEnd event may tell us that the task has failed - in that case doing this.stageIdToTasksComplete(stageId) would cause a NoSuchElementException. Similarly for stageIdToTasksFailed.

Make logging level and number of job submission threads configurable

nilesh-c added 2 commits July 11, 2014 13:03

Make logging level configurable

5016196

Mention no. of failed tasks while logging.

133fc6a

nilesh-c changed the title ~~Make logging level configurable~~ Make logging level and number of job submission threads configurable Jul 11, 2014

Make number of job submission threads configurable.

951d38a

nilesh-c added 3 commits July 11, 2014 22:01

Call ExecutorService.shutdown() at finish to gracefully stop the dang…

c7a0e92

…ling threads The program was not stopping even after SparkContext was stopped and everything. The ExecutorService needs to be shut down too.

Fix the issue where DistConfig.sparkLogLevel was getting null and not…

73b48fd

… the default Level.WARN Problem was: getValue() returns null directly instead of passing it through the closure.

jimkont added a commit that referenced this pull request Jul 12, 2014

Merge pull request #32 from dbpedia/feature-loglevel-config

d67f035

Make logging level and number of job submission threads configurable

jimkont merged commit d67f035 into nildev2 Jul 12, 2014

nilesh-c mentioned this pull request Jul 16, 2014

Benchmark the framework on GCE and come up with a good default for no. of job sumission threads etc. #33

Closed

nilesh-c deleted the feature-loglevel-config branch July 26, 2014 23:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make logging level and number of job submission threads configurable #32

Make logging level and number of job submission threads configurable #32

Uh oh!

nilesh-c commented Jul 11, 2014

Uh oh!

nilesh-c commented Jul 11, 2014

Uh oh!

jimkont commented Jul 11, 2014

Uh oh!

nilesh-c commented Jul 11, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Make logging level and number of job submission threads configurable #32

Make logging level and number of job submission threads configurable #32

Uh oh!

Conversation

nilesh-c commented Jul 11, 2014

Uh oh!

nilesh-c commented Jul 11, 2014

Uh oh!

jimkont commented Jul 11, 2014

Uh oh!

nilesh-c commented Jul 11, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants