Remove all synchronization from OutputFormats #30

nilesh-c · 2014-07-10T15:22:54Z

I am removing all synchronization from the OutputFormat stuff because of the following reasons:

In any case we will need to use Apache Spark in a single-threaded manner with 1 thread per worker (and 1 worker per core so that we can still use the cores) because Hadoop's bz2 InputStream - CBzip2InputStream is not thread-safe. There's a JIRA on that that is fixed and merged to trunk. At least until Hadoop's next version this stays as it is.
We are already guaranteeing single-threaded execution, so needless synchronization will simply cause problems.

jimkont · 2014-07-10T15:57:41Z

...ted/src/main/scala/org/dbpedia/extraction/spark/io/output/DBpediaCompositeOutputFormat.scala

If you create the map with all the different formats when you declare recordWrites and make it immutable will solve the problem? Or the problem rises when another class uses DBpediaCompositeOutputFormat.write()?

We are currently not having multiple threads access these methods, so we should be safe in any case.

But, let's say that Hadoop releases its new version with a thread-safe Bz2 decompression stream that lets users have the choice of having multiple threads per worker (JVM): in such a case we will need to synchronize access to the whole write() and close() methods, not just the Map.

And you might ask: Why not create the whole map of RecordWriters right when the class initializes? Doing it lazily prevents us from having too many RecordWriter instances. For each input split (can be hundreds of them, maybe a few thousands too?) the map of RecordWriters is initialized. This will cause even RecordWriters for unneeded datasets (which have no Quads coming in) to be created too - extra instances, additional GC.

But, @jimkont I have a question here, regarding the last paragraph of my last comment. Initializing ALL record writers does give us a GC overhead, but it allows us to write empty splits with only the formatter.header and formatter.footer that look like:

# started 2014-06-12T04:44:23Z # completed 2014-06-12T04:44:23Z

This is how those files look like in the original framework. Currently in this case the dataset directories remain empty, instead of having files (output splits) looking like above.

Do you think I should go for it? Or I could decide after benchmarking both methods.

nilesh-c · 2014-07-11T14:42:45Z

Update: I just set the number of cores a worker should use to 1, directly in the code. This means no more Bz2 decompression thread issues. @sangv this should solve the users' headache regarding thread issues.

…problems" I mistakenly thought that "spark.cores.max" means max. number of cores per worker. It's actually max. cores to use in the whole cluster. This reverts commit 87f1743.

nilesh-c · 2014-07-11T16:28:47Z

I mistakenly thought that "spark.cores.max" means max. number of cores per worker. It's actually max. cores to use in the whole cluster. :-(

Remove all synchronization from OutputFormats

nilesh-c added 2 commits July 10, 2014 20:36

Remove all synchronization from OutputFormats and docs for that.

c5f7c49

Small formatting fixes.

24ed4a2

jimkont reviewed Jul 10, 2014
View reviewed changes

Set number of worker cores to 1 to prevent bz2 decompression problems

87f1743

Revert "Set number of worker cores to 1 to prevent bz2 decompression …

ea40d08

…problems" I mistakenly thought that "spark.cores.max" means max. number of cores per worker. It's actually max. cores to use in the whole cluster. This reverts commit 87f1743.

sangv added a commit that referenced this pull request Jul 16, 2014

Merge pull request #30 from dbpedia/output-nothreads

24b45cd

Remove all synchronization from OutputFormats

sangv merged commit 24b45cd into nildev2 Jul 16, 2014

nilesh-c deleted the output-nothreads branch July 26, 2014 23:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove all synchronization from OutputFormats #30

Remove all synchronization from OutputFormats #30

Uh oh!

nilesh-c commented Jul 10, 2014

Uh oh!

jimkont Jul 10, 2014

Uh oh!

nilesh-c Jul 10, 2014

Uh oh!

nilesh-c Jul 10, 2014

Uh oh!

nilesh-c Jul 10, 2014

Uh oh!

nilesh-c commented Jul 11, 2014

Uh oh!

nilesh-c commented Jul 11, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Remove all synchronization from OutputFormats #30

Remove all synchronization from OutputFormats #30

Uh oh!

Conversation

nilesh-c commented Jul 10, 2014

Uh oh!

jimkont Jul 10, 2014

Choose a reason for hiding this comment

Uh oh!

nilesh-c Jul 10, 2014

Choose a reason for hiding this comment

Uh oh!

nilesh-c Jul 10, 2014

Choose a reason for hiding this comment

Uh oh!

nilesh-c Jul 10, 2014

Choose a reason for hiding this comment

Uh oh!

nilesh-c commented Jul 11, 2014

Uh oh!

nilesh-c commented Jul 11, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants