Many code changes to make everything work with Path and XmlInputFormat. Completes milestone 2. #22

nilesh-c · 2014-06-13T13:20:38Z

The existing code works with Files. Hadoop works with Paths and FileSystem which provide an abstraction over files and file systems. That is to say, Hadoop encapsulates the stuff about distributed or local files and files systems and lets us work with Paths and InputFormats. This pretty much sums up most of the stuff in the commits.

Other than that, I added the XmlInputFormat. I tested with run-extraction-test and everything works. DistRedirectsTest passes. I added some docs too.

Please note that I have taken care to make the commits in a specific sequence. If you make your way from commit 4b9745f to commit d08417f it'll be easier to understand what's going on.

Also, it's worth a mention that you'll need to clone https://github.com/nilesh-c/extraction-framework , switch to the fromxml branch and run mvn install before building this repo on this branch. I've made a PR to extraction-framework with the changes in fromxml.

…framework.

… Tests.

Also, Mahout's XmlInputFormat is added.

nilesh-c · 2014-06-13T16:54:30Z

Update:

Also, it's worth a mention that you'll need to clone https://github.com/nilesh-c/extraction-framework , switch to the fromxml branch and run mvn install before building this repo on this branch. I've made a PR to extraction-framework with the changes in fromxml.

fromxml was merged to extraction-framework master, and the branch deleted. @sangv you can use either repository now while testing.

nilesh-c · 2014-06-13T18:32:47Z

@sangv - looks like I misread the page from Definitive guide. Further down below it was written that ultimately Hadoop takes max(splitsize, blocksize) as the maximum split size. You're right, the maximum split size is indeed equal to the blocksize unless otherwise specified.

I found the real reason why my jobs were failing: spark.kryoserializer.buffer.mb is set to 50MB. (It's even lower by default) While writing the outputs using SparkUtils.toLocalIterator the memory taken by Spark equals memory needed for a single partition. Here a partition = 64MB by default. Kryo's buffer = 50MB. Hence Kryo complains of buffer overflows.

I'll fix this and make spark.kryoserializer.buffer.mb equal to 100MB by default. It's configurable using spark.config.properties. Also, I'll remove the default mapred.max.split.size = 10MB config from DistConfig.

sangv · 2014-06-13T18:44:28Z

Cool. Thanks. I was going to discourage you from defaulting to 10MB because
it is too small and would lead to bookkeeping overheads for large files.
With some testing that I am doing, I see that it splits into 35 MB blocks
so there are other considerations as to how it determines the split size.

-- Sang

On Fri, Jun 13, 2014 at 2:32 PM, Nilesh Chakraborty <
[email protected]> wrote:

@sangv https://github.com/sangv - looks like I misread the page from
Definitive guide. Further down below it was written that ultimately Hadoop
takes max(splitsize, blocksize) as the maximum split size. You're right,
the maximum split size is indeed equal to the blocksize unless otherwise
specified.

I found the real reason why my jobs were failing:
spark.kryoserializer.buffer.mb is set to 50MB. (It's even lower by
default) During writing the outputs using SparkUtils.toLocalIterator the
memory taken by Spark equals memory need for a single partition. Here a
partition = 64MB by default. Kryo's buffer = 50MB. Hence Kryo complains of
buffer overflows.

I'll fix this and make spark.kryoserializer.buffer.mb equal to 100MB by
default. It's configurable using spark.config.properties. Also, I'll remove
the default mapred.max.split.size = 10MB config from DistConfig.

—
Reply to this email directly or view it on GitHub
#22 (comment)
.

nilesh-c · 2014-06-13T19:21:12Z

Yes, Hadoop does a lot of stuff to find the optimal split sizes, and it often depends upon the InputFormat's implementation of getSplits etc. On a related node, that, along with on-the-fly reading of compressed files were my primary reasons of looking into wikihadoop's code.

jcsahnwaldt · 2014-06-14T13:55:06Z

distributed/src/main/scala/org/dbpedia/extraction/util/RichHadoopPath.scala

This is the method that is probably being called if the others result in compile-time errors. But I'm not sure. I don't quite understand what you're trying to do and where the compiler errors occur.

PR #236 simplies FileLike generic type. We don't need this implicit anymore.

…opPath

Fixes the thread-safety problem in DistDisambiguations and DistRedirects. Make required changes in the Test.

nilesh-c · 2014-06-17T20:46:25Z

Added final touches, resolved most of the issues we talked about above. Sent a PR @ #24 - need to discuss and make a couple more commits and milestone2 should be almost perfect.

Many code changes to make everything work with Path and XmlInputFormat. Completes milestone 2.

nilesh-c added 10 commits June 13, 2014 18:07

Add Hadoop Conf and dumpDir; clean docs.

4b9745f

Added Kryo function wrapper; fixed typo.

9d0ca9a

Overload loadRDD and saveRDD to make them work with Hadoop Paths

6ceeab6

Add RichHadoopPath for more consistent use of Hadoop Path within the …

85119f2

…framework.

Refactor DistRedirects to work with Hadoop Paths. Make changes to its…

01a25a3

… Tests.

Add distributed version of Disambiguations class that works with Path.

97327a4

Added version of MarkerDestination that works with Path.

cd59be9

Refactor DistConfigLoader to make it work with Path and XmlInputFormat.

0c7b9eb

Also, Mahout's XmlInputFormat is added.

Add docs. Use SparkUtils.kryoWrapFunction.

b33926b

Cleaned properties files. Added base-dir to spark.config.properties.

d08417f

jcsahnwaldt reviewed Jun 14, 2014
View reviewed changes

nilesh-c mentioned this pull request Jun 17, 2014

Simplify FileLike's generic type to T dbpedia/extraction-framework#236

Merged

nilesh-c added 4 commits June 17, 2014 21:28

Remove wrapPathStub.

d03af1f

PR #236 simplies FileLike generic type. We don't need this implicit anymore.

delete() throws IOException

e2f3e61

Remove DistMarkerDestination and remove deleteConfirm() from RichHado…

cfaf190

…opPath

Remove static var - use implicits in load and loadFromCache methods

2789bd6

Fixes the thread-safety problem in DistDisambiguations and DistRedirects. Make required changes in the Test.

sangv added a commit that referenced this pull request Jun 20, 2014

Merge pull request #22 from dbpedia/milestone2

dec9f9f

Many code changes to make everything work with Path and XmlInputFormat. Completes milestone 2.

sangv merged commit dec9f9f into nildev2 Jun 20, 2014

nilesh-c deleted the milestone2 branch July 26, 2014 23:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Many code changes to make everything work with Path and XmlInputFormat. Completes milestone 2. #22

Many code changes to make everything work with Path and XmlInputFormat. Completes milestone 2. #22

Uh oh!

nilesh-c commented Jun 13, 2014

Uh oh!

nilesh-c commented Jun 13, 2014

Uh oh!

nilesh-c commented Jun 13, 2014

Uh oh!

sangv commented Jun 13, 2014

Uh oh!

nilesh-c commented Jun 13, 2014

Uh oh!

jcsahnwaldt Jun 14, 2014

Uh oh!

nilesh-c commented Jun 17, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Many code changes to make everything work with Path and XmlInputFormat. Completes milestone 2. #22

Many code changes to make everything work with Path and XmlInputFormat. Completes milestone 2. #22

Uh oh!

Conversation

nilesh-c commented Jun 13, 2014

Uh oh!

nilesh-c commented Jun 13, 2014

Uh oh!

nilesh-c commented Jun 13, 2014

Uh oh!

sangv commented Jun 13, 2014

Uh oh!

nilesh-c commented Jun 13, 2014

Uh oh!

jcsahnwaldt Jun 14, 2014

Choose a reason for hiding this comment

Uh oh!

nilesh-c commented Jun 17, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants