Skip to content

Conversation

@nilesh-c
Copy link
Member

The existing code works with Files. Hadoop works with Paths and FileSystem which provide an abstraction over files and file systems. That is to say, Hadoop encapsulates the stuff about distributed or local files and files systems and lets us work with Paths and InputFormats. This pretty much sums up most of the stuff in the commits.

Other than that, I added the XmlInputFormat. I tested with run-extraction-test and everything works. DistRedirectsTest passes. I added some docs too.

Please note that I have taken care to make the commits in a specific sequence. If you make your way from commit 4b9745f to commit d08417f it'll be easier to understand what's going on.

Also, it's worth a mention that you'll need to clone https://github.com/nilesh-c/extraction-framework , switch to the fromxml branch and run mvn install before building this repo on this branch. I've made a PR to extraction-framework with the changes in fromxml.

@nilesh-c
Copy link
Member Author

Update:

Also, it's worth a mention that you'll need to clone https://github.com/nilesh-c/extraction-framework , switch to the fromxml branch and run mvn install before building this repo on this branch. I've made a PR to extraction-framework with the changes in fromxml.

fromxml was merged to extraction-framework master, and the branch deleted. @sangv you can use either repository now while testing.

@nilesh-c
Copy link
Member Author

@sangv - looks like I misread the page from Definitive guide. Further down below it was written that ultimately Hadoop takes max(splitsize, blocksize) as the maximum split size. You're right, the maximum split size is indeed equal to the blocksize unless otherwise specified.

I found the real reason why my jobs were failing: spark.kryoserializer.buffer.mb is set to 50MB. (It's even lower by default) While writing the outputs using SparkUtils.toLocalIterator the memory taken by Spark equals memory needed for a single partition. Here a partition = 64MB by default. Kryo's buffer = 50MB. Hence Kryo complains of buffer overflows.

I'll fix this and make spark.kryoserializer.buffer.mb equal to 100MB by default. It's configurable using spark.config.properties. Also, I'll remove the default mapred.max.split.size = 10MB config from DistConfig.

@sangv
Copy link
Contributor

sangv commented Jun 13, 2014

Cool. Thanks. I was going to discourage you from defaulting to 10MB because
it is too small and would lead to bookkeeping overheads for large files.
With some testing that I am doing, I see that it splits into 35 MB blocks
so there are other considerations as to how it determines the split size.

-- Sang

On Fri, Jun 13, 2014 at 2:32 PM, Nilesh Chakraborty <
[email protected]> wrote:

@sangv https://github.com/sangv - looks like I misread the page from
Definitive guide. Further down below it was written that ultimately Hadoop
takes max(splitsize, blocksize) as the maximum split size. You're right,
the maximum split size is indeed equal to the blocksize unless otherwise
specified.

I found the real reason why my jobs were failing:
spark.kryoserializer.buffer.mb is set to 50MB. (It's even lower by
default) During writing the outputs using SparkUtils.toLocalIterator the
memory taken by Spark equals memory need for a single partition. Here a
partition = 64MB by default. Kryo's buffer = 50MB. Hence Kryo complains of
buffer overflows.

I'll fix this and make spark.kryoserializer.buffer.mb equal to 100MB by
default. It's configurable using spark.config.properties. Also, I'll remove
the default mapred.max.split.size = 10MB config from DistConfig.


Reply to this email directly or view it on GitHub
#22 (comment)
.

@nilesh-c
Copy link
Member Author

Yes, Hadoop does a lot of stuff to find the optimal split sizes, and it often depends upon the InputFormat's implementation of getSplits etc. On a related node, that, along with on-the-fly reading of compressed files were my primary reasons of looking into wikihadoop's code.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the method that is probably being called if the others result in compile-time errors. But I'm not sure. I don't quite understand what you're trying to do and where the compiler errors occur.

nilesh-c added 4 commits June 17, 2014 21:28
PR #236 simplies FileLike generic type. We don't need this implicit anymore.
Fixes the thread-safety problem in DistDisambiguations and DistRedirects. Make required changes in the Test.
@nilesh-c
Copy link
Member Author

Added final touches, resolved most of the issues we talked about above. Sent a PR @ #24 - need to discuss and make a couple more commits and milestone2 should be almost perfect.

sangv added a commit that referenced this pull request Jun 20, 2014
Many code changes to make everything work with Path and XmlInputFormat. Completes milestone 2.
@sangv sangv merged commit dec9f9f into nildev2 Jun 20, 2014
@nilesh-c nilesh-c deleted the milestone2 branch July 26, 2014 23:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants