-
Notifications
You must be signed in to change notification settings - Fork 17
Many code changes to make everything work with Path and XmlInputFormat. Completes milestone 2. #22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Also, Mahout's XmlInputFormat is added.
|
Update:
fromxml was merged to extraction-framework master, and the branch deleted. @sangv you can use either repository now while testing. |
|
@sangv - looks like I misread the page from Definitive guide. Further down below it was written that ultimately Hadoop takes max(splitsize, blocksize) as the maximum split size. You're right, the maximum split size is indeed equal to the blocksize unless otherwise specified. I found the real reason why my jobs were failing: I'll fix this and make |
|
Cool. Thanks. I was going to discourage you from defaulting to 10MB because -- Sang On Fri, Jun 13, 2014 at 2:32 PM, Nilesh Chakraborty <
|
|
Yes, Hadoop does a lot of stuff to find the optimal split sizes, and it often depends upon the InputFormat's implementation of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the method that is probably being called if the others result in compile-time errors. But I'm not sure. I don't quite understand what you're trying to do and where the compiler errors occur.
PR #236 simplies FileLike generic type. We don't need this implicit anymore.
Fixes the thread-safety problem in DistDisambiguations and DistRedirects. Make required changes in the Test.
|
Added final touches, resolved most of the issues we talked about above. Sent a PR @ #24 - need to discuss and make a couple more commits and milestone2 should be almost perfect. |
Many code changes to make everything work with Path and XmlInputFormat. Completes milestone 2.
The existing code works with Files. Hadoop works with Paths and FileSystem which provide an abstraction over files and file systems. That is to say, Hadoop encapsulates the stuff about distributed or local files and files systems and lets us work with Paths and InputFormats. This pretty much sums up most of the stuff in the commits.
Other than that, I added the XmlInputFormat. I tested with run-extraction-test and everything works. DistRedirectsTest passes. I added some docs too.
Please note that I have taken care to make the commits in a specific sequence. If you make your way from commit 4b9745f to commit d08417f it'll be easier to understand what's going on.
Also, it's worth a mention that you'll need to clone https://github.com/nilesh-c/extraction-framework , switch to the fromxml branch and run
mvn installbefore building this repo on this branch. I've made a PR to extraction-framework with the changes in fromxml.