GitHub - tedxia/nativetask: Hadoop task level native runtime

Introduction

NativeTask is a high performance C++ API & runtime for Hadoop MapReduce. Why it is called NativeTask is that it is a native computing unit only focus on data processing, which is exactly what Task do in the Hadoop MapReduce context. In other word, NativeTask is not responsible for resource management, job Scheduling and fault-tolerance. Those are all managed by original Hadoop components as before, unchanged. But the actual data processing and computation, which consumes most of cluster resources, are delegated to this highly efficient data processing unit.

NativeTask is designed to be very fast, with native C++ API. So more efficient data analysis applications can build upon it, like LLVM based query execution engine mentioned in Google's Tenzing. Actually this is the main objective of NativeTask, to provide a efficient native Hadoop framework, so much more efficient data analyze tools can be built upon it:

Data warehousing tool using state of the art query execution techniques existing in parallel DBMSs, such as compression, vectorization, dynamic compilation, etc. These techniques are more easy to implement in native code, as we can see that most of these techniques are implemented using C/C++: Vectorwise, Vertica.
High performance data mining/machine learning libraries, most of these algorithms are CPU intensive, involving lot of numerical computation, or have been implemented using native languages already, a native runtime permits better performance, or easy porting these algorithms to Hadoop.

From user's perspective, NativeTask is a lot like Hadoop Pipes: using header files and dynamic libraries provided in NativeTask library, you compile your application or class library to a dynamic library rather than executable program(because we use JNI), then using a Submitter tool to submit you job to Hadoop cluster like streaming or pipes do. For more information, please read the design document and examples in src/main/native/examples.

Features

High performance, more cost effective for your Hadoop cluster;
C++ API, so user can develop native applications or apply more aggressive optimizations not available or convenient for java, like SSE/AVX instruction, LLVM, GPU computing, coprocessor etc.
Support no sort, by removing sort, the shuffle stage barrier can be eliminated, yielding better data processing throughput;
Support foldl style API, much faster for aggregation queries;
Binary based MapReduce API, no serialization/deserialization overhead;
Compatible with Hadoop 0.20-0.23(need task-delegation patch)

Notice

This project is in very early stages currently, and is not well documented. If you are familiar with Hadoop MapReduce, you can hack into the source code. For more informantion, please read the design document

Also you can find some discussion in Hadoop JIRA:
https://issues.apache.org/jira/browse/MAPREDUCE-2841

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
patch		patch
prebuild		prebuild
src		src
.gitignore		.gitignore
DESIGN.html		DESIGN.html
DESIGN.txt		DESIGN.txt
INSTALL		INSTALL
LICENSE.txt		LICENSE.txt
README.md		README.md
TODOS		TODOS
pom.xml		pom.xml
prebuild.sh		prebuild.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Features

Notice

About

Releases

Packages

License

tedxia/nativetask

Folders and files

Latest commit

History

Repository files navigation

Introduction

Features

Notice

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages