TeraSort-on-Cloud

TeraSort on Hadoop/Spark, Shared-Memory: Parallel external sort

To run shared memory sort

python shared_memory_sort.py number_of_proccess input_file_name output_file_name -b block_szie -k key -t temp_path

e.g. python shared_memory_sort.py 3 input output -b 55M -k "line[0:10]" -t /mnt

Args	Value
0	number of threads
1	input file
2	output file
-b	block size e.g. 50M 16G
-k	key
-t	temporary file path

We assume each line is 100 bytes. Generate your data with the gensort tool Sortbenchmark. Each line should be no more than 64KB. Modify the code if needed.

To generate data:

./gensort -a num_of_records filename

e.g. 1TB ./gensort -a 10000000000 testfile

To verify sorted data:

./valsort outputFileName

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Config		Config
Scripts		Scripts
src		src
LICENSE		LICENSE
README.md		README.md
readme.txt		readme.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TeraSort-on-Cloud

To run shared memory sort

To generate data:

To verify sorted data:

For more information: Please read.

Please note that we have removed some configuration files and scripts in public version.

License

About

Releases

Packages

Languages

License

Br1an6/TeraSort-on-Cloud

Folders and files

Latest commit

History

Repository files navigation

TeraSort-on-Cloud

To run shared memory sort

To generate data:

To verify sorted data:

For more information: Please read.

Please note that we have removed some configuration files and scripts in public version.

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages