Skip to content
Ondřej Moravčík edited this page Jun 12, 2015 · 16 revisions

There are several ways how to configure ruby-spark and spark. Configuration can be changed before creating context. After that config is read-only.

By environment variable

SPARK_RUBY_SERIALIZER="oj" ruby-spark shell

By property file

Content of property file:

# This is just a comment
spark.ruby.serializer              oj
spark.ruby.serializer.batch_size   4096
spark.ruby.executor.options        -W1
# For shell
ruby-spark shell --properties-file conf.conf
# In ruby
Spark.config.from_file(FILE_PATH)

Global property file

The ~/.ruby-spark.conf contains default configuration which is always loaded. File is automatically created during the first run. There is for example path to target folder.

Direct

This muts be done before starting.

Spark.config do
  set_app_name 'RubySpark'
  set_master 'local[*]'
  set 'spark.ruby.serializer', 'oj'
  set 'spark.ruby.serializer.batch_size', 100
end

Spark.config.set('spark.ruby.serializer.batch_size', 100)

During data uploading

sc.parallelize(1..10, 3, serializer)

Ruby configuration in Spark

Key Default value Description
spark.ruby.serializer marshal Default serializer

marshal: ruby's default (slowest but can serialize everything)
oj: faster than marshal but doesn't work on jruby
message_pack: fastest but cannot serialize large numbers and some objects
spark.ruby.serializer.batch_size 1024 Number of items which will be serialized and send as one item.
If size should be calculated automatically use: auto
spark.ruby.serializer.compress false Compress serilized bytes
spark.ruby.worker.type process Type of workers.

process: new workers are created by fork function
thread: worker is represented as thread
spark.ruby.executor.command %s Command template for ruby script execution. Can be useful if you are using some ruby version manager.

Rbenv:
bash --norc -i -c "export HOME=/home/user; cd; source .bashrc; %s"

Template must contain '%s' which represent origin ruby command.
spark.ruby.executor.options Ruby options for scripts.
spark.ruby.executor.env.[VARIABLE] Environment variables for ruby scripts.
Clone this wiki locally