Spark with Fixed (Preferred) Locations

Customize Spark to run partitions on a subset of workers. Although Spark has the ability to define preferred locations, there's no way to force partitions to a specific set of workers

This repo is an attempt to add this capability

Why?

Suppose there's a large Spark cluster with a mixture of machines AND UDFs will be used that have certain expectations on the machine type (e.g. GPU or no GPU). Being able to run on particular machines is critical and allow a single Spark pipeline to transition between machine types.

Solution Strategy

Create new RDD to allow manual setting of preferredLocations
Create the notion of a FixedLocation which prevents running on a Node that isn't preferred

To help with bookkeeping, all workers are assigned a SPARK_TAGS environment variable. A shell script (worker-info.sh) is executed at runtime to tags with Workers IP addresses.

Modified Spark

Two Spark 2.4.4 were modified. *.ORIG files are added for reference

TaskLocation.scala

Add a new location subtype FixedHostTaskLocation. This class is created when a preferred location begins with a '*'

TaskSetManager.scala

When a task set has a FixedHostTaskLocation, do not allow locality to exceed NODE_LOCAL

Example Usage

import org.apache.spark.sql.functions.udf
val cpu_only = udf((v: String) => v)
val gpu_only = udf((v: String) => v)

val dataset = Seq((0, "hello"), (1, "world")).toDF("id", "text")

import example.extend_spark_api._ // NEW

val cpu = sc.getWorkersFromTag("CPU") // NEW
val gpu = sc.getWorkersFromTag("GPU") // NEW

dataset
  .repartitionTo(cpu) // NEW
  .withColumn("from_cpu", cpu_only('text))
  .repartitionTo(gpu) // NEW
  .withColumn("from_gpu", gpu_only('from_cpu))
  .show

Testing

A docker-compose is provided to simulate a heterogeneous cluster.

docker-compose up
sbt test

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
project		project
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt
docker-compose.yaml		docker-compose.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Spark with Fixed (Preferred) Locations

Why?

Solution Strategy

Modified Spark

TaskLocation.scala

TaskSetManager.scala

Example Usage

Testing

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

BrianHuf/spark-fixed-location

Folders and files

Latest commit

History

Repository files navigation

Spark with Fixed (Preferred) Locations

Why?

Solution Strategy

Modified Spark

TaskLocation.scala

TaskSetManager.scala

Example Usage

Testing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages