Skip to content

alfred82santa/sparpy

Repository files navigation

Sparpy: A Spark entrypoint for Python

Changelog

v0.5.5

  • Added --proxy option in order to set a proxy to access to Python packages repositories.

v0.5.4

  • Added plugin-env section on configuration file in order to be able to set environment variables on plugin download process.
  • Added --plugin-env option (and its environment variable associated SPARPY_PLUGIN_ENVVARS) in order to set environment variables on plugin download process. It could be necessary on some cases using conda environments.
  • Added environment variable SPARPY_CONFIG for option --config.
  • Added environment variable SPARPY_DEBUG for option --debug.

v0.5.3

  • Fix isparpy.

v0.5.2

  • Fix ignoring all packages when exclude packages list is empty.

v0.5.1

  • Fix Python package regex.
  • Fix download script.

v0.5.0

  • Added --exclude-python-packages option in order to exclude python packages.
  • Better parsing plugins names.
  • Added --exclude-packages option in order to exclude spark packages.

v0.4.5

  • Fix isparpy entrypoint. Allows --class parameter.
  • Allow to set constraints files.

v0.4.4

  • Don't set master and deploy_mode default values.

v0.4.3

  • Fix sparpy-submit entrypoint.
  • Fix --property-file option.
  • Fix --class option.

v0.4.2

  • Able to use environment variables for the most of options.

v0.4.1

  • Support to set pip options as configuration using --conf sparpy.config-key=value in order to allow to use sparpy-submit in EMR-on-EKS images.
  • Allows --class in order to allow to use sparpy-submit in EMR-on-EKS images.
  • Allows --property-file in order to allow to use sparpy-submit in EMR-on-EKS images.

v0.4.0

  • Added --pre option in order to allow pre-release packages.
  • Added --env option in order to set environment variables for spark process.
  • Added spark-env config section in order to set environment variables for spark process.
  • Write pip output when it fails.
  • Fixed problems with interactive sparpy.
  • Fixed no-self option in config file.
  • Allow to use plugins that don't use click. They must be callable with one argument of type Sequence[str] in order to pass arguments to it.
  • Added --version option in order to print sparpy version.
  • Fixed error when a plugin requires a package which is already installed but version does not satisfy requirement.
  • Sparpy does not print error traceback when subprocess fails.

v0.3.0

  • Enable --force-download option.
  • Added --find-links option in order to use a directory as package repository.
  • Added --no-index option in order to avoid to use external package repositories.
  • Added --queue option in order to set yarn queue.
  • Ensure driver's python executable is same python as sparpy.
  • Added new entry point sparpy-download just to download packages to specific directory.
  • Added new entry point isparpy in order to start an interactive session.

v0.2.1

  • Force pyspark python executable to same as sparpy.
  • Fix unrecognized options.
  • Fix default configuration file names.

v0.2.0

  • Added configuration file option.
  • Added --debug option.

How to build a Sparpy plugin

On package setup.py an entry point should be configured for Sparpy:

setup(
    name='yourpackage',
    ...

    entry_points={
        ...
        'sparpy.cli_plugins': [
            'my_command_1=yourpackage.module:command_1',
            'my_command_2=yourpackage.module:command_2',
        ]
    }
)

Note

Avoid to use PySpark as requirement in order to not download package from pypi.

Install

It must be installed on a Spark edge node.

$  pip install sparpy[base]

How to use

Using default Spark submit parameters:

$ sparpy --plugin "mypackage>=0.1" my_plugin_command --myparam 1

Configuration files

sparpy and sparpu-submit accept the parameter --config that allow to set a configuration file. If it is not set it will try to use configuration file $HOME/.sparpyrc. It if does not exist it will try to use /etc/sparpy.conf.

Format:

[spark]

master=yarn
deploy-mode=client

queue=my_queue

spark-executable=/path/to/my-spark-submit
conf=
    spark.conf.1=value1
    spark.conf.2=value2

packages=
    maven:package_1:0.1.1
    maven:package_2:0.6.1

repositories=
    https://my-maven-repository-1.com/mvn
    https://my-maven-repository-2.com/mvn

reqs_paths=
    /path/to/dir/with/python/packages_1
    /path/to/dir/with/python/packages_2

[spark-env]

MY_ENV_VAR=value

[plugins]

extra-index-urls=
    https://my-pypi-repository-1.com/simple
    https://my-pypi-repository-2.com/simple

cache-dir=/path/to/cache/dir

plugins=
    my-package1
    my-package2==0.1.2

requirements-files=
    /path/to/requirement-1.txt
    /path/to/requirement-2.txt

find-links=
    /path/to/directory/with/packages_1
    /path/to/directory/with/packages_2

download-dir-prefix=my_prefix_

no-index=false
no-self=false
force-download=true

[plugin-env]

MY_ENV_VAR=value

[interactive]

pyspark-executable=/path/to/pyspark
python-interactive-driver=/path/to/interactive/driver

About

Sparpy: A spark entry point for python

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published