Skip to content

Commit

Permalink
align databricks-iris template to work with kedro-databricks (#227)
Browse files Browse the repository at this point in the history
* improve readme

Signed-off-by: Jens Peder Meldgaard <[email protected]>

* add base dependencies for spark and pandas

Signed-off-by: Jens Peder Meldgaard <[email protected]>

* fix dbfs file paths and remove use of MemoryDataset

Signed-off-by: Jens Peder Meldgaard <[email protected]>

* add option to specify nodes in run

Signed-off-by: Jens Peder Meldgaard <[email protected]>

* not doing any transcoding

Signed-off-by: Jens Peder Meldgaard <[email protected]>

* not also handles the None case

Signed-off-by: Jens Peder Meldgaard <[email protected]>

---------

Signed-off-by: Jens Peder Meldgaard <[email protected]>
  • Loading branch information
JenspederM committed Jul 19, 2024
1 parent fb599e7 commit a0fbc12
Show file tree
Hide file tree
Showing 4 changed files with 47 additions and 18 deletions.
27 changes: 27 additions & 0 deletions databricks-iris/{{ cookiecutter.repo_name }}/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,33 @@ This is your new Kedro project, which was generated using `kedro {{ cookiecutter

Take a look at the [Kedro documentation](https://docs.kedro.org) to get started.

## Getting Started

To create a project based on this starter, ensure you have installed Kedro into a virtual environment. Then use the following command:

```sh
pip install kedro
kedro new --starter=databricks-iris
```

After the project is created, navigate to the newly created project directory:

```sh
cd <my-project-name> # change directory
```

Install the required dependencies:

```sh
pip install -r requirements.txt
```

Now you can run the project:

```sh
kedro run
```

## Rules and guidelines

In order to get the best out of the template:
Expand Down
29 changes: 13 additions & 16 deletions databricks-iris/{{ cookiecutter.repo_name }}/conf/base/catalog.yml
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@

example_iris_data:
type: spark.SparkDataset
filepath: /dbfs/FileStore/iris-databricks/data/01_raw/iris.csv
filepath: /dbfs/FileStore/{{ cookiecutter.python_package }}/data/01_raw/iris.csv
file_format: csv
load_args:
header: True
Expand All @@ -56,48 +56,45 @@ example_iris_data:
# for all SparkDatasets.
X_train@pyspark:
type: spark.SparkDataset
filepath: /dbfs/FileStore/iris-databricks/data/02_intermediate/X_train.parquet
filepath: /dbfs/FileStore/{{ cookiecutter.python_package }}/data/02_intermediate/X_train.parquet
save_args:
mode: overwrite

X_train@pandas:
type: pandas.ParquetDataset
filepath: /dbfs/FileStore/iris-databricks/data/02_intermediate/X_train.parquet
filepath: /dbfs/FileStore/{{ cookiecutter.python_package }}/data/02_intermediate/X_train.parquet

X_test@pyspark:
type: spark.SparkDataset
filepath: /dbfs/FileStore/iris-databricks/data/02_intermediate/X_test.parquet
filepath: /dbfs/FileStore/{{ cookiecutter.python_package }}/data/02_intermediate/X_test.parquet
save_args:
mode: overwrite

X_test@pandas:
type: pandas.ParquetDataset
filepath: /dbfs/FileStore/iris-databricks/data/02_intermediate/X_test.parquet
filepath: /dbfs/FileStore/{{ cookiecutter.python_package }}/data/02_intermediate/X_test.parquet

y_train@pyspark:
type: spark.SparkDataset
filepath: /dbfs/FileStore/iris-databricks/data/02_intermediate/y_train.parquet
filepath: /dbfs/FileStore/{{ cookiecutter.python_package }}/data/02_intermediate/y_train.parquet
save_args:
mode: overwrite

y_train@pandas:
type: pandas.ParquetDataset
filepath: /dbfs/FileStore/iris-databricks/data/02_intermediate/y_train.parquet
filepath: /dbfs/FileStore/{{ cookiecutter.python_package }}/data/02_intermediate/y_train.parquet

y_test@pyspark:
type: spark.SparkDataset
filepath: /dbfs/FileStore/iris-databricks/data/02_intermediate/y_test.parquet
filepath: /dbfs/FileStore/{{ cookiecutter.python_package }}/data/02_intermediate/y_test.parquet
save_args:
mode: overwrite

y_test@pandas:
type: pandas.ParquetDataset
filepath: /dbfs/FileStore/iris-databricks/data/02_intermediate/y_test.parquet
filepath: /dbfs/FileStore/{{ cookiecutter.python_package }}/data/02_intermediate/y_test.parquet

y_pred:
type: pandas.ParquetDataset
filepath: /dbfs/FileStore/{{ cookiecutter.python_package }}/data/03_primary/y_pred.parquet

# This is an example how to use `MemoryDataset` with Spark objects that aren't `DataFrame`'s.
# In particular, the `assign` copy mode ensures that the `MemoryDataset` will be assigned
# the Spark object itself, not a deepcopy version of it, since deepcopy doesn't work with
# Spark object generally.
example_classifier:
type: MemoryDataset
copy_mode: assign
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ ipython>=8.10
jupyterlab>=3.0
notebook
kedro~={{ cookiecutter.kedro_version }}
kedro-datasets[spark.SparkDataset, pandas.ParquetDataset]>=1.0
kedro-datasets[spark, pandas, spark.SparkDataset, pandas.ParquetDataset]>=1.0
kedro-telemetry>=0.3.1
numpy~=1.21
pytest-cov~=3.0
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,19 +10,24 @@ def main():
parser.add_argument("--env", dest="env", type=str)
parser.add_argument("--conf-source", dest="conf_source", type=str)
parser.add_argument("--package-name", dest="package_name", type=str)
parser.add_argument("--nodes", dest="nodes", type=str)

args = parser.parse_args()
env = args.env
conf_source = args.conf_source
package_name = args.package_name
nodes = [node.strip() for node in args.nodes.split(",")]

# https://kb.databricks.com/notebooks/cmd-c-on-object-id-p0.html
logging.getLogger("py4j.java_gateway").setLevel(logging.ERROR)
logging.getLogger("py4j.py4j.clientserver").setLevel(logging.ERROR)

configure_project(package_name)
with KedroSession.create(env=env, conf_source=conf_source) as session:
session.run()
if not nodes:
session.run()
else:
session.run(node_names=nodes)


if __name__ == "__main__":
Expand Down

0 comments on commit a0fbc12

Please sign in to comment.