Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion MANIFEST.in
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,12 @@ exclude *.ini
recursive-include brainglobe_workflows *.py
recursive-include brainglobe_workflows/configs *.json
recursive-include benchmarks *.py
recursive-exclude benchmarks/results *
include asv.conf.json

recursive-exclude * __pycache__
recursive-exclude * *.py[co]
recursive-exclude benchmarks/results *
recursive-exclude benchmarks/html *

global-include *.pxd

Expand All @@ -24,3 +25,4 @@ prune resources

prune .github
prune .tox
prune .asv
35 changes: 29 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,12 +68,35 @@ See our [blog post](https://brainglobe.info/blog/version1/cellfinder-core-and-pl

## Developer documentation

This repository also includes workflow scripts that are benchmarked to support code development.
These benchmarks are run regularly to ensure performance is stable, as the tools are developed and extended.

- Developers can install these benchmarks locally via `pip install .[dev]`. By executing `asv run`, the benchmarks will run with default parameters on a small dataset that is downloaded from [GIN](https://gin.g-node.org/G-Node/info/wiki). See [the asv docs](https://asv.readthedocs.io/en/v0.6.1/using.html#running-benchmarks) for further details on how to run benchmarks.
- Developers can also run these benchmarks on data they have stored locally, by specifying the relevant paths in an input (JSON) file.
- We also maintain an internal runner that benchmarks the workflows over a large, exemplar dataset, of the scale we expect users to be handling. The result of these benchmarks are made publicly available.
This repository also includes code to benchmark typical workflows.
These benchmarks are meant to be run regularly, to ensure performance is stable as the tools are developed and extended.

There are three main ways in which these benchmarks can be useful to developers:
1. Developers can run the available benchmarks locally on a small test dataset.

To do so:
- Install the developer version of the package:
```
pip install .[dev]
```
This is mostly for convenience: the `[dev]` specification includes `asv` as a dependency, but to run the benchmarks it would be sufficient to use an environment with `asv` only. This is because `asv` creates its own virtual environment for the benchmarks, building and installing the relevant version of the `brainglobe-workflows` package in it. By default, the version at the tip of the currently checked out branch is installed.
- Run the benchmarks:
```
asv run
```
This will run the locally defined benchmarks with the default parameters defined at `brainglobe_workflows/configs/cellfinder.json`, on a small dataset downloaded from [GIN](https://gin.g-node.org/G-Node/info/wiki). See the [asv docs](https://asv.readthedocs.io/en/v0.6.1/using.html#running-benchmarks) for further guidance on how to run benchmarks.
1. Developers can also run these benchmarks on data they have stored locally.

To do so:
- Define a config file for the workflow to benchmark. You can use the default one at `brainglobe_workflows/configs/cellfinder.json` for reference.
- Ensure your config file includes an `input_data_dir` field pointing to the data of interest.
- Edit the names of the signal and background directories if required. By default, they are assumed to be in `signal` and `background` subdirectories under `input_data_dir`. However, these defaults can be overwritten with the `signal_subdir` and `background_subdir` fields.
- Run the benchmarks, passing the path to your config file as an environment variable `CONFIG_PATH`. In Unix systems:
```
CONFIG_PATH=/path/to/your/config/file asv run
```

1. We also plan to run the benchmarks on an internal runner using a larger dataset, of the scale we expect users to be handling. The result of these benchmarks will be made publicly available.

Contributions to BrainGlobe are more than welcome.
Please see the [developer guide](https://brainglobe.info/developers/index.html).
Expand Down
16 changes: 8 additions & 8 deletions asv.conf.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,15 @@
"version": 1,

// The name of the project being benchmarked
"project": "brainglobe_workflows",
"project": "../brainglobe-workflows",

// The project's homepage
"project_url": "https://github.com/brainglobe/brainglobe-workflows",

// The URL or local path of the source code repository for the
// project being benchmarked
// "repo": ".",
"repo": "https://github.com/brainglobe/brainglobe-workflows",
"repo": ".",
// "repo": "https://github.com/brainglobe/brainglobe-workflows.git",

// The Python project's subdirectory in your repo. If missing or
// the empty string, the project is assumed to be located at the root
Expand Down Expand Up @@ -40,14 +40,14 @@

// List of branches to benchmark. If not provided, defaults to "master"
// (for git) or "default" (for mercurial).
"branches": ["smg/tests-refactor"], // for git
"branches": ["HEAD"], // for git
// "branches": ["default"], // for mercurial

// The DVCS being used. If not set, it will be automatically
// determined from "repo" by looking at the protocol in the URL
// (if remote), or by looking for special directories, such as
// ".git" (if local).
"dvcs": "git",
// "dvcs": "git",

// The tool to use to create environments. May be "conda",
// "virtualenv", "mamba" (above 3.8)
Expand Down Expand Up @@ -147,19 +147,19 @@

// The directory (relative to the current directory) that benchmarks are
// stored in. If not provided, defaults to "benchmarks"
"benchmark_dir": "brainglobe_benchmarks",
"benchmark_dir": "benchmarks",

// The directory (relative to the current directory) to cache the Python
// environments in. If not provided, defaults to "env"
"env_dir": ".asv/env",

// The directory (relative to the current directory) that raw benchmark
// results are stored in. If not provided, defaults to "results".
"results_dir": "brainglobe_benchmarks/results",
"results_dir": "benchmarks/results",

// The directory (relative to the current directory) that the html tree
// should be written to. If not provided, defaults to "html".
"html_dir": "brainglobe_benchmarks/html",
"html_dir": "benchmarks/html",

// The number of characters to retain in the commit hashes.
// "hash_length": 8,
Expand Down
136 changes: 91 additions & 45 deletions benchmarks/cellfinder_core.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
import json
import os
import shutil
from pathlib import Path

import pooch
from brainglobe_utils.IO.cells import save_cells
from cellfinder.core.main import main as cellfinder_run
from cellfinder.core.tools.IO import read_with_dask
from cellfinder.core.tools.prep import prep_models

from brainglobe_workflows.cellfinder_core.cellfinder_core import (
CellfinderConfig,
Expand All @@ -17,7 +18,7 @@
from brainglobe_workflows.utils import DEFAULT_JSON_CONFIG_PATH_CELLFINDER


class TimeBenchmarkPrepGIN:
class TimeBenchmark:
"""

A base class for timing benchmarks for the cellfinder workflow.
Expand Down Expand Up @@ -78,21 +79,25 @@ class TimeBenchmarkPrepGIN:
sample_time = 0.01 # default: 10 ms = 0.01 s;
min_run_count = 2 # default:2

# Custom attributes
input_config_path = str(DEFAULT_JSON_CONFIG_PATH_CELLFINDER)
# Input config file
# use environment variable CONFIG_PATH if exists, otherwise use default
input_config_path = os.getenv(
"CELLFINDER_CONFIG_PATH",
default=str(DEFAULT_JSON_CONFIG_PATH_CELLFINDER),
)

def setup_cache(
self,
):
def setup_cache(self):
"""
Download the input data from the GIN repository to the local
directory specified in the default_config.json
directory specified in the default_config.json.

Notes
-----
The `setup_cache` method only performs the computations once
per benchmark round and then caches the result to disk [1]_. It cannot
be parametrised [2]_.
be parametrised [2]_. Therefore, if we sweep across different input
JSON files, we need to ensure all data for all configs is made
available with this setup function.


[1] https://asv.readthedocs.io/en/latest/writing_benchmarks.html#setup-and-teardown-functions
Expand All @@ -103,24 +108,22 @@ def setup_cache(
assert Path(self.input_config_path).exists()

# Instantiate a CellfinderConfig from the input json file
# (assumes config is json serializable)
# (fetches data from GIN if required)
with open(self.input_config_path) as cfg:
config_dict = json.load(cfg)
config = CellfinderConfig(**config_dict)

# Download data with pooch
_ = pooch.retrieve(
url=config.data_url,
known_hash=config.data_hash,
path=config._install_path,
progressbar=True,
processor=pooch.Unzip(extract_dir=config.data_dir_relative),
)

# Check paths to input data should now exist in config
# Check paths to input data exist in config now
assert Path(config._signal_dir_path).exists()
assert Path(config._background_dir_path).exists()

# Ensure cellfinder model is downloaded to default path
_ = prep_models(
model_weights_path=config.model_weights,
install_path=None, # Use default,
model_name=config.model,
)

def setup(self):
"""
Run the cellfinder workflow setup steps.
Expand All @@ -129,12 +132,7 @@ def setup(self):
"""

# Run setup
cfg = setup_cellfinder_workflow(
[
"--config",
self.input_config_path,
]
)
cfg = setup_cellfinder_workflow(self.input_config_path)

# Save configuration as attribute
self.cfg = cfg
Expand All @@ -149,7 +147,7 @@ def teardown(self):
shutil.rmtree(Path(self.cfg._output_path).resolve())


class TimeFullWorkflow(TimeBenchmarkPrepGIN):
class TimeFullWorkflow(TimeBenchmark):
"""
Time the full cellfinder workflow.

Expand All @@ -158,69 +156,117 @@ class TimeFullWorkflow(TimeBenchmarkPrepGIN):

Parameters
----------
TimeBenchmarkPrepGIN : _type_
TimeBenchmark : _type_
A base class for timing benchmarks for the cellfinder workflow.
"""

def time_workflow_from_cellfinder_run(self):
def time_workflow(self):
run_workflow_from_cellfinder_run(self.cfg)


class TimeReadInputDask(TimeBenchmarkPrepGIN):
class TimeReadInputDask(TimeBenchmark):
"""
Time the reading input data operations with dask

Parameters
----------
TimeBenchmarkPrepGIN : _type_
TimeBenchmark : _type_
A base class for timing benchmarks for the cellfinder workflow.
"""

def time_read_signal_with_dask(self):
read_with_dask(self.cfg._signal_dir_path)
read_with_dask(str(self.cfg._signal_dir_path))

def time_read_background_with_dask(self):
read_with_dask(self.cfg._background_dir_path)
read_with_dask(str(self.cfg._background_dir_path))


class TimeDetectCells(TimeBenchmarkPrepGIN):
class TimeDetectAndClassifyCells(TimeBenchmark):
"""
Time the cell detection main pipeline (`cellfinder_run`)

Parameters
----------
TimeBenchmarkPrepGIN : _type_
TimeBenchmark : _type_
A base class for timing benchmarks for the cellfinder workflow.
"""

# extend basic setup function
def setup(self):
# basic setup
TimeBenchmarkPrepGIN.setup(self)
TimeBenchmark.setup(self)

# add input data as arrays to config
self.signal_array = read_with_dask(self.cfg._signal_dir_path)
self.background_array = read_with_dask(self.cfg._background_dir_path)
# add input data as arrays to the config
self.signal_array = read_with_dask(str(self.cfg._signal_dir_path))
self.background_array = read_with_dask(
str(self.cfg._background_dir_path)
)

def time_cellfinder_run(self):
cellfinder_run(
self.signal_array, self.background_array, self.cfg.voxel_sizes
self.signal_array,
self.background_array,
self.cfg.voxel_sizes,
self.cfg.start_plane,
self.cfg.end_plane,
self.cfg.trained_model,
self.cfg.model_weights,
self.cfg.model,
self.cfg.batch_size,
self.cfg.n_free_cpus,
self.cfg.network_voxel_sizes,
self.cfg.soma_diameter,
self.cfg.ball_xy_size,
self.cfg.ball_z_size,
self.cfg.ball_overlap_fraction,
self.cfg.log_sigma_size,
self.cfg.n_sds_above_mean_thresh,
self.cfg.soma_spread_factor,
self.cfg.max_cluster_size,
self.cfg.cube_width,
self.cfg.cube_height,
self.cfg.cube_depth,
self.cfg.network_depth,
)


class TimeSaveCells(TimeBenchmarkPrepGIN):
class TimeSaveCells(TimeBenchmark):
# extend basic setup function
def setup(self):
# basic setup
TimeBenchmarkPrepGIN.setup(self)
TimeBenchmark.setup(self)

# add input data as arrays to config
self.signal_array = read_with_dask(self.cfg._signal_dir_path)
self.background_array = read_with_dask(self.cfg._background_dir_path)
self.signal_array = read_with_dask(str(self.cfg._signal_dir_path))
self.background_array = read_with_dask(
str(self.cfg._background_dir_path)
)

# detect cells
self.detected_cells = cellfinder_run(
self.signal_array, self.background_array, self.cfg.voxel_sizes
self.signal_array,
self.background_array,
self.cfg.voxel_sizes,
self.cfg.start_plane,
self.cfg.end_plane,
self.cfg.trained_model,
self.cfg.model_weights,
self.cfg.model,
self.cfg.batch_size,
self.cfg.n_free_cpus,
self.cfg.network_voxel_sizes,
self.cfg.soma_diameter,
self.cfg.ball_xy_size,
self.cfg.ball_z_size,
self.cfg.ball_overlap_fraction,
self.cfg.log_sigma_size,
self.cfg.n_sds_above_mean_thresh,
self.cfg.soma_spread_factor,
self.cfg.max_cluster_size,
self.cfg.cube_width,
self.cfg.cube_height,
self.cfg.cube_depth,
self.cfg.network_depth,
)

def time_save_cells(self):
Expand Down
Loading