Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #15 +/- ##
=====================================
Coverage 0.00% 0.00%
=====================================
Files 1 2 +1
Lines 5 98 +93
=====================================
- Misses 5 98 +93
☔ View full report in Codecov by Sentry. |
alessandrofelder
left a comment
There was a problem hiding this comment.
Thanks @sfmig - especially for the really detailed and good PR description!
This is a good start - I think the ability to programmatically change the input data is a key requirement and currently missing though, so I've requested changes (happy to move that to a separate issue though if you prefer! In this case, as me for re-review and link to the issue, and I will approve) - I've commented about how I suggest doing this (I could be wrong!)
I plan to implement the other variants in future PRs (but happy to discuss this).
Let's discuss this at the next developer meeting (I've put it on the agenda).
Should I use a dataclass decorator around theWorkflow class, and just have the setup_parmeters and setup_input_data inside init? It would definitely look cleaner, but for now I decided to leave it more explicit, so that we have more control on what is initialised and when. This may be useful when trying to reuse these functions in asv.
Would it make sense to split it into a configuration dataclass, and the setup_workflow function (see my comments around the INPUT_DATA_CACHE_DIR variable)? Maybe I'm too biased by how I approached this for brainreg?
I assume we are not interested in timing the download from GIN, right?
I agree.
Some functions are not part of this basic workflow (e.g. loading the output xml file, containing the result of the analysis). Should we make them part of a simpler workflow (like a data loading workflow) or should we benchmark them individually (maybe following the structure of the Python modules as in the current asv work)?
In the future, we may want to use asv to time these functions for a range of values for certain preprocessing parameters.
As you suggest, we may want to benchmark them individually in the future 🤔 maybe open an issue and not benchmark them for now?
|
|
||
| # Local cached directories | ||
| CELLFINDER_CACHE_DIR = Path.home() / ".cellfinder_benchmarks" | ||
| INPUT_DATA_CACHE_DIR = CELLFINDER_CACHE_DIR / "cellfinder_test_data" |
There was a problem hiding this comment.
I think we need a way to be able to configure the value for INPUT_DATA_CACHE_DIR, so we can run the same workflow on different data?
I am thinking the best way is to check whether an environmental variable exists, and if it doesn't, use this data as default, via pooch?
There was a problem hiding this comment.
For an idea, see
| self.voxel_sizes = [5, 2, 2] # microns | ||
| self.start_plane = 0 | ||
| self.end_plane = -1 | ||
| self.trained_model = None # if None, it will use a default model | ||
| self.model_weights = None | ||
| self.model = "resnet50_tv" | ||
| self.batch_size = 32 | ||
| self.n_free_cpus = 2 | ||
| self.network_voxel_sizes = [5, 1, 1] | ||
| self.soma_diameter = 16 | ||
| self.ball_xy_size = 6 | ||
| self.ball_z_size = 15 | ||
| self.ball_overlap_fraction = 0.6 | ||
| self.log_sigma_size = 0.2 | ||
| self.n_sds_above_mean_thresh = 10 | ||
| self.soma_spread_factor = 1.4 | ||
| self.max_cluster_size = 100000 | ||
| self.cube_width = 50 | ||
| self.cube_height = 50 | ||
| self.cube_depth = 20 | ||
| self.network_depth = "50" |
There was a problem hiding this comment.
Maybe we want to be able to set (some of) these as well, to run the workflow with different parameters?
|
|
||
| Parameters | ||
| ---------- | ||
| cfg : Workflow |
There was a problem hiding this comment.
It's interesting that you call the argument of type Workflow "cfg", no? Should Workflow really be called CellfinderConfig?
…signal and background files to None by default in dataclass (they need to be predefined).
bece87e to
8e13570
Compare
|
thanks for the feedback @alessandrofelder ! I think it looks much better now 🤩 I followed the structure from your example to:
For the config definition, I followed your approach: I check if there is an environment variable pointing to a config json file, otherwise I use the default dictionary (a dictionary can also be passed as an input). For retrieving the data, I check if the data exists locally, otherwise I fetch it from GIN. I also added some smoke tests for the workflow, which in retrospect maybe was a bit of a rabbit hole... I moved the tests to a different PR - still draft, I am having some odd issues with pytest and checking the logs |
alessandrofelder
left a comment
There was a problem hiding this comment.
Awesome.
I've made a minor suggestion to narrow the scope of the default_dict variable for you to consider.
| ) | ||
|
|
||
| else: | ||
| config = CellfinderConfig(**default_config_dict) |
There was a problem hiding this comment.
I wonder whether default_dict should only be initialised here inside the else block, instead of as a global variable, as it's good practice to keep the scope of variables small and define them near where they are actually used? (Also, if we keep it as a module-level constant it is now, it should be called DEFAULT_DICT according to PEP8?)
| assert input_config_path.exists() | ||
|
|
||
| # read into dict (assuming config is a json dict?) | ||
| # TODO:add error handling here? |
There was a problem hiding this comment.
This is a good point: we might want to move from configuration dataclasses to pydantic or attrs for implicit validation? Open an issue?
…ult config from input config in workflow setup.
|
this PR went through quite a bit of discussion, with enough changes to make it into a separate PR - the current solution is now at #23. Closing this one for now! |
This PR is a first step towards defining the cellfinder-core workflows (#10)
Workflow variants
From the README, the cellfinder workflow can be defined in two ways:
main()function fromcellfinder_core.main(see snippet here)Within each approach, we can also choose to read in the data using
Dask, or usingtifffile.imread. That would make 4 different workflow variants.In this PR, I focus on the first variant (using the
main()function and reading the data withDask) - I plan to implement the other variants in future PRs (but happy to discuss this).Workflow class
I defined a
Workflowclass with:voxel_size), andsetup_methods that wrap the steps required to process the data, but that we don't plan to benchmark (such as defining processing parameters or downloading the test data).I did this mostly thinking about how we could reuse these workflows for
asv, but the ideal structure for reusing this in benchmarking needs more exploration (likely in its own PR). I was assuming we will want to time the whole workflow (i.e.,workflow_from_cellfinder_run), and also each of the main steps separately (here,read_with_dask,cellfinder_run,save_cells)Note: initially I was also thinking about including some other the methods in this class (not "
setup_methods"), that would define wrappers around several API functions. For example, I had an extra methodread_signal_and_background_with_dask, for reading the signal and background data combined, thinking that we may want to time them in combination. But I am not sure if this split is worth it... so I just kept it simple and used theread_with_daskAPI function instead.Questions and feedback
Workflowclass, and just have thesetup_parmetersandsetup_input_datainside__init__? It would definitely look cleaner, but for now I decided to leave it more explicit, so that we have more control on what is initialised and when. This may be useful when trying to reuse these functions inasv.asvwork)?asvto time these functions for a range of values for certain preprocessing parameters.