Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restructure the analysis specification and remove sharedResources #144

Open
anthonysena opened this issue Jul 1, 2024 · 3 comments
Open

Comments

@anthonysena
Copy link
Collaborator

anthonysena commented Jul 1, 2024

The v0.x structure of a Strategus analysis specification is of the form

{
   "sharedResources":[....],
   "moduleSpecifications": [...]
}

The original thinking here is that we might need to refer to different "shared resources" which has mainly been the cohort definitions (with subsets) and negative control outcomes. Under the hood, Strategus will take the sharedResources section of the analysis specification and the selected module specifications and pass that to the module for execution.

@chrisknoll and others have proposed removing this section since it is a bit of a "catch all" that lacks structure. I'd propose that we simply store the elements of the cohort definitions, subsets and negative control outcomes with the CohortGeneratorModule specifications. We can then pass the CohortGeneratorModule specifications & the specific module's specifications into the job context for the module so that, if needed, these elements can be accessed.

Full analysis specification example: https://github.com/ohdsi-studies/FluoroquinoloneAorticAneurysm/blob/master/inst/analysisSpecification.json

@anthonysena anthonysena added this to the v1.0.0 milestone Jul 1, 2024
@chrisknoll
Copy link
Contributor

Thanks, @anthonysena for putting this issue out there for discussion.

My feeling on the 'shared resources' element is that it's a bit of a redundant moniker on those elements of the Strategus JSON, where you could just have cohort definitions, negative control specs etc, just declared at the top-level of the strategus where all modules know that they exist. One complication/confusion we have today is that some people think about putting the design specification inside the module node of Strategus, while other things were found in 'shared'. My thought about structuring this would be:

root
| - CohortDefinitions
| - SubgroupDefinitions
| - CharacterizationAnalysis
| - IncidenceAnalysis
| - CohortMethodAnalysis
| - PredictionAnalysis
| - Modules

So under this structure, it's pretty clear what the elements of the Strategus json. In this way, all the design elements have a 'home'. The 'Modules' one is where we currently specify which modules are in play for the given analsyis (that's where you find the version and names of the modules), however, I think we're moving away from the 'versioned modules', correct? Meaning: when bulding a Strategus spec, you will be using the Strategus API based on a given version of the library (managed in either a Docker container or renv.lock file) which will produce the JSON payload. then when you exeucute, you just use the proper runtime context for this specification (again, either run within a docker or initilized from a renv.lock).

With the runtime context established, the process of executing the strategus analysis would be:

  • Generate Cohorts
  • If CharacterzationAnalsyis specified, run CharacterizationAnalysis
  • ... repeat above check for each other type of analysis we want to run. maybe this list can be specified as an imput paramater to the execution call (ie: execute(design, options) where options lets you control which analysis to run.

I also think we discussed separating the 'publish results' to a separate function, eithr as a separate function of Strategus, or depend on a completely external process for handing strategus output.

@anthonysena
Copy link
Collaborator Author

My feeling on the 'shared resources' element is that it's a bit of a redundant moniker on those elements of the Strategus JSON, where you could just have cohort definitions, negative control specs etc, just declared at the top-level of the strategus where all modules know that they exist.

I think we agree that the notion of sharedResources should be removed in favor of named elements as you have proposed. Where I am still a bit uneasy is the promotion of these elements (CohortDefinitions for example) to the top-level of the specification. Part of my uneasiness stems from the fact that we (you and I) presume cohort definitions mean Circe definitions. Additionally, the SubgroupDefinitions and NegativeControlOutcomeCohortDefinitions are settings/specifications that the CohortGenerator works with to construct the cohorts. So promoting these beyond the scope of CohortGenerator feels wrong (even though that's what we've done in sharedResources).

Perhaps the full analysis specification should be made available to each module so that it may be interrogated? If another module (i.e. CohortDiagnostics) needs info about the cohorts, it can interrogate the specification for the CohortGenerator module?

One complication/confusion we have today is that some people think about putting the design specification inside the module node of Strategus, while other things were found in 'shared'. My thought about structuring this would be:

For reference, here is how the design specification is nested inside of the module node as you described (in case anyone else reading this is wondering):

  "moduleSpecifications": [
    {
      "module": "CohortGeneratorModule",
      "version": "0.1.0",
      "remoteRepo": "github.com",
      "remoteUsername": "ohdsi",
      "settings": {
        "incremental": true,
        "generateStats": true
      },
      "attr_class": ["CohortGeneratorModuleSpecifications", "ModuleSpecifications"]
    }

Moving these to a top level of the design document would make this clearer to your point. I then presume you'd have only 1 specification for a module in a given analysis specification in this case? I think this is fair since a module can have its settings handle multiple analyses.

The 'Modules' one is where we currently specify which modules are in play for the given analysis (that's where you find the version and names of the modules), however, I think we're moving away from the 'versioned modules', correct?

That's right - the version of the "module" will be dictated by the version of Strategus that is declared in the renv.lock file and the corresponding module's package version.

With the runtime context established, the process of executing the strategus analysis would be:

Generate Cohorts
If CharacterzationAnalsyis specified, run CharacterizationAnalysis
... repeat above check for each other type of analysis we want to run. maybe this list can be specified as an input parameter to the execution call (ie: execute(design, options) where options lets you control which analysis to run.

Generate Cohorts is a special case - we have to do that one up front and so agreed that it comes first. Also agreed about how we might loop through the additional elements of the analysis specification. Fully agreed about the function signature execute(design, options) where options lets you control which analysis to run. - this has been a pain in the v0.x line since when we want to execute part of an analysis, it required that you create another analysis specification which is not ideal.

I also think we discussed separating the 'publish results' to a separate function, either as a separate function of Strategus, or depend on a completely external process for handing strategus output.

I think this is out of scope for this topic (analysis specification changes) but yes, I'd expect that results management is different from the analysis specification and execution. This difference will result in different functions of Strategus for these areas of concern.

@anthonysena
Copy link
Collaborator Author

Adding to this issue based on #51 (comment), we'd like to integrate the renv.lock file contents into the analysis specification to ensure that the R packages used to create the specifications are the same as the ones used to execute the analysis. I think we need to elaborate on details in #148 before we decide on adding this to the analysis specification but wanted to make sure the idea was put forward here to track it.

@anthonysena anthonysena changed the title Remove sharedResources from analysis specification Restructure the analysis specification and remove sharedResources Jul 18, 2024
@anthonysena anthonysena modified the milestones: v1.0.0, Backlog Sep 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants