Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of CohortIncidence module. #147

Merged

Conversation

chrisknoll
Copy link
Contributor

This PR uses ResultsModelManager to create the results schema and upload results.

Fixed param typo in ResultDataModel.R
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My feeling is that this should be in the package for several reasons:

  1. If a change is introduced in the model for any reason that requires a release of both packages. Update just one and this will result in a break interoperability
  2. Users will always need to work with their data, even if they don't use strategus they may wish to use the shiny apps or other report generation tools. Now they will have a dependency they didn't previously need
  3. The current release of CohortIncidence just released a breaking change with regards to shiny apps and any other apps by not implementing migrations. Now, currently existing projects will no longer be upgradable without custom hacks for what is a relatively small change to the data model
  4. It's inconsistent with other packages and utilities
  5. It places a burden on the maintainer of Strategus to maintain 100% version consistency with CohortIncidence

Copy link
Contributor Author

@chrisknoll chrisknoll Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the questions and comments, @azimov . I'll preface this by saying I did give a lot of consideration between what goes into the package and what is part of the module. The deciding factor was if it was a Strategus concern, it went into the module, if it was a package concern, it goes into the package.

If a change is introduced in the model for any reason that requires a release of both packages. Update just one and this will result in a break interoperability

Not necessarily: if the new version of the package produces the same output, from the same input, then I don't think you need a new version of Strategus (just reference a new version in renv.lock). If the analytics change, either by inputs or outputs, I think it would make sense that a different version of Strategus would be necessary to represent the changes in dependencies. I would suggest 2 things on this front: a release of Strategus comes with a published renv.lock file that contains which versions of package dependencies have been tested with the given version of Strategus, and that within a single release of Strategus you may have multiple updates to underlying packages.

Users will always need to work with their data, even if they don't use strategus they may wish to use the shiny apps or other report generation tools. Now they will have a dependency they didn't previously need

While this is true, it can't use the OhdsiShinyModules to view results from the underlying packages at the package level (ie: take the output of CI package and use OhdsiShinyModules to view results). The reason is that the OhdsiShinyModules nave adopted conventions that have been specified by Strategus. 2 Examples: if you want the list of databases, you go to the database meta table (which was created by Strategus; and if you want to get cohort definition names, you go to the cg_table_prefixcohort_definition, a table created by CohortGenerator which Strategus enforces that CohortGenerator is part of the Strategus analsyis. See the following:
https://github.com/search?q=repo%3AOHDSI%2FOhdsiShinyModules%20prefixcohort_definition&type=code

Based on these decisions, I would suggest that OHDSIShinyModules becomes something like StrategusShinyModules, or we could embed the ShinyModules into the release of Strategus so that you have both the execution of analysis, persistence of results and report viewer all bound to the same version of the software. I fully understand that the original intent of OhdsiShinyModules was intended to view results of the individual HADES packages, but, in my estimation, this principle has been abandoned in favor of Strategus-specific concerns.

The current release of CohortIncidence just released a breaking change with regards to shiny apps and any other apps by not implementing migrations. Now, currently existing projects will no longer be upgradable without custom hacks for what is a relatively small change to the data model

I thought about this, but I decided on the approach for 2 reasons: 1) you won't be changing the version of the analysis packages within a single project. and 2) while I would have preferred a migration approach to this, the most expedient way was to use the functionality out of RMM to create a schema base don the results model spec. I'd like to move to a pure-migration approach to managing database schemas (as we do in WebAPI), but based on the nature that the schema shouldn't change within a study, I decided the simplest approach would be to use RMM.

It's inconsistent with other packages and utilities

There isn't exactly a published standard on this, and the module layer provides that structure. However, it is consistent with the EvidenseSynthisis package which is another RMM that is contained in Strategus.

This PR follows the Strategus conventions through the use of the common R6 classes for module implementation, and in this way, it is consistent with the other packages. This module layer allows the underlying packages to be independent towards how they can function most effectively (in their own way) while being consistent in execution across all modules.

There are a number of ways that packages are inconsistent: CI handles JSON serialization/deserializtion differently, but it is because serialization via ParallelLogger is broken. Not every package defines inputs via R6 classes. I think there is going to be differences in approaches for many of the HADES packages, and I can understand the desire to keep that consistent, it may be more trouble than it's worth to get down to every possible detail.

It places a burden on the maintainer of Strategus to maintain 100% version consistency with CohortIncidence

I'm not sure this is the case, as @anthonysena wasn't involved in any of the implementation of this PR for incorporation of CI into Strategus. I do expect that we'll have different collaborators as the 'responsible party' for the maintenance of the underlying modules (which may or may not be the underlying package maintainer), and I think this will be coordinated through the Strategus team.

Thanks again for your thoughts and comments. Let me know if I misunderstood anything, and I will be glad to provide more detail.

@chrisknoll chrisknoll marked this pull request as draft July 17, 2024 14:48
@chrisknoll
Copy link
Contributor Author

@anthonysena , I've converted to draft as there's one additional thing I'd like to add (and this is a good illustration of some comments that I made in the feedback from @azimov ):

In OhdsiShinyModules, the Characterization/CohortIncidence view has some functionality that wants to navigate to those outcomes that belong to specific targets of interest. In other words, the CI UX for reporting would like to be able to select outcomes that exist for a given T. This isn't something that is captured in the analysis results, but it is something that is required by Strategus. So, I will add a new table to the results model specification in Strategus that doesn't exist in the underlying package, which will store the given T-O-TAR pairs that were specified in the analysis. The Module implementation will add this result to the output as part of the execute() method.

@chrisknoll
Copy link
Contributor Author

Hmm, after I merged the upstream branch to this branch, tests failed in something related to Characterization. @anthonysena do you have any idea where that came from?

@chrisknoll
Copy link
Contributor Author

chrisknoll commented Jul 17, 2024

On the topic of module implementation philosophy:

I spoke for an hour with @jreps and @pbr6cornell about the implementation details and what is a Strategus, OHDSIShinyModule and package concern, and while we do want to get to a consensus on philosophy and approach, we're not quite there yet. To that end, I've decided that the implementation for this PR will stand, but I did commit to after having this in place for a time, we can discuss the merits and problems with the approach and I have committed to revisiting the code to address any issues that arise. Moving the implementation from module into package will not result in any API changes, so there should not be a backwards compatibility challenge by making the change in the future.

@anthonysena
Copy link
Collaborator

Hmm, after I merged the upstream branch to this branch, tests failed in something related to Characterization. @anthonysena do you have any idea where that came from?

Yes, the most recent release of FeatureExtraction broke Characterization as described here: OHDSI/Characterization#54. Once we patch Characterization, the unit tests should run without issue.

Given this, we can move forward with merging this work into the other feature branch if you feel it is ready? I did a quick review and it is fine for now so we can wrap up the initial module work.

@chrisknoll
Copy link
Contributor Author

chrisknoll commented Jul 19, 2024

@anthonysena , I added a new table to the results model for CI that isn't part of the output of the package, but is a requested table to support reporting. This is a good example of where module comes in and bridges functionality betwene the package and the reporting. So this PR is ready to go.

@chrisknoll chrisknoll marked this pull request as ready for review July 19, 2024 17:00
@chrisknoll
Copy link
Contributor Author

chrisknoll commented Jul 22, 2024

@anthonysena , i added a commit to this PR but touches DatabaseMetaData.R:

I had to set runCheckAndFixCommands = TRUE, in uploadResults, because that seems to be the only way to fix the database_id int value to be properly converted to a varchar. Without it, the number comes up as '####.0' instead of an integer ####. I was running into the issue where the dabasese_id in one table was #.0 and in another table was ###, and was causing the join to fail. This was working when the database_id was part of a PK column (it performs the int->as.character conversion if it is a primary key column, but something changed with that in my local env so now I force these tables to undergo the check and fix.

Ideally, we'd just want to convert all columns to the correct underlying type before upload (which is one of the 3 things that checkAndFixCommands does), so maybe we can have RMM change to allow just the column conversion done separately from the other 2 things (which one checks for dupes and I forget what the 3rd function is, but these 2 ops would be very expensive on very large data).

@anthonysena
Copy link
Collaborator

@anthonysena , i added a commit to this PR but touches DatabaseMetaData.R:

I had to set runCheckAndFixCommands = TRUE, in uploadResults, because that seems to be the only way to fix the database_id int value to be properly converted to a varchar. Without it, the number comes up as '####.0' instead of an integer ####. I was running into the issue where the dabasese_id in one table was #.0 and in another table was ###, and was causing the join to fail. This was working when the database_id was part of a PK column (it performs the int->as.character conversion if it is a primary key column, but something changed with that in my local env so now I force these tables to undergo the check and fix.

Ideally, we'd just want to convert all columns to the correct underlying type before upload (which is one of the 3 things that checkAndFixCommands does), so maybe we can have RMM change to allow just the column conversion done separately from the other 2 things (which one checks for dupes and I forget what the 3rd function is, but these 2 ops would be very expensive on very large data).

So database_id is a varchar as declared in https://github.com/OHDSI/Strategus/blob/remove-deps-add-module-interface/inst/csv/databaseMetaDataRdms.csv#L13. Perhaps you need to cast this in your module when you set the database_id? It might be helpful to see how you are testing this out as the unit tests were passing earlier without this change to the DatabaseMetaData.R.

@chrisknoll
Copy link
Contributor Author

chrisknoll commented Jul 22, 2024

It's all handed by uploadResults:
https://github.com/OHDSI/ResultModelManager/blob/d939010caf5fdc4e898c6e6d1c68d39dc86dc77f/R/DataModel.R#L535

I do not know why it started working differently, I'm not sure if it was based on a merge from upstream or a new version install of RMM, but the solution was to set that flag in uploadResults.

@chrisknoll
Copy link
Contributor Author

chrisknoll commented Jul 23, 2024

Let me push up a change where i don't use that flag, but append the databaseId as a as.character() However, there will need to be some change on the DatabaseModule.R side where you do the same (add database_id as a character value).

Edit:

I see the problem with this now: The value is written to the CSV correctly (there is no .0 in the csv). However, we lose control over what happens in ResultsModelManager::uploadResults() since it reads the CSVs and works on them based on resultsModelSpec. By default, RMM does not convert the csv columns into the correct data type (UNLESS you use the runCheckAndFixCommands flag), so there's nothing we can do on the outside of RMM. Hence, the change was necessary.

I think the long term solution is to ask RMM to modify the readCSV to read the columns based on the data types that are specified in the results model. By default, the read_csv() tries to 'guess' the value and the database_id looks like numbers.

@anthonysena
Copy link
Collaborator

I did a test to see if there was a difference in behavior when uploading results to SQLite vs. PostgreSQL based on the notes made by @chrisknoll above. The answer is that there is a difference in behavior. When I uploaded the database_meta_data into SQLite, the value was formatted as database_id = 85642205.0 while in PostgreSQL the value is database_id = 85642205.

For now, to keep things consistent, I'll remove the runCheckAndFixCommands = TRUE after I merge this into the other feature branch.

@anthonysena anthonysena merged commit 00be102 into remove-deps-add-module-interface Jul 26, 2024
2 of 8 checks passed
@anthonysena anthonysena deleted the module-interface-cohortincidence branch July 26, 2024 18:58
anthonysena added a commit that referenced this pull request Jul 31, 2024
* Implementation of CohortIncidence module. (#147)
* Adjustments from testing

---------

Co-authored-by: Chris Knoll <[email protected]>
anthonysena added a commit that referenced this pull request Oct 7, 2024
* Implementation of CohortIncidence module. (#147)
* Adjustments from testing

---------

Co-authored-by: Chris Knoll <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants