-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add features from post-coordinated concepts (value_as_concept_id) #262
Conversation
Wow! this is good stuff. i am looking forward to test this |
@schuemie thanks for your contribution and the explanation in the PR! It looks good to me. However, I can't test it locally on my Mac because it doesn't download the latest SqlRender from CRAN.
That's probably also why the R-check on MacOS fails |
Yes, I'm surprised the MacOS binary isn't yet available on all CRAN servers. You can already install 1.18.0 on your Mac from GitHub, but it should be available from CRAN any day now. |
yes that does work locally indeed. But I think it's better to wait with merging the PR when the binary is available. Since otherwise we probably get an error when making a new release and pushing it to CRAN. |
@@ -8,7 +8,8 @@ CREATE TABLE #cov_ref ( | |||
covariate_id BIGINT, | |||
covariate_name VARCHAR(512), | |||
analysis_id INT, | |||
concept_id INT | |||
concept_id INT, | |||
value_as_concept_id INT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making a note for @jreps - once this feature is introduced, this set of changes will add a new column value_as_concept_id
to the covariate reference table. We'll need to look at the results data model where this is stored for Characterization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @schuemie for this PR - this is a great addition! I left a question for you on the ID collisions just to see how we might document this for users and potentially add in some unit tests.
I'm also going to link this PR to #247 since we'd want to test this particular feature on all available RDBMS platforms since it uses the new HASH approach. I'd presume that this was tested with SqlRender v1.18 so perhaps this isn't necessary for this PR but we should do this for the v3.6 release.
SELECT covariate_id, | ||
MIN(@domain_concept_id) AS @domain_concept_id, | ||
MIN(value_as_concept_id) AS value_as_concept_id | ||
INTO #no_collisions | ||
FROM #temp_features | ||
GROUP BY covariate_id; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although collisions in covariate IDs are unlikely, I programmed defensively so that if a collision occurs, only one covariate is selected.
Just wanted to ask about this point and I believe this is the relevant code in SQL that handles the case where there are collisions. In this case, if a collision occurs, do we need to notify the user or emit some warning about this? I'm just thinking that if I am building a model and want to use all available measurement/observation+value concepts but then some are excluded on the basis of a collision on the hash values, would we want to alert the user to this. This feels like it might introduce bias into the feature selection process for these covariates?
Just thinking that we (more specifically me) could add a unit test to make sure that this is handled in an expected way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not opposed to generating some warning when a collision occurs, but it seems that this would require a major change in the way FeatureExtraction works. Right now, FeatureExtraction's Java code generates SQL that is executed on the server. I don't see how we can get a warning out of the server environment to the user. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose we'd have to create a temp table that includes the collisions and export it into the object model for the covariateData. To your point, that's a larger change so perhaps its not worth implementing this idea at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realized I could use the J&J network to see how many post-coordinated concepts are currently in use. Across 10 databases (Anthony: those used in ASD-002), I found a total of 181,976 unique concept_id - value_as_concept_id - table combinations, covering a total of 14.5 billion records. I was surprised by this magnitude, and this also underscores the importance of supporting these features.
Applying the original hashing algorithm to this set, I found 750 collisions, or 0.4% (actually 0.8% if you include both duplicates), which I though was a bit high.
I therefore asked ChatGPT to come up with a better hashing algorithm, which I implemented. This new algorithm had 27 collisions, or 0.01%.
As you'll see in the code, I now also added a collisions
column to the covariateRef table that counts the number of collisions detected. The R function now throws a warning if this column has values > 0. Note that this only detects collisions within a single data pull, for example for a single cohort, so it is extremely unlikely to occur. I added a unit test to make sure it works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Martijn for making these changes - everything looks good from my review. Before I merge this in, I'd like to enable all DB tests for FeatureExtraction so we can verify that this new hashing approach works across DBs. I plan to work on this today and will let you know when it is ready to merge into this feature branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
FYI: DatabaseConnector unit tests are failing on Linux because one of the database drivers is causing a null pointer exception. I've already spent days trying to figure out which one. Just so you know if you see odd behavior on Linux it is probably that.
I merged in the recent changes to ensure that all OHDSI test DB servers are used for unit tests and this branch passed all tests. I'll merge in these changes now. |
This PR adds the construction of features from post-coordinated concepts in the measurement and observation tables, as discussed in issue #67
As discussed, documented, and evaluated in this file, the covariate builder uses a simple hashing function implemented in SQL to compute covariate IDs from the two concept IDs and analysis ID. Although collisions in covariate IDs are unlikely, I programmed defensively so that if a collision occurs, only one covariate is selected. Note that, for the SQL hashing to function on Oracle and Snowflake, the latest version of SqlRender (1.18.0) is required, as specified in the DESCRIPTION file.
The current implementation adds a column to the
covariateRef
table in theCovariateData
object documenting thevalue_as_concept_id
. This has value NA for all other covariates constructed by FeatureExtraction. Note that this required changes in the Java code, and we therefore have a new JAR file.I added several unit tests, including these features in test-query-no-fail.R so we know they don't cause errors on the various database platforms, and a simple tests on Eunomia to see if the computations are correct. More unit tests could be added, but I would ask others implement these.
I have taken the liberty of adding these new features to the default set of covariates. I know this will therefore basically affect every HADES study moving forward, but it seems that if other features based on measurements (measurements exist (yes/no) and measurement in normal range (below/within/above)) are part of the default set, then these new ones should be as well.
Related to this last comment: I observed that the measurement feature for normal range in the short term was not part of the default set, even though the long-term one was, as well as other short term measurement features. This must be an error (probably by me), so I also changed it to be included it in the default set as well.