Add features from post-coordinated concepts (value_as_concept_id) #262

schuemie · 2024-05-31T06:24:00Z

This PR adds the construction of features from post-coordinated concepts in the measurement and observation tables, as discussed in issue #67

As discussed, documented, and evaluated in this file, the covariate builder uses a simple hashing function implemented in SQL to compute covariate IDs from the two concept IDs and analysis ID. Although collisions in covariate IDs are unlikely, I programmed defensively so that if a collision occurs, only one covariate is selected. Note that, for the SQL hashing to function on Oracle and Snowflake, the latest version of SqlRender (1.18.0) is required, as specified in the DESCRIPTION file.

The current implementation adds a column to the covariateRef table in the CovariateData object documenting the value_as_concept_id. This has value NA for all other covariates constructed by FeatureExtraction. Note that this required changes in the Java code, and we therefore have a new JAR file.

I added several unit tests, including these features in test-query-no-fail.R so we know they don't cause errors on the various database platforms, and a simple tests on Eunomia to see if the computations are correct. More unit tests could be added, but I would ask others implement these.

I have taken the liberty of adding these new features to the default set of covariates. I know this will therefore basically affect every HADES study moving forward, but it seems that if other features based on measurements (measurements exist (yes/no) and measurement in normal range (below/within/above)) are part of the default set, then these new ones should be as well.

Related to this last comment: I observed that the measurement feature for normal range in the short term was not part of the default set, even though the long-term one was, as well as other short term measurement features. This must be an error (probably by me), so I also changed it to be included it in the default set as well.

…it tests

gowthamrao · 2024-06-04T10:55:29Z

Wow! this is good stuff. i am looking forward to test this

ginberg · 2024-06-04T14:05:44Z

@schuemie thanks for your contribution and the explanation in the PR! It looks good to me. However, I can't test it locally on my Mac because it doesn't download the latest SqlRender from CRAN.

> install.packages("SqlRender")
trying URL 'https://cran.rstudio.com/bin/macosx/big-sur-arm64/contrib/4.3/SqlRender_1.17.0.tgz'
Content type 'application/x-gzip' length 460315 bytes (449 KB)
==================================================
downloaded 449 KB

That's probably also why the R-check on MacOS fails
I noticed that SqlRender has been released recently, do you think it will be available for all platforms soon?

schuemie · 2024-06-04T14:09:51Z

Yes, I'm surprised the MacOS binary isn't yet available on all CRAN servers. You can already install 1.18.0 on your Mac from GitHub, but it should be available from CRAN any day now.

ginberg · 2024-06-04T14:44:54Z

yes that does work locally indeed. But I think it's better to wait with merging the PR when the binary is available. Since otherwise we probably get an error when making a new release and pushing it to CRAN.

R/DefaultCovariateSettings.R

anthonysena · 2024-06-07T13:17:15Z

inst/sql/sql_server/CreateCovAnalysisRefTables.sql

@@ -8,7 +8,8 @@ CREATE TABLE #cov_ref (
 	covariate_id BIGINT,
 	covariate_name VARCHAR(512),
 	analysis_id INT,
-	concept_id INT
+	concept_id INT,
+	value_as_concept_id INT


Making a note for @jreps - once this feature is introduced, this set of changes will add a new column value_as_concept_id to the covariate reference table. We'll need to look at the results data model where this is stored for Characterization.

anthonysena

Thank you @schuemie for this PR - this is a great addition! I left a question for you on the ID collisions just to see how we might document this for users and potentially add in some unit tests.

I'm also going to link this PR to #247 since we'd want to test this particular feature on all available RDBMS platforms since it uses the new HASH approach. I'd presume that this was tested with SqlRender v1.18 so perhaps this isn't necessary for this PR but we should do this for the v3.6 release.

anthonysena · 2024-06-07T13:30:44Z

inst/sql/sql_server/MeasObsValueAsConcept.sql

+SELECT covariate_id,
+	MIN(@domain_concept_id) AS @domain_concept_id,
+	MIN(value_as_concept_id) AS value_as_concept_id
+INTO #no_collisions
+FROM #temp_features
+GROUP BY covariate_id;
+


Although collisions in covariate IDs are unlikely, I programmed defensively so that if a collision occurs, only one covariate is selected.

Just wanted to ask about this point and I believe this is the relevant code in SQL that handles the case where there are collisions. In this case, if a collision occurs, do we need to notify the user or emit some warning about this? I'm just thinking that if I am building a model and want to use all available measurement/observation+value concepts but then some are excluded on the basis of a collision on the hash values, would we want to alert the user to this. This feels like it might introduce bias into the feature selection process for these covariates?

Just thinking that we (more specifically me) could add a unit test to make sure that this is handled in an expected way.

I'm not opposed to generating some warning when a collision occurs, but it seems that this would require a major change in the way FeatureExtraction works. Right now, FeatureExtraction's Java code generates SQL that is executed on the server. I don't see how we can get a warning out of the server environment to the user. Thoughts?

I suppose we'd have to create a temp table that includes the collisions and export it into the object model for the covariateData. To your point, that's a larger change so perhaps its not worth implementing this idea at this point.

I realized I could use the J&J network to see how many post-coordinated concepts are currently in use. Across 10 databases (Anthony: those used in ASD-002), I found a total of 181,976 unique concept_id - value_as_concept_id - table combinations, covering a total of 14.5 billion records. I was surprised by this magnitude, and this also underscores the importance of supporting these features.

Applying the original hashing algorithm to this set, I found 750 collisions, or 0.4% (actually 0.8% if you include both duplicates), which I though was a bit high.

I therefore asked ChatGPT to come up with a better hashing algorithm, which I implemented. This new algorithm had 27 collisions, or 0.01%.

As you'll see in the code, I now also added a collisions column to the covariateRef table that counts the number of collisions detected. The R function now throws a warning if this column has values > 0. Note that this only detects collisions within a single data pull, for example for a single cohort, so it is extremely unlikely to occur. I added a unit test to make sure it works.

Thanks Martijn for making these changes - everything looks good from my review. Before I merge this in, I'd like to enable all DB tests for FeatureExtraction so we can verify that this new hashing approach works across DBs. I plan to work on this today and will let you know when it is ready to merge into this feature branch.

Sounds good!

FYI: DatabaseConnector unit tests are failing on Linux because one of the database drivers is causing a null pointer exception. I've already spent days trying to figure out which one. Just so you know if you see odd behavior on Linux it is probably that.

…ion detected

anthonysena · 2024-06-19T13:19:23Z

I merged in the recent changes to ensure that all OHDSI test DB servers are used for unit tests and this branch passed all tests. I'll merge in these changes now.

Admin_mschuemi and others added 10 commits May 17, 2024 03:39

Testing possible hashing scheme for postcoordinated concepts

d74059f

Initial implementation of post-coordinated concept features

0686d0b

No errors on SQLite

70d822c

Fixing errors on other platforms. Requires new SqlRender.

9502b7c

Add simple unit test for postcoordinated concept features

acb4c44

Adding valueAsConceptId to covariate reference table

e68fd57

Forgot to skip when not testing on Eunomia

e833f8e

Updating Java checksum

de42663

Changing temp cohort table name to prevent interference with other un…

42ef478

…it tests

Referencing newly released SqlRender version

96f7279

schuemie requested a review from ginberg May 31, 2024 06:24

ginberg approved these changes Jun 4, 2024

View reviewed changes

anthonysena reviewed Jun 7, 2024

View reviewed changes

R/DefaultCovariateSettings.R Show resolved Hide resolved

anthonysena reviewed Jun 7, 2024

View reviewed changes

anthonysena requested changes Jun 7, 2024

View reviewed changes

Swithing covariate ID hashing algorithm. Throwing warning when collis…

be1c466

…ion detected

anthonysena linked an issue Jun 18, 2024 that may be closed by this pull request

Support for observation / measurement values #67

Closed

Merge branch 'develop' into issue67

692121b

anthonysena approved these changes Jun 19, 2024

View reviewed changes

anthonysena merged commit d6711c8 into develop Jun 19, 2024
8 checks passed

anthonysena deleted the issue67 branch June 19, 2024 13:19

anthonysena mentioned this pull request Jul 18, 2024

FeatureExtraction v3.6 changes covariate_ref table OHDSI/Characterization#53

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add features from post-coordinated concepts (value_as_concept_id) #262

Add features from post-coordinated concepts (value_as_concept_id) #262

schuemie commented May 31, 2024

gowthamrao commented Jun 4, 2024

ginberg commented Jun 4, 2024

schuemie commented Jun 4, 2024

ginberg commented Jun 4, 2024

anthonysena Jun 7, 2024

anthonysena left a comment

anthonysena Jun 7, 2024

schuemie Jun 13, 2024

anthonysena Jun 14, 2024

schuemie Jun 18, 2024

anthonysena Jun 18, 2024

schuemie Jun 18, 2024

anthonysena commented Jun 19, 2024

Add features from post-coordinated concepts (value_as_concept_id) #262

Add features from post-coordinated concepts (value_as_concept_id) #262

Conversation

schuemie commented May 31, 2024

gowthamrao commented Jun 4, 2024

ginberg commented Jun 4, 2024

schuemie commented Jun 4, 2024

ginberg commented Jun 4, 2024

anthonysena Jun 7, 2024

Choose a reason for hiding this comment

anthonysena left a comment

Choose a reason for hiding this comment

anthonysena Jun 7, 2024

Choose a reason for hiding this comment

schuemie Jun 13, 2024

Choose a reason for hiding this comment

anthonysena Jun 14, 2024

Choose a reason for hiding this comment

schuemie Jun 18, 2024

Choose a reason for hiding this comment

anthonysena Jun 18, 2024

Choose a reason for hiding this comment

schuemie Jun 18, 2024

Choose a reason for hiding this comment

anthonysena commented Jun 19, 2024