Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest sequencing accession IDs #387

Merged
merged 7 commits into from
Oct 7, 2024
Merged

Conversation

davereinhart
Copy link
Contributor

@davereinhart davereinhart commented Aug 7, 2024

The clinical data module is being updated to parse sequencing accessioning data for uploading into receiving.clinical, and the clinical ETL is being updated for ingestion into warehouse tables. This data is being sourced from tracking sheets maintained on Github for the Seattle Flu Study.

Running this ETL on receiving.clinical records with document containing gisaid_accession or genbank_accession will result in custom processing for this particular type of data. After matching to an existing sample, a minimal consensus_genome and genomic_sequence record will be generated for each covid-19, RSV-A, RSV-B, Influenza A and Influenza B sequence record.

Adding a function to clinical ETL module to parse sequencing accession ID tracking sheets that contain NWGC IDs, BBI-assigned strain names, GISAID, and GenBank accessions. This function parses, filters, and formats the data for ingestion into ID3C.
Calculates a sequence identifier by hashing the `strain_name` and appending the pathogen code (RSVA, RSVB, or HCOV19) to be used as the identifier in warehouse.genomic_sequence table.
…eceiving table

The clinical ETL is being updated for ingestion of sequencing accessioning data from receiving.clinical. This data is being sourced from tracking sheets maintained on Github for the Seattle Flu Study.

Running this ETL on receiving.clinical records with document containing `gisaid_accession` or `genbank_accession` will result in custom processing for this particular type of data. After matching to an existing sample, a minimal `consensus_genome` and `genomic_sequence` record will be generated for each covid-19, RSV-A, and RSV-B sequence record.
@davereinhart davereinhart requested a review from a team as a code owner August 7, 2024 15:32
@davereinhart davereinhart marked this pull request as draft August 7, 2024 15:33
@davereinhart davereinhart force-pushed the ingest-seq-accession-ids branch from 25c298e to 3929359 Compare September 10, 2024 23:33
Updating ETL to process sequencing accession identifier data from SFS from the clinical receiving table. This data is being processed through the clinical ETL because it has a distinct format from previous sequencing data, and is likely to only be needed once at project close.
The upsert_genome function was only performing inserts. Adding ON CONFLICT clause to perform a "non-updating" update and return the existing record id.
@davereinhart davereinhart force-pushed the ingest-seq-accession-ids branch from 3929359 to e306ec1 Compare September 16, 2024 15:58
@davereinhart davereinhart marked this pull request as ready for review September 16, 2024 17:22
@davereinhart davereinhart merged commit 932e855 into master Oct 7, 2024
4 checks passed
@davereinhart davereinhart deleted the ingest-seq-accession-ids branch October 7, 2024 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants