Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
RFC79: Incremental Upload of Data Entries (#48)
* Add clinical_attribute_meta records to the seed mini To make the dataset look like real data in the database * Implement sample attribute rewriting flag * Add --overwrite-existing for the rest of test cases Apperently, the flag does not change anything. But we add it anyway as the tests for "incremental" data upload. * Test that mutations stay after updating the sample attributes * Add overwrite-existing support for mutations data * Fix --overwirte-existing flag description for importer of profile data * Add loader command to update case list with sample ids adding to the all case list and case list specified with command arguments is supported * Add option to remove sample ids from the remaining case lists From case lists that is not _all case list and not specified with --add-to-case-lists option * Make removing sample ids from not mentioned case lists a default behaviour * Make update case list command to read case lists files * Fix test clinical data headers * Test incremental patient upload * Add flag to reload patient clinical attributes * Add TODO comment to remove MIXED_ATTRIBUTES data type with a reference to the ticket * WIP adopt py script to incremental upload * Fix java.sql.SQLException: Generated keys not requested * Clean alteration_driver_annotation during mutations inc. upload * Fix validator and importer py scripts for inc. upload * Add test/demo data for incremental loading of study_es_0 study * Rename and move incremental tests to incementalTest folder * Update TODO comment how to deal with multiple sample files * Move study_es_0_inc to the new test data folder * Fix removing patient attributes on samples inc. upload * Change study_es_0_inc to contain more diverse data We changed them to work for the demo. Mutation numbers did not change on demo. * Specify that data_directory for incremental data * Disambiguate clinical data constants names Not it was easy to be confused where sample and clinical_sample (attributes), patient and clinical_patient (attributes) related code * Remove not necessary TODO comments * Remove MSK copyright mistakenly copy-pasted * Fix comment of UpdateCaseListsSampleIds.run() method * Make --overwrite-existing flag description more generic This flag for command to upload molecular profile data * Add TODO comments for possible reuse of the code * Update case lists for multiple clinical sample files Potentially for different studies * Extract and reuse common logic to read and validate case lists * Fix TestIntegrationTest - change location of the files - make sure assertions could work on the seed mini db - get rid from absent cbioportal dependencies * Revert RESOURCE_DEFINITION_DICTIONARY initialsation to empty set * Minor improvments. Apply PRs feedback * Make tests fail the build. Conduct exit status of tests correctly * Write Validation complete only in case of successful validation * Add python tests for incremental/full data import * Add unit test for incremental data validation * Test rough order of importer commands. Remove sorting in the script to guarantee that * Extract smaller functions from the big one in py script Make process_data_directory(...) smaller * Refactor tab delim. data importer - Calculate number of lines in the file in the loader - Remove unused imports and fields - Reuse constructors - Reuse common parsing logic in tab delimiter importer - Show full stacktrace which helps in dinding where tests errored out * Implement incremental upload of mRNA data * Add RPPA test * Add normal sample to thest data to test skipping * Add rows with more columns then in header to skip * Skip rows that don't have enough sample columns * Test for invalid entrez id * Extract common code from inc. tab. delim. tests * Implement incremntal upload of cna data via tab. delim. loader * Blanken values for genes not mentioned in the file * Remove unused code * Throw unsupported operation exception for GENESET_SCORE incremental upload * Add generic assay data incremental upload test * Fix integration tests * Make tab. delimiter data uploader transactional * Check for illegal state in tab delim. data update It's dangerous as we would further mess up the data in the row * Wire incremental tab delim. data upload to cli commands * Expand README with section on how to run incremental upload * Address TODOs in tab delim. importer * Add more data types to incremental data upload folder * Remove obsolete TODO comment * Reuse genetic_profile record if it exists in db already Do it for all data types, not only MAF * Test incremental upload of tab delim. data types from umbrella script - Split big tab. delim test to multiple tests based on data type. - Use ImportProfileData instead of ImportTabDelimData for testing. - We cover more logic with such tests. - This is more stable interface. ImportTabDelimData can be refactored. * Move counting lines if file inside generic assay patient level data uploader * Give error that generic asssay patient level data is not supported * Clean sample_cna_event despite whether it has alteration_driver_annotation rows or not * Fix cbioportalImport script execution args variable was not declared * Remove not needed spring context initialisation that caused different errors to occur * Make error message more informative when gene panel is not found Do not throw NPE, but NSEE with error message that mentions panel id * Add more genes to the mini seed to load study_es_0 * Make study_es_0_inc data pass validation * Document in README how to load study_es_0 study * Implement incremental upload for timeline data * Implement incremental upload of CNA DISCRETE long data * Add data type sanity check for tsv uploded * Move storing/dedup logic of genetic alteration values to importer * Move all inc. upload logic for tab delim. data types to GeneticAlterationImporter * Add CNA DISCRETE LONG to study_es0_inc test dataset * Remove unused code * Make validation to pass for CNA long and study_es_0_inc data * Implement incremental upload for gene panel matrix The uploader was working in incremental manner already. I had to add tests for those only. I had to implement incremental upload for gene panel matrix from differend data (CNA, Mutations) uploaders though. * Make validation of study_es_0_inc data to pass * Implement incremental upload of structural variants data I removed DaoGeneticProfileSamples.addGeneticProfileSamples(geneticProfileId, orderedSampleList); as it does not seem to be needed. it does not make any sense to store samples in genetic_profile_samples, if you don't use genetic_alteration table at all. * Implement incremental upload of CNA segmented data * Make it explicit that timeline uploader support bulk mode only * Fix number of columns in SV tsv data file * Update paragraph on inc. upload in README * Rename validation method to better describe it's purpose To really validate entrez id, we need to look it up * Fix cleaning alteration_driver_annotation table for specific sample * DRY tab separated value string parsing * Reuse FileUtil.isInfoLine(String line) throughout the code * Extract ensuring header and row match to tsv utility class * Simplify delete sql. Rely on cascade delete instead. * Generalise overwrite-existing flag description to make it more accurate * Rename updateMode to isIncrementalUpdateMode flag * Improve description of overwrite-existing flag for gene panel profile map * Implement more optimal way to update sample profile * Optimize code by always using batch upsert for sample profile * Recognise that SEG importer always use bulkLoad * Organise bulk mode flushing for SEG importer * Ignore case for bulkLoad load mode option as everywhere in the code * add comma to README * improve order comments for INCREMENTAL_UPLOAD_SUPPORTED_META_TYPES * Add join by GENETIC_PROFILE_ID column for sample_cna_event and alteration_driver_annotaiton tables * Check for inconsistency in sample ids and values while reading genetic alterations * Make method name to initialise transaction clearer * Remove TODOs that were done * Rename isInfoLine util. method to isDataLine I got feedback that "info line" sounds like the header metadata lines starting with # * Simplify code by using inheritence instead of composition * Optimize removing genetic alterations by removing them for the whole genetic profile at once. one sql statment instead of N * Access inherited variables with this. intead of super. the confusion that triggered the change: The use of super. indicates that the subclass also declares one with the same name, but you are trying to not set that somehow? * Remove unused code from DaoSampleList.addSampleList() * Remove extra semicolons at the end of java statements * Rename upsertSampleProfiles to upsertSampleToProfileMapping method in DaoSampleProfile * Use java 8 way to convert typed list to array in GeneticAlterationIncrementalImporter * Improve doc comments for TsvUtil.isDataLine(String line) * Rename and codument better method to updateCaseLists * Remove DEFINED_CANCER_TYPES global variable * Add docstring to sample attribute remove methods Make it explicity that function will delete any matching records "if they exist" * Add docstring to method to update fraction genome altered clinical attribute Specify that sampleIds is optional and can be set to null * Make DAO contant that hold SQL private increase incapsulation * Stop doing rows math, it's just a status! * Adopt C style of incrementing jdbc paramters * Improve wording in error message * Remove unused method of genetic alteration importer * Extract db communicating methods out of the constructor introduce initialise() method * Improve time complexity from N^2 to N * Use american english for method names --------- Co-authored-by: pieterlukasse <[email protected]>
- Loading branch information