Check if by param exists when doing sample-based annotation

Plus rename TSV to .tsv
vib-singlecell-nf · Sep 17, 2020 · edcda97 · edcda97
1 parent de1f896
commit edcda97
Show file tree

Hide file tree

Showing 10 changed files with 56 additions and 39 deletions.
diff --git a/docs/features.rst b/docs/features.rst
@@ -162,16 +162,16 @@ For both methods, here are the mandatory params to set:
 
 If ``aio`` used, the following additional params are required:
 
-- ``cellMetaDataFilePath`` is a file path pointing to a single TSV file (with header) with at least 2 columns: a column containing all the cell IDs and an annotation column.
+- ``cellMetaDataFilePath`` is a file path pointing to a single .tsv file (with header) with at least 2 columns: a column containing all the cell IDs and an annotation column.
 - ``indexColumnName`` is the column name from ``cellMetaDataFilePath`` containing the cell IDs information. This column **can** have unique values; if it's not the case, it's important that the combination of the values from the ``indexColumnName`` and the ``sampleColumnName`` are unique. 
 - ``sampleColumnName`` is the column name from ``cellMetaDataFilePath`` containing the sample ID/name information. Make sur that the values from this column match the samples IDs inferred from the data files. To know how those are inferred, please read the `Input Data Formats`_ section.
 
 If ``obo`` is used, the following params are required:
 
 - ``cellMetaDataFilePath``
 
-  - In multi-sample mode, is a file path containing a glob pattern. The target file paths should each pointing to a TSV file (with header) with at least 2 columns: a column containing all the cell IDs and an annotation column.
-  - In single-sample mode, is a file path pointing to a single TSV file (with header) with at least 2 columns: a column containing all the cell IDs and an annotation column.
+  - In multi-sample mode, is a file path containing a glob pattern. The target file paths should each pointing to a .tsv file (with header) with at least 2 columns: a column containing all the cell IDs and an annotation column.
+  - In single-sample mode, is a file path pointing to a single .tsv file (with header) with at least 2 columns: a column containing all the cell IDs and an annotation column.
   - **Note**: the file name(s) of ``cellMetaDataFilePath`` is/are required to contain the sample ID(s).
 
 - ``sampleSuffixWithExtension`` is the suffix used to extract the sample ID from the file name(s) of ``cellMetaDataFilePath``. The suffix should be the part after the sample name in the file path.
@@ -200,7 +200,7 @@ The profile ``utils_sample_annotate`` should be added when generating the main c
 
 Then, the following parameters should be updated to use the module feature:
 
-- ``metaDataFilePath`` is a TSV file (with header) with at least 2 columns where the first column need to match the sample IDs. Any other columns will be added as annotation in the final loom i.e.: all the cells related to their sample will get annotated with their given annotations.
+- ``metaDataFilePath`` is a .tsv file (with header) with at least 2 columns where the first column need to match the sample IDs. Any other columns will be added as annotation in the final loom i.e.: all the cells related to their sample will get annotated with their given annotations.
 
 .. list-table:: Sample-based Metadata Table
     :widths: 40 40 20
@@ -287,7 +287,7 @@ If ``external`` used, the following additional params are required:
 
 - ``filters`` is a List of Maps where each Map is required to have the following parameters:
 
-  - ``cellMetaDataFilePath`` is a file path pointing to a single TSV file (with header) with at least 3 columns: a column containing all the cell IDs, another containing the sample ID/name information, and a column to use for the filtering.
+  - ``cellMetaDataFilePath`` is a file path pointing to a single .tsv file (with header) with at least 3 columns: a column containing all the cell IDs, another containing the sample ID/name information, and a column to use for the filtering.
   - ``indexColumnName`` is the column name from ``cellMetaDataFilePath`` containing the cell IDs information. This column **must** have unique values. 
   - `optional` ``sampleColumnName`` is the column name from ``cellMetaDataFilePath`` containing the sample ID/name information. Make sur that the values from this column match the samples IDs inferred from the data files. To know how those are inferred, please read the `Input Data Formats`_ section.
   - `optional` ``filterColumnName`` is the column name from ``cellMetaDataFilePath`` which be used to filter out cells.

diff --git a/docs/pipelines.rst b/docs/pipelines.rst
@@ -374,7 +374,7 @@ Contrary to the aformentioned pipelines, these are not end-to-end. They are used
 **cell_annotate**
 -----------------
 
-Runs the ``cell_annotate`` workflow which will perform a cell-based annotation of the data using a set of provided TSV metadata files.
+Runs the ``cell_annotate`` workflow which will perform a cell-based annotation of the data using a set of provided .tsv metadata files.
 We show a use case here below with 10x Genomics data were it will annotate different samples using the ``obo`` method. For more information
 about this cell-based annotation feautre please visit `Cell-based metadata annotation`_ section.
 
@@ -426,7 +426,7 @@ Now we can run it with the following command:
 **cell_annotate_filter**
 ------------------------
 
-Runs the ``cell_annotate_filter`` workflow which will perform a cell-based annotation of the data using a set of provided TSV metadata files following by a cell-based filtering.
+Runs the ``cell_annotate_filter`` workflow which will perform a cell-based annotation of the data using a set of provided .tsv metadata files following by a cell-based filtering.
 We show a use case here below with 10x Genomics data were it will annotate different samples using the ``obo`` method. For more information
 about this cell-based annotation feautre please visit `Cell-based metadata annotation`_ section and `Cell-based metadata filtering`_ section.
 

diff --git a/src/scenic b/src/scenic
diff --git a/src/utils/README.md b/src/utils/README.md
@@ -20,7 +20,7 @@ params {
 ```
 Then, the following parameters should be updated to use the module feature:
 
-- `cellMetaDataFilePath` is a TSV file (with header) with at least 2 columns: a column containing all the cell IDs and an annotation column.
+- `cellMetaDataFilePath` is a .tsv file (with header) with at least 2 columns: a column containing all the cell IDs and an annotation column.
 - `indexColumnName` is the column name from `cellMetaDataFilePath` containing the cell IDs information.
 - `sampleColumnName` is the column name from `cellMetaDataFilePath` containing the sample ID/name information.
 - `annotationColumnNames` is an array of columns names from `cellMetaDataFilePath` containing different annotation metadata to add.
@@ -42,7 +42,7 @@ params {
 ```
 Then, the following parameters should be updated to use the module feature:
 
-- `metaDataFilePath` is a TSV file (with header) with at least 2 columns where the first column need to match the sample IDs. Any other columns will be added as annotation in the final loom i.e.: all the cells related to their sample will get annotated with their given annotations.
+- `metaDataFilePath` is a .tsv file (with header) with at least 2 columns where the first column need to match the sample IDs. Any other columns will be added as annotation in the final loom i.e.: all the cells related to their sample will get annotated with their given annotations.
 
 | id  | chemistry | ... |
 | ------------- | ------------- | ------------- |

diff --git a/src/utils/bin/sc_h5ad_annotate_by_cell_metadata.py b/src/utils/bin/sc_h5ad_annotate_by_cell_metadata.py
@@ -18,7 +18,7 @@
 parser.add_argument(
     "cell_meta_data_file_path",
     type=argparse.FileType('r'),
-    help='The file path to meta data (TSV with header) for each cell where values from a column could be used to annotate the cells.'
+    help='The file path to metadata (.tsv with header) for each cell where values from a column could be used to annotate the cells.'
 )
 
 parser.add_argument(
@@ -49,7 +49,7 @@
     '-s', '--sample-column-name',
     type=str,
     dest="sample_column_name",
-    help="The column name containing the sample ID for each cell entry in the cell meta data."
+    help="The column name containing the sample ID for each cell entry in the cell metadata."
 )
 
 parser.add_argument(

diff --git a/src/utils/bin/sc_h5ad_annotate_by_sample_metadata.py b/src/utils/bin/sc_h5ad_annotate_by_sample_metadata.py
@@ -42,7 +42,7 @@
     type=str,
     action="store",
     dest="metadata_file_path",
-    help="Path to the meta data. It expects a tabular separated file (.tsv) with header and a required 'id' column."
+    help="Path to the metadata. It expects a tabular separated file (.tsv) with header and a required 'id' column."
 )
 
 parser.add_argument(
@@ -99,7 +99,7 @@
     raise Exception("VSN ERROR: Missing sample_id column in the obs slot of the AnnData of the given h5ad.")
 
 if args.sample_column_name is None:
-    raise Exception("VSN ERROR: sampleColumnName is missing in the sample_annotate config.")
+    raise Exception("VSN ERROR: Missing --sample-column-name argument (sampleColumnName param in sample_annotate config)")
 
 metadata = pd.read_csv(
     filepath_or_buffer=args.metadata_file_path,
@@ -109,9 +109,9 @@
 sample_info = metadata[metadata[args.sample_column_name] == SAMPLE_NAME]
 
 if len(sample_info) == 0:
-    raise Exception(f"VSN ERROR: The meta data TSV file does not contain sample ID '{SAMPLE_NAME}'.")
+    raise Exception(f"VSN ERROR: The metadata .tsv file does not contain sample ID '{SAMPLE_NAME}'.")
 elif args.method == "sample" and len(sample_info) > 1:
-    raise Exception(f"VSN ERROR: The meta data TSV file contains duplicate entries with the sample ID '{SAMPLE_NAME}'. Fix your metadata or use the 'sample+' method.")
+    raise Exception(f"VSN ERROR: The metadata .tsv file contains duplicate entries with the sample ID '{SAMPLE_NAME}'. Fix your metadata or use the 'sample+' method.")
 
 if args.method == "sample":
     for (column_name, column_data) in sample_info.iteritems():
@@ -134,7 +134,7 @@
     # Update the obs slot of the AnnData
     adata.obs = new_obs
 else:
-    raise Exception("VSN ERROR: This meta data type {} is not implemented".format(args.type))
+    raise Exception(f"VSN ERROR: Unrecognized method {args.method}.")
 
 if args.annotation_column_names is not None and len(args.annotation_column_names) > 0:
     adata.obs = adata.obs[args.annotation_column_names]

diff --git a/src/utils/bin/sc_h5ad_prepare_obs_filter.py b/src/utils/bin/sc_h5ad_prepare_obs_filter.py
@@ -41,7 +41,7 @@ def str_to_bool(s):
     dest="method",
     choices=['internal', 'external'],
     default='internal',
-    help="The method to prepare the filters. Internal means, the input is expected to be a .h5ad otherwise it expects a .tsv."
+    help="The method to prepare the filters. Internal means, the input is expected to be a .h5ad otherwise it expects a .tsv file."
 )
 
 parser.add_argument(
@@ -55,14 +55,14 @@ def str_to_bool(s):
     '-s', '--sample-column-name',
     type=str,
     dest="sample_column_name",
-    help="The column name containing the sample ID for each row in the cell meta data."
+    help="The column name containing the sample ID for each row in the cell metadata."
 )
 
 parser.add_argument(
     '-x', '--index-column-name',
     type=str,
     dest="index_column_name",
-    help="The column name containing the index (unique identifier) for each row in the cell meta data."
+    help="The column name containing the index (unique identifier) for each row in the cell metadata."
 )
 
 parser.add_argument(

diff --git a/src/utils/bin/sc_h5ad_update.py b/src/utils/bin/sc_h5ad_update.py
@@ -28,7 +28,7 @@
     type=argparse.FileType('r'),
     dest="x_pca",
     required=False,
-    help='The path the (compressed) TSV file containing the new PCA embeddings.'
+    help='The path the (compressed) .tsv file containing the new PCA embeddings.'
 )
 parser.add_argument(
     '-r', "--empty-x",

diff --git a/src/utils/bin/sra_to_metadata.py b/src/utils/bin/sra_to_metadata.py
@@ -10,7 +10,7 @@
 
 parser = argparse.ArgumentParser(
     description='''
-Convert a SRA ID to a meta data TSV file with the following information
+Convert a SRA ID to a metadata .tsv file with the following information
 - experiment_accession, e.g.: SRX4084637
 - experiment_title, e.g.: GSM3142622: w1118_1d_WholeBrain_Unstranded_RNA-seq; Drosophila melanogaster; RNA-Seq
 - experiment_desc, e.g.: GSM3142622: w1118_1d_WholeBrain_Unstranded_RNA-seq; Drosophila melanogaster; RNA-Seq
@@ -67,7 +67,7 @@
     "-o", "--output",
     type=argparse.FileType('w'),
     required=True,
-    help='The TSV file path that will stored the metadata for the given SRA Project ID.'
+    help='The .tsv file path that will stored the metadata for the given SRA Project ID.'
 )
 
 args = parser.parse_args()
@@ -116,7 +116,7 @@
     axis=1
 )
 
-# Filter the meta data based on the given ilters (if provided)
+# Filter the metadata based on the given ilters (if provided)
 if args.sample_filters is not None:
     # Convert * (if not preceded by .) to .*
     def replace_bash_asterisk_wildcard(glob):

diff --git a/src/utils/processes/h5adAnnotate.nf b/src/utils/processes/h5adAnnotate.nf
@@ -102,29 +102,46 @@ process SC__ANNOTATE_BY_SAMPLE_METADATA {
 
         // method / type param
         methodAsArgument = ''
-        methodAsArgument = processParams.by.containsKey('method') ? processParams.by.method : ''
-        // make it backward compatible (see sample_annotate_v1.config)
-        methodAsArgument = processParams.containsKey('type') ? processParams.type : methodAsArgument
-
-        // metadata file path param
+        if(processParams.containsKey("by")) {
+            methodAsArgument = processParams.by.containsKey('method') ? processParams.by.method : ''
+        } else {
+            // make it backward compatible (see sample_annotate_old_v1.config)
+            methodAsArgument = processParams.containsKey('type') ? processParams.type : methodAsArgument
+        }
+
+        // metadataFilePath param
         metadataFilePathAsArgument = getMetadataFilePath(processParams)
 
-        compIndexColumnNamesFromAdataAsArguments = processParams.by.containsKey('compIndexColumnNames') ?
-            processParams.by.compIndexColumnNames.collect { key, value -> return key }.collect({ '--adata-comp-index-column-name ' + ' ' + it }).join(' ') :
-            ''
-        compIndexColumnNamesFromMetadataAsArguments = processParams.by.containsKey('compIndexColumnNames') ?
-            processParams.by.compIndexColumnNames.collect { key, value -> return value }.collect({ '--metadata-comp-index-column-name ' + ' ' + it }).join(' ') :
-            ''
-        annotationColumnNamesAsArguments = processParams.by.containsKey('annotationColumnNames') ?
-            processParams.by.annotationColumnNames.collect({ '--annotation-column-name' + ' ' + it }).join(' ') :
-            ''
+        compIndexColumnNamesFromAdataAsArguments = ''
+        compIndexColumnNamesFromMetadataAsArguments = ''
+        annotationColumnNamesAsArguments = ''
+        if(processParams.containsKey("by")) {
+            compIndexColumnNamesFromAdataAsArguments = processParams.by.containsKey('compIndexColumnNames') ?
+                processParams.by.compIndexColumnNames.collect { key, value -> return key }.collect({ '--adata-comp-index-column-name ' + ' ' + it }).join(' ') :
+                ''
+            compIndexColumnNamesFromMetadataAsArguments = processParams.by.containsKey('compIndexColumnNames') ?
+                processParams.by.compIndexColumnNames.collect { key, value -> return value }.collect({ '--metadata-comp-index-column-name ' + ' ' + it }).join(' ') :
+                ''
+            annotationColumnNamesAsArguments = processParams.by.containsKey('annotationColumnNames') ?
+                processParams.by.annotationColumnNames.collect({ '--annotation-column-name' + ' ' + it }).join(' ') :
+                ''
+        }
+
+        //  samplecolumnName
+        sampleColumnName = ''
+        if(processParams.containsKey("by")) {
+            sampleColumnName = processParams.by.sampleColumnName
+        } else {
+            // make it backward compatible (see sample_annotate_old_v1.config)
+            sampleColumnName = processParams.sampleColumnName
+        }
 
         """
         ${binDir}/sc_h5ad_annotate_by_sample_metadata.py \
             --sample-id ${sampleId} \
             ${methodAsArgument != '' ? '--method ' + methodAsArgument : '' } \
             ${metadataFilePathAsArgument != '' ? '--metadata-file-path ' + metadataFilePathAsArgument : '' } \
-            ${processParams.by.containsKey("sampleColumnName") ? '--sample-column-name ' + processParams.by.sampleColumnName : '' } \
+            ${'--sample-column-name ' + sampleColumnName} \
             ${compIndexColumnNamesFromAdataAsArguments} \
             ${compIndexColumnNamesFromMetadataAsArguments} \
             ${annotationColumnNamesAsArguments} \
+1 −1		bin/arboreto_with_multiprocessing.py
+2 −2		bin/aucell_from_folder.py
+1 −1		bin/convert_multi_runs_features_to_regulons.py
+2 −2		bin/merge_motif_track_loom.py
+1 −1		bin/save_multi_runs_to_loom.py
+3 −3		bin/utils.py
+1 −0		conf/append.config