Skip to content

4. Command line arguments

AllenZPGu edited this page Nov 10, 2023 · 4 revisions

--mode or -m (required=False)

Whether TALLSorts is run in testing mode or training mode. Accepts the following arguments:

  • test (default): Supply a counts matrix to classify
  • train: Supply a counts matrix and true classifications to generate a TALLSorts classification model, which then can be used to test

--samples or -s (required=True)

Path to samples (rows) x genes (columns) csv file representing a raw counts matrix.

  • If mode=test, this will be the testing matrix.
  • If mode=train, this will be the matrix on which the classifiers are trained.

--genelabels or --gl (required=False)

If mode=test and you are using the default TALLSorts model, this is whether your gene labels in your samples matrix are in Ensembl ID or Gene Symbol form. Accepts the following arguments:

  • id (default): Ensembl ID
  • symbol: Gene symbol Ignored if mode=train.

--destination or -d (required=False)

Defaults to current working directory.

  • If mode=test, this is the directory where you want the testing report to be saved.
  • If mode=train, this is the directory where the trained classifier model object file custom.pkl.gz will be stored.

--model-path or --mp (required=False)

  • If mode=test, this is the path of the classifier model object file with extension .pkl.gz. Defaults to the TALLSorts default model stored at <root>/models/tallsorts/tallsorts_default_model.pkl.gz
  • If mode=train, this argument will be ignored.

The following commands are used exclusively when mode=train.

--sample-sheet or --ss (required=True)

Path to CSV file with samples (rows) x subtypes (columns), representing true classifications of each sample. Cells are 1 if ths sample belongs to the subtype, or 0 if not.

--filter or -f (required=False)

If provided, input genes will be filtered by the same method used when generating the default TALLSorts model. Please refer to our publications's Supplementary Information for a description of our method. Leave this flag out if your counts matrix is already pre-filtered.

Important: If you intend to use this flag, you must have pyensembl installed as per Step 5 of our Installation guide, and your gene labels in the sample-sheet must be in Ensembl ID form.

--hierarchy (required=True)

Path to CSV file that describes the hierarchical relationships of the subtypes. Its rows are of the form <subtype label>,<parent label>. If a subtype has no parent, leave <parent label> blank. Subtype labels should correspond to column headings in the sample-sheet CSV file.

--training-params or --tp (required=False)

Path to CSV file that provides arguments to the sklearn LogisticRegression class on which TALLSorts runs. Its rows are subtype labels, and its columns are parameters of the LogisticRegression class.

Leave argument out, or leave cells blank to default to the following default values: random_state=0, max_iter=10000, tol=0.0001, penalty='l1', solver='saga', C=0.2, class_weight='balanced'

See sklearn's docs for more info.

--training-cores or --tc (required=False)

Number of cores to use when training TALLSorts in parallel. Uses joblib's parallel_backend. Defaults to 1.