Skip to content

3. Training a custom model

AllenZPGu edited this page Nov 8, 2023 · 3 revisions

Given your own countrs matrix, TALLSorts offers the ability to train your own custom model with custom subtype labels. Below are some example instructions.

  1. Navigate to the tests\training_example folder.

  2. Take note of the following files

  • train_counts.csv: a counts matrix of samples (rows) and gene IDs (columns). Gene labels can be in any arbitrary format as long as you intend to test on the same gene labels.
  • sample_sheet.csv: CSV file with samples (rows) and subtype labels (columns), representing true classifications of each sample. Cells are 1 if ths sample belongs to the subtype, or 0 if not. Note that subtype labels must match the counts matrix.
  • hierarchy.csv: CSV file that describes the hierarchical relationships of the subtypes. Its rows are of the form <subtype label>,<parent label>. If a subtype has no parent, leave <parent label> blank. Subtype labels should correspond to column headings in the sample-sheet CSV file. In this specific case, Subtype E can be further subclassified into either belonging to Subtype F or not.
  • training_params.csv: CSV file that provides arguments to the sklearn LogisticRegression class on which TALLSorts runs. Its rows are subtype labels, and its columns are parameters of the LogisticRegression class. Blank cells correspond to default values of random_state=0, max_iter=10000, tol=0.0001, penalty='l1', solver='saga', C=0.2, class_weight='balanced'
  1. Train the model by running the following command. This may take several minutes, depending on the size of your counts matrix.
TALLSorts -m train -s train_counts.csv --ss sample_sheet.csv --hierarchy hierarchy.csv --tp training_params.csv --tc 4 -d trained_model
  • -m train: Runs TALLSorts in training mode
  • -s, --ss, -hierarchy, --tp: Paths to the CSV files described in Step 2.
  • -d trained_model: creates a folder trained_model in which to store the trained model object file
  • --tc 3: uses 3 courses during parallelisation. Otherwise defaults to 1, adjust number as appropriate.
  1. You can find the trained model object custom.pkl.gz under the trained_model directory. You can test that this model object works by running a test on the training counts:
TALLSorts -s train_counts.csv -d train_test_output --mp trained_model/custom.pkl.gz

noting the --mp trained_model/custom.pkl.gz path which is the path to the custom model. YOu can find the training results under the train_test_output directory.

References

RNA-seq counts used in tests/test_counts.csv were generated from publicly-available RNA-seq reads published by:

Autry RJ, Paugh SW, Carter R, Shi L, Liu J, Ferguson DC, et al. Integrative genomic analyses reveal mechanisms of glucocorticoid resistance in acute lymphoblastic leukemia. Nature Cancer. 2020;1(3):329-44. | Link to paper | Link to data here and here

Sample and gene names were changed to random strings to preserve anonymity.

Clone this wiki locally