-
Notifications
You must be signed in to change notification settings - Fork 0
3. Training a custom model
Given your own countrs matrix, TALLSorts offers the ability to train your own custom model with custom subtype labels. Below are some example instructions.
-
Navigate to the
tests\training_example
folder. -
Take note of the following files
-
train_counts.csv
: a counts matrix of samples (rows) and gene IDs (columns). Gene labels can be in any arbitrary format as long as you intend to test on the same gene labels. -
sample_sheet.csv
: CSV file with samples (rows) and subtype labels (columns), representing true classifications of each sample. Cells are1
if ths sample belongs to the subtype, or0
if not. Note that subtype labels must match the counts matrix. -
hierarchy.csv
: CSV file that describes the hierarchical relationships of the subtypes. Its rows are of the form<subtype label>,<parent label>
. If a subtype has no parent, leave<parent label>
blank. Subtype labels should correspond to column headings in thesample-sheet
CSV file. In this specific case, Subtype E can be further subclassified into either belonging to Subtype F or not. -
training_params.csv
: CSV file that provides arguments to the sklearnLogisticRegression
class on which TALLSorts runs. Its rows are subtype labels, and its columns are parameters of theLogisticRegression
class. Blank cells correspond to default values ofrandom_state=0, max_iter=10000, tol=0.0001, penalty='l1', solver='saga', C=0.2, class_weight='balanced'
- Train the model by running the following command. This may take several minutes, depending on the size of your counts matrix.
TALLSorts -m train -s train_counts.csv --ss sample_sheet.csv --hierarchy hierarchy.csv --tp training_params.csv --tc 4 -d trained_model
-
-m train
: Runs TALLSorts in training mode -
-s
,--ss
,-hierarchy
,--tp
: Paths to the CSV files described in Step 2. -
-d trained_model
: creates a foldertrained_model
in which to store the trained model object file -
--tc 3
: uses 3 courses during parallelisation. Otherwise defaults to 1, adjust number as appropriate.
- You can find the trained model object
custom.pkl.gz
under thetrained_model
directory. You can test that this model object works by running a test on the training counts:
TALLSorts -s train_counts.csv -d train_test_output --mp trained_model/custom.pkl.gz
noting the --mp trained_model/custom.pkl.gz
path which is the path to the custom model. YOu can find the training results under the train_test_output
directory.
RNA-seq counts used in tests/test_counts.csv
were generated from publicly-available RNA-seq reads published by:
Autry RJ, Paugh SW, Carter R, Shi L, Liu J, Ferguson DC, et al. Integrative genomic analyses reveal mechanisms of glucocorticoid resistance in acute lymphoblastic leukemia. Nature Cancer. 2020;1(3):329-44. | Link to paper | Link to data here and here
Sample and gene names were changed to random strings to preserve anonymity.