This is the source code for our paper:
Fernandes, I.K., Vieira, C.C., Dias, K.O.G. et al. Using machine learning to combine genetic and environmental data for maize grain yield predictions across multi-environment trials. Theor Appl Genet 137, 189 (2024). https://doi.org/10.1007/s00122-024-04687-w.
Citation:
@article{fernandes2024,
author={Fernandes, Igor K. and Vieira, Caio C. and Dias, Kaio O. G. and Fernandes, Samuel B.},
title={Using machine learning to combine genetic and environmental data for maize grain yield predictions across multi-environment trials},
journal={Theoretical and Applied Genetics},
year={2024},
month={Jul},
day={23},
volume={137},
number={8},
pages={189},
issn={1432-2242},
doi={10.1007/s00122-024-04687-w},
url={https://doi.org/10.1007/s00122-024-04687-w}
}
Before starting reproducing, here are some important notes:
- You will need a lot of space to run all experiments
- The scripts ran in a HPC cluster using SLURM, thus you may need to rename job partitions accordingly to the HPC cluster you use (check the
.sh
files)
After cloning the repository, download the data here, extract it, and put both Training_Data
and Testing_Data
folders inside the data
folder. Unzip the VCF file Training_Data/5_Genotype_Data_All_Years.vcf.zip
.
The folder structure should be as follows:
Maize_GxE_Prediction/
├── data/
│ ├── Training_Data/
│ └── Testing_Data/
├── src/
├── logs/
├── output/
│ ├── cv0/
│ ├── cv1/
│ └── cv2/
Install the conda environment:
conda env create -f environment.yml
Install R packages:
# from CRAN
install.packages("arrow")
install.packages("data.table")
install.packages("AGHmatrix")
install.packages("devtools")
install.packages("asreml") # for BLUEs and FA
# from github source
setRepositories(ind = 1:2)
devtools::install_github("samuelbfernandes/simplePHENOTYPES")
- Create BLUEs:
JOB_BLUES=$(sbatch --parsable 1-job_blues.sh)
- Create datasets for cross-validation schemes:
JOB_DATASETS=$(sbatch --dependency=afterok:$JOB_BLUES --parsable 2-job_datasets.sh)
- Filter VCF and create kinships matrices (you will need
vcftools
andplink
here):
JOB_GENOMICS=$(sbatch --dependency=afterok:$JOB_DATASETS --parsable 3-job_genomics.sh)
- Create Kronecker products between environmental and genomic relationship matrices (will take some hours):
JOB_KRON=$(sbatch --dependency=afterok:$JOB_GENOMICS --parsable 4-job_kroneckers.sh)
- Fit E models:
for i in {1..10}; do sbatch --export=seed=${i} --job-name=Eseed${i} --output=logs/job_e_seed${i}.txt 5-job_e.sh; done
- Fit G and G+E models:
for i in {1..10}; do sbatch --export=seed=${i} --job-name=Gseed${i} --output=logs/job_g_seed${i}.txt 6-job_g.sh; done
- Fit GxE models (will take several hours):
for i in {1..10}; do sbatch --export=seed=${i} --job-name=GxEs${i} --output=logs/job_gxe_seed${i}.txt --dependency=afterok:$JOB_KRON --parsable 7-job_gxe.sh; done
- fit GBLUP FA(1) models (will take several hours):
for i in {1..10}; do sbatch --export=seed=${i} --job-name=faS${i} --output=logs/job_fa_seed${i}.txt 8-job_fa.sh; done
Some files in output
will be big, particularly the Kronecker files, so you might want to exclude them later.
We can check some results directly from the terminal. Here are some examples:
Check some GxE results:
find logs/ -name 'gxe_*' | xargs grep -E 'RMSE:*' | head
Store SVD explained variances:
find logs/ -name '*cv*' | xargs grep -E '*variance*' > logs/svd_explained_variance.txt
Check accuracy of GBLUP FA(1) models in CV0:
grep \\[1\\] logs/fa_cv0*
Check which models are done for GxE in one of the repetitions:
cat logs/job_gxe_seed6.txt