The dataset used with this workflow is derived from Fannie Mae’s Single-Family Loan Performance Data with all rights reserved by Fannie Mae. This processed dataset is redistributed with permission and consent from Fannie Mae.
To acquire this dataset, please visit RAPIDS Datasets Homepage
The Mortgage workflow is composed of three core phases:
- ETL - Extract, Transform, Load
- Data Conversion
- ML - Training
Data is:
- Read in from storage
- Transformed to emphasize key features
- Loaded into volatile memory for conversion
Features are:
- Broken into (labels, data) pairs
- Distributed across many workers
- Converted into compressed sparse row (CSR) matrix format for XGBoost
The CSR data is fed into a distributed training session with xgboost.dask
We regularly benchmark RAPIDS on this workload to measure our performance against not just Apache Spark on CPUs but past versions of RAPIDS.