Python-based tool for parsing and validating CFPB's RegTech-related data submissions. It uses Pandera, a Pandas-based data testing framework, to define schemas for datasets and to perform all data validations. It is intended to be used as a library for Python-based apps, but can also be used directly via command-line interface.
We are currently focused on implementing the SBL (Small Business Lending) data submission. For details on this dataset and its validations, please see: Filing instructions guide for small business lending data collected in 2024
The following setup must be completed prior to running the CLI utilities or doing any development on the project.
The following software packages are prerequisites to installing this software.
-
Checkout this project
git clone https://github.com/cfpb/regtech-data-validator.git cd regtech-data-validator
-
Install Python packages via Poetry
poetry install
-
Activate Poetry's virtual environment
poetry shell
Note: All Python packages used in project can be found in
pyproject.toml
This project includes the cfpb-val
CLI utility for validating CFPB's RegTech-related
data collection file formats. It currently supports the small business lending (SBL) data
collected for 2024, but may support more formats in the future. This tool is intended for
testing purposes, allowing a quick way to check the validity of a file without having
to submit it through the full filing systems.
$ cfpb-val validate --help
Usage: cfpb-val validate [OPTIONS] PATH
โญโ Arguments โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ * path FILE [default: None] [required] โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ --context <key>=<value> [example: lei=12345678901234567890] โ
โ --output [csv|json|pandas|table|download] โ
โ [default: table] โ
โ --help Show this message and exit. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
-
Validate a file with no findings
$ cfpb-val validate tests/data/sbl-validations-pass.csv status: SUCCESS, findings: 0
-
Validate a file with findings, passing in LEI as context
$ cfpb-val validate tests/data/sbl-validations-fail.csv --context lei=000TESTFIUIDDONOTUSE โญโโโโโโโโโโโโโฌโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ โ finding_no โ record_no โ field_name โ field_value โ validation_severity โ validation_id โ validation_name โ โโโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค โ 1 โ 4 โ uid โ 000TESTFIUIDDONOTUSEXBXVID13XTC1 โ error โ E3000 โ uid.duplicates_in_dataset โ โ 2 โ 5 โ uid โ 000TESTFIUIDDONOTUSEXBXVID13XTC1 โ error โ E3000 โ uid.duplicates_in_dataset โ โ 3 โ 0 โ uid โ โ error โ E0001 โ uid.invalid_text_length โ โ 4 โ 1 โ uid โ BXUIDXVID11XTC2 โ error โ E0001 โ uid.invalid_text_length โ โ 5 โ 2 โ uid โ BXUIDXVID11XTC31234567890123456789012345678901 โ error โ E0001 โ uid.invalid_text_length โ โ ... โ ... โ ... โ ... โ ... โ ... โ ... โ โ 115 โ 278 โ po_4_race_baa_ff โ 12345678901234567890123456789012345678901234567890 โ error โ E1000 โ po_4_race_baa_ff.invalid_text_length โ โ 116 โ 290 โ po_4_race_pi_ff โ 12345678901234567890123456789012345678901234567890 โ error โ E1020 โ po_4_race_pi_ff.invalid_text_length โ โ 117 โ 302 โ po_4_gender_flag โ 9001 โ error โ E1040 โ po_4_gender_flag.invalid_enum_value โ โ 118 โ 306 โ po_4_gender_ff โ 12345678901234567890123456789012345678901234567890 โ error โ E1060 โ po_4_gender_ff.invalid_text_length โ โฐโโโโโโโโโโโโโดโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ status: FAILURE, findings: 118, validation phase: Syntactical
-
Validate a file with findings with output in JSON format
$ cfpb-val validate tests/data/sbl-validations-fail.csv --output json [ { "validation": { "id": "E0001", "name": "uid.invalid_text_length", "description": "'Unique identifier' must be at least 21 characters in length and at most 45 characters in length.", "severity": "error" }, "records": [ { "record_no": 0, "fields": [ { "name": "uid", "value": "" } ] }, { "record_no": 1, "fields": [ { "name": "uid", "value": "BXUIDXVID11XTC2" } ] }, { "record_no": 2, "fields": [ { "name": "uid", "value": "BXUIDXVID11XTC31234567890123456789012345678901" } ] } ] }, ... status: FAILURE, findings: 118
This repo includes 2 test datasets, one with all valid data, and one where each line represents a different failed validation, or different permutation of the same failed validation.
We use these test files in for automated test, but can also be passed in via the
cfpb-val
CLI utility for manual testing.
This section is for developer who wish to contribute to this project.
Note: If you simply want to use the cfpb-val tool for testing you data, you don't need to read any further.
- Check functions should focus on reuse.
- Most of the validations share logic with other validations.
- Avoid using lambdas for Check functions.
- They do not promote reuse.
- They are harder to debug.
- They are harder to test.
- Check function signatures should reflect the functionality.
- Check functions should have corresponding unit tests.
- Check definitions' name should be set to validation ID.
- Check new added lines are formatted correctly.
This project uses pytest for automated testing.
$ pytest
======================================================= test session starts =======================================================
platform darwin -- Python 3.12.0, pytest-7.4.0, pluggy-1.3.0 -- /Users/keelerh/Library/Caches/pypoetry/virtualenvs/regtech-data-validator-p5DmiWB0-py3.12/bin/python
cachedir: .pytest_cache
rootdir: /Users/keelerh/Projects/regtech-data-validator
configfile: pyproject.toml
testpaths: tests
plugins: cov-4.1.0, typeguard-4.1.3
collected 112 items
tests/test_check_functions.py::TestInvalidDateFormat::test_with_valid_date PASSED [ 0%]
tests/test_check_functions.py::TestInvalidDateFormat::test_with_invalid_date PASSED [ 1%]
tests/test_check_functions.py::TestInvalidDateFormat::test_with_invalid_day PASSED [ 2%]
tests/test_check_functions.py::TestInvalidDateFormat::test_with_invalid_month PASSED [ 3%]
tests/test_check_functions.py::TestInvalidDateFormat::test_with_invalid_year PASSED [ 4%]
...
tests/test_schema_functions.py::TestValidatePhases::test_with_valid_lei PASSED [ 96%]
tests/test_schema_functions.py::TestValidatePhases::test_with_invalid_data PASSED [ 97%]
tests/test_schema_functions.py::TestValidatePhases::test_with_multi_invalid_data_with_phase1 PASSED [ 98%]
tests/test_schema_functions.py::TestValidatePhases::test_with_multi_invalid_data_with_phase2 PASSED [ 99%]
tests/test_schema_functions.py::TestValidatePhases::test_with_invalid_lei PASSED [100%]
---------- coverage: platform darwin, python 3.12.0-final-0 ----------
Name Stmts Miss Branch BrPart Cover Missing
-----------------------------------------------------------------------------------------
regtech_data_validator/check_functions.py 184 14 78 0 91% 55-59, 111-121, 275-276, 297-298, 420-421
regtech_data_validator/checks.py 15 0 2 0 100%
regtech_data_validator/cli.py 62 62 30 0 0% 1-126
regtech_data_validator/create_schemas.py 61 3 18 3 92% 94, 99, 133
regtech_data_validator/global_data.py 12 0 8 0 100%
regtech_data_validator/phase_validations.py 6 0 0 0 100%
regtech_data_validator/schema_template.py 6 0 0 0 100%
-----------------------------------------------------------------------------------------
TOTAL 346 79 136 3 75%
3 empty files skipped.
Coverage XML written to file coverage.xml
Test coverage details can be found in this project's
python-coverage-comment-action-data
branch.
The (test_csv_to_code_differences.py
) test compares the validation code in
phase_validations.py
against the CSV-based 2024 SBL validation spec
(2024-validations.csv
).
This test checks that the list of validation IDs in one match the other, and will report on IDs that are missing in either.
The test will also validate that all severities (error or warning) match. The test will then
do a hard string compare between the validation descriptions, with a couple of caveats:
- Any python validation check whose description starts with a single quote will first add the single quote to the CSV's description, if one doesn't exist. This is done because if someone modifies the CSV in Excel, Excel will drop the beginning single quote, which it interprets as a formatter telling Excel "this field is a string"
- Certain descriptions in the CSV have 'complex' formatting to produce layouts with lists, new lines and white space that may not compare correctly. Since how validation descriptions will be formatted on the results page for a submission, currently the test will strip off some of this formatting and compare the text.
This test runs automatically as part of our unit testing pipeline. You can also
run the test manually by running the command poetry run pytest tests/test_csv_to_code_differences.py
This will create an errors.csv
file at the root of the repo that can be used to easily view
differences found between the two files.
Normally the pytest will point to the main branch in the sbl-content repo, but you can modify the test to point to a development branch that has upcoming changes, run the test with the above command, and then evaluate what changes may need to be made to the python validation code.
This repository utilizing black
and ruff
libraries to check and fix any
formatting issues.
# Example of Ruff with an error
$ ruff .
tests/test_check_functions.py:205:26: E712 [*] Comparison to `False` should be `cond is False`
Found 1 error.
[*] 1 potentially fixable with the --fix option.
# Example of black with reformatted line
$ black .
reformatted regtech_data_validator/cli.py
All done! โจ ๐ฐ โจ
1 file reformatted, 13 files left unchanged.
This setup uses VS Code's Dev Containers extension to create a standardized development environment, allowing for a quick setup, and a greater guarantee that you'll have all the tools you need to start developing. It does rely heavily on Docker, so some familiarity there is highly recommended.
- Open this repository within VS Code and press
COMMAND + SHIFT + p
on your keyboard. This will open the command bar at the top of your window. - Enter
Dev Containers: Rebuild and Reopen in Container
. VS Code will open a new window and you'll see a status message towards the bottom right of your screen that the container is building and attaching. - This will take a few minutes the first time because Docker needs to build the container without a build cache.
- You may receive a notification that VS Code wants to perform a reload because some extensions could not load. Sometimes this happens because extensions are loaded in conflicting orders and dependencies are not satisfied.
Unit tests can be run through VS Code Test.
CFPB is developing the
RegTech Data Validator
in the open to maximize transparency and encourage
third party contributions. If you want to contribute, please read and abide by
the terms of the License for this project. Pull Requests are always
welcome.
If you have an inquiry or suggestion for the validator or any SBL related code please reach out to us at [email protected]