More robust `DenseMatrix._get_col_stds` #436

mlondschien · 2025-01-24T08:01:36Z

xref #414
Checklist

Added a CHANGELOG.rst entry

mlondschien · 2025-01-24T08:10:28Z

A non-zero constant column results in

sqrt_arg = transpose_square_dot_weights(self._array, weights) - col_means**2

close to machine precision ~1e-15. Then _get_col_stds returns ~1e-8 for that column. The resulting mult of the StandardizedMatrix is large ~1e8, and so at least one entry of np.outer(mult, mult) blows up to 1e15. As a result, the corresponding entry in StandardizedMatrix.sandwich(np.ones(n)/n) no longer cancels out.

If DenseMatrix._get_col_stds returned a value close to machine precision, the mult is 1 and so these instabilities don't occur.

It might be worthwile to put a check somewhere in the init of StandardizedMatrix: If mult is close to sqrt(machine_precision), the results of sandwich will be inaccurate.

src/tabmat/dense_matrix.py

example.py

stanmart · 2025-01-24T08:18:23Z

Nice find, thank you!

stanmart · 2025-01-24T08:30:44Z

Would the same operation for the sparse matrix also benefit from a similar change? Hopefully we have much fewer non-zero constant columns there, but still :)

tabmat/src/tabmat/sparse_matrix.py

Lines 297 to 306 in 8662571

    
           sqrt_arg = ( 
        
               transpose_square_dot_weights( 
        
                   self._array.data, 
        
                   self._array.indices, 
        
                   self._array.indptr, 
        
                   weights, 
        
                   weights.dtype, 
        
               ) 
        
               - col_means**2 
        
           )

mlondschien · 2025-01-24T08:56:36Z

Am I seeing this correctly and no tests are run as part of ci?

Yes, because this lives on a different repo.

mlondschien · 2025-01-24T08:57:49Z

Would the same operation for the sparse matrix also benefit from a similar change? Hopefully we have much fewer non-zero constant columns there, but still :)

tabmat/src/tabmat/sparse_matrix.py

Lines 297 to 306 in 8662571

sqrt_arg = (

transpose_square_dot_weights(

self._array.data,

self._array.indices,

self._array.indptr,

weights,

weights.dtype,

)

- col_means**2

)

I guess this would be a more complex change if we want to avoid a loop through all (including zero) entries.

stanmart · 2025-01-24T09:59:43Z

Am I seeing this correctly and no tests are run as part of ci?

We should probably change our CI config such that tests run on pull requests, and not just on pushes.

I guess this would be a more complex change if we want to avoid a loop through all (including zero) entries.

Good point. Let's just skip it then, it should not be an in issue anyways.

stanmart

I will run the benchmarks in a bit to make sure there is no performance regression due to these changes, but otherwise LGTM.

.github/workflows/ci.yml

example.py

src/tabmat/standardized_mat.py

stanmart · 2025-01-24T14:27:36Z

~~Also, could you please also add a changelog entry? I think we could consider it a bug fix, but I'm fine with "other changes", too.~~

Sorry, I found it 🙈

.github/workflows/ci.yml

Co-authored-by: Martin Stancsics <[email protected]>

mlondschien · 2025-01-29T06:57:28Z

Any change we can merge this and do a release :)? I can remove Jan's file and add this as a test. Where should I add it to?

stanmart · 2025-01-29T08:11:40Z

Yes! I can add the test if you don't mind me pushing to this branch. We'll probably want to add a changelog entry, too. How about something like "Improved the robustness of standardization for dense matrices with constant columns"?

mlondschien · 2025-01-29T08:28:55Z

Feel free to push as you'd like.

I already added

Bug fix:

A more robust :meth:DenseMatrix._get_col_stds results in more accurate :meth:StandardizedMatrix.sandwich results.

stanmart · 2025-01-29T08:32:46Z

Ah, I can't because it's your repo. It's fine though. So I thought something like this might be good in tests/test_matrices.py:

@pytest.mark.parametrize("dtype", [np.float32, np.float64])
def test_standardize_constant_cols(dtype):
    X = np.array(
        [
            [46.231056, 126.05263, 144.46439],
            [46.231224, 128.66818, 0.7667693],
            [46.231186, 104.97506, 193.8872],
            [46.230835, 130.10156, 143.88954],
            [46.230896, 116.76007, 7.5629334],
        ],
        dtype=dtype,
    )
    v = np.array(
        [0.12428328, 0.67062443, 0.6471895, 0.6153851, 0.38367754], dtype=dtype
    )
    weights = np.full(X.shape[0], 1 / X.shape[0], dtype=dtype)

    standardized_mat, _, _ = tm.DenseMatrix(X).standardize(
        weights, center_predictors=True, scale_predictors=True
    )

    result = standardized_mat.transpose_matvec(v)
    expected = standardized_mat.toarray().T @ v

    np.testing.assert_allclose(result, expected)

The issue is that the test still fails for float32. Or is that expected and this branch does not fully fix what example.py demonstrates?

mlondschien · 2025-01-29T08:35:38Z

maybe you are not a maintainer 🤷

stanmart · 2025-01-29T08:39:18Z

Haha, I thought I was, but who knows 😅

mlondschien · 2025-01-29T08:53:20Z

c02fbc2 still fails.

stanmart · 2025-01-29T08:55:52Z

Yeah, so it's failing for float32 still. Maybe we shoud add a more direct test instead? It's the standard deviation calculation which was improved after all

For example this:

@pytest.mark.parametrize("dtype", [np.float32, np.float64])
def test_standardize_almost_constant_cols(dtype):
    X = np.array(
        [
            [46.231056, 126.05263, 144.46439],
            [46.231224, 128.66818, 0.7667693],
            [46.231186, 104.97506, 193.8872],
            [46.230835, 130.10156, 143.88954],
            [46.230896, 116.76007, 7.5629334],
        ],
        dtype=dtype,
    )
    v = np.array(
        [0.12428328, 0.67062443, 0.6471895, 0.6153851, 0.38367754], dtype=dtype
    )
    weights = np.full(X.shape[0], 1 / X.shape[0], dtype=dtype)

    _, means, stds = tm.DenseMatrix(X).standardize(
        weights, center_predictors=True, scale_predictors=True
    )

    decimal = 3 if dtype == np.float32 else 6
    np.testing.assert_almost_equal(means, X.mean(axis=0), decimal=decimal)
    np.testing.assert_almost_equal(stds, X.std(axis=0), decimal=decimal)

fails on main and passes on this branch which is what we want to see.

stanmart · 2025-01-29T08:57:28Z

~~Let me try again if I can make changes to this branch~~ Never mind :)

stanmart · 2025-01-29T09:03:18Z

You really do need the limited precision for assert_almost_equals:

    decimal = 3 if dtype == np.float32 else 6
    np.testing.assert_almost_equal(..., decimal=decimal)
    np.testing.assert_almost_equal(..., decimal=decimal)

mlondschien and others added 5 commits January 24, 2025 08:59

Add comment to stat_mat.standardize()

ca34017

More conservative approach to compute standard deviations.

65ab55d

Format

ed0da75

Update changelog.

5930908

Add example to CI.

f9fa1a5

mlondschien requested review from MarcAntoineSchmidtQC, jtilly, lbittarello and stanmart as code owners January 24, 2025 08:01

stanmart reviewed Jan 24, 2025

View reviewed changes

src/tabmat/dense_matrix.py Outdated Show resolved Hide resolved

stanmart reviewed Jan 24, 2025

View reviewed changes

example.py Outdated Show resolved Hide resolved

try something

6d0a3ac

Run ci on pushes on prs

e26d59f

stanmart approved these changes Jan 24, 2025

View reviewed changes

.github/workflows/ci.yml Outdated Show resolved Hide resolved

example.py Outdated Show resolved Hide resolved

src/tabmat/standardized_mat.py Show resolved Hide resolved

stanmart reviewed Jan 24, 2025

View reviewed changes

.github/workflows/ci.yml Outdated Show resolved Hide resolved

Update .github/workflows/ci.yml

124c900

Co-authored-by: Martin Stancsics <[email protected]>

Add test.

c02fbc2

Test the actual problem.

3fcc01a

Format

812f01f

mlondschien added 3 commits January 29, 2025 10:17

Update precision to sqrt(eps)

245192b

Format

080a475

Remove jan's example file.

1cda738

stanmart approved these changes Jan 29, 2025

View reviewed changes

stanmart merged commit 04a6f68 into Quantco:main Jan 29, 2025
26 checks passed

mlondschien deleted the issue-414-b branch January 29, 2025 13:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More robust `DenseMatrix._get_col_stds` #436

More robust `DenseMatrix._get_col_stds` #436

mlondschien commented Jan 24, 2025

mlondschien commented Jan 24, 2025

stanmart commented Jan 24, 2025

stanmart commented Jan 24, 2025

mlondschien commented Jan 24, 2025 •

edited

Loading

mlondschien commented Jan 24, 2025

stanmart commented Jan 24, 2025

stanmart left a comment

stanmart commented Jan 24, 2025 •

edited

Loading

mlondschien commented Jan 29, 2025

stanmart commented Jan 29, 2025

mlondschien commented Jan 29, 2025

stanmart commented Jan 29, 2025

mlondschien commented Jan 29, 2025

stanmart commented Jan 29, 2025

mlondschien commented Jan 29, 2025

stanmart commented Jan 29, 2025

stanmart commented Jan 29, 2025 •

edited

Loading

stanmart commented Jan 29, 2025

More robust DenseMatrix._get_col_stds #436

More robust DenseMatrix._get_col_stds #436

Conversation

mlondschien commented Jan 24, 2025

mlondschien commented Jan 24, 2025

stanmart commented Jan 24, 2025

stanmart commented Jan 24, 2025

mlondschien commented Jan 24, 2025 • edited Loading

mlondschien commented Jan 24, 2025

stanmart commented Jan 24, 2025

stanmart left a comment

Choose a reason for hiding this comment

stanmart commented Jan 24, 2025 • edited Loading

mlondschien commented Jan 29, 2025

stanmart commented Jan 29, 2025

mlondschien commented Jan 29, 2025

stanmart commented Jan 29, 2025

mlondschien commented Jan 29, 2025

stanmart commented Jan 29, 2025

mlondschien commented Jan 29, 2025

stanmart commented Jan 29, 2025

stanmart commented Jan 29, 2025 • edited Loading

stanmart commented Jan 29, 2025

More robust `DenseMatrix._get_col_stds` #436

More robust `DenseMatrix._get_col_stds` #436

mlondschien commented Jan 24, 2025 •

edited

Loading

stanmart commented Jan 24, 2025 •

edited

Loading

stanmart commented Jan 29, 2025 •

edited

Loading