Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More robust DenseMatrix._get_col_stds #436

Merged
merged 14 commits into from
Jan 29, 2025
Merged

Conversation

mlondschien
Copy link
Contributor

xref #414
Checklist

  • Added a CHANGELOG.rst entry

@mlondschien
Copy link
Contributor Author

A non-zero constant column results in

sqrt_arg = transpose_square_dot_weights(self._array, weights) - col_means**2

close to machine precision ~1e-15. Then _get_col_stds returns ~1e-8 for that column. The resulting mult of the StandardizedMatrix is large ~1e8, and so at least one entry of np.outer(mult, mult) blows up to 1e15. As a result, the corresponding entry in StandardizedMatrix.sandwich(np.ones(n)/n) no longer cancels out.

If DenseMatrix._get_col_stds returned a value close to machine precision, the mult is 1 and so these instabilities don't occur.

It might be worthwile to put a check somewhere in the init of StandardizedMatrix: If mult is close to sqrt(machine_precision), the results of sandwich will be inaccurate.

example.py Outdated Show resolved Hide resolved
@stanmart
Copy link
Collaborator

Nice find, thank you!

@stanmart
Copy link
Collaborator

Would the same operation for the sparse matrix also benefit from a similar change? Hopefully we have much fewer non-zero constant columns there, but still :)

sqrt_arg = (
transpose_square_dot_weights(
self._array.data,
self._array.indices,
self._array.indptr,
weights,
weights.dtype,
)
- col_means**2
)

@mlondschien
Copy link
Contributor Author

mlondschien commented Jan 24, 2025

Am I seeing this correctly and no tests are run as part of ci?

Yes, because this lives on a different repo.

@mlondschien
Copy link
Contributor Author

Would the same operation for the sparse matrix also benefit from a similar change? Hopefully we have much fewer non-zero constant columns there, but still :)

sqrt_arg = (
transpose_square_dot_weights(
self._array.data,
self._array.indices,
self._array.indptr,
weights,
weights.dtype,
)
- col_means**2
)

I guess this would be a more complex change if we want to avoid a loop through all (including zero) entries.

@stanmart
Copy link
Collaborator

Am I seeing this correctly and no tests are run as part of ci?

We should probably change our CI config such that tests run on pull requests, and not just on pushes.

I guess this would be a more complex change if we want to avoid a loop through all (including zero) entries.

Good point. Let's just skip it then, it should not be an in issue anyways.

Copy link
Collaborator

@stanmart stanmart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will run the benchmarks in a bit to make sure there is no performance regression due to these changes, but otherwise LGTM.

.github/workflows/ci.yml Outdated Show resolved Hide resolved
example.py Outdated Show resolved Hide resolved
src/tabmat/standardized_mat.py Show resolved Hide resolved
@stanmart
Copy link
Collaborator

stanmart commented Jan 24, 2025

Also, could you please also add a changelog entry? I think we could consider it a bug fix, but I'm fine with "other changes", too.

Sorry, I found it 🙈

.github/workflows/ci.yml Outdated Show resolved Hide resolved
Co-authored-by: Martin Stancsics <[email protected]>
@mlondschien
Copy link
Contributor Author

Any change we can merge this and do a release :)? I can remove Jan's file and add this as a test. Where should I add it to?

@stanmart
Copy link
Collaborator

Yes! I can add the test if you don't mind me pushing to this branch. We'll probably want to add a changelog entry, too. How about something like "Improved the robustness of standardization for dense matrices with constant columns"?

@mlondschien
Copy link
Contributor Author

Feel free to push as you'd like.

I already added

Bug fix:

  • A more robust :meth:DenseMatrix._get_col_stds results in more accurate :meth:StandardizedMatrix.sandwich results.

@stanmart
Copy link
Collaborator

Ah, I can't because it's your repo. It's fine though. So I thought something like this might be good in tests/test_matrices.py:

@pytest.mark.parametrize("dtype", [np.float32, np.float64])
def test_standardize_constant_cols(dtype):
    X = np.array(
        [
            [46.231056, 126.05263, 144.46439],
            [46.231224, 128.66818, 0.7667693],
            [46.231186, 104.97506, 193.8872],
            [46.230835, 130.10156, 143.88954],
            [46.230896, 116.76007, 7.5629334],
        ],
        dtype=dtype,
    )
    v = np.array(
        [0.12428328, 0.67062443, 0.6471895, 0.6153851, 0.38367754], dtype=dtype
    )
    weights = np.full(X.shape[0], 1 / X.shape[0], dtype=dtype)

    standardized_mat, _, _ = tm.DenseMatrix(X).standardize(
        weights, center_predictors=True, scale_predictors=True
    )

    result = standardized_mat.transpose_matvec(v)
    expected = standardized_mat.toarray().T @ v

    np.testing.assert_allclose(result, expected)

The issue is that the test still fails for float32. Or is that expected and this branch does not fully fix what example.py demonstrates?

@mlondschien
Copy link
Contributor Author

image maybe you are not a maintainer 🤷

@stanmart
Copy link
Collaborator

Haha, I thought I was, but who knows 😅

@mlondschien
Copy link
Contributor Author

c02fbc2 still fails.

@stanmart
Copy link
Collaborator

Yeah, so it's failing for float32 still. Maybe we shoud add a more direct test instead? It's the standard deviation calculation which was improved after all

For example this:

@pytest.mark.parametrize("dtype", [np.float32, np.float64])
def test_standardize_almost_constant_cols(dtype):
    X = np.array(
        [
            [46.231056, 126.05263, 144.46439],
            [46.231224, 128.66818, 0.7667693],
            [46.231186, 104.97506, 193.8872],
            [46.230835, 130.10156, 143.88954],
            [46.230896, 116.76007, 7.5629334],
        ],
        dtype=dtype,
    )
    v = np.array(
        [0.12428328, 0.67062443, 0.6471895, 0.6153851, 0.38367754], dtype=dtype
    )
    weights = np.full(X.shape[0], 1 / X.shape[0], dtype=dtype)

    _, means, stds = tm.DenseMatrix(X).standardize(
        weights, center_predictors=True, scale_predictors=True
    )

    decimal = 3 if dtype == np.float32 else 6
    np.testing.assert_almost_equal(means, X.mean(axis=0), decimal=decimal)
    np.testing.assert_almost_equal(stds, X.std(axis=0), decimal=decimal)

fails on main and passes on this branch which is what we want to see.

@stanmart
Copy link
Collaborator

stanmart commented Jan 29, 2025

Let me try again if I can make changes to this branch Never mind :)

@stanmart
Copy link
Collaborator

You really do need the limited precision for assert_almost_equals:

    decimal = 3 if dtype == np.float32 else 6
    np.testing.assert_almost_equal(..., decimal=decimal)
    np.testing.assert_almost_equal(..., decimal=decimal)

@stanmart stanmart merged commit 04a6f68 into Quantco:main Jan 29, 2025
26 checks passed
@mlondschien mlondschien deleted the issue-414-b branch January 29, 2025 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants