-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More robust DenseMatrix._get_col_stds
#436
Conversation
A non-zero constant column results in
close to machine precision ~1e-15. Then If It might be worthwile to put a check somewhere in the init of |
Nice find, thank you! |
Would the same operation for the sparse matrix also benefit from a similar change? Hopefully we have much fewer non-zero constant columns there, but still :) tabmat/src/tabmat/sparse_matrix.py Lines 297 to 306 in 8662571
|
Am I seeing this correctly and no tests are run as part of ci? Yes, because this lives on a different repo. |
I guess this would be a more complex change if we want to avoid a loop through all (including zero) entries. |
We should probably change our CI config such that tests run on pull requests, and not just on pushes.
Good point. Let's just skip it then, it should not be an in issue anyways. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will run the benchmarks in a bit to make sure there is no performance regression due to these changes, but otherwise LGTM.
Sorry, I found it 🙈 |
Co-authored-by: Martin Stancsics <[email protected]>
Any change we can merge this and do a release :)? I can remove Jan's file and add this as a test. Where should I add it to? |
Yes! I can add the test if you don't mind me pushing to this branch. We'll probably want to add a changelog entry, too. How about something like "Improved the robustness of standardization for dense matrices with constant columns"? |
Feel free to push as you'd like. I already added
|
Ah, I can't because it's your repo. It's fine though. So I thought something like this might be good in @pytest.mark.parametrize("dtype", [np.float32, np.float64])
def test_standardize_constant_cols(dtype):
X = np.array(
[
[46.231056, 126.05263, 144.46439],
[46.231224, 128.66818, 0.7667693],
[46.231186, 104.97506, 193.8872],
[46.230835, 130.10156, 143.88954],
[46.230896, 116.76007, 7.5629334],
],
dtype=dtype,
)
v = np.array(
[0.12428328, 0.67062443, 0.6471895, 0.6153851, 0.38367754], dtype=dtype
)
weights = np.full(X.shape[0], 1 / X.shape[0], dtype=dtype)
standardized_mat, _, _ = tm.DenseMatrix(X).standardize(
weights, center_predictors=True, scale_predictors=True
)
result = standardized_mat.transpose_matvec(v)
expected = standardized_mat.toarray().T @ v
np.testing.assert_allclose(result, expected) The issue is that the test still fails for |
Haha, I thought I was, but who knows 😅 |
c02fbc2 still fails. |
Yeah, so it's failing for For example this: @pytest.mark.parametrize("dtype", [np.float32, np.float64])
def test_standardize_almost_constant_cols(dtype):
X = np.array(
[
[46.231056, 126.05263, 144.46439],
[46.231224, 128.66818, 0.7667693],
[46.231186, 104.97506, 193.8872],
[46.230835, 130.10156, 143.88954],
[46.230896, 116.76007, 7.5629334],
],
dtype=dtype,
)
v = np.array(
[0.12428328, 0.67062443, 0.6471895, 0.6153851, 0.38367754], dtype=dtype
)
weights = np.full(X.shape[0], 1 / X.shape[0], dtype=dtype)
_, means, stds = tm.DenseMatrix(X).standardize(
weights, center_predictors=True, scale_predictors=True
)
decimal = 3 if dtype == np.float32 else 6
np.testing.assert_almost_equal(means, X.mean(axis=0), decimal=decimal)
np.testing.assert_almost_equal(stds, X.std(axis=0), decimal=decimal) fails on main and passes on this branch which is what we want to see. |
|
You really do need the limited precision for decimal = 3 if dtype == np.float32 else 6
np.testing.assert_almost_equal(..., decimal=decimal)
np.testing.assert_almost_equal(..., decimal=decimal) |
xref #414
Checklist
CHANGELOG.rst
entry