-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix and optimize bin1d_vec
+ extend units tests + fix imprecise bin edge creation
#270
base: main
Are you sure you want to change the base?
Fix and optimize bin1d_vec
+ extend units tests + fix imprecise bin edge creation
#270
Conversation
The tolerance for bin size `h` must be based on eps and the _involved_ numbers in calculating it, not the resulting number. The two involved numbers are of the same order of magnitude, and one if it is `a0` (the first bin edge `bin[0] == min(bins)`), for which we already calculate the tolerance (`a0_tol`). So we can use `a0_tol` also as tolerance for `h`.
…on dtype * outsource `tol` from if condition * reorder some operations/checks * reorder/group terms in idx calculation to honor floating point arithmetics * revise/add comments
... due to assuming/requiring sorted bins - as mentioned in the docstring; "`bins [...] must be monotonically increasing`". This is not checked, presumably because pyCSEP itself never passes unsorted bins to this method (they always originate from ranges). --> Added a check at the top, but commented it out; we may activate it in the future if there is need - for now, it only serves as an additional hint.
i.e., don't create array copies if the input is already an array.
…operating on arrays ... by simply assuring that `idx` is an array.
…ex_of` (`bin1d_vec` already does that with `.asarray`)
... even in the unit tests (only used when binning magnitudes) Notes: * specifying `tol=0.00001` was not even necessary with the revised tolerance section by commit 949d342 (albeit most of the `tol=0.00001` were introduced by this commit); * Why? because magnitude bins are "cleanly" created with `np.arange()` in `csep.core.regions.magnitude_bins`; * their redundancy is confirmed by passing the corresponding unit test `test_scalar_outside` without `tol=0.00001` using the unfixed `bin1d_vec`; * yet, I left optional `tol=None` argument in `bin1d_vec` and the functions that call it -- just in case there is a real need to override tolerance.
* in `test_calc.TestBin1d`: * add `test_bin1d_single_bin2`, which inputs a single bin _and_ a single point; * split off a new `test_scalar_inside` from `test_scalar_outside`, which checks: * _all_ corner cases (*.*5) * _all_ center bins (*.*0) * _all_ end of bins (*.*99999999) * one large value (10) * add separate class `TestBin2d` in `test_region`, which makes more realistic unit tests by extending the self-check mentioned in SCECcode#255 and performing it... * for three regions: Italy, California, NZ (New-Zealand); * for all origins, mid-points, and end-corners (in the bin, at the opposite side of origin); * for double precision (_float64_ / _f8_) and single precision (_float32_ / _f4_) of the points (not bin edges); * as loop (over each point individually) and single vector (all points at once). * ==> 36 unit test combinations * (albeit targeting `bin1d_vec`, it also unit-tests region's`CartesianGrid2D.get_index_of` and by extension `GriddedForecast.get_index_of`)
Spotted by running unit tests for NZ region (the end-corner-based unit test `test_bin2d_regions_endcorner` failed). Those imprecise bin edges were originated in `utils.calc.cleaner_range`. The crucial modification is using `round` instead of `floor`. The other change – replacing the hard-coded `const` with a flexible `scale` parameter – is to account for decimal places of bin edges _and_ stepping (`const = 100000` would lead to imprecise bin edges if `h < 0.00001`). + simplified `core.regions.magnitude_bins()` to just call (this improved) `cleaner_range()`; it was essentially a copy of the former.
Bonus: using
|
Enhancement | vectorized | looped |
---|---|---|
np.searchsorted at the core |
52.8 µs | 41.1 ms |
np.digitize at the core |
54.8 µs | 49.9 ms |
👎
Bonus: using
|
Enhancement | vectorized | looped |
---|---|---|
decimal |
17.1 ms | 2.25 s |
So 320x / 50x slower than the current implementation.
Also, it doesn't pass some unit tests.
Additionally, array2string must be well configured to avoid creating any weird strings from floats or float arrays due to numerical precision (alternatively, one could represent the region's origins/bins directly as Decimal
s on initializations, but that's all not necessary since we solved the numerical issues above).
👎
bin1d_vec
+ extend units tests + fix imprecise bin edge creation
Wow, Marcus! Thank you!
…________________________________
From: Marcus Herrmann ***@***.***>
Sent: 06 February 2025 15:30
To: SCECcode/pycsep ***@***.***>
Cc: Subscribed ***@***.***>
Subject: Re: [SCECcode/pycsep] Fix spatial binning precision + extend units tests + fix imprecise bin edge creation (PR #270)
Bonus: using decimal package
For the sake of curiosity, I also considered the decimal package as suggested by Bill in the original issue<#255 (comment)> (actually way before this PR and the aha-moment in my comment<#255 (comment)>).
Here necessary changes:
- bins = numpy.array(bins)
- p = numpy.array(p)
+ def array2decimal(x):
+ x = x.squeeze() # omit superfluous dims
+ if x.ndim:
+ return np.array(
+ list(
+ map(Decimal,
+ numpy.array2string(
+ x
+ max_line_width=np.inf, threshold=np.inf
+ )[1:-1].split())
+ )
+ )
+ else:
+ return np.array(Decimal(numpy.array2string(x)))
+
+
+ p = array2decimal(numpy.array(p))
+ bins = array2decimal(numpy.array(bins))
and
- # Deal with border cases
- a0_tol = numpy.abs(a0) * numpy.finfo(numpy.float64).eps
- h_tol = numpy.abs(h) * numpy.finfo(numpy.float64).eps
- p_tol = numpy.abs(p) * numpy.finfo(numpy.float64).eps
-
- # absolute tolerance
- if tol is None:
- idx = numpy.floor((p + (p_tol + a0_tol) - a0) / (h - h_tol))
- else:
- idx = numpy.floor((p + (tol + a0_tol) - a0) / (h - h_tol))
- if h < 0:
- raise ValueError("grid spacing must be positive and monotonically increasing.")
- # account for floating point uncertainties by considering extreme case
+ idx = (p - a0) / h
But the performance is abysmal due to mapping via strings (that's how the decimals package works to circumvent numerical precision):
Enhancement vectorized looped
decimal 17.1 ms 2.25 s
So 320x / 50x slower than the current implementation.
Also, it doesn't pass some unit tests.
Additionally, array2string<https://numpy.org/doc/stable/reference/generated/numpy.array2string.html> must be well configured to avoid creating any weird strings from floats due to numerical precision (alternatively, one could represent the region's origins/bins directly as Decimals on initializations, but that's all not necessary since we solved the numerical issues above).
—
Reply to this email directly, view it on GitHub<#270 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AD6OISNYUGUFADKWGCL4VD32ON5X3AVCNFSM6AAAAABWTZ63C2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNBQGE2DQNJRG4>.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
@mherrmann3 The new solution looks much simpler and more optimal than rounding in the issue. Nice Work! I'm encouraged by your tests with the reproducibility package. I'm looking into why some of the builds are failing for different OS versions right now. It's failing due to an error in the |
it looks like it is related to the pinned vcrpy installation we are using in the workflows/build-test.yml file there was a warning in the release notes on vcrpy that this version might not play nicely with urllib3. the last successful builds used it looks like we have some options here:
@mherrmann3 do you have any suggestions or thoughts on this? |
Hey Bill, I'm not experienced with github's workflows, but I'd prefer the simplest option, number 1, in your list. I don't see any reason for the pinning. It seems that @pabloitu simply pinned So we could unpin it twice in Edit: Note that the most recent version If we go with |
@pabloitu Any thoughts on this? |
⚠ Caution, longer treatise ahead! 🤷♂️
1. Solving numerical issues (#255)
I previously proposed a preliminary fix based on rounding the estimated bin size
h
. But there's a more elegant fix:in this line.
This change is sufficient because the prior floating-point operation
h = bins[1] - bins[0]
in the previous line carries the unit roundoff error of the involved numbers (and may introduce a further roundoff error). To account for it, the tolerance must be based on machineeps
ilon and the involved numbers (not the resulting number). In our case, the two involved numbers in calculatingh
are of the same order of magnitude, and one of it isa0
(i.e., the first bin edgebin[0] == min(bins)
), for which we already calculate the tolerance (a0_tol
). So we can usea0_tol
also as tolerance forh
.Here's an illustration:
To account for the floating-point/roundoff errors,
h_tol
must be larger than theimprecision
ofh
.Only the new
h_tol
estimation guarantees that by reusinga0_tol = abs(a0) * eps
.2. Further enhancements
2.1 to
bin1d_vec
Working with and looking at
bin1d_vec
way too often, I noticed that it can be simplified, made more robust/versatile and (slightly) speed up by addressing various aspects. In the following, I list each incremental enhancement and measure its performance; I committed them separately for convenience and providing related notes in their description.To quantify the performance speed-up, I timed the self-check over all origins for California, but – to avoid the overhead in
get_index_of
– only for one coordinate component by directly callingbin1d_vec
in two ways:%timeit -r 300 -n 1 for origin in region.origins(): bin1d_vec(origin[0], region.xs)
%timeit -r 6000 -n 100 bin1d_vec(origins_lon, region.xs)
.min(bins)
withbins[0]
.asarray()
instead of.array()
2.2 other changes
region.CartesianGrid2D.get_index_of()
, I also removednumpy.array()
(redundant with first lines inbin1d_vec
; should slightly speed up spatial binning)tol=0.00001
frombin1d_vec
calls or functions that call it (only used when binning magnitudes), even in the unit tests3. Extend unit tests
test_calc.TestBin1d
:test_bin1d_single_bin2
, which inputs a single bin and a single point;test_scalar_inside
fromtest_scalar_outside
, which checks for (magnitude) bin edges5.95, 6.05, ..., 8.95
:TestBin2d
intest_region
, which makes more realistic unit tests by extending the self-check mentioned in Spatial binning doesn't properly account for floating point precision #255 and performing it...bin1d_vec
, it also unit-tests region'sCartesianGrid2D.get_index_of
and by extensionGriddedForecast.get_index_of
)4. Another issue: imprecise bin edge creation
For the NZ region, the end-corner-based unit test (
test_bin2d_regions_endcorner
) failed; inspecting the problematic corner point for the Longitude yields:It took me a while to spot the culprit; but this helped:
Oops, they are not rounded to the first decimal digit despite
region.dh == 0.1
!Apparently, the underlying
csep.utils.calc.cleaner_range()
is the culprit. It turns out that the particularly chosenconst = 100000
leads to weird numerical imprecision (it doesn't happen if it were 10'000 or 1'000'000):Then
np.floor(const * start)
operation yields16569999.0
instead of1657000.0
.Numpy's user guide suggest:
I tried it, but some bin edges still contained
.*99999...
.Instead, I modified
cleaner_range()
only in some other aspects. The crucial modification is usinground
instead offloor
– very related to the problematics that lead me to this PR. The other change – replacing the hard-codedconst
with a flexiblescale
parameter – is supposed to account for the number of decimal digits of bin edges and stepping (const = 100000
would lead to imprecise bin edges ifh < 0.000'01
or if the bin edges start at*.000'001
– maybe someone needs that in the future 😉).In the same commit, I also simplified
core.regions.magnitude_bins()
to just call (this improved)cleaner_range()
; it was essentially a copy of it.5. Implication on existing test results
As mentioned in my original comment of issue #255, test results produced after this PR will become irreproducible with past results only if the test catalog contains one or more events whose coordinates align with the region's spatial bin edges, e.g., a coarse single-decimal-digit coordinate like 42.3° and a gridding of 0.1°. Otherwise, this issue and PR is irrelevant.
But even if a catalog does contain such events, I don't expect test results would change significantly – perhaps only if all events have coarse locations. Be reminded that I spotted this whole issue only due to a difference at the 3-5th decimal digit for IGPE compared to an independent binning implementation; the test catalog contained five of such events (262 events in total).
I still wanted to assess if this PR leads to some irreproducibility, so I ran our first reproducibility package using the most recent pyCSEP state (v0.6.3 + 22 commits) – once without and once with all changes in this PR. Eventually, they were exactly the same (up to the last [15th] decimal digit).1 Apparently, it doesn't involve events with coarse locations (this is a good thing!).
Apart that, I didn't do other comparisons; feel free to suggest me some (they should contain events with coarse locations and/or involve the NZ region).
Closes #255
Footnotes
Btw: they were not exactly the same as the expected output due to some occasional mismatches at the last (15th) decimal digit. These slight mismatches are likely due to using a different platform/OS and/or a different combination of packages (I installed the most up-to-date versions of packages specified in pyCSEP's
requirements.yml
+ adjustednumpy==1.22.4
andmatplotlib==3.5.2
to get a compatible environment for the newest pyCSEP version). ↩