Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix and optimize bin1d_vec + extend units tests + fix imprecise bin edge creation #270

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

mherrmann3
Copy link
Contributor

@mherrmann3 mherrmann3 commented Feb 6, 2025

⚠ Caution, longer treatise ahead! 🤷‍♂️

1. Solving numerical issues (#255)

I previously proposed a preliminary fix based on rounding the estimated bin size h. But there's a more elegant fix:

-    h_tol = numpy.abs(h) * numpy.finfo(numpy.float64).eps
+    h_tol = a0_tol

in this line.

This change is sufficient because the prior floating-point operation h = bins[1] - bins[0] in the previous line carries the unit roundoff error of the involved numbers (and may introduce a further roundoff error). To account for it, the tolerance must be based on machine epsilon and the involved numbers (not the resulting number). In our case, the two involved numbers in calculating h are of the same order of magnitude, and one of it is a0 (i.e., the first bin edge bin[0] == min(bins)), for which we already calculate the tolerance (a0_tol). So we can use a0_tol also as tolerance for h.

Here's an illustration:

       bin1:   31.50000000000000000 (a0)
       bin2:   31.60000000000000142

          h:    0.10000000000000142 (bin2 - a0)
imprecision:    0.00000000000000142 (|h - 0.1|)

        eps:    0.00000000000000022 (FYI)
    |h|*eps:    0.00000000000000002 (previous h_tol)

   |a0|*eps:    0.00000000000000699 (new h_tol)

To account for the floating-point/roundoff errors, h_tol must be larger than the imprecision of h.
Only the new h_tol estimation guarantees that by reusing a0_tol = abs(a0) * eps.

FYI: accounting for h_tol would not be necessary if the actual bin size h is accurately specified (e.g., 0.1), i.e., not based on a prior operation. CartesianGrid already stores it as meta-data .dh, which is exactly 0.1 for all currently implemented regions. bin1d_vec could be adapted to receive this bin size as an additional argument. But bin1d_vec is also used for binning magnitudes, for which the bin size is not explicitly stored in the region. Of course we could change that and pass it to bin1d_vec, but I believe it's not necessary (for now).

2. Further enhancements

2.1 to bin1d_vec

Working with and looking at bin1d_vec way too often, I noticed that it can be simplified, made more robust/versatile and (slightly) speed up by addressing various aspects. In the following, I list each incremental enhancement and measure its performance; I committed them separately for convenience and providing related notes in their description.

To quantify the performance speed-up, I timed the self-check over all origins for California, but – to avoid the overhead in get_index_of – only for one coordinate component by directly calling bin1d_vec in two ways:

  1. looped (7682 individual call): %timeit -r 300 -n 1 for origin in region.origins(): bin1d_vec(origin[0], region.xs)
  2. vectorized (vector with 7682 values), which matters much more in practice: %timeit -r 6000 -n 100 bin1d_vec(origins_lon, region.xs)
Enhancement vectorized looped
Only the bare fix 23.8 µs 36.3 ms
1. Reorganize + make tolerance depend on dtype 24.1 µs 37.7 ms
2. Replace .min(bins) with bins[0] 22.2 µs 26.3 ms
3. Use .asarray() instead of .array() 20.7 µs 24.8 ms
4. Avoid try/except by always operating on arrays 20.8 µs 31.4 ms

Notes about bin1d_vec (for those interested):

  • As mentioned in bin1d_vec's docstring, "bins [...] must be monotonically increasing" because the implemented idx calculation assumes/requires sorted bins (for this reason, there is no need for .min()). This is not checked, presumably because pyCSEP itself never passes unsorted bins to this method (they always originate from ranges). Nevertheless, I added a check at the top, but commented it out; we may activate it in the future if there is need - for now, it only serves as an additional hint.
  • for passing the unit tests mentioned below, the necessary tolerances to add in the idx calculation depend on the dtype of p:
    • float64 doesn't need adding a0_tol nor p_tol;
    • float32 doesn't need adding a0_tol, but p_tol (since p is float32);
    • --> So a0_tol could be omitted, but to avoid unforeseeable future oddities, let's keep it anyway (and because it theoretically makes sense adding it).
  • the grouping and order of terms in the idx calculation influence the roundoff error, see What Every Computer Scientist Should Know About Floating-Point Arithmetic. Illustration:
 40.40000000000000568 (p)
 31.50000000000000000 (a0)
  0.00000000000003728 (p_tol)
  0.00000000000003679 (a0_tol)

  8.90000000000000568 (p - a0)
  0.00000000000007407 (p_tol + a0_tol)

  8.90000000000007674  <--  p + (p_tol + a0_tol) - a0  [ORIGINAL]
  8.90000000000007674  <--  p + p_tol + a0_tol - a0
  8.90000000000008029  <--  p - a0 + (p_tol + a0_tol)
  8.90000000000008029  <--  p - a0 + p_tol + a0_tol    [NEW]
  8.90000000000007851  <--  p + p_tol - a0 + a0_tol

Only the 3rd and 4th option is closest to (and occasionally exactly) the actual sum (and is largest). In the first enhancement step, I used the 4th option.

2.2 other changes


Notes (for those interested):

  • specifying tol=0.00001 was not even necessary with the revised tolerance section by commit 949d342c (albeit most of the tol=0.00001 were introduced by this commit);
    • Why? because magnitude bins are "cleanly" created with np.arange() in csep.core.regions.magnitude_bins;
    • their redundancy is confirmed by passing the extended unit test test_scalar_inside (see below) without tol=0.00001 using the unfixed bin1d_vec;
    • yet, I left the optional tol=None argument in bin1d_vec and the functions that call it – just in case there is a real need to override the tolerance.

3. Extend unit tests

  • in test_calc.TestBin1d:
    • add test_bin1d_single_bin2, which inputs a single bin and a single point;
    • split off a new test_scalar_inside from test_scalar_outside, which checks for (magnitude) bin edges 5.95, 6.05, ..., 8.95:
      • all corner cases (*.*5)
      • all center bins (*.*0)
      • all end of bins (*.*99999999)
      • one large value (10)
  • add separate class TestBin2d in test_region, which makes more realistic unit tests by extending the self-check mentioned in Spatial binning doesn't properly account for floating point precision #255 and performing it...
    • for three regions: Italy, California, NZ (New-Zealand);
    • for all origins, mid-points, and end-corners (in the bin at the opposite side of origin);
    • for double precision (float64 / f8) and single precision (float32 / f4) of the points (not bins);
    • as loop (over each point individually) and single vector (all points at once).
    • ==> 36 unit test combinations
    • (albeit targeting bin1d_vec, it also unit-tests region's CartesianGrid2D.get_index_of and by extension GriddedForecast.get_index_of)

Notes (for those interested):

  • the float32-based unit tests only pass after the first enhancement in the previous section;
    • (float32 is not a use case when using pyCSEP's implemented readers normally (read values will eventually be converted to float64 here by using these specified dtypes), but it could happen with a custom loader function and the intention to save some memory - note that the reader module still reveals some (incorrect) mentions of float32/f4 in the docstrings).
  • the unit tests for the NZ region only pass after addressing another issue in the next section.

4. Another issue: imprecise bin edge creation

For the NZ region, the end-corner-based unit test (test_bin2d_regions_endcorner) failed; inspecting the problematic corner point for the Longitude yields:

      origin: 167.79999999999998
           p: 167.89998999999997
          a0: 165.69999
           h: 0.09999999999999432
      p - a0: 2.19999999999996

It took me a while to spot the culprit; but this helped:

region.xs: [165.69999, 165.79999, 165.89999, 165.99999, 166.09999, 166.19999, 166.29999, 166.39999, ...]
dh: 0.1

Oops, they are not rounded to the first decimal digit despite region.dh == 0.1!

FYI: The chain leading to failed test:

  1. improperly rounded .xs (.*9999)
  2. ... causes a likewise imprecise a0
  3. ... which causes (p - a0 + p_tol + a0_tol) / (h - h_tol) to be just above 22 (22.000099999901586), instead of being just below (21.99999999990184)
  4. ... which eventually nudges idx into the next bin

Apparently, the underlying csep.utils.calc.cleaner_range() is the culprit. It turns out that the particularly chosen const = 100000 leads to weird numerical imprecision (it doesn't happen if it were 10'000 or 1'000'000):

165.7 * 100000  = 16569999.999999998 <-- the chosen `const`
165.7 * 10000   = 1657000.0
165.7 * 1000000 = 165700000.0

Then np.floor(const * start) operation yields 16569999.0 instead of 1657000.0.

Numpy's user guide suggest:

Use numpy.linspace if you want the endpoint to be included in the result, or if you are using a non-integer step size.

I tried it, but some bin edges still contained .*99999....

Instead, I modified cleaner_range() only in some other aspects. The crucial modification is using round instead of floor – very related to the problematics that lead me to this PR. The other change – replacing the hard-coded const with a flexible scale parameter – is supposed to account for the number of decimal digits of bin edges and stepping (const = 100000 would lead to imprecise bin edges if h < 0.000'01 or if the bin edges start at *.000'001 – maybe someone needs that in the future 😉).

Note (for those interested):
This sub-issue is apparently caused by using numpy.loadtxt() inside nz_csep_region() instead of parse_csep_template() as used for reading the California and Italy CSEP region, which uses float() from the Python standard library. So this additional fix makes bin creation more robust to input produced with different reader functions.

In the same commit, I also simplified core.regions.magnitude_bins() to just call (this improved) cleaner_range(); it was essentially a copy of it.

5. Implication on existing test results

As mentioned in my original comment of issue #255, test results produced after this PR will become irreproducible with past results only if the test catalog contains one or more events whose coordinates align with the region's spatial bin edges, e.g., a coarse single-decimal-digit coordinate like 42.3° and a gridding of 0.1°. Otherwise, this issue and PR is irrelevant.

But even if a catalog does contain such events, I don't expect test results would change significantly – perhaps only if all events have coarse locations. Be reminded that I spotted this whole issue only due to a difference at the 3-5th decimal digit for IGPE compared to an independent binning implementation; the test catalog contained five of such events (262 events in total).

I still wanted to assess if this PR leads to some irreproducibility, so I ran our first reproducibility package using the most recent pyCSEP state (v0.6.3 + 22 commits) – once without and once with all changes in this PR. Eventually, they were exactly the same (up to the last [15th] decimal digit).1 Apparently, it doesn't involve events with coarse locations (this is a good thing!).

Apart that, I didn't do other comparisons; feel free to suggest me some (they should contain events with coarse locations and/or involve the NZ region).


Closes #255

Footnotes

  1. Btw: they were not exactly the same as the expected output due to some occasional mismatches at the last (15th) decimal digit. These slight mismatches are likely due to using a different platform/OS and/or a different combination of packages (I installed the most up-to-date versions of packages specified in pyCSEP's requirements.yml + adjusted numpy==1.22.4 and matplotlib==3.5.2 to get a compatible environment for the newest pyCSEP version).

The tolerance for bin size `h` must be based on eps and the _involved_ numbers in calculating it, not the resulting number. The two involved numbers are of the same order of magnitude, and one if it is `a0` (the first bin edge `bin[0] == min(bins)`), for which we already calculate the tolerance (`a0_tol`). So we can use `a0_tol` also as tolerance for `h`.
…on dtype

* outsource `tol` from if condition
* reorder some operations/checks
* reorder/group terms in idx calculation to honor floating point arithmetics
* revise/add comments
... due to assuming/requiring sorted bins - as mentioned in the docstring; "`bins [...] must be monotonically increasing`".
This is not checked, presumably because pyCSEP itself never passes unsorted bins to this method (they always originate from ranges).
--> Added a check at the top, but commented it out; we may activate it in the future if there is need - for now, it only serves as an additional hint.
i.e., don't create array copies if the input is already an array.
…operating on arrays

... by simply assuring that `idx` is an array.
…ex_of`

(`bin1d_vec` already does that with `.asarray`)
... even in the unit tests
(only used when binning magnitudes)

Notes:
* specifying `tol=0.00001` was not even necessary with the revised tolerance section by commit 949d342 (albeit most of the `tol=0.00001` were introduced by this commit);
* Why? because magnitude bins are "cleanly" created with `np.arange()` in `csep.core.regions.magnitude_bins`;
* their redundancy is confirmed by passing the corresponding unit test `test_scalar_outside` without `tol=0.00001` using the unfixed `bin1d_vec`;
* yet, I left optional `tol=None` argument in `bin1d_vec` and the functions that call it -- just in case there is a real need to override tolerance.
* in `test_calc.TestBin1d`:
    * add `test_bin1d_single_bin2`, which inputs a single bin _and_ a single point;
    * split off a new `test_scalar_inside` from `test_scalar_outside`, which checks:
        * _all_ corner cases (*.*5)
        * _all_ center bins (*.*0)
        * _all_ end of bins (*.*99999999)
        * one large value (10)
* add separate class `TestBin2d` in `test_region`, which makes more realistic unit tests by extending the self-check mentioned in SCECcode#255 and performing it...
    * for three regions: Italy, California, NZ (New-Zealand);
    * for all origins, mid-points, and end-corners (in the bin, at the opposite side of origin);
    * for double precision (_float64_ / _f8_) and single precision (_float32_ / _f4_) of the points (not bin edges);
    * as loop (over each point individually) and single vector (all points at once).
    * ==> 36 unit test combinations
    * (albeit targeting `bin1d_vec`, it also unit-tests region's`CartesianGrid2D.get_index_of` and by extension `GriddedForecast.get_index_of`)
Spotted by running unit tests for NZ region (the end-corner-based unit test `test_bin2d_regions_endcorner` failed).
Those imprecise bin edges were originated in `utils.calc.cleaner_range`.

The crucial modification is using `round` instead of `floor`.
The other change – replacing the hard-coded `const` with a flexible `scale` parameter – is to account for decimal places of bin edges _and_ stepping (`const = 100000` would lead to imprecise bin edges if `h < 0.00001`).

+ simplified `core.regions.magnitude_bins()` to just call (this improved) `cleaner_range()`; it was essentially a copy of the former.
@mherrmann3
Copy link
Contributor Author

Bonus: using np.searchsorted/np.digitize at the core

I tried a completely different approach for calculating idx using numpy's arsenal of convenient functions; appropriate candidates are np.searchsorted and np.digitize.

The code changes would be:

- a0_tol = _get_tolerance(a0)
+ a0_tol = _get_tolerance(bins[-1])  # the last element is largest and gives the largest tolerance
- h_tol = a0_tol  # must be based on *involved* numbers
  p_tol = tol or get_tolerance(p)

- idx = numpy.floor((p + p_tol - a0 + a0_tol) / (h - h_tol))
- idx = numpy.asarray(idx)  # assure idx is an array
+ if not right_continuous:
+     h_tol = a0_tol
+     p[p + p_tol >= bins[-1] - a0_tol + h - h_tol] = -1  # mark too far right points as 'outside'
+ idx = np.searchsorted(bins - a0_tol, p + p_tol, side='right') - 1
or
+ idx = np.digitize(p + p_tol, bins - a0_tol) - 1

(...)

- idx = idx.astype(numpy.int64)

i.e., no need for h_tol. Albeit passing all unit tests, their performance - surprisingly - doesn't hold up with our custom implementation:

Enhancement vectorized looped
np.searchsorted at the core 52.8 µs 41.1 ms
np.digitize at the core 54.8 µs 49.9 ms

👎

@mherrmann3
Copy link
Contributor Author

mherrmann3 commented Feb 6, 2025

Bonus: using decimal package

For the sake of curiosity, I also considered the decimal package as suggested by Bill in the original issue (actually way before this PR and the aha-moment in my comment).
Here the necessary changes:

-    bins = numpy.array(bins)
-    p = numpy.array(p)
+    def array2decimal(x):
+        x = x.squeeze()  # omit superfluous dims
+        if x.ndim:
+            return np.array(
+                list(
+                    map(Decimal,
+                        numpy.array2string(
+                            x
+                            max_line_width=np.inf, threshold=np.inf
+                        )[1:-1].split())
+                    )
+            )
+        else:
+            return np.array(Decimal(numpy.array2string(x)))
+    
+    
+    p = array2decimal(numpy.array(p))
+    bins = array2decimal(numpy.array(bins))

and

-    # Deal with border cases
-    a0_tol = numpy.abs(a0) * numpy.finfo(numpy.float64).eps
-    h_tol = numpy.abs(h) * numpy.finfo(numpy.float64).eps
-    p_tol = numpy.abs(p) * numpy.finfo(numpy.float64).eps
-    
-    # absolute tolerance
-    if tol is None:
-        idx = numpy.floor((p + (p_tol + a0_tol) - a0) / (h - h_tol))
-    else:
-        idx = numpy.floor((p + (tol + a0_tol) - a0) / (h - h_tol))
-    if h < 0:
-        raise ValueError("grid spacing must be positive and monotonically increasing.")
-    # account for floating point uncertainties by considering extreme case
+    idx = (p - a0) / h

But the performance is abysmal due to mapping via strings (that's how the decimals package works to circumvent numerical precision):

Enhancement vectorized looped
decimal 17.1 ms 2.25 s

So 320x / 50x slower than the current implementation.

Also, it doesn't pass some unit tests.

Additionally, array2string must be well configured to avoid creating any weird strings from floats or float arrays due to numerical precision (alternatively, one could represent the region's origins/bins directly as Decimals on initializations, but that's all not necessary since we solved the numerical issues above).

👎

@mherrmann3 mherrmann3 changed the title Fix spatial binning precision + extend units tests + fix imprecise bin edge creation Fix and optimize bin1d_vec + extend units tests + fix imprecise bin edge creation Feb 6, 2025
@mjw98
Copy link

mjw98 commented Feb 6, 2025 via email

@wsavran
Copy link
Collaborator

wsavran commented Feb 11, 2025

@mherrmann3 The new solution looks much simpler and more optimal than rounding in the issue. Nice Work! I'm encouraged by your tests with the reproducibility package. I'm looking into why some of the builds are failing for different OS versions right now. It's failing due to an error in the comcat package but I haven't tracked down the reason for the vcr package throwing errors in some version but not others.

@wsavran
Copy link
Collaborator

wsavran commented Feb 11, 2025

it looks like it is related to the pinned vcrpy installation we are using in the workflows/build-test.yml file
pip install vcrpy==4.3.1 pytest pytest-cov

there was a warning in the release notes on vcrpy that this version might not play nicely with urllib3. the last successful builds used urllib3==2.2.3 and the failing builds in this pr are using urllib3==2.3.0 for python >= 3.10.

it looks like we have some options here:

  1. remove the pinned version from .github/workflows/build-test.yml
  2. create a requirements-test.yml that includes sphinx and vcrpy (to mimic the requirements-dev.txt) and use this in the workflow
  3. add vcrpy, pytest, and pytest-cov to the create-args directive under jobs >> steps in .github/workflows/build-test.yml.

@mherrmann3 do you have any suggestions or thoughts on this?

@mherrmann3
Copy link
Contributor Author

mherrmann3 commented Feb 13, 2025

Hey Bill, I'm not experienced with github's workflows, but I'd prefer the simplest option, number 1, in your list.

I don't see any reason for the pinning. It seems that @pabloitu simply pinned vcrpy to 4.3.1 as it was the most recent version back at that time (and then re-used this part also for build_pip).

So we could unpin it twice in .github/workflows/build-test.yml. You go ahead?

Edit:

Note that the most recent version v7.0.0 drops support for Python 3.8. So if we want to keep 3.8, we must pin to v6.0.2; safest would be v5.0.0, since also v6.0.0 leads to some breaking changes whose consequences I cannot anticipate.

If we go with v6.0.2 or no pinning, we could also try adding Python 3.12 to the automatic builds.

@wsavran
Copy link
Collaborator

wsavran commented Feb 13, 2025

@pabloitu Any thoughts on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Spatial binning doesn't properly account for floating point precision
3 participants