pydktree breaks down for very large number of data points #38

alejandrobll · 2018-10-22T20:22:54Z

I have been using pykdtree to obtain nearest neighbours and it seems that it breaks down for a very large dataset. I managed to reproduce the problem with the following example:

from pykdtree.kdtree import KDTree
import numpy as np

pos = np.random.rand(int(5e8),3)
nb = 32
tree = KDTree(pos)
d, idx = tree.query(pos, k=nb)
h = d[:,nb-1]
print np.min(h)

The result of the previous code is that the minimum distance to the 32nd neighbouring particles to some particle is zero, which is incorrect and, indeed very unlikely. It turns out that zero is assigned to many more than just one particle. I fact, it is zero for a large fraction of the particles. Doing

import numpy as np

k, = np.where(h == 0)
print(len(k))

returns 365782272. I.e., it is 0 for ~73 % of the whole sample. This is clearly the wrong answer.

I discovered the problem when using py-sphviewer, which relies on pyktree to find the smoothing length of particles in cosmological simulations. When the number of particles within the simulated volumes is very large (several hundreds of millions), pykdtree assigns a wrong distance of 0 between individual particles and their 32nd neighbours.

Any idea on what might be causing this weird behaviour? I also checked with either single and double precision.

storpipfugl · 2018-10-22T21:24:37Z

It's a pointer arithmetics overflow problem: https://github.com/storpipfugl/pykdtree/blob/master/pykdtree/_kdtree_core.c#L1410

I'll look into giving pykdtree an overhaul to support contemporary data set sizes

- replace all 32-bit integers with 64-bit integers - previously results would silently be incorrect if `num_points * k > 2^32-1` - caused by integer overflow (see storpipfugl#38) - for e.g. k=20 this occurred for ~200 million points (~a few GB of RAM) - now results are correct for any (practical) number of points - overflow would now only occur for k=20 with ~10^17 points (> trillions of GB of RAM) - perf implications??

- replace all 32-bit integers with 64-bit integers - previously the results would silently be incorrect if `num_points * k > 2^32-1` - caused by integer overflow (see storpipfugl#38) - for e.g. k=20 this occurred for ~200 million points (~a few GB of RAM) - now the results are correct for any (practical) number of points - overflow would now only occur for k=20 with ~10^17 points (> trillions of GB of RAM) - resolves storpipfugl#38

- add `index_bits` optional argument to KDTree - default value is 32: preserves existing behaviour & performance - user can specify 64 instead to use 64-bit integers - this ensures correct results when n_points * k > 2^32 - uses approx 50% more RAM than 32 bit option - resolves storpipfugl#38 - in 32-bit int mode add checks to avoid returning incorrect results - KDTree checks that `n_points < 2^32` - KDTree.query checks that `n_points * k < 2^32` - update tests - parametrize all existing tests to test 32-bit and 64-bit int mode - add a query test with n_points * k too large - didn't add a test with n_points too large as it would require 16GB RAM to run

storpipfugl mentioned this issue Nov 3, 2018

Segmentation Fault #39

Open

lkeegan mentioned this issue Jan 16, 2025

Fix incorrect results for large numbers of points #134

Closed

lkeegan mentioned this issue Jan 17, 2025

Add automatic switching between 32-bit and 64-bit structures depending on number of points #135

Merged

mraspaud closed this as completed in #135 Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pydktree breaks down for very large number of data points #38

pydktree breaks down for very large number of data points #38

alejandrobll commented Oct 22, 2018 •

edited

Loading

storpipfugl commented Oct 22, 2018

pydktree breaks down for very large number of data points #38

pydktree breaks down for very large number of data points #38

Comments

alejandrobll commented Oct 22, 2018 • edited Loading

storpipfugl commented Oct 22, 2018

alejandrobll commented Oct 22, 2018 •

edited

Loading