Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pydktree breaks down for very large number of data points #38

Open
alejandrobll opened this issue Oct 22, 2018 · 1 comment · May be fixed by #134
Open

pydktree breaks down for very large number of data points #38

alejandrobll opened this issue Oct 22, 2018 · 1 comment · May be fixed by #134

Comments

@alejandrobll
Copy link

alejandrobll commented Oct 22, 2018

I have been using pykdtree to obtain nearest neighbours and it seems that it breaks down for a very large dataset. I managed to reproduce the problem with the following example:

from pykdtree.kdtree import KDTree
import numpy as np

pos = np.random.rand(int(5e8),3)
nb = 32
tree = KDTree(pos)
d, idx = tree.query(pos, k=nb)
h = d[:,nb-1]
print np.min(h)

The result of the previous code is that the minimum distance to the 32nd neighbouring particles to some particle is zero, which is incorrect and, indeed very unlikely. It turns out that zero is assigned to many more than just one particle. I fact, it is zero for a large fraction of the particles. Doing

import numpy as np

k, = np.where(h == 0)
print(len(k))

returns 365782272. I.e., it is 0 for ~73 % of the whole sample. This is clearly the wrong answer.

I discovered the problem when using py-sphviewer, which relies on pyktree to find the smoothing length of particles in cosmological simulations. When the number of particles within the simulated volumes is very large (several hundreds of millions), pykdtree assigns a wrong distance of 0 between individual particles and their 32nd neighbours.

Any idea on what might be causing this weird behaviour? I also checked with either single and double precision.

@storpipfugl
Copy link
Owner

It's a pointer arithmetics overflow problem: https://github.com/storpipfugl/pykdtree/blob/master/pykdtree/_kdtree_core.c#L1410

I'll look into giving pykdtree an overhaul to support contemporary data set sizes

lkeegan added a commit to lkeegan/pykdtree that referenced this issue Jan 16, 2025
- replace all 32-bit integers with 64-bit integers
- previously results would silently be incorrect if `num_points * k > 2^32-1`
  - caused by integer overflow (see storpipfugl#38)
  - for e.g. k=20 this occurred for ~200 million points (~a few GB of RAM)
- now results are correct for any (practical) number of points
  - overflow would now only occur for k=20 with ~10^17 points (> trillions of GB of RAM)
- perf implications??
lkeegan added a commit to lkeegan/pykdtree that referenced this issue Jan 16, 2025
- replace all 32-bit integers with 64-bit integers
- previously the results would silently be incorrect if `num_points * k > 2^32-1`
  - caused by integer overflow (see storpipfugl#38)
  - for e.g. k=20 this occurred for ~200 million points (~a few GB of RAM)
- now the results are correct for any (practical) number of points
  - overflow would now only occur for k=20 with ~10^17 points (> trillions of GB of RAM)
  - resolves storpipfugl#38
@lkeegan lkeegan linked a pull request Jan 16, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants