Skip to content
This repository has been archived by the owner on Aug 31, 2021. It is now read-only.

Index out of Bounds ERROR during Large, Sparse MultiClusterIndex Creation #20

Open
kalbmj opened this issue Jul 20, 2018 · 1 comment
Open

Comments

@kalbmj
Copy link

kalbmj commented Jul 20, 2018

Hello,

I am working with a sparse dataset that has many rows and cols:
>>> X_train <1796130x3231961 sparse matrix of type '<type 'numpy.float64'>' with 207786451 stored elements in Compressed Sparse Row format>

I've started by working with the default params for the MultiClusterIndex creation, and had great luck on slices with smaller number of rows. For example: a subset of 12,000 rows took less then a minute to index, and a training dataset of 300,000 columns took less than 20mins to create the MultiClusterIndexes (both of these subsets used all columns).

When I attempt to run the same command on the entire dataset, it runs for a little over an hour and then throws the following error:
cp0 = ci.MultiClusterIndex(X_train, Y_train)

Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Library/Python/2.7/site-packages/pysparnn/cluster_index.py", line 427, in __init__ distance_type, matrix_size))) File "/Library/Python/2.7/site-packages/pysparnn/cluster_index.py", line 154, in __init__ records_data[clustr], IndexError: index 1776193 is out of bounds for axis 1 with size 1776130

Do you have any suggestions for resolving this issue, or tweaking the parameters to make this dataset more efficient when creating the MultiClusterIndex?

Thank you in advance.

Update: ran into the same issue with Python version 3.7 and 2.7 (both out of bounds exceptions, trying to access different index locations for axis 1).

@kalbmj
Copy link
Author

kalbmj commented Jul 21, 2018

Another related question: would it be best with data of this size to set the matrix_size manually, so that it is something smaller and results in more levels of the tree than the recommended 2 levels? Thanks in advance.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant