Index out of Bounds ERROR during Large, Sparse MultiClusterIndex Creation #20

kalbmj · 2018-07-20T23:13:37Z

Hello,

I am working with a sparse dataset that has many rows and cols:
>>> X_train <1796130x3231961 sparse matrix of type '<type 'numpy.float64'>' with 207786451 stored elements in Compressed Sparse Row format>

I've started by working with the default params for the MultiClusterIndex creation, and had great luck on slices with smaller number of rows. For example: a subset of 12,000 rows took less then a minute to index, and a training dataset of 300,000 columns took less than 20mins to create the MultiClusterIndexes (both of these subsets used all columns).

When I attempt to run the same command on the entire dataset, it runs for a little over an hour and then throws the following error:
cp0 = ci.MultiClusterIndex(X_train, Y_train)

Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Library/Python/2.7/site-packages/pysparnn/cluster_index.py", line 427, in __init__ distance_type, matrix_size))) File "/Library/Python/2.7/site-packages/pysparnn/cluster_index.py", line 154, in __init__ records_data[clustr], IndexError: index 1776193 is out of bounds for axis 1 with size 1776130

Do you have any suggestions for resolving this issue, or tweaking the parameters to make this dataset more efficient when creating the MultiClusterIndex?

Thank you in advance.

Update: ran into the same issue with Python version 3.7 and 2.7 (both out of bounds exceptions, trying to access different index locations for axis 1).

The text was updated successfully, but these errors were encountered:

kalbmj · 2018-07-21T01:52:55Z

Another related question: would it be best with data of this size to set the matrix_size manually, so that it is something smaller and results in more levels of the tree than the recommended 2 levels? Thanks in advance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index out of Bounds ERROR during Large, Sparse MultiClusterIndex Creation #20

Index out of Bounds ERROR during Large, Sparse MultiClusterIndex Creation #20

kalbmj commented Jul 20, 2018 •

edited

Loading

kalbmj commented Jul 21, 2018

Index out of Bounds ERROR during Large, Sparse MultiClusterIndex Creation #20

Index out of Bounds ERROR during Large, Sparse MultiClusterIndex Creation #20

Comments

kalbmj commented Jul 20, 2018 • edited Loading

kalbmj commented Jul 21, 2018

kalbmj commented Jul 20, 2018 •

edited

Loading