You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Aug 31, 2021. It is now read-only.
I am working with a sparse dataset that has many rows and cols: >>> X_train <1796130x3231961 sparse matrix of type '<type 'numpy.float64'>' with 207786451 stored elements in Compressed Sparse Row format>
I've started by working with the default params for the MultiClusterIndex creation, and had great luck on slices with smaller number of rows. For example: a subset of 12,000 rows took less then a minute to index, and a training dataset of 300,000 columns took less than 20mins to create the MultiClusterIndexes (both of these subsets used all columns).
When I attempt to run the same command on the entire dataset, it runs for a little over an hour and then throws the following error: cp0 = ci.MultiClusterIndex(X_train, Y_train)
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Library/Python/2.7/site-packages/pysparnn/cluster_index.py", line 427, in __init__ distance_type, matrix_size))) File "/Library/Python/2.7/site-packages/pysparnn/cluster_index.py", line 154, in __init__ records_data[clustr], IndexError: index 1776193 is out of bounds for axis 1 with size 1776130
Do you have any suggestions for resolving this issue, or tweaking the parameters to make this dataset more efficient when creating the MultiClusterIndex?
Thank you in advance.
Update: ran into the same issue with Python version 3.7 and 2.7 (both out of bounds exceptions, trying to access different index locations for axis 1).
The text was updated successfully, but these errors were encountered:
Another related question: would it be best with data of this size to set the matrix_size manually, so that it is something smaller and results in more levels of the tree than the recommended 2 levels? Thanks in advance.
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Hello,
I am working with a sparse dataset that has many rows and cols:
>>> X_train <1796130x3231961 sparse matrix of type '<type 'numpy.float64'>' with 207786451 stored elements in Compressed Sparse Row format>
I've started by working with the default params for the MultiClusterIndex creation, and had great luck on slices with smaller number of rows. For example: a subset of 12,000 rows took less then a minute to index, and a training dataset of 300,000 columns took less than 20mins to create the MultiClusterIndexes (both of these subsets used all columns).
When I attempt to run the same command on the entire dataset, it runs for a little over an hour and then throws the following error:
cp0 = ci.MultiClusterIndex(X_train, Y_train)
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Library/Python/2.7/site-packages/pysparnn/cluster_index.py", line 427, in __init__ distance_type, matrix_size))) File "/Library/Python/2.7/site-packages/pysparnn/cluster_index.py", line 154, in __init__ records_data[clustr], IndexError: index 1776193 is out of bounds for axis 1 with size 1776130
Do you have any suggestions for resolving this issue, or tweaking the parameters to make this dataset more efficient when creating the MultiClusterIndex?
Thank you in advance.
Update: ran into the same issue with Python version 3.7 and 2.7 (both out of bounds exceptions, trying to access different index locations for axis 1).
The text was updated successfully, but these errors were encountered: