-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trying to understand Cluster chain shape using hybrid indexer #249
Comments
Hi @preetbawa,
A lot these 701 clusters might be singleton clusters. The way cluster_chain is set up is that if a certain branch in the tree-based index ends up with just 1 node at level k while the tree depth is d, then this singleton branch is extended to depth d by creating (d - k) dummy internal nodes and chaining them together. You can check how many of these 701 clusters are singleton by looking at columns of 2076 x 701. The sum jth column corresponds to number of child nodes from cluster j. The depth parameter controls the depth of the trie in the hybrid-index (3 in your case). HybridIndexer uses hierarchical clustering beyond given depth parameter until each leaf node has at most max_leaf_size datapoints. These singleton clusters are probably created as the index only creates a trie with those 2076 datapoints which may result in a lot of singleton clusters (this would correspond to number of unique length 3 prefixes in your datapoints?). And after creating the trie, hierarchical clustering is not invoked in this example as all clusters probably already contain less than max_leaf_size=100 datapoints. |
thanks Nishant, i did get cluster chain part but wanted some confirmation which your provided with good detail. b) how can we fix this issue we are facing where each cluster is containing only one label - can we reduce depth of trie index part to say 1 or 2 - would that help ? |
Yes. Sorry for using nodes and clusters interchangeably. Internal nodes in the tree correspond to clusters containing all leaf nodes (actual datapoints) in the subtree corresponding to the internal node.
Reducing the trie depth to 1 or 2 can help but you can also post-process to explicitly merge those singleton branches. |
thanks Nishant, that helps |
one quick question, for this issue [https://github.com//issues/247] we are unable to save the model to disk, i checked internally weights matrix, and cluster matrix both are csr matrix, not sure why we os error operation not supported, guys who wrote this code haven't responded in two weeks, is another way we can get some help for this issue, its blocking us to really use load this model in an application and do real-time inference. thanks |
Description
i am using HybridIndexer to do label indexing, using PIFA Embedding for labels by combining label one hot encoded matrix with input feature matrix.
i ran following code:
Build Hybrid Index using Trie and Clustering approach together with max trie depth of 3 and max_leaf size of 100
from pecos.utils.cluster_util import ClusterChain
Hierarchical Clustering Chain as a list of Sparse Matrices - sparse matrice to represent clusters at a specific depth
cluster_chain: ClusterChain = HybridIndexer.gen(feat_mat=label_features,
label_strs=labels_sorted,
depth=3,
max_leaf_size=100,
seed=0,
max_iter=40,
spherical_clustering=True
)
and then run this to understand shape
Explore Shape of Cluster Chain
for i in range(len(cluster_chain)):
print(type(cluster_chain[i]))
print(f"shape of current matrix {cluster_chain[i].shape} " )
RESULTS:
shape of current matrix (24, 1)
<class 'scipy.sparse.csr.csr_matrix'>
shape of current matrix (184, 24)
<class 'scipy.sparse.csr.csr_matrix'>
shape of current matrix (701, 184)
<class 'scipy.sparse.csc.csc_matrix'>
shape of current matrix (2076, 701)
it makes sense that there are 4 clusters, as trie depth is3 but does that mean for hierarchical clustering there is only one level - also last matrix is of shape 2076, 701 does that mean there are 701 clusters at leaf level - my sorted unique labels are only 2076 , why so many clusters are created at leaf level or am i understanding this piece correctly, also why its the hierarchical part of tree is flat with only one level depth , i would assume it would branch out after trie or is it in hybrid indexing you only get one level of clustering after trie is built with clusters built as the last depth level - that still maybe makes sense but not sure why so many clusters as 701 built for labels that are only 2k approx.
The text was updated successfully, but these errors were encountered: