-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trying to use HybridIndexer for Label Indexing, run into issue where TrieWrapper has no attribute '_sorted' #244
Comments
I would appreciate feedback on this matter as we are blocked to use HybridIndexing, from what i can tell this attribute is not really used - this code in pecos is in examples path examples/qp2q/models/indices.py |
The bug is a result of So there can be two solutions:
Hope this will help resolve the issue! |
thanks Nitin for your response, so what i did to bypass before your response was to do the following: add init method in TrieWrapper and then whereever child_trie._sorted was been assigned in the code i just hardcoded to True, i am curious what's impact of traversing children in sorted order or not , especially we are building autocomplete solution as well. |
irrespective of hardcoding _sorted, will try out your suggestion, thanks so much. |
Nitin, i have another question, once we build clusters using Hybrid Indexing, how can i visualize those clusters - hierarchical clusters, i want to see which label embeddings are in same cluster, also do label strs as well go into those clusters - how can i compare what set of labels end up in same cluster ? |
Second question: i am trying to follow example in code path examples/qp2q/models/pecosq2q.py i am not sure why this is been done in this if else logic i initially build OneHotEncoding of labels and then convert to csr matrix which is Y here for us, then i am trying to use PIFA embedding with input feature matrix as X line 514 - 519
thanks |
I think the default value of
See Line 26 for more details.
If |
One more question about saving model - i was able to successfully train model(atleast not shape errors or other errors) this is code snippet i used from pecos.xmc.xlinear.model import XLinearModel **xlinear_model = XLinearModel.train( )** xlinear_model.save("/dbfs/FileStore/pzn_ai/contextualized_autocomplete/model/") but when i try to save model to disk as shown above i get following error: INFO - pecos.xmc.base - Training Layer 0 of 3 Layers in HierarchicalMLModel, neg_mining=tfn.. stack trace: /databricks/python/lib/python3.9/site-packages/scipy/sparse/_matrix_io.py in save_npz(file, matrix, compressed) <array_function internals> in savez(*args, **kwargs) /databricks/python/lib/python3.9/site-packages/numpy/lib/npyio.py in savez(file, *args, **kwds) /databricks/python/lib/python3.9/site-packages/numpy/lib/npyio.py in _savez(file, args, kwds, compress, allow_pickle, pickle_kwargs) /usr/lib/python3.9/zipfile.py in close(self) OSError: [Errno 95] Operation not supported During handling of the above exception, another exception occurred: OSError Traceback (most recent call last) During handling of the above exception, another exception occurred: OSError Traceback (most recent call last) /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pecos/xmc/xlinear/model.py in save(self, model_folder) /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pecos/xmc/base.py in save(self, folder) /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pecos/xmc/base.py in save(self, folder) /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pecos/utils/smat_util.py in save_matrix(tgt, mat) OSError: [Errno 95] Operation not supported |
i wonder if above error is related to again some versioning problem with use on databricks with different python version etc |
Does it say what is the type of the matrix? Perhaps @rofuyu or @OctoberChang might be able to help with this as it looks like an issue with core pecos functionality? |
let me check why i don't see that error where it shows 'this type not supported' and describes the type, code is there with RaiseNotImplemented but it doesn' show up in logging in databricks |
matrix W, and C under model_chain element are both sparse csr matrices , why its having issues, is it some other member which is causing a problem ? |
@rofuyu or @OctoberChang can you guys please shed light on this issue, its blocking us from saving model to disk. |
@preetbawa , I know that you checked that the matrices W and C are sparse_csr matrices but can you share more details about the exact exception being raised here? What exactly does the exception message from Line 98 say? This error would not be raised if the matrix being saved was a scipy sparse_csr matrix. |
Description
Trying to leverage XLinear Model for Autcomplete suggestion model for our use case, Trie Plus Hierarichal Clustering makes sense for use case, so we are using HybridIndexer method, and it runs into error building the cluster/s.
How to Reproduce?
For reasons of compliance, I can't put the data here, but idea is to create pandas Dataframe with 3 columns
a) prev_query, prefix, and next_query. (next_query is the label) - easy to create dummy pandas dataframe with this data
b) here search_session_training_set_sorted is the pandas df with "previous_query", "prefix", and "next_query" columns and dataframe is sorted by the label column "next_query"
Build one hot encoded label matrix wrapped in scipy csr matrix.
label_y_ohe_matrix = csr_matrix(pd.get_dummies(search_session_training_set_sorted["next_query"]).values).astype(np.float32)
Build unique set of label_strs sorted for Trie part of Indexing.
labels_unique = set(search_session_training_set_sorted["next_query"].values.flatten())
labels_unique_sorted = sorted(labels_unique)
Build prefix tf-idf position weighted char level vectorizer and get actual tfidf vectors for each prefix.
input_x_prefix_list = search_session_training_set_sorted["prefix"].tolist()
tf_idf_prefix_vectorizer = PositionProductTfidf(analyzer="char", ngram_range=(1,2), dtype=np.float32, strip_accents="unicode")
input_prefix_matrix = tf_idf_prefix_vectorizer.fit_transform(input_x_prefix_list)
Build previous_query tf-idf word level vectorizer and get back tf-idf vectors for all previous_query terms.
input_prev_query_list = search_session_training_set_sorted["previous_query"].tolist()
tf_idf_prev_query_vectorizer = TfidfVectorizer(analyzer="word", ngram_range=(1,1), dtype=np.float32, strip_accents="unicode")
input_prev_query_matrix = tf_idf_prev_query_vectorizer.fit_transform(input_prev_query_list)
Horizontally stack the previous query and prefix horizontally as one input feature matrix (csr format)
input_feature_matrix = normalize(smat.hstack([input_prev_query_matrix, input_prefix_matrix]), "l2", axis=1)
Build label features using PIFA Embedding method.
label_features = csr_matrix(LabelEmbeddingFactory.create(
label_y_ohe_matrix,
input_feature_matrix,
method="pifa"), dtype=sp.float32)
Do label indexing using HybridIndexer strategy
cluster_matrix = HybridIndexer.gen(feat_mat=label_features,
label_strs=labels_unique_sorted,
depth=2,
max_leaf_size=100,
seed=0,
max_iter=20,
spherical_clustering=True
)
this last command above generates error like this below
07/11/2023 16:54:16 - INFO - py4j.java_gateway - Received command c on object id p1
07/11/2023 16:54:16 - INFO - main - Starting Hybrid-Trie Indexing
07/11/2023 16:54:16 - INFO - main - Added all labels to trie. Now building trie till depth = 2
in build_cluster_chain(self, depth)
79 def build_cluster_chain(self, depth):
80
---> 81 cluster_chain = self._build_sparse_cluster_chain_helper(depth=depth)
82
83 assert len(cluster_chain) == depth + 1
in _build_sparse_cluster_chain_helper(self, depth)
162 par_child_smat = smat.coo_matrix(np.ones((self.n_children, 1)))
163
--> 164 for child_char, child_trie in self.get_children():
165 child_cluster_chain = child_trie._build_sparse_cluster_chain_helper(depth=depth - 1)
166 all_cluster_chains += [child_cluster_chain]
in get_children(self)
29 child_trie._root = child_root
30 assert isinstance(child_trie._root, pygtrie._Node)
---> 31 child_trie._sorted = self._sorted
32 yield child_char, child_trie
33 elif isinstance(self._root.children, pygtrie._OneChild):
AttributeError: 'TrieWrapper' object has no attribute '_sorted'
Environment
(Add as much information about your environment as possible, e.g. dependencies versions.)
The text was updated successfully, but these errors were encountered: