-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fit_transform
and transform
on the same feature doesn't return the same value
#48
Comments
Hi there! Thank you for using PaCMAP. The result is expected to be different, since in PaCMAP the |
Thanks for your fast reply! I think conceptually it makes more sense that identical incoming points should be projected to the same place as the old points. Users who used PCA or UMAP before (like me) would expect this behavior. My specific case was to write a test on our software to check if Full disclosure, I haven't read the PaCMAP paper, and not sure what I described here is doable. If this is not possible for PaCMAP to mirror the sklearn's |
Why would you expect any other behaviour from a dimensionality reduction technique? Could you suggest a use case where you don't want this to happen, i.e. where this not happening is useful? |
Thank you for your suggestion! A warning has been added to the method, and we will think about ways to improve the |
@hyhuang00 Thanks for your effort, you can close this issue if you want. |
Ensuring points that has very similar values in the high-dimensional space to locate at close but different places in the low-dimensional space is useful when it comes to visualization. It helps the embedding to avoid the so-called "crowding problem" during the optimization, and sometimes it helps our users to know that there are multiple points exhibiting at the same place, forming a cluster. This might be less helpful when the embedding is used for other purposes. Perhaps we can make an option to allow different behavior. |
Very true, but 'very similar values' and 'the same value' are two different use cases. |
Hi there, I am trying to fit a model with a smaller set and the apply the transform to a bigger set but I encountered this error which I assume is about the generating the neighbors. Can you let me know how I can handle it?AssertionError Traceback (most recent call last) /tmp/ipykernel_623958/736146467.py in DimRed2(df1, df2, method, dims, pca) ~/.local/lib/python3.8/site-packages/pacmap/pacmap.py in transform(self, X, basis, init, save_pairs) ~/.local/lib/python3.8/site-packages/pacmap/pacmap.py in generate_extra_pair_basis(basis, X, n_neighbors, tree, distance, verbose) AssertionError: If the annoyindex is not cached, the original dataset must be provided. and here is my function X is the smaller set and X2 the big dataset:
|
Hello, thank you for PACMAP, beautiful work. I second this question. I am reaching: AssertionError: If the annoyindex is not cached, the original dataset must be provided. when i call the transform method on a new dataset after it has already been fit on a previous one. It is desirable to be able to transform new data into an existing embedding space. EDIT: this was due to the fact that I had not specified save_tree = True. Might be good to spell that out a bit more clearly in the documentation! Thank you :) |
Hi, thanks for developing PaCMAP, lovely work!
I found that using
transform
after usingfit_transform
on the same set of features yields different results.I ran the following example:
And returns
I would expect the same results because the
fit_transform
should be the combination offit
andtransform
(regardless of the implementation details). This is what PCA in sklearn and UMAP do.Is this an intended feature? And if the answer is No, what should we do? One possible solution I found is
But this only solves the problem at the implementation level, not at the conceptual level. Since the returned values from
fit_transform
andtransform
are different, I'm not sure I can trust the output oftransform
.PS: this has nothing to do with the random seed, since I fixed the random seed, I can get the same result across runs.
The text was updated successfully, but these errors were encountered: