Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fit_transform and transform on the same feature doesn't return the same value #48

Open
duguyue100 opened this issue Dec 28, 2022 · 9 comments

Comments

@duguyue100
Copy link

duguyue100 commented Dec 28, 2022

Hi, thanks for developing PaCMAP, lovely work!

I found that using transform after using fit_transform on the same set of features yields different results.

I ran the following example:

import pacmap
import numpy as np

np.random.seed(0)

init = "pca"  # results can be reproduced also with "random"

reducer = pacmap.PaCMAP(
    n_components=2, n_neighbors=10, MN_ratio=0.5, FP_ratio=2.0, save_tree=True
)

features = np.random.randn(100, 30)

reduced_features = reducer.fit_transform(features, init=init)
print(reduced_features[:10])

transformed_features = reducer.transform(features)
print(transformed_features[:10])

And returns

[[ 0.7728913   3.785831  ]
 [-0.69379026  2.116452  ]
 [-1.7770871  -0.97542125]
 [ 2.5090704   1.8718773 ]
 [-0.06890291 -2.2959301 ]
 [ 1.9657456   1.1580495 ]
 [ 1.0486693  -1.4648851 ]
 [-1.4896832   1.7203271 ]
 [ 0.54106015  2.38868   ]
 [ 3.0175838  -1.9216222 ]]

[[-0.03516154  2.543376  ]
 [-0.467008    1.6641414 ]
 [-0.44973713 -1.535601  ]
 [ 1.0218439   1.5691875 ]
 [-0.30733356 -2.3227684 ]
 [ 0.8294033   1.0432268 ]
 [ 0.10503205 -0.8651409 ]
 [-0.63982046  0.59202313]
 [ 0.38573623  1.5135498 ]
 [ 2.0508025  -1.5033388 ]]

I would expect the same results because the fit_transform should be the combination of fit and transform (regardless of the implementation details). This is what PCA in sklearn and UMAP do.

Is this an intended feature? And if the answer is No, what should we do? One possible solution I found is

reducer = reducer.fit(features, init=init)

# Now the following lines return the same feature.
reduced_features = reducer.transform(features)
transformed_features = reducer.transform(features)

But this only solves the problem at the implementation level, not at the conceptual level. Since the returned values from fit_transform and transform are different, I'm not sure I can trust the output of transform.

PS: this has nothing to do with the random seed, since I fixed the random seed, I can get the same result across runs.

@hyhuang00
Copy link
Collaborator

Hi there! Thank you for using PaCMAP. The result is expected to be different, since in PaCMAP the transform() function treats the input as additional data points that is expanded to the original data. In the current version, transform() will try to place the new input to their nearest neighbors' low dimension embeddings. As a result, there is no guarantee on whether the same points will always be placed to the same place. This design choice allows the points to be differentiated. However, as we said in the README, this feature is not finalized and we welcome any feedbacks towards its design. Is there any reason you want two data points that has the same value to be placed at the same place?

@duguyue100
Copy link
Author

duguyue100 commented Jan 4, 2023

Thanks for your fast reply! I think conceptually it makes more sense that identical incoming points should be projected to the same place as the old points. Users who used PCA or UMAP before (like me) would expect this behavior.

My specific case was to write a test on our software to check if fit_transform and transform produce the same results.
Since the outcome of this test is false, I disabled certain reproducibility behavior in our software for PaCMAP.

Full disclosure, I haven't read the PaCMAP paper, and not sure what I described here is doable. If this is not possible for PaCMAP to mirror the sklearn's fit_transform and transform, then I think it makes sense to place a big bold warning in both readme and documentation.

@MattWenham
Copy link

Is there any reason you want two data points that has the same value to be placed at the same place?

Why would you expect any other behaviour from a dimensionality reduction technique? Could you suggest a use case where you don't want this to happen, i.e. where this not happening is useful?

@hyhuang00
Copy link
Collaborator

Thanks for your fast reply! I think conceptually it makes more sense that identical incoming points should be projected to the same place as the old points. Users who used PCA or UMAP before (like me) would expect this behavior.

My specific case was to write a test on our software to check if fit_transform and transform produce the same results. Since the outcome of this test is false, I disabled certain reproducibility behavior in our software for PaCMAP.

Full disclosure, I haven't read the PaCMAP paper, and not sure what I described here is doable. If this is not possible for PaCMAP to mirror the sklearn's fit_transform and transform, then I think it makes sense to place a big bold warning in both readme and documentation.

Thank you for your suggestion! A warning has been added to the method, and we will think about ways to improve the transform method.

@duguyue100
Copy link
Author

@hyhuang00 Thanks for your effort, you can close this issue if you want.

@hyhuang00
Copy link
Collaborator

Is there any reason you want two data points that has the same value to be placed at the same place?

Why would you expect any other behaviour from a dimensionality reduction technique? Could you suggest a use case where you don't want this to happen, i.e. where this not happening is useful?

Ensuring points that has very similar values in the high-dimensional space to locate at close but different places in the low-dimensional space is useful when it comes to visualization. It helps the embedding to avoid the so-called "crowding problem" during the optimization, and sometimes it helps our users to know that there are multiple points exhibiting at the same place, forming a cluster. This might be less helpful when the embedding is used for other purposes. Perhaps we can make an option to allow different behavior.

@MattWenham
Copy link

Why would you expect any other behaviour from a dimensionality reduction technique? Could you suggest a use case where you don't want this to happen, i.e. where this not happening is useful?

Ensuring points that has very similar values in the high-dimensional space to locate at close but different places in the low-dimensional space is useful when it comes to visualization.

Very true, but 'very similar values' and 'the same value' are two different use cases.

@TCWO
Copy link

TCWO commented Aug 17, 2023

Hi there, I am trying to fit a model with a smaller set and the apply the transform to a bigger set but I encountered this error which I assume is about the generating the neighbors. Can you let me know how I can handle it?

AssertionError Traceback (most recent call last)
/tmp/ipykernel_623958/2284593526.py in
----> 1 data_all_dr, t_all_dr = DimRed2(data_sampl, data_norm, method = dr, dims=dims)

/tmp/ipykernel_623958/736146467.py in DimRed2(df1, df2, method, dims, pca)
84
85 # Now, use the fitted model to transform a larger dataset (X_large)
---> 86 dr = embedding.transform(X2, init='pca', save_pairs=False)
87
88 end = time.time()

~/.local/lib/python3.8/site-packages/pacmap/pacmap.py in transform(self, X, basis, init, save_pairs)
932 self.apply_pca, self.verbose)
933 # Sample pairs
--> 934 self.pair_XP = generate_extra_pair_basis(basis, X,
935 self.n_neighbors,
936 self.tree,

~/.local/lib/python3.8/site-packages/pacmap/pacmap.py in generate_extra_pair_basis(basis, X, n_neighbors, tree, distance, verbose)
397 npr, dimp = X.shape
398
--> 399 assert (basis is not None or tree is not None), "If the annoyindex is not cached, the original dataset must be provided."
400
401 # Build the tree again if not cached

AssertionError: If the annoyindex is not cached, the original dataset must be provided.

and here is my function X is the smaller set and X2 the big dataset:
elif method == 'PaCMAP':
#slightly different since we need to transform the dataframe to an array as an input for the pacmap function
start = time.time()
X = data
X = np.asarray(X)
X = X.reshape(X.shape[0], -1)
X2 = data2
X2 = np.asarray(X2)
X2 = X2.reshape(X2.shape[0], -1)
# Setting n_neighbors to "None" leads to a default choice
embedding = pacmap.PaCMAP(n_components=dims, n_neighbors=None, MN_ratio=0.5, FP_ratio=2.0)
# fit the data (The index of transformed data corresponds to the index of the original data)
#embedding.fit(X, init="pca")
#dr = embedding.transform(X2)

        # Fit and transform using a smaller dataset (X_small)
        embedding_small = embedding.fit_transform(X, init='pca', save_pairs=True)

        # Now, use the fitted model to transform a larger dataset (X_large)
        dr = embedding.transform(X2, init='pca', save_pairs=False)
        
        end = time.time()
        t = end-start

@escheer
Copy link

escheer commented Mar 14, 2024

Hello, thank you for PACMAP, beautiful work.

I second this question. I am reaching:

AssertionError: If the annoyindex is not cached, the original dataset must be provided.

when i call the transform method on a new dataset after it has already been fit on a previous one. It is desirable to be able to transform new data into an existing embedding space.
Can you provide some guidance on this?

EDIT: this was due to the fact that I had not specified save_tree = True. Might be good to spell that out a bit more clearly in the documentation! Thank you :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants