-
Notifications
You must be signed in to change notification settings - Fork 814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UMAP Roadmap #15
Comments
Here's the smallest notebook I could think of for basic usage demonstration |
Thanks! I was hoping to have some further description in Markdown in the notebook, but this is an excellent beginning. |
Would you mind if we pulled direct quotes from your README for the notebook ('basic usage demonstration' and 'explaining parameter options and their effects')? I'm also currently wrapping the two ideas as one notebook, with a basic usage section at the top, and more in-depth information after that. Thoughts? |
Go ahead and pull whatever you need. It's helpful if you can explore the parameter effects in a little detail. Kyle McDonald had some nice min_dist comparisons here https://twitter.com/kcimc/status/930180473262919685 . Exploring some of the other effects similarly as well ( |
Here's another version, exploring some of parameters |
I love the metric exploration! The custom metrics nicely show off what can be done, and the effects (e.g. the pure red metric has a clear linear embedding etc.) |
@Fil I have added in a version of your parameter exploration notebook (with some minor changes and added text commentary and explanation) in the notebooks directory. Have a look and let me know if it looks okay to you. I really appreciate your work on this, so let me know how you would like to be acknowledged within the notebook. |
Wow! I wanted to do the hue and HSL metrics, but didn't think they would turn out that splendid. Thank you! For the credit you should remove "excellent", and can add "for visionscarto.net" after my name for affiliation. I'm preparing another example, will follow up when it's ready :) |
So I had started on a different notebook approach, and decided to see it through to an alpha version. It uses scikit-learn's digits data, so it at least offers a different perspective. Like I said, it's an alpha/early draft version. There's plenty of points that I just got bored of writing instead of coding, but I'm going to go back to them soon. I also blinked and the documentation/code changed so I'll have to update that too. Here it is, That said, I really like the notebook @Fil came up with, and @lmcinnes improved on, I think it offers a better intro to UMAP. |
@CrakeNotSnowman That looks great! To be honest more intros are good, especially if they come from different perspectives, as this one does. There are some really interesting results in there. Sorry about the code and documentation changes; I'm a tinkerer and I can't help it. I definitely look forward to seeing this with any further expository writing. |
What about UMAP for text data (similar to word2vec)? |
I have a colleague who is working on that -- there's some underlying theory to be worked through, but I believe the core ideas are now all in place. The essence of the idea is this: word2vec can be viewed as (in the limit) a matrix factorization problem, which is to say similar to PCA. It should be possible to use manifold learning like UMAP to do the embedding rather than something linear like PCA. Ideally this should capture word similarity better, at the cost that word algebra will no longer work. The details are in what data to embed (something based on a word-word-co-occurence matrix), and how to measure distance (negative log likelihoods under a suitable model), and how to interpet the theory around all of that. Progress is being made, but it may be a little while before anything releasable happens. |
Are you also planning to explore other exact and approximate k-nn graph methods? nmslib is a super fast parallelized implementation with a plethora of knn methods. |
I'm forgoing exact knn-graph methods as most are too slow on high dimensional data. I agree that nmslib is impressive but for this project I was hoping to keep the dependencies relatively self-contained. Right now I'm using my own python based implementation of NN-descent (for which kgraph is the reference implementation). The advantages of NN-descent are that it is non-metric space based (just like nmslib), and can be used for direct approximate knn-graph construction rather than building an index and then querying. If someone else wanted to build an optimized UMAP on top of nmslib I would certainly be interested to see it -- it would likely outperform this version due to the parallelism (presuming a suitably parallelised version of the SGD for layout was paired with it). |
It would be useful to have a way to save the UMAP model to a file for transforming future data into the same space. What would it take to get save/load functions? |
@ghannum I admit that I had been hoping that the standard methods for model persistence in sklearn (pickling etc.) would handle this -- is that not working with UMAP, or are you looking for something a little different than what it would provide? This isn't really my area of expertise, so you'll have to excuse my lack of knowledge here. |
@lmcinnes I tried to pickle the model file, but pickling only works for data objects - not classes. I believe the correct approach would be to write one function which puts all of the relevant model data into a list and pickles the list. Then write a load function which loads the pickled data and constructs the model object. |
@ghannum Okay, thanks, I'll try to look into this at some point. At the very least I'll add it to the roadmap. |
@lmcinnes very interested in that feature! |
Unless I am not understanding something, pickling seems to work fine, at least on the current main branch. Here is a simple example that shows pickling and unpickling of a trained model, even with a custom metric. Note: if you unpickle a model with a custom metric, that metric must already be defined in that same file; the pickle only contains a reference to the metric function.
|
Thanks for testing that out. I've been learning a little about potential
issues in pickling in pynndescent and it seems that ``pickle.HIGHEST_PROTOCOL``
is the key point here if you are using python2 -- by default pickle in
python2 uses a different protocol that may not support pickling UMAP well.
…On Fri, Jul 20, 2018 at 1:02 PM Joseph Courtney ***@***.***> wrote:
Unless I am not understanding something, pickling seems to work fine, at
least on the current main branch. Here is a simple example that shows
pickling and unpickling of a trained model, even with a custom metric.
Note: if you unpickle a model with a custom metric, that metric must
already be defined in that same file; the pickle only contains a reference
to the metric function.
import numpy as np
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import umap
import pickle
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
digits.data,
digits.target,
stratify=digits.target,
random_state=42
)
def mydist(x, y):
return np.max(np.abs(x - y))
trans = umap.UMAP(
n_neighbors=5,
random_state=42,
metric=mydist
).fit(X_train)
plt.scatter(trans.embedding_[:, 0], trans.embedding_[:, 1], s=5, c=y_train, cmap='Spectral')
plt.title('Embedding of the training set by UMAP', fontsize=24)
plt.show()
plt.close()
with open('trans.pkl', 'wb') as f:
pickle.dump(trans, f, pickle.HIGHEST_PROTOCOL)
with open('trans.pkl', 'rb') as f:
trans = pickle.load(f)
test_embedding = trans.transform(X_test)
plt.scatter(test_embedding[:, 0], test_embedding[:, 1], s=5, c=y_test, cmap='Spectral')
plt.title('Embedding of the test set by UMAP', fontsize=24)
plt.show()
plt.close()
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#15 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ALaKBfoiIeHnwnUhqRXnlfa5RB3pSwaHks5uIg0IgaJpZM4QcUO9>
.
|
@josephcourtney 's example fails when the training data is larger. No error: X = np.random.randn(4000, 48)
trans = umap.UMAP(
n_neighbors=5,
random_state=42,
metric="euclidean
).fit(X)
with open('trans.pkl', 'wb') as f:
pickle.dump(trans, f, pickle.HIGHEST_PROTOCOL)
with open('trans.pkl', 'rb') as f:
trans = pickle.load(f) Error: X = np.random.randn(5000, 48)
trans = umap.UMAP(
n_neighbors=5,
random_state=42,
metric="euclidean
).fit(X)
with open('trans.pkl', 'wb') as f:
pickle.dump(trans, f, pickle.HIGHEST_PROTOCOL)
with open('trans.pkl', 'rb') as f:
trans = pickle.load(f) Traceback:
|
@bccho : That's a little disconcerting. It seems to be some sort of issue with pickle storing certain objects. At 4096 there is a switch in how knn computation is handled, so that may be responsible, but it is entirely unclear to me where in the whole process this is going astray. It must be in some subobjects of the basic UMAP class, so is likely an issue for those objects in general (scipy sparse matrices perhaps?). I'm away for a few days but I'll try to look into it when I get back. If there is any chance you can switch to python3 that will resolve the issue, but I understand that that is not always an option. |
Unfortunately I have no control over moving to python 3 (as much as I would like to), but for a workaround, I can try saving individual subobjects to files and re-loading them. EDIT: After iterating through individual attributes from EDIT: Here is a functioning workaround for Python 2: import pickle
def save_umap(umap):
for attr in ["_tree_init", "_search", "_random_init"]:
if hasattr(umap, attr):
delattr(umap, attr)
return pickle.dumps(umap, pickle.HIGHEST_PROTOCOL)
def load_umap(s):
umap = pickle.loads(s)
from umap.nndescent import make_initialisations, make_initialized_nnd_search
umap._random_init, umap._tree_init = make_initialisations(
umap._distance_func, umap._dist_args
)
umap._search = make_initialized_nnd_search(
umap._distance_func, umap._dist_args
)
return umap
import numpy as np
X = np.random.randn(5000, 16)
X_new = np.random.randn(100, 16)
from umap import UMAP
um = UMAP()
um.fit(X)
emb = um.transform(X_new)
pkl = save_umap(um)
um_new = load_umap(pkl) # no error!
emb_new = um_new.transform(X_new) |
Sorry I'm currently on vacation. I stole a little time, but I won't be able
to look into this properly until Monday.
…On Wed, Aug 29, 2018 at 3:08 PM Byung-Cheol Cho ***@***.***> wrote:
Unfortunately I have no control over moving to python 3 (as much as I
would like to), but for a workaround, I can try saving individual
subobjects to files and re-loading them.
Can you indicate what subobjects and parameters are required for transform
to work correctly?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#15 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ALaKBScXwi59qVCGS16hsvDAEDf1cv-Uks5uVua0gaJpZM4QcUO9>
.
|
Thanks for finding a workaround! It looks like it was the numba-jitted functions that were not pickling properly, at least under 2.7. I'll have to see if I can figure out a more permanent solution. |
I think that was the problem too. You could probably put the |
That makes sense. I'll add it to my todo list. Thanks.
…On Sat, Sep 8, 2018 at 2:48 PM Byung-Cheol Cho ***@***.***> wrote:
I think that was the problem too. You could probably put the save_umap
and get_umap code in as part of __getstate__ and __setstate__
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#15 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ALaKBaDmwZLOto1FEWnydsA3P9PwdobQks5uZBD7gaJpZM4QcUO9>
.
|
I was able to persist |
@bccho thanks for this fix, been running into the problem with larger training sets with joblib and pickle for the past week. Needed to use Same error as above:
|
am thinking about using UMAP for IDS project as feature extraction methods is it a good Idea? have any body did this before ?? |
It is worth trying, but a lot will depend on the nature of your data. I have seen UMAP used for IDS projects, though usually more as part of an exploratory tool rather than a production pipeline. |
can you share with me some of these project ? |
Unfortunatlely I can't share details. Sorry. |
@lmcinnes my two cents is that the issue with umap is the use case. I see a lot of people do not how which is the advantage to use umap instead of t-sne / pca... |
Couldn't get the pickle.dumps/loads workaround to work (python3.8). man = self._unserialize_umap(man) [20/1811] File "/app/jwtauthtest/autoencoder.py", line 223, in _unserialize_umap
umap = pickle.loads(s)
File "/usr/local/lib/python3.8/site-packages/pynndescent/pynndescent_.py", line 1028, in __setstate__
self._rp_forest = tuple([renumbaify_tree(tree) for tree in d["_rp_forest"]])
File "/usr/local/lib/python3.8/site-packages/pynndescent/pynndescent_.py", line 1028, in <listcomp>
self._rp_forest = tuple([renumbaify_tree(tree) for tree in d["_rp_forest"]])
File "/usr/local/lib/python3.8/site-packages/pynndescent/rp_trees.py", line 1178, in renumbaify_tree
hyperplanes.extend(tree.hyperplanes)
File "/usr/local/lib/python3.8/site-packages/numba/typed/typedlist.py", line 366, in extend
return _extend(self, iterable)
File "/usr/local/lib/python3.8/site-packages/numba/core/dispatcher.py", line 415, in _compile_for_args
error_rewrite(e, 'typing')
File "/usr/local/lib/python3.8/site-packages/numba/core/dispatcher.py", line 358, in error_rewrite
reraise(type(e), e, None)
File "/usr/local/lib/python3.8/site-packages/numba/core/utils.py", line 80, in reraise
raise value.with_traceback(tb)
numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
- Resolution failure for literal arguments:
No implementation of function Function(<function impl_extend at 0x7feb5058daf0>) found for signature:
>>> impl_extend(ListType[array(float64, 2d, C)], reflected list(array(float32, 1d, C)))
There are 2 candidate implementations:
- Of which 2 did not match due to:
Overload in function 'impl_extend': File: numba/typed/listobject.py: Line 1027.
With argument(s): '(ListType[array(float64, 2d, C)], reflected list(array(float32, 1d, C)))':
Rejected as the implementation raised a specific error:
TypingError: Failed in nopython mode pipeline (step: nopython frontend)
- Resolution failure for literal arguments:
No implementation of function Function(<function impl_append at 0x7feb5058d280>) found for signature:
>>> impl_append(ListType[array(float64, 2d, C)], array(float32, 1d, C))
There are 2 candidate implementations:
- Of which 2 did not match due to:
Overload in function 'impl_append': File: numba/typed/listobject.py: Line 589.
With argument(s): '(ListType[array(float64, 2d, C)], array(float32, 1d, C))':
Rejected as the implementation raised a specific error:
LoweringError: Failed in nopython mode pipeline (step: nopython mode backend)
File "../usr/local/lib/python3.8/site-packages/numba/typed/listobject.py", line 597:
def impl(l, item):
casteditem = _cast(item, itemty)
^
During: lowering "$8call_function.3 = call $2load_global.0(item, $6load_deref.2, func=$2load_global.0, args=[Var(item, listobject.py:597), Var($6load_deref.2, listobject.py:597)], kws=(), vararg=None)" at /usr/local/lib/python3.8/site-packages/numba/typed/listobject.py (597)
raised from /usr/local/lib/python3.8/site-packages/numba/core/utils.py:81
- Resolution failure for non-literal arguments:
None
During: resolving callee type: BoundFunction((<class 'numba.core.types.containers.ListType'>, 'append') for ListType[array(float64, 2d, C)])
During: typing of call at /usr/local/lib/python3.8/site-packages/numba/typed/listobject.py (1051)
File "../usr/local/lib/python3.8/site-packages/numba/typed/listobject.py", line 1051:
def impl(l, iterable):
<source elided>
for i in iterable:
l.append(i)
^
raised from /usr/local/lib/python3.8/site-packages/numba/core/typeinfer.py:994
- Resolution failure for non-literal arguments:
None
During: resolving callee type: BoundFunction((<class 'numba.core.types.containers.ListType'>, 'extend') for ListType[array(float64, 2d, C)])
During: typing of call at /usr/local/lib/python3.8/site-packages/numba/typed/typedlist.py (101)
File "../usr/local/lib/python3.8/site-packages/numba/typed/typedlist.py", line 101:
def _extend(l, iterable):
return l.extend(iterable) Also tried |
A rough roadmap of things to be done for UMAP. Some of these tasks are easy, some are hard, and some require deeper knowledge of UMAP. Short and medium term tasks should be approachable for many people. Reply to this issue if you are interested in taking up any of them.
Short term items
Medium term items
readthedocs
Longer term items
No priority
The text was updated successfully, but these errors were encountered: