Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom dataset functionality #42

Open
DmitryKey opened this issue Sep 30, 2021 · 2 comments
Open

Custom dataset functionality #42

DmitryKey opened this issue Sep 30, 2021 · 2 comments

Comments

@DmitryKey
Copy link
Contributor

I need to implement a custom dataset and its handling and been thinking about the easiest way to approach it.

I've implemented something half-way through it that kept me going and allowed to plug-in a custom dataset -- in fact it is a dataset derived from BIGANN by reducing dimensionality using a neural network.

I will show the code of what I needed to change and happy to discuss this further!

@maumueller
Copy link
Collaborator

@DmitryKey I'm not sure where the actual dimensionality reduction happens in #43? It seems that you just needed to place the entry into datasets.py, which is the right approach.

What is the pipeline that you had in mind that could improve the process?

@DmitryKey
Copy link
Contributor Author

DmitryKey commented Oct 5, 2021

@maumueller there is no implementation for dim reduction step -- it is done elsewhere, in a separate neural network produced by my team peer.

In order for me to try this new dataset, with reduced dimensionality, I need to treat it as a 7th dataset, if this makes sense. Because it will have different dtype compared to original (non reduced) and different (lower) number of dimensions.

So I was thinking that in addition to changing datasets.py, I'd need to change the I/O, because my dataset can live somewhere else, like local disk / blob storage.
One other issue I experienced was that I had to still name my dataset with some recognized name, known to the framework -- ideally I would like to control this part as well, but by changing DATASETS dictionary I don't see how that connects to the I/O, like dataset path.

DmitryKey added a commit to DmitryKey/big-ann-benchmarks that referenced this issue Oct 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants