Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic sparse embedding layer size for flexible incremental training #8232

Closed
dakshvar22 opened this issue Mar 18, 2021 · 6 comments
Closed
Assignees
Labels
area:rasa-oss/ml 👁 All issues related to machine learning difficulty:medium 🚶‍♀️ research:incremental-training type:enhancement ✨ Additions of new features or changes to existing ones, should be doable in a single PR type:experiments 🔬 To verify if the feature is working as expected we need to run some experiments.

Comments

@dakshvar22
Copy link
Contributor

dakshvar22 commented Mar 18, 2021

Description of Problem:
Currently in order to account for new vocabulary items during incremental training, we create a buffer for extra vocabulary inside CountVectorsFeaturizer because of which users need to specify values for additional_vocabulary_size option. This can be cumbersome to account for right at the first training run of your assistant. Moreover, once the buffer for extra vocabulary is exhausted, the user needs to re-train from scratch which can be time consuming. Is there a more efficient approach for incremental training?

Overview of the Solution:
One alternative is to account for new vocabulary directly in the architecture of DIET/TED. Both the architectures have a sparse embedding layer in the beginning which transform the one hot vector of incoming tokens into embedding vectors. The input size of the sparse embedding layer can be computed on the fly during the graph building stage according to the length of incoming sparse features of inputs. In every new incremental training run, if the vocabulary increased, then there will be new set of weights initialized in the sparse embedding layer to account for the new vocabulary items.

So, for example if the length of the incoming sparse features vector of input is 50 in the first run, then the input size of the sparse embedding layer will be 50. In the next fine-tuning run assuming the length increased by 20, the pre-trained weights corresponding to the existing first 50 vocabulary items will be loaded and set appropriately and an extra set of weights will be initialized corresponding to the new 20 dimensions in the sparse feature vector.

Things to note:

  1. The vocabulary index for a token should not change across fine-tuning runs. This still has to be handled by CountVectorsFeaturizer but is a detail for the user which they don't have to worry about.
  2. Since the size of sparse features will keep changing across fine-tuning runs, the persisted model data that is used during load will also have to account for that. Subsequently, new set of weights will have to be merged inside train with the existing sparse embedding layer after the previous weights are already loaded in load.

Open Questions:

  1. What should be the initialization scheme for the new weights being added during the fine-tuning run? It could be either random normal distribution with usual set of means and std, or the mean and std can be computed from the existing set of weights in that layer.

Experiments should be run to validate any changes to performance of fine-tuning with the proposed approach v/s the existing approach. Theoretically there shouldn't be a lot of difference in performance.

@dakshvar22 dakshvar22 added type:enhancement ✨ Additions of new features or changes to existing ones, should be doable in a single PR difficulty:medium 🚶‍♀️ type:experiments 🔬 To verify if the feature is working as expected we need to run some experiments. area:rasa-oss/ml 👁 All issues related to machine learning labels Mar 18, 2021
@Ghostvv
Copy link
Contributor

Ghostvv commented Mar 18, 2021

new set of weights will have to be merged with the existing sparse embedding layer after the previous weights are already loaded.

I think it cannot be done in load since you don't see new training data there, it should be handled in train

@dakshvar22
Copy link
Contributor Author

Totally correct. I meant doing it in train once the weights are loaded in load. Updating the description above 👍

@jupyterjazz
Copy link
Contributor

I'm working on this issue.
I have written a simple prototype that you can see here with some documents about my ideas and an implementation proposal.
At this point, I'm trying to make a draft version work inside Rasa OSS.

@tttthomasssss
Copy link
Contributor

@ka-bu assigned as reviewer.

@dakshvar22
Copy link
Contributor Author

@tttthomasssss This is a large issue broken down into 3 smaller issues (and PRs) that you can see linked above. I am already in the process of reviewing them (and almost close to getting the PRs merged), so I think we can skip adding another reviewer here.

@dakshvar22
Copy link
Contributor Author

@jupyterjazz Once this issue is complete, we should verify that the bug in #8496 is not persisting still.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:rasa-oss/ml 👁 All issues related to machine learning difficulty:medium 🚶‍♀️ research:incremental-training type:enhancement ✨ Additions of new features or changes to existing ones, should be doable in a single PR type:experiments 🔬 To verify if the feature is working as expected we need to run some experiments.
Projects
None yet
Development

No branches or pull requests

6 participants