-
I have implemented something like the following example in the docs:
Where I train by providing context inputs (x) and either true or fake targets (y). The BCE decreases for both the training and holdout set. However, if I add a Dense layer using the decoder params and calculate the SoftmaxCE, this value increases with time for both the training and holdout sets. How can it be that the network learns to discriminate between real and noise (context, target) pairs, but then using the NCEDense params in a dense layer to create input into Softmax gets worse? |
Beta Was this translation helpful? Give feedback.
Replies: 5 comments 1 reply
-
@zjost thanks for posting the code sample. Are you saying that in this example |
Beta Was this translation helpful? Give feedback.
-
@eric-haibin-lin that's exactly right: I had to implement my own sampler because I couldn't find any pre-made implementations that generated the tuple of three tensors needed for |
Beta Was this translation helpful? Give feedback.
-
@zjost Got it. Are you returning the probability of candidates, or the expected count of candidates? The NCEBlock expects the expected count as inputs. Can you try returning Another sampler you can try is the |
Beta Was this translation helpful? Give feedback.
-
@eric-haibin-lin I have re-implemented everything on a new, simpler problem and am getting the same behavior. I have made the correction to return As a comparison, I also trained a network with a Dense layer decoder and directly minimized the Softmax loss function without using any kind of negative sampling. On the same dataset, this loss decreases as expected. This at least shows that the learning task is possible on this data. Let me share some code.
NCE network:
Training
Am I using the right parameterization for |
Beta Was this translation helpful? Give feedback.
-
I have worked on this more and discovered that if I allow training to keep running, eventually the SoftMaxCE will start to improve. It seems that the loss curve always increases for the first several epochs, but eventually starts reducing. I'm still curious to understand why this happens, and if there are better ways to e.g. schedule the learning rate to get improved convergence. However, I don't think there's an issue with the code/implementation, so this issue can be closed. I apologize for the false alarm. |
Beta Was this translation helpful? Give feedback.
I have worked on this more and discovered that if I allow training to keep running, eventually the SoftMaxCE will start to improve. It seems that the loss curve always increases for the first several epochs, but eventually starts reducing.
I'm still curious to understand why this happens, and if there are better ways to e.g. schedule the learning rate to get improved convergence. However, I don't think there's an issue with the code/implementation, so this issue can be closed. I apologize for the false alarm.