Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug on Section 9.7, 10.4, 10.7 of D2L-DJL #119

Open
markbookk opened this issue Apr 29, 2021 · 0 comments
Open

Bug on Section 9.7, 10.4, 10.7 of D2L-DJL #119

markbookk opened this issue Apr 29, 2021 · 0 comments

Comments

@markbookk
Copy link
Contributor

Description

There seems to be a bug on the code for these sections which don’t allow me to replicate the results of the loss on the D2L Python book by using DJL. I have tested various methods and have talked to Zach, Frank, and Lai of possible solutions and have tried and implemented all the suggestions without much result...

Environment

I tested my theories and suggestions given in section 10.4 using this (https://github.com/markbookk/java-d2l-IDE/tree/main/section10_4) code. I also set all initializers to ONE as well as in the Python side to eliminate the random factor of Xavier initialization or similar.

I haven’t completed section 10.7 since these 3 sections rely on the same code base for each other. I originally saw this problem when creating section 9.7 but it later replicated to section 10.4 so the problem will still occur on section 10.7.

Problem

The problem occurs when training is performed and we can see that the result of the loss function diverges from the result expected (Python side) as the training continues. I debugged and tried multiple things (which I’ll mention soon) and I noticed the problem comes after calling backward. I checked the NDArrays of the parameters and the gradients of each parameter before calling backward, and comparing to the Python side, they are exactly the same. In addition, I checked the loss function and it is the same (there is a slight difference of something like 0.0001 but only because of how Java float handle less floating points compared to Python). Although the loss sum result is the same, the result of the prediction is 0 so maybe that has something to do with my testing but I tried multiple things and I couldn’t find a way change this result in a way that achieved the same environment on both Python and Java to compare.

What I did to try and solve it

  • Verified that the Python blocks and Java blocks were the same
  • Verified that the blocks such as Dense/Linear were passing the same parameters to the engines
  • Debugged and inspected values on every step to see where arrays and gradients were different
    • As mentioned, seems to be after backward
    • Achieved this by printing the result of the sum of both NDArray and its gradients of the parameters
  • Verified that the inputs were the same
  • Verified the loss functions were the same
  • Tried setting a random seed and removing Initializer.ONES

Why setting a random seed doesn’t work

I did try setting a random seed and yes, setting the same for Python and Java sides work in “theory”. I generated random NDArrays to test this theory and they were the same. Now the problem occurs when the sequence of this random calls isn’t the same or even when the calls are not exactly the same. As an example, in my code, I leverage the functionality that DJL can automatically initialize your blocks but in the Python side, they manually call encoder.initializer() for example and then call different methods before calling forward, which isn’t the case for DJL side as initialization occurs right before forward is called. This is just an example on my code but I could try and replicate it but that doesn’t fix the problem that comes by MXNet Python and MXNet DJL not having the same exact code or sequence for this random calls.

Possible Solutions

  • I think setting the random seed every time exactly before initialization may work but not sure... What I mean by this is like setting random.seed(1234) multiple times, to be precise, before every random value is expected.
  • Set the same “random” arrays and gradients manually for both Python and Java to be able to debug accordingly.
  • Check all parameters being sent to MXNet engine is the same on both Python and Java side
    • I did this but I may have missed something
  • Debug my code to see if maybe I just missed something in my code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant