You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have also tried to do profiling on 1080 Ti with the same codebsae from this github, and have some questions to ask.
The result shows that the ‘Issued Warp Per scheduler’ is only 0.77, which implies that the poor latency hiding, it might be too low compared to 0.94 on 1060, 0.88 on 1070.
Also, the result of ‘Warp State Statistics’ shows that, the bottleneck is ‘Stall Short Scoreboard’ which is related to operations to shared memory.
Below are my shared memory profiling:
Instructions, Requests, %Peak, Bank Conflicts
201326592, 706282140, 76.35, 504955548
Compared with 1060 and 1070, they are the same instructions, but more requests and bank conflicts, I guess it might be the reason of high latency on my experiment.
But, I don’t know why the requests and bank conflicts are about 257274 more than 1060/1070, could anyone help with that?
The text was updated successfully, but these errors were encountered:
The number of requests and bank conflicts are data dependent and change for every hash. A random value from a register is used for the load address so how many conflicts there are across a warp is random.
If you run the same block and hash on 1060/1070/1080Ti you should get the same results. Running a different block/hash you're seeing a 0.05% difference, which is negligible and actually less variation than I would expect.
There has a NSIGHT profiling result on the web:
https://medium.com/@ifdefelse/understanding-progpow-performance-and-tuning-d72713898db3
I have also tried to do profiling on 1080 Ti with the same codebsae from this github, and have some questions to ask.
The result shows that the ‘Issued Warp Per scheduler’ is only 0.77, which implies that the poor latency hiding, it might be too low compared to 0.94 on 1060, 0.88 on 1070.
Also, the result of ‘Warp State Statistics’ shows that, the bottleneck is ‘Stall Short Scoreboard’ which is related to operations to shared memory.
Below are my shared memory profiling:
Instructions, Requests, %Peak, Bank Conflicts
201326592, 706282140, 76.35, 504955548
Compared with 1060 and 1070, they are the same instructions, but more requests and bank conflicts, I guess it might be the reason of high latency on my experiment.
But, I don’t know why the requests and bank conflicts are about 257274 more than 1060/1070, could anyone help with that?
The text was updated successfully, but these errors were encountered: