PDF operators for the random samplers, and also the Dirichlet #14617

david-seiler · 2019-04-04T15:28:21Z

Description

This PR replaces #14579; when I retargeted that from 1.3.x to master, the Jenkins CI build got confused somehow, and refuses to start new test runs (though the Travis build was fine). All the comments from the review of that PR should be addressed in this changeset.

This PR adds operators for computing the densities of samples drawn from any of the various distributions defined in operator/random, as well as their gradients, plus also the Dirichlet even though we don't yet have a sampler for it. There are lots of changes to test_random.py to test each PDF alongside its distribution; aside from that, the patch should be entirely stand-alone. See pdf_op.cc for more-detailed description strings.

Checklist

Essentials

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

piyushghai · 2019-04-04T17:36:20Z

Thanks for migrating your PR from v1.3.x branch to master.
@mxnet-label-bot Add [pr-awaiting-review, operator]

david-seiler · 2019-04-05T09:37:10Z

The Python3 TensorRT GPU test failed: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-14617/1/pipeline

That failure doesn't look related to the PR; first make warns that "TensorRT not enabled by default. Please set the MXNET_USE_TENSORRT environment variable to 1 or call mx.contrib.tensorrt.set_use_tensorrt(True) to enable." Then a bit later, Cuda initialization fails with error 35. I made a trivial change to rerun the tests, and the same thing happened again (along with a numeric differentiation failure that I'll investigate). Is that test known to be flaky?

david-seiler · 2019-04-08T08:16:43Z

The Python3 TensorRT GPU test failed: ...

It looks like lots of other PRs are failing the same way, e.g. http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-14393/3/pipeline/, http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-14614/4/pipeline/, http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-14476/24/pipeline/274, http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-14639/1/pipeline/, http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-14614/3/pipeline/

david-seiler · 2019-04-11T09:34:12Z

The build checks all pass now. What's the next step to get this merged?

anirudhacharya · 2019-04-11T21:01:45Z

tests/python/unittest/test_random.py

+                  grad_nodes = ['v1', 'v2'] if symbdic['discrete'] else ['v0', 'v1', 'v2']
+                  check_numeric_gradient(test_pdf, [un1, p1, p2], grad_nodes=grad_nodes, atol=backw_atol, rtol=backw_rtol, dtype=dtype)
+
+@with_seed(1000)


any reason why the seed has been fixed here?

Even in float64, the numeric gradient for the Dirichlet is a little unreliable: sometimes it diverges and generate Infs or NaNs. Without a fixed seed it would still pass almost all of the time, but this is safer.

Well, I said that, but all the tests were against mxnet 1.3.x. I retested against 1.5 and couldn't reproduce any of the failures, so I've removed the explicit seed.

the explicit seed is still there

Oops, so it was. But now it's not.

@david-seiler It's actually still there

roywei · 2019-04-29T16:32:24Z

@apeforest @eric-haibin-lin could you help review? thanks!

asmushetzel · 2019-05-01T09:44:54Z

Would be really good if we can get this in. There are multiple teams that I know that will benefit from it. It also addresses most requested features from #12932.

anirudhacharya · 2019-05-01T21:10:32Z

tests/python/unittest/test_random.py

+
+    alpha = alpha.reshape((2, 2, num_classes))
+
+    for dtype in [np.float32, np.float64]:


any reason why fp16 is not being tested?

One can have fp16 but the tolerances have to be really loose, on the order of 5e-1, to get scipy and the symbolic forward to agree reliably. I've put it in, but it's something real users should be a little careful with.

Update: this worked for me locally, but didn't consistently pass in the Jenkins builds so I've taken it back out. It's not really a sensible thing to do anyway; the Dirichlet involves a sum of lgammas of samples, it's just not going to be very stable in f16 no matter what you do. If users want something like that, they should probably use higher precision internally and then downcast at the end.

Okay.

But if you did want to test those, you could set different tolerance levels for different dtypes with something like this - rtol = 5e-1 if dtype is np.float16 else 1e-4( the values are random)

Pardon me, missed this earlier. You can do that, but the thinking here is that you shouldn't: maybe there's some user somewhere who really knows what they're doing and truly wants a float16 Dirichlet, but in the common case it's a bad idea that should be avoided.

anirudhacharya · 2019-05-01T21:12:43Z

Apologies for the delay and thanks for this contribution. It is plenty useful.

I have put in a few comments but the PR looks good to me for most part. I have marked some of the committers to review/merge this PR.

@eric-haibin-lin @szha @reminisce @haojin2 can you please take a look.

aaronmarkham

Seems pretty sparse on the documentation front.

haojin2

Some cosmetic issues first, taking a look at the backend code now

haojin2 · 2019-05-01T21:46:19Z

tests/python/unittest/test_random.py

+        backw_atol = 1e-5 if dtype == np.float64 else 1e-3
+        backw_rtol = 1e-4 if dtype == np.float64 else 5e-2
+        for use_log in [False, True]:
+            print("use_log",use_log)


Nit: Remove this print

haojin2 · 2019-05-01T21:47:11Z

tests/python/unittest/test_random.py

-    check_with_device(mx.context.current_context(), 'float64')
+    check_with_device(mx.context.current_context(), np.float16)
+    check_with_device(mx.context.current_context(), np.float32)
+    check_with_device(mx.context.current_context(), np.float64)


Nit:

for dtype in [np.float16, np.float32, np.float64]: check_with_device(mx.context.current_context(), dtype)

haojin2 · 2019-05-01T21:48:28Z

tests/python/unittest/test_random.py

+            res = results if use_log else np.exp(results)
+            check_symbolic_forward(test_pdf, [samples, alpha], [res], atol=forw_atol, rtol=forw_rtol, dtype=dtype)
+            if dtype == 'float64':
+                check_numeric_gradient(test_pdf, [samples, alpha], numeric_eps=1e-7, atol=backw_atol, rtol=backw_rtol, dtype=dtype)


I saw that numeric gradient is not checked for fp32, what is the reason behind that? I think we should have coverage for the most commonly used data type. And can you also add the symbolic backward check using check_symbolic_backward ?

Unfortunately, we don't have an independent source of truth for the gradients the way we do for the PDF itself. We check our symbolic forward against the densities given by scipy, but scipy doesn't have functions for analytic gradients of the PDFs, just the same kinds of tools for numeric differentiation that we've got, so there's nothing for check_symbolic_backward to check against.

Similarly, we've found that the numeric gradient is most reliable in float64. The closed-form gradient functions we've written are -- assuming they're written correctly at all -- more accurate than numeric approximation can be. Checking the gradient numerically in float64 provides a lot of evidence about whether the gradient functions are written correctly, but checking float32 doesn't add very much more; discrepancies are more likely to be caused by numeric errors in the approximation than in our code.

haojin2 · 2019-05-01T21:57:49Z

src/operator/random/pdf_op.h

+        const IType2 *cur_alpha  = alpha+index*k;
+        const DType scaling(grad_out[i]*(logpdf ? DType(1) : out[i]));
+        DType sum_alpha(0);
+        for ( int j = 0; j < k; ++j ) {


Nit: Don't need the extra spaces within the parentheses

for (int j = 0; j < k; ++j) {

Same for several other places.

haojin2 · 2019-05-01T22:00:33Z

src/operator/random/pdf_op.h

+                   const std::vector<TBlob>& outputs) {
+  using namespace mshadow;
+  CHECK_EQ(inputs.size(), pnum+3);
+  CHECK_EQ(outputs.size(), pnum+1);


Nit: space around +

CHECK_EQ(inputs.size(), pnum + 3); CHECK_EQ(outputs.size(), pnum + 1);

Same applies for some other operators.

Lots more operator whitespace now. Might be useful to have this in the linter.

david-seiler · 2019-06-03T09:59:13Z

commenting to reopen, I was trying to debug a mysterious linker failure in unix-gpu, which does not occur in mainline but which nevertheless seems to exist independent of any of these changes.

lebeg · 2019-06-04T14:40:46Z

ci/windows/test_py3_cpu.ps1

@@ -24,7 +24,7 @@ $env:MXNET_HOME=[io.path]::combine($PSScriptRoot, 'mxnet_home')

 C:\Python37\Scripts\pip install -r tests\requirements.txt
 C:\Python37\python.exe -m nose -v --with-timer --timer-ok 1 --timer-warning 15 --timer-filter warning,error --with-xunit --xunit-file nosetests_unittest.xml tests\python\unittest
-if (! $?) { Throw ("Error running unittest") }
+if (! $?) { Throw ("Error running unittest) }


Are you sure you want to remove the "?

whoops, I was factoring some error-handling code out to PR-15147 and got a little too aggressive. Good catch, fixed now.

vandanavk · 2019-06-16T19:36:26Z

@samskalicky @apeforest for review

…r (plus also the PDF of the Dirichlet). Supports probabilities and log-probabilities, as well as gradients.

sxjscience · 2019-07-19T05:30:41Z

LGTM

sxjscience · 2019-07-19T05:39:16Z

This also has not checked the grad_req=kAddTo case. Let's revise it later @haojin2

ChaiBapchya · 2019-07-25T18:18:04Z

~~Curious to know if pdf operators aren't supposed to have NDArray API? Currently they're only in Symbol API right? @sxjscience @david-seiler~~
Realized it has an NDArray API. Just not mentioned in the docs.
Thanks.

…r (plus also the PDF of the Dirichlet). Supports probabilities and log-probabilities, as well as gradients. (apache#14617)

david-seiler mentioned this pull request Apr 4, 2019

PDF operators for each distribution #14579

Closed

4 tasks

marcoabreu added Operator pr-awaiting-review PR is waiting for code review labels Apr 4, 2019

david-seiler force-pushed the master branch from 987cb67 to d1f0933 Compare April 5, 2019 08:21

asmushetzel mentioned this pull request Apr 5, 2019

Probability Distributions Support #12932

Open

david-seiler force-pushed the master branch from d1f0933 to 81c9530 Compare April 5, 2019 13:14

david-seiler force-pushed the master branch 4 times, most recently from 3d359c8 to 32dc121 Compare April 10, 2019 08:22

anirudhacharya reviewed Apr 11, 2019

View reviewed changes

david-seiler force-pushed the master branch from 32dc121 to cd1c621 Compare April 15, 2019 11:02

anirudhacharya reviewed May 1, 2019

View reviewed changes

szha requested a review from haojin2 May 1, 2019 21:20

lanking520 assigned haojin2 May 1, 2019

aaronmarkham reviewed May 1, 2019

View reviewed changes

haojin2 reviewed May 1, 2019

View reviewed changes

david-seiler force-pushed the master branch 6 times, most recently from 2d1d971 to fb4decb Compare May 7, 2019 16:49

david-seiler force-pushed the master branch 2 times, most recently from 9fb5055 to c84dc0b Compare June 3, 2019 09:16

david-seiler closed this Jun 3, 2019

david-seiler force-pushed the master branch from c84dc0b to 99e69e6 Compare June 3, 2019 09:53

david-seiler reopened this Jun 3, 2019

david-seiler force-pushed the master branch 9 times, most recently from d826ca0 to fd73f6c Compare June 4, 2019 13:33

lebeg reviewed Jun 4, 2019

View reviewed changes

david-seiler force-pushed the master branch from fd73f6c to 280f663 Compare June 4, 2019 15:05

david-seiler force-pushed the master branch from 280f663 to 4de1f4a Compare July 15, 2019 09:24

PDF operators for each distribution for which we have a random sample…

91db71c

…r (plus also the PDF of the Dirichlet). Supports probabilities and log-probabilities, as well as gradients.

david-seiler force-pushed the master branch from 4de1f4a to 91db71c Compare July 15, 2019 13:43

szha requested a review from sxjscience July 18, 2019 21:52

sxjscience approved these changes Jul 19, 2019

View reviewed changes

sxjscience merged commit b887c06 into apache:master Jul 19, 2019

ChaiBapchya mentioned this pull request Jul 25, 2019

[OpPerf] PDF Random ops fix #15661

Merged

3 tasks

szha mentioned this pull request Sep 13, 2019

[RFC] Apache MXNet 2.0 Roadmap #16167

Open


		alpha = alpha.reshape((2, 2, num_classes))

		for dtype in [np.float32, np.float64]:

PDF operators for the random samplers, and also the Dirichlet #14617

PDF operators for the random samplers, and also the Dirichlet #14617

Conversation

david-seiler commented Apr 4, 2019

Description

Checklist

Essentials

piyushghai commented Apr 4, 2019

david-seiler commented Apr 5, 2019

david-seiler commented Apr 8, 2019

david-seiler commented Apr 11, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roywei commented Apr 29, 2019

asmushetzel commented May 1, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anirudhacharya commented May 1, 2019

aaronmarkham left a comment

Choose a reason for hiding this comment

haojin2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

david-seiler commented Jun 3, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vandanavk commented Jun 16, 2019

sxjscience commented Jul 19, 2019

sxjscience commented Jul 19, 2019

ChaiBapchya commented Jul 25, 2019 • edited Loading

ChaiBapchya commented Jul 25, 2019 •

edited

Loading