PHPLIB-461: Fix functional tests for change streams on sharded clusters#659
PHPLIB-461: Fix functional tests for change streams on sharded clusters#659alcaeus merged 2 commits intomongodb:masterfrom
Conversation
d4fa1a0 to
72495a3
Compare
|
Review feedback addressed in separate commit. I'll squash once the build is stable and after updating to a newer driver release that fixes PHPC-1419. |
dc74e7a to
dbd1f0d
Compare
| } | ||
|
|
||
| // Temporarily skip tests because of an issue with change streams in the driver | ||
| $this->markTestSkipped('$changeStreams currently don\'t on replica sets'); |
There was a problem hiding this comment.
The PR itself LGTM, but I realized that removing this skip re-introduced the test failures.
I assume those were not happening in previous Travis builds since this skip was first introduced in 2008bc7 as part of #655?
We can discuss offline about whether to merge as-is for now and defer sorting out the double-free for replica set and sharded replica set tests in PHPLIB-432.
There was a problem hiding this comment.
Yes and no. Not sure what made the errors appear consistently, but the skip only affected sharded clusters, not the replicaset build. With this branch, the replicaset build always fails while the build on a sharded cluster at least passes sometimes.
Since this entire PR deals with fixing tests on sharded clusters, I don't see a reason why to merge this without running the tests on a sharded cluster. I'd first like to investigate the cause for the double free some more before just plain skipping those tests that cause the failures.
|
@alcaeus: Regarding the unexpected passing builds despite the lingering double-free issue (PHPLIB-432), I noticed that the job for this most recent, passing build ran on PHP 7.3.9 (built August 30). IIRC, previous failed builds were using 7.3.2 (built February 11). I wasn't able to find the Travis build history for the branch you were using to research this issue, but you may want to check that to see if your recently passing builds also display a new version. If so, it's possible this was fixed upstream; however, I don't see anything in https://github.com/travis-ci/travis-ci/issues or https://github.com/travis-ci/php-src-builder to suggest it was an intentional fix. EDIT: #638 is also passing after a fresh rebase on |
|
I'm still observing the failure on PHP 7.3.9, see the most recent build log for example: https://travis-ci.com/alcaeus/mongo-php-library/jobs/231388623#L239. This branch was rebased to pull in latest changes from this PR, which interestingly enough no longer shows this behaviour, neither in the PR build on travis-ci.org nor in my branch build on travis-ci.com. I was afraid that the behaviour might only exhibit itself on travis-ci.com (since that uses different infrastructure IIRC), but that doesn't seem to be the case either. Restarting the build a couple of times did not bring those failures back, so I'm even more at a loss than before. |
8384953 to
944e729
Compare
https://jira.mongodb.org/browse/PHPLIB-461
Due to the way
getMoreworks on sharded clusters, we can't rely on the firstgetMorecall after an insert to return the data we expect. I'll quote the response from the query team to my questions about whygetMorebehaves this way.With an
maxAwaitTimeMSof 500 ms we stay below the minimumperiodicNoopIntervalSecsof 1 second, which can cause ourgetMorecommands on shards without data to time out before a no-op is read from the oplog. One way to solve this would be to increase themaxAwaitTimeMSto a value above 1000 ms to ensure thatgetMorecalls don't time out before another no-op is written, but this causes tests to be significantly slower. Furthermore, we should never make any assumptions as to what batch a change stream event is returned in, only that it will eventually be returned in the correct order. To ensure the tests don't unnecessarily fail on sharded clusters, they will now execute multiplegetMorecalls if necessary, only failing if a given number (currently 5) ofgetMorecalls fails to return any data. This was also necessary since resume operations would still yield empty batches on the initial call toaggregate.This PR also contains a workaround for PHPC-1419, where a
NonResumableChangeStreamErrorerror label is not applied to theServerExceptioncreate for aChangeStreamFatalError. This causes the iterator to repeatedly retry resuming the change stream for an unresumable error. This workaround can be removed once the above issue was fixed.