Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shaka 3 rebuffers indefinitely on VDMS dash live stream after 3-5 minutes (works fine in shaka 2.x) #3169

Closed
TheJohnBowers opened this issue Feb 18, 2021 · 19 comments
Assignees
Labels
priority: P0 Broken for everyone; no workaround; urgent status: archived Archived and locked; will not be updated type: bug Something isn't working correctly
Milestone

Comments

@TheJohnBowers
Copy link
Contributor

Have you read the FAQ and checked for duplicate open issues?
Yes

What version of Shaka Player are you using?
3.0.8

Can you reproduce the issue with our latest release version?
Yes

Can you reproduce the issue with the latest code from master?
Yes - But I cannot reproduce it in shaka 2.4/2.5

Are you using the demo app or your own custom app?
Demo App

If custom app, can you reproduce the issue using our demo app?
NA

What browser and OS are you using?
Chrome on macos

For embedded devices (smart TVs, etc.), what model and firmware version are you using?
NA

What are the manifest and license server URIs?

Manifest:
https://content-ausw7.uplynk.com/channel/b965d55496414804891dde9c96928ffb.mpd?rays=abcd&delay=30
License:
https://content.uplynk.com/wv

What did you do?

Just play back any Verizon Media Dash Live stream for for 5 minutes or so and you will get infinite rebuffering

What did you expect to happen?
Playback without infinite rebuffering

What actually happened?

I recorded a little demo video that guides you through what is happening:

https://content.uplynk.com/player5/PjpqgfaCDjuJxkQHgrBvXsa.html

Basically after ~ 3 minutes shaka starts redownloading all slices after a certain slice after ever manifest update it gets to a point where it is downloading more than a dozen slices after each manifest refresh and then it starts infinitely buffering.

@TheModMaker
Copy link
Contributor

Probably a duplicate of #3139, but different enough that I won't mark this as a duplicate.

@TheModMaker TheModMaker added type: bug Something isn't working correctly and removed needs triage labels Feb 23, 2021
@TheModMaker TheModMaker added this to the v3.1 milestone Feb 23, 2021
@avelad
Copy link
Member

avelad commented Mar 25, 2021

@ismena, I have seen that you have assigned this issue, is there any progress with this problem, do you think it will be fixed for the next release?

@ismena
Copy link
Contributor

ismena commented Mar 25, 2021

Huh, apparently I am indeed assigned to this issue o.O
I haven't had a chance to look yet, but okay, I'll bite the bullet and shoot to fix it for 3.1

@ismena ismena assigned theodab and unassigned ismena Apr 5, 2021
@ismena
Copy link
Contributor

ismena commented Apr 5, 2021

Alright, my fix for #3139 doesn't fix this :(
@theodab is stepping in to help me out and make sure this is solved for 3.1

@avelad
Copy link
Member

avelad commented Apr 12, 2021

@theodab @ismena do you have any update?

@avelad
Copy link
Member

avelad commented Apr 15, 2021

@ismena @theodab Do you have an update on this issue? I am very interested.

@joeyparrish joeyparrish added the priority: P0 Broken for everyone; no workaround; urgent label Apr 19, 2021
@joeyparrish joeyparrish assigned joeyparrish and unassigned theodab Apr 19, 2021
@joeyparrish
Copy link
Member

The team was pretty badly divided among other non-open-source projects at Google for a while, but we're mostly back together and focused on Shaka Player again. This morning we spent some time shuffling issues and priorities around, with the goal to get v3.1 done by the end of the month. This morning we decided that this is one of three top-priority issues, and this one has been assigned to me.

I'm beginning work on diagnosis now.

@joeyparrish
Copy link
Member

With v3.0.10, I can reproduce the issue as you describe in your video.

With the latest code from the master branch, I get different behavior. Instead of re-downloading segments, at some point, it stops fetching segments entirely. It doesn't show a buffering spinner.

In that branch, in this state, seeking to the live edge will cause it to fetch a few segments, but then it stops again. The frame on the screen never seems to update.

I see these interesting Chrome media logs on master:

timestamp property value
00:00:50.949 info "DecryptingDemuxerStream: no key for key ID 36A883B024EB4CC9BAF50CC5E30B571E; will resume decrypting after new usable key is available"
00:00:51.380 info "DecryptingDemuxerStream: no key for key ID E9F4E449B45149DA9FE1739A1FDA3F8D; will resume decrypting after new usable key is available"
00:00:51.460 kVideoPlaybackRoughness 18.369
00:00:51.460 kVideoPlaybackFreezing 0.03351
00:00:51.460 kFramerate 30
00:00:51.460 pipeline_buffering_state {"for_suspended_start":false,"reason":"DEMUXER_UNDERFLOW","state":"BUFFERING_HAVE_NOTHING"}

It looks to me like with the master branch, we fetch ahead to meet the buffering goal, but we can't play that buffer out because of a missing key.

So it appears that the re-downloading behavior is fixed in master, revealing another issue which is related to key rotation.

@joeyparrish
Copy link
Member

Normally, the way external key rotation should work is that we see new PSSH boxes in the content at some point, and we use those to trigger a new key session and fetch an updated license. These new PSSH boxes make their way to us through the encrypted event from the video element. This event fires any time the platform encounters a PSSH box in MP4 content.

However, I see your MPD has a PSSH embedded in it. When we see a PSSH embedded in the manifest, we don't listen for encrypted events. Instead, the PSSH in the manifest overrides anything that we might find in the content.

For external key rotation, you don't want this. So you can use the manifest.dash.ignoreDrmInfo config to ignore what's in the manifest and only use the encrypted event.

When I tried this, however, I found that there aren't any encrypted events at all from your content. I've seen this before, in content where the PSSH in the manifest is the only PSSH. To me, this wouldn't make sense in a stream with key rotation, though.

@joeyparrish joeyparrish added the status: bad content Caused by invalid, broken, or unsupported content label Apr 19, 2021
@TheJohnBowers
Copy link
Contributor Author

TheJohnBowers commented Apr 19, 2021

So this sort of key rotation works with shaka 2.4/2.5 No issues. I definitely disagree that this is bad content. If the PSSH can be exclusively included in a manifest and not in the encoded content, I don't see where there is justification to say that key rotation should not work in this scenario.

@joeyparrish
Copy link
Member

Sorry, maybe it was rash to call it bad content. What I should say instead is "I don't understand how this content works". :-)

@joeyparrish joeyparrish removed the status: bad content Caused by invalid, broken, or unsupported content label Apr 19, 2021
@joeyparrish
Copy link
Member

I see where I made a mistake looking at the MPD: there is a PSSH in the manifest per period. So the rotation should be handled by the manifest updates, and PSSH boxes are not needed in the content at all.

I'll double-check that we're reacting to those properly and fetching licenses for them.

@joeyparrish
Copy link
Member

The key rotation issue seems to be caused by acfa1a8, which is not in any release version.

In that change, a call to filter the manifest on update was removed from the DASH parser. This filtering had a non-obvious side-effect of updating the init data in the DRM engine and triggering new sessions for key rotation.

With that call restored (though that may or may not be the best solution), I'm able to play for quite a long time so far. I think that fixed it in the master branch.

One more issue I see is a failed assertion several minutes in: "segment_index.js:656 Assertion failed: Cannot have evicted segments in non-first Periods". I'm seeing this in both v3.0.10 and master, and it doesn't coincide with the playback failure in v3.0.10, so it could be unrelated.

So, here's what's left to sort out:

  • What change in master fixes the re-downloading issue? It should be cherry-picked to v3.0.x.
  • What is the best fix for the broken key rotation in master? This needs to be fixed ASAP.
  • What is causing the failed assertion? Is the assertion wrong, or have we messed up the state of MetaSegmentIndex somewhere? At what change did this assertion start failing?

@TheJohnBowers
Copy link
Contributor Author

If you would like a similar test stream that does not rotate keys every 3 minutes to compare and contrast:

https://content.uplynk.com/channel/410beb25b7db4e4fb2e8c42724c07f67.mpd
https://content.uplynk.com/wv

@joeyparrish
Copy link
Member

  • What change in master fixes the re-downloading issue? It should be cherry-picked to v3.0.x.

I performed a git bisect to track down the fix in master that is still needed in v3.0.x. At each step, I had to manually add back the filter() call I mentioned that was removed in acfa1a8. This process revealed that the redownloading issue was fixed in 7e50866. We will cherry-pick this to v3.0.x, so this should fix the issue in v3.0.11.

@joeyparrish
Copy link
Member

  • What is causing the failed assertion? Is the assertion wrong, or have we messed up the state of MetaSegmentIndex somewhere? At what change did this assertion start failing?

git bisect reveals that this assertion has been failing on your content since 78357ed, which is also when the assertion was first introduced. I'll dig deeper to find out how we can fix it.

@joeyparrish
Copy link
Member

I found the root cause of the issue with the failed assertion, and the result seems to be harmless.

The assertion is meant to double-check our assumptions, namely that we evict content from earlier to later. If we've evicted anything from a period, it means that everything in earlier periods should have been evicted, too.

Except that in practice, the eviction process is driven by updates and merging. When a period vanishes from the manifest in an update (as it should), nothing in the parser triggers the final evict() call on that period's SegmentIndex. The result is that we hold onto one or a very small number of references per index, but playback continues anyway because they will never be used.

@joeyparrish
Copy link
Member

Here's the final summary:

I have a fix for the failed assertion in review. Again, the assertion failure is harmless to playback.

I also have a fix in review for the key rotation failure in the master branch. This will ensure we don't lose key rotation when v3.1 comes out. This issue does not affect v3.0.

The fix for the redownloading issue is already in the master, and I'm working on a round of cherry-picks from master to v3.0.x which will appear in v3.0.11.

Thanks!

shaka-bot pushed a commit that referenced this issue Apr 21, 2021
In #3169, we discovered failing assertions for some multi-Period DASH
live streams.  The failing assertion was meant to ensure that our
eviction logic was correct, and that our assumptions are maintained
properly.

Though the assertion failed, nothing was actually wrong with playback.
But the assertion itself should only fail if there is actually
something wrong.

With the choice between removing the assertion (since playback is
fine) or "fixing" the assertion (to avoid this false negative), I
chose to fix the assertion to retain the value of it to catch actual
bugs in our logic in the future.

The assertion specifies that you should not see evicted segments after
non-evicted segments.  But eviction was only done when merging a
segment index with an old version of the same index (same
Representation and Period).  When an update drops a Period from the
manifest, this means everything in that Period should be evicted.  But
the eviction call wouldn't happen, because we had no new version of
the segment index to update the old one.

An extra call to evict() on each manifest update can fix the state
of the system so that our assertions pass.

The assertion message appeared identically in two different
assertions, so this change also rewrites the assertion messages to
differentiate them and to clarify their meanings.

Issue #3169

Change-Id: Ifeb7a1a8f48af704b028a22d987349209d7ee485
@joeyparrish
Copy link
Member

The fix for the master branch key rotation issue has been merged. Everything should be working in the nightly build after tonight.

The fix for the redownloading issue has been cherry-picked, but not yet pushed to v3.0.x until I can verify that all (29) cherry-picks for this branch so far are correct and passing tests. I have a few test failures to investigate before I can push those changes to that branch, but the fix will be out in v3.0.11 shortly afterward.

joeyparrish pushed a commit that referenced this issue Apr 22, 2021
In live streams, we can evict segments outside the availability window
faster than they disappear from the manifest. If that happens, we used
to evict them several times (add them back in and then evict again).
This caused the eviction counter to increase beyond what it should be
and we had trouble finding segments afterwards.

Closes #3139
Closes #3169

Change-Id: Iafebfaf8e1e9ebb09a64cdf7e09a882115fd8eb6
joeyparrish added a commit that referenced this issue Apr 22, 2021
In #3169, we discovered failing assertions for some multi-Period DASH
live streams.  The failing assertion was meant to ensure that our
eviction logic was correct, and that our assumptions are maintained
properly.

Though the assertion failed, nothing was actually wrong with playback.
But the assertion itself should only fail if there is actually
something wrong.

With the choice between removing the assertion (since playback is
fine) or "fixing" the assertion (to avoid this false negative), I
chose to fix the assertion to retain the value of it to catch actual
bugs in our logic in the future.

The assertion specifies that you should not see evicted segments after
non-evicted segments.  But eviction was only done when merging a
segment index with an old version of the same index (same
Representation and Period).  When an update drops a Period from the
manifest, this means everything in that Period should be evicted.  But
the eviction call wouldn't happen, because we had no new version of
the segment index to update the old one.

An extra call to evict() on each manifest update can fix the state
of the system so that our assertions pass.

The assertion message appeared identically in two different
assertions, so this change also rewrites the assertion messages to
differentiate them and to clarify their meanings.

Issue #3169

Change-Id: Ifeb7a1a8f48af704b028a22d987349209d7ee485
@shaka-bot shaka-bot added the status: archived Archived and locked; will not be updated label Jun 20, 2021
@shaka-project shaka-project locked and limited conversation to collaborators Jun 20, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
priority: P0 Broken for everyone; no workaround; urgent status: archived Archived and locked; will not be updated type: bug Something isn't working correctly
Projects
None yet
Development

No branches or pull requests

7 participants