-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compare invocation namespaces when handling a cycle and recovering a queue #5432
Conversation
Huge! This was the only remaining bug I had identified from my use. |
Is there anything that can be unit tested here? |
I tried to add a unit test but it seems it's not easy to reproduce the situation based on a unit test. |
Codecov Report
@@ Coverage Diff @@
## master #5432 +/- ##
==========================================
- Coverage 76.81% 76.54% -0.28%
==========================================
Files 241 241
Lines 14630 14632 +2
Branches 616 616
==========================================
- Hits 11238 11200 -38
- Misses 3392 3432 +40
... and 10 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
LGTM then. This is a pretty critical bug since an activation can be delivered to a wrong container so should get this merged as quickly as possible. |
It seems there is a problem in multi-runtime tests. |
I'm not sure why the multiruntime tests are failing now |
As I shared via the dev list, this change is urgent and not related to anything about multi-runtime tests, I would merge this PR without passing the tests in 24 hours unless there is any objection. |
Change lgtm. |
…queue (apache#5432) * Compare invocation namespaces when handling a cycle and recovering a queue * Temporarily enable upterm session for debugging * Revert the upterm change (cherry picked from commit 9371d62)
Description
This is to fix a bug in that ETCD data for non-existent containers persist forever.
This bug happens when a shared action is invoked.
When the queue manager receives an activation message, it finds a queue based on the
docId
only and this is wrong.As a result, the activation message could be sent to the wrong queue as long as the
docId
is same.But the
docId
is same for all shared actions.ex)
/whisk.system/sharedPackage/hello
/style95/myPackage/hello
(docId:/whisk.system/sharedPackage/hello
)/bdoyle/yourPackage/hello
(docId:/whisk.system/sharedPackage/hello
)So an activation for
/style95/myPackage/hello
could be sent to/bdoyle/yourPackage/hello
.Then a memory queue will send the activation to a container for
/style95/myPackage/hello
.The container is initialized with
/style95/myPackage/hello
and it registers the ETCD data for the running container.But after executing the activation for
/bdoyle/yourPackage/hello
, it tries to remove the ETCD data for the running container using the key prefix with/bdoyle/yourPackage/hello
because the key now resides in the container data(WarmData
).Accordingly, the original ETCD data for the running container is not deleted forever.
Please refer to the following logs that I found.
(The shared action is
whisk.system/sharedPackage/hello
and I replaced the name of two namespace tostyle95
andbdoyle
from the original logs just to hide proprietary information.)As you can see above the activation(
98481e836089419d881e836089d19d00
) is originally forstyle95
but it is finally sent to the queue forbdoyle
.And the queue sent this activation to a container with an ID(
f372eb3f960da72c4bff75110d24b0c5b72d6d306d3c31e59f4c10f90ccefdff
).This container was originally created for
bdoyle
.But when the same container(
f372eb3f960da72c4bff75110d24b0c5b72d6d306d3c31e59f4c10f90ccefdff
) is being paused, it tries to unwatch endpoint based on thestyle95
key because it executed an activation forstyle95
namespace.Consequently, the data for
bdoyle
(whisk/namespace/bdoyle/whisk.system/sharedPackage/hello/38-5e7fc51a452c685030fdfcec61e3bdf1/invoker0/container/f372eb3f960da72c4bff75110d24b0c5b72d6d306d3c31e59f4c10f90ccefdff
) is not removed forever unless the invoker is restarted.Related issue and scope
My changes affect the following components
Types of changes
Checklist: