Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix flaky penetrationProtectTestWithComputeIfAbsent #872

Merged
merged 7 commits into from
Mar 27, 2024

Conversation

Roiocam
Copy link
Contributor

@Roiocam Roiocam commented Mar 27, 2024

Resolve #870

Some of the threads will request a key with keyPrefix_0, but it won't cover it on the loader, so it may increase the counter unexpectedly(by race).

@areyouok
Copy link
Collaborator

这个我没看懂,竞争出现在哪里?

@Roiocam
Copy link
Contributor Author

Roiocam commented Mar 27, 2024

这个我没看懂,竞争出现在哪里?

我没细看穿透保护的代码,按单测的理解,应该是每一个 Key 有一个锁来保护。

20 个线程,会有三个 key 被访问,0,1,2,第一次大家都会成功,第二轮遍历,假如是 key_0 的那个先拿到时间片,那么 loadSuccess 就会 + 1,变成 4 了。

我这个 PR 就是避免这种情况,让 key 落在 0,1,两个都能被 if 覆盖上,不过我也没想明白一个点:在新的代码,如果 key_1 拿到 时间片时,那么可能还是会出现 loadSuccess=4,断言出错的情况。

@areyouok
Copy link
Collaborator

穿透保护是为了避免多个线程同时去执行加载同一个key,增加(数据库的)压力。

如果线程1正在加载key0,线程2进来,线程2需要等待线程1的加载结果,不会执行加载。如果线程1已经加载完成,线程3进来,此时cache命中,就不会执行加载了。这样key0也是只能加载一次的。

如果单测没有通过,应该是有别的原因。

有很多单测依赖sleep来保证特定的多线程执行顺序,确实比较脆弱。sleep时间长了,单测运行慢,时间短了,在性能低的环境下执行容易失败。现在看来还是依靠CountDownLauch等手段来指定严格的执行顺序更靠谱一些,测试代码的编写会更复杂一点。

@Roiocam
Copy link
Contributor Author

Roiocam commented Mar 27, 2024

感谢解释,我在本地是可以复现的,穿透保护看起来有点问题?能让线程同时执行 loader,只要去掉 sleep,然后 debug 断点让某些线程卡一下就可以

i am loaded:penetrationProtect_0
i am loaded:penetrationProtect_1
i am loaded:penetrationProtect_2
15:59:40.277 [Thread-32] INFO  com.alicp.jetcache.AbstractCache - loader wait timeout:PT0.001S
15:59:40.417 [Thread-34] WARN  com.alicp.jetcache.AbstractCache - loader wait interrupted
i am loaded:penetrationProtect_0
i am loaded:penetrationProtect_1
i am loaded:penetrationProtect_2
i am loaded:penetrationProtect_0
i am loaded:penetrationProtect_0

java.lang.AssertionError: 
Expected :3
Actual   :5

@areyouok
Copy link
Collaborator

你的日志里面,这里load超时是1毫秒,是不是这个原因导致的

@areyouok
Copy link
Collaborator

还有个可能的原因,有的测试的cache超时时间设置比较短,比如200ms,在比较慢的环境运行超时了,导致再次加载

@Roiocam
Copy link
Contributor Author

Roiocam commented Mar 27, 2024

还有个可能的原因,有的测试的cache超时时间设置比较短,比如200ms,在比较慢的环境运行超时了,导致再次加载

确实是这样,之前 DEBUG 看到 expireAfterWrite 很小的时候尝试改了一下,发现不行。后来排查为啥这个单测跑了两遍才看到 AbstractEmbeddedCacheTest 里频繁改 expireAfterWrite 和 expireAfterAccess,把相关的参数都改了就好了

@Roiocam Roiocam force-pushed the flaky-penetrationProtect branch from 9b52f6d to ee60359 Compare March 27, 2024 10:52
@areyouok areyouok merged commit 1b4dd04 into alibaba:master Mar 27, 2024
2 checks passed
@Roiocam Roiocam deleted the flaky-penetrationProtect branch March 28, 2024 01:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AbstractCacheTest.penetrationProtectTest flaky
2 participants