Asynchronous test runs are sometimes not completed correctly #1147

jmaczan · 2024-11-09T22:55:33Z

Describe the bug
When running evaluate() with run_async=True, sometimes tests are not completed (so any job/task/pipeline that relies on exit code will fail). Results are neither printed nor emitted. The evaluation is essentialy stuck after running the last test case. It is not always the case, without any regularity, so it looks like a race condition. Likely there is issue with async stuff in evaluate.py file - a_execute_test_cases(), get_or_create_event_loop() or loop.run_until_complete.

It might be possible that await asyncio.sleep(throttle_value) leads to semaphore being stuck or something. I haven't debug it except a brief static code analysis, though.

To Reproduce
Steps to reproduce the behavior:

Create a test cases list
Run them using evaluate() with run_async=True
All tests are executed asynchronously and that's fine
Results are not printed, saved, etc. It is essentialy stuck after running the last test case

Expected behavior
Tests always end with printing either results or errors. They should never last infinitely

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: [e.g. iOS] Windows 11

The text was updated successfully, but these errors were encountered:

penguine-ip · 2024-11-13T02:51:20Z

@jmaczan I never encountered this issue, can you give us something to reproduce it? For example, number of test cases, metric you're using, etc.

rjiangnju · 2024-11-15T19:48:41Z

@penguine-ip ,
I think I met similar issues before. I have about 130 test cases, the metric I use are:
Hallucination, Answer Relevancy, Faithfulness, Contextual Precision, Contextual Recall, Contextual Relevancy

with run_async=False, it finished without any issue

jmaczan · 2024-11-15T19:54:20Z

@penguine-ip I have about 20 test cases, I use literally the same set of metrics as @rjiangnju. run_async=False solves the issue, but to me it's a no-go since I run each case for multiple times to remove top and bottom results and draw median value to get more stabilized metrics

threeteck · 2024-11-15T20:37:07Z

I have the same issue, my metrics are GEval and Answer Relevancy. I run evaluate method multiple times (about 100 test cases per run) in my script and sometimes it gets stuck near the end. Running with run_async=False works fine, but its much slower. I also tried to split the number of test cases into more groups, but even with just 3 test cases per evaluate it still gets stuck randomly.

threeteck · 2024-11-15T22:22:13Z

I have the same issue, my metrics are GEval and Answer Relevancy. I run evaluate method multiple times (about 100 test cases per run) in my script and sometimes it gets stuck near the end. Running with run_async=False works fine, but its much slower. I also tried to split the number of test cases into more groups, but even with just 3 test cases per evaluate it still gets stuck randomly.

To add to this, I've been running the code in jupyter notebook on a Windows 10, but when I transfered the code to a simple python file and executed it, everything worked fine. However, since this bug occurs randomly, I may just have gotten lucky.

jmaczan · 2024-11-15T22:24:10Z

I have similar experiences, it fails for me in Azure Pipelines, but seems to work fine locally

penguine-ip · 2024-11-16T17:43:53Z

@jmaczan @threeteck @rjiangnju Hey all, if it works fine locally but not in some environments then it might be because we're writing to file to cache results. When you run evaluate(write_cache=False), it no longer writes to file - can you quickly try this and let me know if it solves the problem?

For other environments, the OS might be handling file locking differently. Basically, when we run things in async, we need to make sure the cache we're writing is always in its most updated state, so sometimes it cases different coroutines to deadlock each other in reading and writing to file.

rjiangnju · 2024-11-17T07:42:50Z

Hi @penguine-ip ,

Just tried with write_cache=False and It seems work on my side. The os is windows 11, but the code is running in a docker container with Debian 12 by WSL, hope this can help solve the issue.

jmaczan · 2024-11-17T09:42:11Z

Unfortunately it doesn't work for me, I keep having the same issue. I'm also running pipeline on Ubuntu and I'm using deepeval 1.5.0

jmaczan · 2024-11-17T10:22:38Z

@rjiangnju could you share all other parameters you use in evaluate(), along with 'write_cache=False'?

rjiangnju · 2024-11-17T17:39:13Z

@jmaczan , here is the command I use:
evaluation_result = evaluate(dataset, metrics, run_async=True, throttle_value=10, max_concurrent=3, show_indicator=True, print_results=False,ignore_errors=True, write_cache=False)

Maybe also related to the metric, I changed a little, removed Hallucination and added two geval metrics.

penguine-ip · 2024-11-18T00:41:29Z

@jmaczan @rjiangnju yup you can find it here: https://docs.confident-ai.com/docs/evaluation-introduction#evaluating-without-pytest

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Asynchronous test runs are sometimes not completed correctly #1147

Asynchronous test runs are sometimes not completed correctly #1147

jmaczan commented Nov 9, 2024

penguine-ip commented Nov 13, 2024

rjiangnju commented Nov 15, 2024

jmaczan commented Nov 15, 2024

threeteck commented Nov 15, 2024

threeteck commented Nov 15, 2024

jmaczan commented Nov 15, 2024

penguine-ip commented Nov 16, 2024

rjiangnju commented Nov 17, 2024

jmaczan commented Nov 17, 2024

jmaczan commented Nov 17, 2024

rjiangnju commented Nov 17, 2024

penguine-ip commented Nov 18, 2024

Asynchronous test runs are sometimes not completed correctly #1147

Asynchronous test runs are sometimes not completed correctly #1147

Comments

jmaczan commented Nov 9, 2024

penguine-ip commented Nov 13, 2024

rjiangnju commented Nov 15, 2024

jmaczan commented Nov 15, 2024

threeteck commented Nov 15, 2024

threeteck commented Nov 15, 2024

jmaczan commented Nov 15, 2024

penguine-ip commented Nov 16, 2024

rjiangnju commented Nov 17, 2024

jmaczan commented Nov 17, 2024

jmaczan commented Nov 17, 2024

rjiangnju commented Nov 17, 2024

penguine-ip commented Nov 18, 2024