Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Asynchronous test runs are sometimes not completed correctly #1147

Open
jmaczan opened this issue Nov 9, 2024 · 12 comments
Open

Asynchronous test runs are sometimes not completed correctly #1147

jmaczan opened this issue Nov 9, 2024 · 12 comments

Comments

@jmaczan
Copy link

jmaczan commented Nov 9, 2024

Describe the bug
When running evaluate() with run_async=True, sometimes tests are not completed (so any job/task/pipeline that relies on exit code will fail). Results are neither printed nor emitted. The evaluation is essentialy stuck after running the last test case. It is not always the case, without any regularity, so it looks like a race condition. Likely there is issue with async stuff in evaluate.py file - a_execute_test_cases(), get_or_create_event_loop() or loop.run_until_complete.

It might be possible that await asyncio.sleep(throttle_value) leads to semaphore being stuck or something. I haven't debug it except a brief static code analysis, though.

To Reproduce
Steps to reproduce the behavior:

  1. Create a test cases list
  2. Run them using evaluate() with run_async=True
  3. All tests are executed asynchronously and that's fine
  4. Results are not printed, saved, etc. It is essentialy stuck after running the last test case

Expected behavior
Tests always end with printing either results or errors. They should never last infinitely

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. iOS] Windows 11
@penguine-ip
Copy link
Contributor

@jmaczan I never encountered this issue, can you give us something to reproduce it? For example, number of test cases, metric you're using, etc.

@rjiangnju
Copy link

@penguine-ip ,
I think I met similar issues before. I have about 130 test cases, the metric I use are:
Hallucination, Answer Relevancy, Faithfulness, Contextual Precision, Contextual Recall, Contextual Relevancy

with run_async=False, it finished without any issue

@jmaczan
Copy link
Author

jmaczan commented Nov 15, 2024

@penguine-ip I have about 20 test cases, I use literally the same set of metrics as @rjiangnju. run_async=False solves the issue, but to me it's a no-go since I run each case for multiple times to remove top and bottom results and draw median value to get more stabilized metrics

@threeteck
Copy link

I have the same issue, my metrics are GEval and Answer Relevancy. I run evaluate method multiple times (about 100 test cases per run) in my script and sometimes it gets stuck near the end. Running with run_async=False works fine, but its much slower. I also tried to split the number of test cases into more groups, but even with just 3 test cases per evaluate it still gets stuck randomly.

@threeteck
Copy link

I have the same issue, my metrics are GEval and Answer Relevancy. I run evaluate method multiple times (about 100 test cases per run) in my script and sometimes it gets stuck near the end. Running with run_async=False works fine, but its much slower. I also tried to split the number of test cases into more groups, but even with just 3 test cases per evaluate it still gets stuck randomly.

To add to this, I've been running the code in jupyter notebook on a Windows 10, but when I transfered the code to a simple python file and executed it, everything worked fine. However, since this bug occurs randomly, I may just have gotten lucky.

@jmaczan
Copy link
Author

jmaczan commented Nov 15, 2024

I have similar experiences, it fails for me in Azure Pipelines, but seems to work fine locally

@penguine-ip
Copy link
Contributor

@jmaczan @threeteck @rjiangnju Hey all, if it works fine locally but not in some environments then it might be because we're writing to file to cache results. When you run evaluate(write_cache=False), it no longer writes to file - can you quickly try this and let me know if it solves the problem?

For other environments, the OS might be handling file locking differently. Basically, when we run things in async, we need to make sure the cache we're writing is always in its most updated state, so sometimes it cases different coroutines to deadlock each other in reading and writing to file.

@rjiangnju
Copy link

Hi @penguine-ip ,

Just tried with write_cache=False and It seems work on my side. The os is windows 11, but the code is running in a docker container with Debian 12 by WSL, hope this can help solve the issue.

@jmaczan
Copy link
Author

jmaczan commented Nov 17, 2024

Unfortunately it doesn't work for me, I keep having the same issue. I'm also running pipeline on Ubuntu and I'm using deepeval 1.5.0

@jmaczan
Copy link
Author

jmaczan commented Nov 17, 2024

@rjiangnju could you share all other parameters you use in evaluate(), along with 'write_cache=False'?

@rjiangnju
Copy link

@jmaczan , here is the command I use:
evaluation_result = evaluate(dataset, metrics, run_async=True, throttle_value=10, max_concurrent=3, show_indicator=True, print_results=False,ignore_errors=True, write_cache=False)

Maybe also related to the metric, I changed a little, removed Hallucination and added two geval metrics.

@penguine-ip
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants