Releases: confident-ai/deepeval
Releases · confident-ai/deepeval
Red teaming, safety testing, and improved synthesizer, conversational metrics, multi-modal metrics
Red teaming, safety testing, and improved synthesizer, conversational metrics, multi-modal metrics
Latest
In DeepEval 1.4.7, we're releasing:
- LLM red teaming. Safety test your LLM application for 40+ vulnerabilities with 10+ attack enhancements, docs here: https://docs.confident-ai.com/docs/red-teaming-introduction
- Improved synthetic data synthesizer, much more functionality and customizbility: https://docs.confident-ai.com/docs/evaluation-datasets-synthetic-data
- Conversational metrics: Dedicated metrics to evaluate LLM turns
- Multi-modal metrics: Image editing and text to image evaluation
Agentic Evaluation Metric, Custom Evaluation LLMs, and Async for Synthetic Data Generation
In DeepEval v0.21.74, we have:
- Agnetic evaluation metric to evaluate tool calling correctness for LLM agents: https://docs.confident-ai.com/docs/metrics-tool-correctness
- Pydantic Schemas to enforce JSON outputs for custom, smaller LLMs: https://docs.confident-ai.com/docs/guides-using-custom-llms
- Asynchronous support for synthetic data generation: https://docs.confident-ai.com/docs/evaluation-datasets-synthetic-data
- Tracing integration for LLamaIndex and LangChain: https://docs.confident-ai.com/docs/confident-ai-tracing
Verbosity in Metrics, Hyperparameter Logging, Improved Synthetic Data Generation, Better Async Support
In DeepEval v0.21.62, we:
- added an option to print out intermediate steps during metric execution, which can be configured via the
verbose_mode
parameter: https://docs.confident-ai.com/docs/metrics-answer-relevancy#example - hyperparameters can be logged to Confident AI via the evaluate() function: https://docs.confident-ai.com/docs/getting-started#optimizing-hyperparameters
- Synthetic data generation now gives more realistic results and is more customizable: https://docs.confident-ai.com/docs/evaluation-datasets-synthetic-data
Synthetic Data, Caching, Benchmarks, and GEval improvement
For deepeval's latest release v0.21.15, we release:
- Synthetic Data generation. Generate synthetic data from documents easily: https://docs.confident-ai.com/docs/evaluation-datasets-synthetic-data
- caching. If you're running 10k test cases and it fails at the 9999th test case, you no longer have to rerun the first 9999 test case as you can just read from cache using the
-c
flag: https://docs.confident-ai.com/docs/evaluation-introduction#cache - repeats. If you want to repeat each test case for statistical significant, use the
-r
flag: https://docs.confident-ai.com/docs/evaluation-introduction#repeats - LLM Benchmarks. Supporting popular benchmarks such as MMLU, HellaSwag, and BIG-BH so anyone can evaluate ANY model on research backed benchmarks in a few lines of code.
- G-Eval improvements. The G-Eval metric now supports using logprobs of tokens to find the weighted summed score.
Async Support for Prod
In deepeval v0.20.85:
- asynchronous support throughout deepeval, and no longer using threads. Users can also call individual metrics asynchronously: https://docs.confident-ai.com/docs/metrics-introduction#measuring-metrics-in-async
- improved the way in which you create a custom LLM for evaluation. You'll now have to implement an asynchronous generate() method to use deepeval's async features: https://docs.confident-ai.com/docs/metrics-introduction#using-a-custom-llm
- strict mode for all metrics!
- improve
evaluate()
function for more customizability: https://docs.confident-ai.com/docs/evaluation-introduction#evaluating-without-pytest
Conversational Metrics and Synthetic Data Generation
In DeepEval's latest release, there is now:
- conversational metrics: https://docs.confident-ai.com/docs/metrics-knowledge-retention. This metric evaluates whether your LLM is able to retain factual information presented to it throughout a conversation
- synthetic data generation. Generate evaluation datasets from scratch: https://docs.confident-ai.com/docs/evaluation-datasets#generate-an-evaluation-dataset
Production Stability
For the newest release, deepeval now is now stable for production use:
- reduced package size
- separated functionality of pytest vs deepeval test run command
- included coverage score for summarization
- fix contextual precision node error
- released docs for better transparency into metrics calculation
- allows users to configure RAGAS metrics for custom embedding models: https://docs.confident-ai.com/docs/metrics-ragas#example
- fixed bugs with checking for package updates
Hugging Face and LlamaIndex integration
For the latest release, DeepEval:
- Supports Hugging Face users by providing real-time evaluations during fine-tuning: https://docs.confident-ai.com/docs/integrations-huggingface
- Supports LlamaIndex users by allowing unit testing of LlamaIndex apps in CI/CD, and offer metrics in LlamaIndex's evaluators: https://docs.confident-ai.com/docs/integrations-llamaindex
- Improvements to accuracy and reliability in Faithfulness and Answer Relevancy
- Summarization Metric now offers explanation
- You can now use ANY LLM for evaluation: https://docs.confident-ai.com/docs/metrics-introduction#using-a-custom-llm
LLM-Evals now support all LangChain chatmodels
- LLM-Evals (LLM evaluated metrics) now support all of langchain's chat models.
LLMTestCase
now hasexecution_time
andcost
, useful for those looking to evaluate on these parametersminimum_score
is nowthreshold
instead, meaning you can now create custom metrics that either have a "minimum" or "maximum" thresholdLLMEvalMetric
is nowGEval
- Llamaindex Tracing integration: (https://docs.llamaindex.ai/en/stable/module_guides/observability/observability.html#deepeval)
ALL RAG Metrics now offers score reasoning, and a lot more.
In this release:
- Faithfulness, Answer Relevancy, Contextual Relevancy, Contextual Precision, and Contextual Recall, all offer a reasoning for its given score.
- Azure OpenAI now supported via a single command in the CLI: https://docs.confident-ai.com/docs/metrics-introduction#using-azure-openai
- New Summarization Metric that uses the QAG framework for its implementation: https://docs.confident-ai.com/docs/metrics-summarization
- Pulling datasets from Confident AI now offers an intermediate step for additional data processing before evaluation: https://docs.confident-ai.com/docs/confident-ai-evaluate-datasets#pull-your-dataset-from-confident-ai
- Decoupled imports from
transformers
,sentence_transformers
, andpandas
to reduce package size