[ENHANCEMENT] how to evaluate a full app behavior? #5274

dcsan · 2024-11-05T19:37:55Z

I can only see how to test an existing LLM directly - eg openAI etc.
what if i want to test the results of my own app, which includes RAG etc features.

is there a way to connect to an API endpoint and run evals?

Jgilhuly · 2024-11-05T22:29:56Z

Hi @dcsan , thanks for the question!

Phoenix let's you run evals on any part of your application. Most of our prebuilt evaluators focus on LLM responses, but some focus on things like evaluating your RAG retriever step. That examples walks through how to evaluate documents on how relevant they are to a user's question, using a prebuilt evaluator we have in Phoenix.

All Phoenix evals follow the same general loop of:

Preparing the data you want to evaluate - often this is Phoenix traces
Generate evaluation labels for that data. This could be done with our evaluators like in the example above, or through another approach
Logging the evaluation results back to Phoenix

Another good resource to check out would be this walkthrough of building your own eval pipeline. That might be most similar to what you're thinking of.

Hope that helps. Let me know if you have questions there! Or happy to discuss your use case specifically in our community slack

dcsan · 2024-11-05T22:36:10Z

how can i use this to just call an API and evaluate the response? eg with a field like text ?

I wanted to use your dashboard to manage test data, then use your tool to eval the results (with an LLM) and view the results in your dashboard. seems like a pretty common request? I'm not testing openAI's API, i'm testing my own.

also i can't join the slack without a blessed domain:

Jgilhuly · 2024-11-05T23:18:34Z

Ah sorry for the link mix up, here's the right one to use: https://join.slack.com/t/arize-ai/shared_invite/zt-22vj03k4k-MlrNEwv5WeswapTs0kNCBw - updated above as well!

To answer your question, here's what you'd do:

Call your API and trace the request in Phoenix. Because you're using your own API, use these docs
Add evaluations to the traces you've captured in Phoenix. Those evaluations could be about any of the information in the traces, they don't have to be related to an LLM call.

Some of the docs use OpenAI as an example, but you can swap in the call to your own API instead.

dcsan added enhancement New feature or request triage issues that need triage labels Nov 5, 2024

github-project-automation bot added this to phoenix Nov 5, 2024

github-project-automation bot moved this to 📘 Todo in phoenix Nov 5, 2024

dosubot bot added the c/evals label Nov 5, 2024

RogerHYang removed the triage issues that need triage label Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENHANCEMENT] how to evaluate a full app behavior? #5274

[ENHANCEMENT] how to evaluate a full app behavior? #5274

dcsan commented Nov 5, 2024

Jgilhuly commented Nov 5, 2024 •

edited

Loading

dcsan commented Nov 5, 2024

Jgilhuly commented Nov 5, 2024

[ENHANCEMENT] how to evaluate a full app behavior? #5274

[ENHANCEMENT] how to evaluate a full app behavior? #5274

Comments

dcsan commented Nov 5, 2024

Jgilhuly commented Nov 5, 2024 • edited Loading

dcsan commented Nov 5, 2024

Jgilhuly commented Nov 5, 2024

Jgilhuly commented Nov 5, 2024 •

edited

Loading