-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zero shot polis report to tldr summary #1842
Comments
Remarkably stable across multiple calls to Claude 3.5 sonnet on any given dataset ! |
TLDR sanity check "evaluation":
Harder to evaluate but needed: evaluating coverage
|
Examples of stability accross calls to Claude 3.5 Sonnet on the Bowling Green report: and on New Zealand report |
Feedback from @DZNarayanan on the stability of evaluations in the Bowling Green example: while it appears stable to a general eye, for a specialist who knows that conversation quite well the several summaries show actually quite a lot of variability in what they put forward, and are not all aligned with what was actually put forward in the human report. This points to the need for thorough evaluation, both qualitative and quantiative. As discussed with @colinmegill and @DZNarayanan, we know that LLMs do not come with formal guarantees, unlike PCA and k-Means, so we have to be deliberate into how we use them, what empirical guarantees we require before using them. |
This issue is a feature! Paste append the raw text (copy and paste) of any automatically generated polis after this prompt.
Here's a report to test! https://pol.is/report/r7bhuide6netnbr8fxbyh
Instructions
The text was updated successfully, but these errors were encountered: