-
Notifications
You must be signed in to change notification settings - Fork 326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluations in ell #285
Comments
Potentially we can add the following as a niceties for dataset construction # Use this as a handicap for users specifying their own datasets, they need to be explicit about the input.
InputType = Union[Dict[str, Any], NoneType, List[Any]]
class Datapoint(UserDict):
def __init__(self, input: InputType, **rest):
super().__init__(input, **rest)
assert isinstance(input, (dict, list)) or input is None, f"Input must be a dict, list, or None, got {type(input)}"
@property
def input(self) -> InputType:
return self.data["input"]
dataset : List[Datapoint] = [
Datapoint(input={"text": "Hello world"}, random_heuristic="Hello world", hf_score=0.5),
Datapoint(input=[1, 2, 3, 4, 5], random_heuristic="List of numbers", hf_score=0.5),
Datapoint(input=None, random_heuristic="No input", hf_score=0.5),
]
# But datasets on the otherside will accept arbitrary data typess for consturction.
#XXX: Need to figure out if we should actually build a basic dataset class with validation or just leave it as a list of dicts.
class Dataset:
def __init__(self, data: Iterable[Dict[str, Any]]):
self.data = data
#XXX: Validation
# If we do this now we can potentially force user to serialize their data in the data store etc.
self.validate()
def __iter__(self):
return iter(self.data)
def __getitem__(self, idx):
return self.data[idx]
def __len__(self):
return len(self.data)
def validate(self):
for datapoint in self.data:
if not isinstance(datapoint, dict):
raise ValueError(f"Each datapoint must be a dictionary, got {type(datapoint)}")
if "input" not in datapoint:
raise ValueError("Each datapoint must have an 'input' key", datapoint)
if not isinstance(datapoint["input"], (list, dict)):
raise ValueError(f"The 'input' value must be a list or dictionary, got {type(datapoint['input'])}", datapoint)
@classmethod
def from_pd(cls, dataframe: pd.DataFrame, input_column: str):
return cls(dataframe[input_column].to_list())
@classmethod
def from_jsonl(cls, file_path: str):
with open(file_path, "r") as file:
return cls(json.load(file))
@classmethod
def from_pickle(cls, file_path: str):
with open(file_path, "rb") as file:
return cls(pickle.load(file)) |
We need to solve this antipattenr (what if we just want to eval something with a criterion on no dataset dataset = [
{
"input": [],
}
]*10
@ell.simple(model="gpt-4o")
def write_a_bad_poem():
return "Write a really poorly written poem."
@ell.simple(model="gpt-4o")
def write_a_good_poem():
return "Write a really well written poem."
@ell.simple(model="gpt-4o", temperature=0.1)
def is_good_poem(poem : str):
"""Include either yes or no in your response at most onnce but not both."""
return "Is this a good poem yes/no? {poem}"
def score(datapoint, output):
return float("yes" in output.lower())
eval=ell.evaluation.Evaluation(name="poem_eval", dataset=dataset, criteria={"is_good": score})
print("EVALUATING BAD POEM")
result = eval.run(write_a_bad_poem, n_workers=4)
print("EVALUATING GOOD POEM")
result = eval.run(write_a_good_poem, n_workers=4)
|
Add a migration using alembic |
Okay. So we have a really good spec now for at least invocation labels and invocation labelers allowing us to define arbitrary rubrics using JSON schemas. The big problem currently is that now I'm thinking about dataset serialization and furthermore, by storing invocations with param objects. We have a larger problem where now if I run multiple indications on the same parameters. So for example in the same datapoints I will duplicate the dataset my store for as many different metrics as there are. This seems like it'll end up being really, really slow so someone needs to come through here, and it's probably me and redesign the data model so that this is a lot more efficient. The other question is in general, should we have a dataset and this matters if we want to reserialize the evaluation at some later point in time rather than just store the version of it. For example, if I specify an evaluation, I probably actually do want to look at the data of the evaluation. So the picture in L Studio would be a list of rows that are a part of the evaluation. As part of the migration, we could build a new parameter, essentially an input contents for an invocation, and have that hashed. These objects would be stored separately and efficiently. The dataset would then be just a list of these blobs, which are input blobs corresponding with invocations. This is actually the true way to serialize it because we have a bunch of inputs we're going to redundantly use every single time with variable outputs. Alternatively, we could keep the invocation contents as the true parameters. For the evaluation view, we definitely want to have the dataset and then different evaluation runs. So you can view, I guess it would be three tabs: dataset, invocations, and runs/metrics. Clearly, we need to have first-class support for the dataset, and the dataset schema itself will have an input and then a bunch of other objects on it. These input objects themselves would be for each data point, so each row in the dataset. Another issue is that there's a dataset and a datapoint class, basically. The dataset class contains a list of datapoints. If we wanted to be thorough about this, we would reengineer the blob to separate out these unique objects. And then we could really go down the Weights and Biases style. We would basically define a dataset object, just like Weave does. And then that dataset is automatically added to the store, which is not something you really want to do necessarily. But inherently it'll be added to the store because we are using the dataset in the eval. So this is tough, right? I mean, I could just say, "Hey, you know, here's the eval. There's a dataset object. It's this size." If you wanted to actually look into the dataset, then you need to open it with Python and so on, except for when we actually run the eval for the first time. Because if we actually run the eval, then it'll get committed to the database. And that's the philosophy: if we don't run the eval, it doesn't change. So what, for example, if I want to change the metric, I don't want to rerun all the completions again. So that's kind of a flaw with this design as well. |
Overall thinking here is we need to redesign our store spec to have better support for redundant entities. |
What about multimodal feedback? |
erDiagram
SerializedLMP ||--o{ Invocation : "has"
SerializedLMP ||--o{ SerializedLMPUses : "uses/used_by"
SerializedLMP ||--o{ EvaluationRun : "evaluated_in"
Invocation ||--|| InvocationContents : "has"
Invocation ||--o{ InvocationTrace : "consumes/consumed_by"
Invocation ||--o{ EvaluationResultDatapoint : "labeled_in"
Evaluation ||--|{ EvaluationLabeler : "has"
Evaluation ||--|{ EvaluationRun : "has"
EvaluationRun ||--|{ EvaluationResultDatapoint : "has"
EvaluationRun ||--|{ EvaluationRunLabelerSummary : "has"
EvaluationLabeler ||--|{ EvaluationLabel : "has"
EvaluationLabeler ||--|{ EvaluationRunLabelerSummary : "has"
EvaluationResultDatapoint ||--|{ EvaluationLabel : "has"
SerializedLMP {
string lmp_id PK
string name
string source
LMPType lmp_type
}
Invocation {
string id PK
string lmp_id FK
}
InvocationContents {
string invocation_id PK,FK
}
Evaluation {
string id PK
string name
string dataset_hash
}
EvaluationLabeler {
int id PK
string name
EvaluationLabelerType type
}
EvaluationRun {
int id PK
int evaluation_id FK
string evaluated_lmp_id FK
}
EvaluationResultDatapoint {
int id PK
string invocation_being_labeled_id FK
string evaluation_run_id FK
}
EvaluationLabel {
int labeled_datapoint_id PK,FK
string labeler_id PK,FK
}
EvaluationRunLabelerSummary {
string evaluation_run_id PK,FK
string evaluation_labeler_id PK,FK
}
|
For the no
|
UX is getting there |
This is a major feature release.
Spec: https://github.com/MadcowD/ell/blob/cd64ab9bb0d3a09195fef7a32ef77ac5d7e6c912/docs/ramblings/evalspec.md
Ramblings: https://github.com/MadcowD/ell/blob/cd64ab9bb0d3a09195fef7a32ef77ac5d7e6c912/docs/ramblings/thoughtsonevals.md
Example: https://github.com/MadcowD/ell/blob/6afad20bc58a99e9f3fe0a76ff6b7642471d63a7/examples/eval.py
The big ones:
UX/IMPL TODOs
https://testdriven.io/blog/fastapi-sqlmodel/
Next Step TODOS
(Misc todos)
Bugs:
The text was updated successfully, but these errors were encountered: