Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluations in ell #285

Open
34 of 66 tasks
MadcowD opened this issue Oct 4, 2024 · 10 comments
Open
34 of 66 tasks

Evaluations in ell #285

MadcowD opened this issue Oct 4, 2024 · 10 comments

Comments

@MadcowD
Copy link
Owner

MadcowD commented Oct 4, 2024

This is a major feature release.
Spec: https://github.com/MadcowD/ell/blob/cd64ab9bb0d3a09195fef7a32ef77ac5d7e6c912/docs/ramblings/evalspec.md
Ramblings: https://github.com/MadcowD/ell/blob/cd64ab9bb0d3a09195fef7a32ef77ac5d7e6c912/docs/ramblings/thoughtsonevals.md
Example: https://github.com/MadcowD/ell/blob/6afad20bc58a99e9f3fe0a76ff6b7642471d63a7/examples/eval.py

The big ones:

  • Tracing
  • Performacne Tracking
  • Hand labeling interface (maybe next PR)

UX/IMPL TODOs

Next Step TODOS

(Misc todos)

  • #XXX: Seperate this into VersionedEvaluation and Evaluation because versioning is somewhat expensive if someone has a big eval Then perhaps we could default to VersionedEval in the docs or version=False. Not sure.
  • TODO: Link Invocations to EvalRuns
  • TODO: Link Invocations to INvocationScores.
  • TODO: Write to DB
  • TODO: Build UX for analyzing evals.
  • TODO: Solve (input, labels, score_fn) etc
  • TODO: What about automatic cross validation & splitting.
  • TODO: Consider wandb style metrics later.
  • Need a way to compare evals across metrics
  • refactor evaluationcardtitle & lmp cardtitle to have hte same base class

Bugs:

  • Threading issue
  • Lambda serialziaion incorrect
image
  • Metrics from older evals get pulled out on the ocmputation graph
image
@MadcowD MadcowD added this to the v0.1.0 release milestone Oct 4, 2024
@MadcowD
Copy link
Owner Author

MadcowD commented Oct 4, 2024

#269

@MadcowD MadcowD mentioned this issue Oct 4, 2024
@MadcowD
Copy link
Owner Author

MadcowD commented Oct 5, 2024

Potentially we can add the following as a niceties for dataset construction

# Use this as a handicap for users specifying their own datasets, they need to be explicit about the input.
InputType = Union[Dict[str, Any], NoneType, List[Any]]
class Datapoint(UserDict):
    def __init__(self, input: InputType, **rest):
        super().__init__(input, **rest)
        assert isinstance(input, (dict, list)) or input is None, f"Input must be a dict, list, or None, got {type(input)}"
    
    @property
    def input(self) -> InputType:
        return self.data["input"]

dataset : List[Datapoint] = [
    Datapoint(input={"text": "Hello world"}, random_heuristic="Hello world", hf_score=0.5),
    Datapoint(input=[1, 2, 3, 4, 5], random_heuristic="List of numbers", hf_score=0.5),
    Datapoint(input=None, random_heuristic="No input", hf_score=0.5),
]
     
# But datasets on the otherside will accept arbitrary data typess for consturction.
#XXX: Need to figure out if we should actually build a basic dataset class with validation or just leave it as a list of dicts.
class Dataset:
    def __init__(self, data: Iterable[Dict[str, Any]]):
        self.data = data
        #XXX: Validation
        # If we do this now we can potentially force user to serialize their data in the data store etc.
        self.validate()

    def __iter__(self):
        return iter(self.data)
    
    def __getitem__(self, idx):
        return self.data[idx]
    
    def __len__(self):
        return len(self.data)

    def validate(self):
        for datapoint in self.data:
            if not isinstance(datapoint, dict):
                raise ValueError(f"Each datapoint must be a dictionary, got {type(datapoint)}")
            if "input" not in datapoint:
                raise ValueError("Each datapoint must have an 'input' key", datapoint)
            if not isinstance(datapoint["input"], (list, dict)):
                raise ValueError(f"The 'input' value must be a list or dictionary, got {type(datapoint['input'])}", datapoint)

    @classmethod
    def from_pd(cls, dataframe: pd.DataFrame, input_column: str):
        return cls(dataframe[input_column].to_list())

    @classmethod
    def from_jsonl(cls, file_path: str):
        with open(file_path, "r") as file:
            return cls(json.load(file))

    @classmethod
    def from_pickle(cls, file_path: str):
        with open(file_path, "rb") as file:
            return cls(pickle.load(file))

@MadcowD
Copy link
Owner Author

MadcowD commented Oct 5, 2024

We need to solve this antipattenr (what if we just want to eval something with a criterion on no dataset

dataset = [
    { 
        "input": [],
    }
]*10

@ell.simple(model="gpt-4o")
def write_a_bad_poem():
    return "Write a really poorly written poem."


@ell.simple(model="gpt-4o")
def write_a_good_poem():
    return "Write a really well written poem."

@ell.simple(model="gpt-4o", temperature=0.1)
def is_good_poem(poem : str):
    """Include either yes or no in your response at most onnce but not both."""
    return "Is this a good poem yes/no? {poem}"

def score(datapoint, output):
    return float("yes" in output.lower())

eval=ell.evaluation.Evaluation(name="poem_eval", dataset=dataset, criteria={"is_good": score})

print("EVALUATING BAD POEM")
result = eval.run(write_a_bad_poem, n_workers=4)

print("EVALUATING GOOD POEM")
result = eval.run(write_a_good_poem, n_workers=4)

@MadcowD
Copy link
Owner Author

MadcowD commented Oct 8, 2024

Add a migration using alembic
https://testdriven.io/blog/fastapi-sqlmodel/

@MadcowD
Copy link
Owner Author

MadcowD commented Oct 8, 2024

Okay. So we have a really good spec now for at least invocation labels and invocation labelers allowing us to define arbitrary rubrics using JSON schemas. The big problem currently is that now I'm thinking about dataset serialization and furthermore, by storing invocations with param objects. We have a larger problem where now if I run multiple indications on the same parameters. So for example in the same datapoints I will duplicate the dataset my store for as many different metrics as there are. This seems like it'll end up being really, really slow so someone needs to come through here, and it's probably me and redesign the data model so that this is a lot more efficient.

The other question is in general, should we have a dataset and this matters if we want to reserialize the evaluation at some later point in time rather than just store the version of it. For example, if I specify an evaluation, I probably actually do want to look at the data of the evaluation. So the picture in L Studio would be a list of rows that are a part of the evaluation.

As part of the migration, we could build a new parameter, essentially an input contents for an invocation, and have that hashed. These objects would be stored separately and efficiently. The dataset would then be just a list of these blobs, which are input blobs corresponding with invocations. This is actually the true way to serialize it because we have a bunch of inputs we're going to redundantly use every single time with variable outputs. Alternatively, we could keep the invocation contents as the true parameters.

For the evaluation view, we definitely want to have the dataset and then different evaluation runs. So you can view, I guess it would be three tabs: dataset, invocations, and runs/metrics. Clearly, we need to have first-class support for the dataset, and the dataset schema itself will have an input and then a bunch of other objects on it. These input objects themselves would be for each data point, so each row in the dataset. Another issue is that there's a dataset and a datapoint class, basically. The dataset class contains a list of datapoints. If we wanted to be thorough about this, we would reengineer the blob to separate out these unique objects.

And then we could really go down the Weights and Biases style. We would basically define a dataset object, just like Weave does. And then that dataset is automatically added to the store, which is not something you really want to do necessarily. But inherently it'll be added to the store because we are using the dataset in the eval. So this is tough, right? I mean, I could just say, "Hey, you know, here's the eval. There's a dataset object. It's this size." If you wanted to actually look into the dataset, then you need to open it with Python and so on, except for when we actually run the eval for the first time. Because if we actually run the eval, then it'll get committed to the database. And that's the philosophy: if we don't run the eval, it doesn't change. So what, for example, if I want to change the metric, I don't want to rerun all the completions again. So that's kind of a flaw with this design as well.

@MadcowD
Copy link
Owner Author

MadcowD commented Oct 8, 2024

Overall thinking here is we need to redesign our store spec to have better support for redundant entities.

@MadcowD
Copy link
Owner Author

MadcowD commented Oct 8, 2024

What about multimodal feedback?

@MadcowD
Copy link
Owner Author

MadcowD commented Oct 9, 2024

erDiagram
    SerializedLMP ||--o{ Invocation : "has"
    SerializedLMP ||--o{ SerializedLMPUses : "uses/used_by"
    SerializedLMP ||--o{ EvaluationRun : "evaluated_in"
    Invocation ||--|| InvocationContents : "has"
    Invocation ||--o{ InvocationTrace : "consumes/consumed_by"
    Invocation ||--o{ EvaluationResultDatapoint : "labeled_in"
    Evaluation ||--|{ EvaluationLabeler : "has"
    Evaluation ||--|{ EvaluationRun : "has"
    EvaluationRun ||--|{ EvaluationResultDatapoint : "has"
    EvaluationRun ||--|{ EvaluationRunLabelerSummary : "has"
    EvaluationLabeler ||--|{ EvaluationLabel : "has"
    EvaluationLabeler ||--|{ EvaluationRunLabelerSummary : "has"
    EvaluationResultDatapoint ||--|{ EvaluationLabel : "has"

    SerializedLMP {
        string lmp_id PK
        string name
        string source
        LMPType lmp_type
    }
    Invocation {
        string id PK
        string lmp_id FK
    }
    InvocationContents {
        string invocation_id PK,FK
    }
    Evaluation {
        string id PK
        string name
        string dataset_hash
    }
    EvaluationLabeler {
        int id PK
        string name
        EvaluationLabelerType type
    }
    EvaluationRun {
        int id PK
        int evaluation_id FK
        string evaluated_lmp_id FK
    }
    EvaluationResultDatapoint {
        int id PK
        string invocation_being_labeled_id FK
        string evaluation_run_id FK
    }
    EvaluationLabel {
        int labeled_datapoint_id PK,FK
        string labeler_id PK,FK
    }
    EvaluationRunLabelerSummary {
        string evaluation_run_id PK,FK
        string evaluation_labeler_id PK,FK
    }
Loading

@MadcowD
Copy link
Owner Author

MadcowD commented Oct 9, 2024

For the no input specification problem ideally we do something like this


# This is the ideal way of doing this.
eval = Evaluation(
    name="swear words",
    n_evals=100,
    metrics={"swear_words_appeared": lambda datapoint, output: output.count("shit")}) # get more statistical significance 

eval.run(lmp, n_workers=10) # can leverage workers however we want fundamentally.

@MadcowD
Copy link
Owner Author

MadcowD commented Oct 12, 2024

UX is getting there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant