Joining dataframes #16

dave-killough · 2023-04-20T23:36:03Z

Nice. I've been doing similar things with SQL sources. I'm curious why you attach the ask to a specific dataframe. Can multiple dataframes be considered in one request? Thanks

bluecoconut · 2023-04-24T05:43:16Z

Hey @dave-killough ,

I attached to a specific dataframe because it was the easiest to "gather" the local context. In a python kernel, just doing sketch.ask(...) and inspecting the locals() in the same parent as the call is possible but potentially leading to lots of mistaken variables and too many dataframes. Also, this is somewhat rough in jupyter kernel style sessions, where navigating the call stack and finding other variables seemed to be a bit difficult (though if you have a trick here, this might make this really easy).

In many jupyter notebooks, people often re-use the same dataframe as well (editing in place), so the clarity of the answer I felt should be contextualized to a single dataframe.

One option for multiple joining I considered was something like

df.sketch.howto("question", extra_contextual_dfs=[*dfs])

but that didn't feel "easy to type" / "native".

Do you have any ideas or preferences for what would make a good API?

dave-killough · 2023-04-24T13:07:21Z

Thanks for the explanation. Specifying the dataframes to include is important. There are often many working dataframes created in a notebook that should be excluded from end-result queries. It seems useful to specify the dataframes once, and then have a higher level ask function that can connect the dataframes. ChatGPT is quite good at modeling table relationships from just schema, and we're getting effective SQL joins as a result. I expect it could work similarly with dataframe merges.

Maybe something like this:

sketch.dataset([client_df, purchase_df, contact_df])
sketch.ask("who bought the most last year?")

dnk8n · 2023-04-27T07:44:08Z

Asking questions across multiple dataframes would be very interesting. I was hoping to do something like this:

import pandas as pd
import sketch
from sqlalchemy import create_engine

# Define a class to store views
class Views:
    def __init__(self, schema_name, db_url):
        self.views = {}
        self.history = []
        self.engine = create_engine(db_url)
        self.schema_name = schema_name

        # Get all views in the schema
        views = pd.read_sql_query(f"SELECT table_name FROM information_schema.views WHERE table_schema = '{schema_name}'", con=self.engine)

        # Loop through each view and read it into a Pandas dataframe
        for view in views['table_name']:
            df = pd.read_sql_query(f"SELECT * FROM {schema_name}.{view}", con=self.engine)
            self.add_view(view, df)

    def add_view(self, name, df):
        self.views[name] = df

    def ask(self, question):
        answer = sketch.ask(question, self.views.items())
        self.history.append((question, answer))

# Connect to Postgres
db_url = 'postgresql://user:password@localhost:5432/mydatabase'
schema_name = 'my_schema'

# Create an instance of Views
my_views = Views(schema_name, db_url)

# Ask questions of all views and store history
my_views.ask('What are the top problematic projects with respect to planned vs actual time tracked?')
my_views.ask('Which activities appear to require upskilling?')

# Interrogate my_views.history for example

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Joining dataframes #16

Joining dataframes #16

dave-killough commented Apr 20, 2023

bluecoconut commented Apr 24, 2023

dave-killough commented Apr 24, 2023

dnk8n commented Apr 27, 2023

Joining dataframes #16

Joining dataframes #16

Comments

dave-killough commented Apr 20, 2023

bluecoconut commented Apr 24, 2023

dave-killough commented Apr 24, 2023

dnk8n commented Apr 27, 2023