- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 144
Add summary backfill #1948
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add summary backfill #1948
Conversation
| The latest updates on your projects. Learn more about Vercel for GitHub. 
 | 
|  | ||
|  | ||
| def normalize_summary(summary: str) -> str: | ||
| strip_summary = re.sub(r"^Summary:", "", summary) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove just Summary: that is present at the beginning of the line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not move this and similar comments on the PR to inline comment in the script? Will help future readers.
|  | ||
| def normalize_summary(summary: str) -> str: | ||
| strip_summary = re.sub(r"^Summary:", "", summary) | ||
| lines = strip_summary.splitlines() | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Split by newlines
| def normalize_summary(summary: str) -> str: | ||
| strip_summary = re.sub(r"^Summary:", "", summary) | ||
| lines = strip_summary.splitlines() | ||
| handle_list_items = [re.sub(r"^- ", "", x) for x in lines] | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For any lines which look like - some text here. give us some text here.
        
          
                llm/normalize_summaries.py
              
                Outdated
          
        
      | strip_summary = re.sub(r"^Summary:", "", summary) | ||
| lines = strip_summary.splitlines() | ||
| handle_list_items = [re.sub(r"^- ", "", x) for x in lines] | ||
| handle_remaining_whitespace = [x.strip() for x in handle_list_items if x.strip() != ""] | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strip all extraneous whitespace and filter any empty lines
| lines = strip_summary.splitlines() | ||
| handle_list_items = [re.sub(r"^- ", "", x) for x in lines] | ||
| handle_remaining_whitespace = [x.strip() for x in handle_list_items if x.strip() != ""] | ||
| return " ".join(handle_remaining_whitespace) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put everything back together
| databases/ | ||
| .secret.local | ||
| .secret.local | ||
| summaries-and-topics.csv | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is generated by running llm/backfill_summaries.py and I assume we don't want to accidentally commit that.
3e0c451    to
    2652ed9      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great! Just a few comments, all minor stuff.
| @@ -0,0 +1,9 @@ | |||
| import re | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I request you add a toplevel comment to all python files briefly explaining their purpose?
|  | ||
|  | ||
| def normalize_summary(summary: str) -> str: | ||
| strip_summary = re.sub(r"^Summary:", "", summary) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not move this and similar comments on the PR to inline comment in the script? Will help future readers.
        
          
                llm/backfill_summaries.py
              
                Outdated
          
        
      | @@ -0,0 +1,82 @@ | |||
| # This script fills any missing 'summary' or 'topics' fields on the data model. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very minor: recommend changing to multiline comment for readability, i.e.
"""This script...
"""
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommend adding to comment the explanation that it queries, by default, for session 194 bills and writes output to a local CSV and also automatically edits it in the firebase, IIUC.
| import csv | ||
| from normalize_summaries import normalize_summary | ||
|  | ||
| # Application Default credentials are automatically created. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have docs on how to connect to the MAPLE prod firebase, assuming that's what you are doing? If so, can we link that here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, as far as I know yes. In ## Contributing Backend Features to Dev/Prod: in the main README.md file.
|  | ||
| # Conceptually, we want to return a very consistent format when generated status reports. | ||
| # It would allow us to skip LLM regeneration when moving from dev to production. | ||
| def make_bill_summary(bill_id, status, summary, topics): | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand the purpose of this. Isn't the standard way to use csv.writer.writerow to just pass it a list of strings, a la here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed over Zoom, added the doc comment below to try to clarify the purpose.
        
          
                llm/backfill_summaries.py
              
                Outdated
          
        
      | return [f"{bill_id}", f"{status}", f"{summary}", f"{topics}"] | ||
|  | ||
|  | ||
| bills_ref = db.collection("generalCourts/194/bills") | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommend moving global constants to top of script and using ALL_CAPS naming convention, per PEP8.
        
          
                llm/backfill_summaries.py
              
                Outdated
          
        
      |  | ||
| bills_ref = db.collection("generalCourts/194/bills") | ||
| bills = bills_ref.get() | ||
| with open("./summaries-and-topics.csv", "w") as csvfile: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommend making this filename a global constant surfaced at top of script.
| summary = get_summary_api_function(bill_id, document_title, document_text) | ||
| if summary["status"] in [-1, -2]: | ||
| csv_writer.writerow( | ||
| make_bill_summary(bill_id, "failed_summary", None, None) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this string "failed_summary" going to show up directly on the site? If so, can we make it more informative to the user, i.e., "NOTE: summary generation failed for this bill, summary not avialable."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not end up on the site! Only in the CSV for following up!
| continue | ||
| # Note: `normalize_summary` does some post-processing to clean up the summaries | ||
| # As of 2025-10-21 this was necessary due to the LLM prompt | ||
| summary = normalize_summary(summary["summary"]) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be a followup issue/PR, but do we also need to inject this function somewhere in our production code, i.e. when we run this as a lambda?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we do, good call out.
Summary
Add a summary/topics backfill script. This is really intended as a one off script but could be useful if the document trigger didn't work properly.
First test, this looks good enough to attempt to write back to firebase.
https://console.firebase.google.com/u/0/project/digital-testimony-dev/firestore/databases/-default-/data/~2FgeneralCourts~2F194~2Fbills~2FH3872 <- has an example where I set both summary and topics.
Here is an example of the corrected summary formatting with appropriate CSV output,
As additional proof, I can read it back in via pandas
Checklist
Screenshots
Add some screenshots highlighting your changes.
Known issues
If you've run against limitations or caveats, include them here. Include follow-up issues as well.
Steps to test/reproduce
For each feature or bug fix, create a step by step list for how a reviewer can test it out. E.g.: