Skip to content

Conversation

@chiroptical
Copy link
Collaborator

@chiroptical chiroptical commented Sep 24, 2025

Summary

Add a summary/topics backfill script. This is really intended as a one off script but could be useful if the document trigger didn't work properly.

First test, this looks good enough to attempt to write back to firebase.

H3284,generated_summary,Summary: The bill proposes a program to help state agencies and departments prioritize the hiring of military veterans and individuals who have completed service with the Peace Corps, AmeriCorps, and Commonwealth Corps. It aims to improve recruitment, development, and retention of these individuals in public sector jobs. The human resources division would provide certifications to confirm their status as veterans or alumni of these programs. This initiative seeks to enhance employment opportunities for those who have served in these capacities.,None
image

https://console.firebase.google.com/u/0/project/digital-testimony-dev/firestore/databases/-default-/data/~2FgeneralCourts~2F194~2Fbills~2FH3872 <- has an example where I set both summary and topics.

Here is an example of the corrected summary formatting with appropriate CSV output,

H4602,generated_topics,"The bill proposes to increase the Monson select board from 3 to 5 members, allowing for broader representation. If the bill is passed, three new select board members would be elected at the next annual town election, with varying term lengths based on the number of votes received. After these initial elections, all future select board members would serve 3-year terms. The bill would take effect as soon as it is passed.","[{'topic': 'Political advertising', 'category': 'Government Operations and Elections'}, {'topic': 'Government studies and investigations', 'category': 'Government Operations and Elections'}, {'topic': 'Community life and organization', 'category': 'Housing and Community Development'}]"

As additional proof, I can read it back in via pandas

> python
Python 3.12.3 (main, Aug 14 2025, 17:47:21) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> df = pd.read_csv("~/summaries-and-topics.csv")
>>> df.head()
  bill_id            status summary topics
0      H1           skipped     NaN    NaN
1     H10  previous_summary     NaN    NaN
2    H100  previous_summary     NaN    NaN
3   H1000  previous_summary     NaN    NaN
4   H1001  previous_summary     NaN    NaN
>>>

Checklist

  • On the frontend, I've made my strings translate-able.
  • If I've added shared components, I've added a storybook story.
  • I've made pages responsive and look good on mobile.

Screenshots

Add some screenshots highlighting your changes.

Known issues

If you've run against limitations or caveats, include them here. Include follow-up issues as well.

Steps to test/reproduce

For each feature or bug fix, create a step by step list for how a reviewer can test it out. E.g.:

  1. Go to the home page
  2. Click on a testimony
  3. See that it's loaded with a loading spinner

@vercel
Copy link

vercel bot commented Sep 24, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
maple-dev Ready Ready Preview Comment Oct 29, 2025 1:33am



def normalize_summary(summary: str) -> str:
strip_summary = re.sub(r"^Summary:", "", summary)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove just Summary: that is present at the beginning of the line.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not move this and similar comments on the PR to inline comment in the script? Will help future readers.


def normalize_summary(summary: str) -> str:
strip_summary = re.sub(r"^Summary:", "", summary)
lines = strip_summary.splitlines()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split by newlines

def normalize_summary(summary: str) -> str:
strip_summary = re.sub(r"^Summary:", "", summary)
lines = strip_summary.splitlines()
handle_list_items = [re.sub(r"^- ", "", x) for x in lines]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For any lines which look like - some text here. give us some text here.

strip_summary = re.sub(r"^Summary:", "", summary)
lines = strip_summary.splitlines()
handle_list_items = [re.sub(r"^- ", "", x) for x in lines]
handle_remaining_whitespace = [x.strip() for x in handle_list_items if x.strip() != ""]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strip all extraneous whitespace and filter any empty lines

lines = strip_summary.splitlines()
handle_list_items = [re.sub(r"^- ", "", x) for x in lines]
handle_remaining_whitespace = [x.strip() for x in handle_list_items if x.strip() != ""]
return " ".join(handle_remaining_whitespace)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put everything back together

databases/
.secret.local
.secret.local
summaries-and-topics.csv
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is generated by running llm/backfill_summaries.py and I assume we don't want to accidentally commit that.

Copy link
Collaborator

@jicruz96 jicruz96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@nesanders nesanders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great! Just a few comments, all minor stuff.

@@ -0,0 +1,9 @@
import re
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I request you add a toplevel comment to all python files briefly explaining their purpose?



def normalize_summary(summary: str) -> str:
strip_summary = re.sub(r"^Summary:", "", summary)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not move this and similar comments on the PR to inline comment in the script? Will help future readers.

@@ -0,0 +1,82 @@
# This script fills any missing 'summary' or 'topics' fields on the data model.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor: recommend changing to multiline comment for readability, i.e.

"""This script...
"""

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend adding to comment the explanation that it queries, by default, for session 194 bills and writes output to a local CSV and also automatically edits it in the firebase, IIUC.

import csv
from normalize_summaries import normalize_summary

# Application Default credentials are automatically created.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have docs on how to connect to the MAPLE prod firebase, assuming that's what you are doing? If so, can we link that here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, as far as I know yes. In ## Contributing Backend Features to Dev/Prod: in the main README.md file.


# Conceptually, we want to return a very consistent format when generated status reports.
# It would allow us to skip LLM regeneration when moving from dev to production.
def make_bill_summary(bill_id, status, summary, topics):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand the purpose of this. Isn't the standard way to use csv.writer.writerow to just pass it a list of strings, a la here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed over Zoom, added the doc comment below to try to clarify the purpose.

return [f"{bill_id}", f"{status}", f"{summary}", f"{topics}"]


bills_ref = db.collection("generalCourts/194/bills")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend moving global constants to top of script and using ALL_CAPS naming convention, per PEP8.


bills_ref = db.collection("generalCourts/194/bills")
bills = bills_ref.get()
with open("./summaries-and-topics.csv", "w") as csvfile:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend making this filename a global constant surfaced at top of script.

summary = get_summary_api_function(bill_id, document_title, document_text)
if summary["status"] in [-1, -2]:
csv_writer.writerow(
make_bill_summary(bill_id, "failed_summary", None, None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this string "failed_summary" going to show up directly on the site? If so, can we make it more informative to the user, i.e., "NOTE: summary generation failed for this bill, summary not avialable."

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not end up on the site! Only in the CSV for following up!

continue
# Note: `normalize_summary` does some post-processing to clean up the summaries
# As of 2025-10-21 this was necessary due to the LLM prompt
summary = normalize_summary(summary["summary"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be a followup issue/PR, but do we also need to inject this function somewhere in our production code, i.e. when we run this as a lambda?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we do, good call out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants