Add summary backfill #1948

chiroptical · 2025-09-24T00:17:58Z

Summary

Add a summary/topics backfill script. This is really intended as a one off script but could be useful if the document trigger didn't work properly.

First test, this looks good enough to attempt to write back to firebase.

H3284,generated_summary,Summary: The bill proposes a program to help state agencies and departments prioritize the hiring of military veterans and individuals who have completed service with the Peace Corps, AmeriCorps, and Commonwealth Corps. It aims to improve recruitment, development, and retention of these individuals in public sector jobs. The human resources division would provide certifications to confirm their status as veterans or alumni of these programs. This initiative seeks to enhance employment opportunities for those who have served in these capacities.,None

https://console.firebase.google.com/u/0/project/digital-testimony-dev/firestore/databases/-default-/data/~2FgeneralCourts~2F194~2Fbills~2FH3872 <- has an example where I set both summary and topics.

Here is an example of the corrected summary formatting with appropriate CSV output,

H4602,generated_topics,"The bill proposes to increase the Monson select board from 3 to 5 members, allowing for broader representation. If the bill is passed, three new select board members would be elected at the next annual town election, with varying term lengths based on the number of votes received. After these initial elections, all future select board members would serve 3-year terms. The bill would take effect as soon as it is passed.","[{'topic': 'Political advertising', 'category': 'Government Operations and Elections'}, {'topic': 'Government studies and investigations', 'category': 'Government Operations and Elections'}, {'topic': 'Community life and organization', 'category': 'Housing and Community Development'}]"

As additional proof, I can read it back in via pandas

> python
Python 3.12.3 (main, Aug 14 2025, 17:47:21) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> df = pd.read_csv("~/summaries-and-topics.csv")
>>> df.head()
  bill_id            status summary topics
0      H1           skipped     NaN    NaN
1     H10  previous_summary     NaN    NaN
2    H100  previous_summary     NaN    NaN
3   H1000  previous_summary     NaN    NaN
4   H1001  previous_summary     NaN    NaN
>>>

Checklist

On the frontend, I've made my strings translate-able.
If I've added shared components, I've added a storybook story.
I've made pages responsive and look good on mobile.

Screenshots

Add some screenshots highlighting your changes.

Known issues

If you've run against limitations or caveats, include them here. Include follow-up issues as well.

Steps to test/reproduce

For each feature or bug fix, create a step by step list for how a reviewer can test it out. E.g.:

Go to the home page
Click on a testimony
See that it's loaded with a loading spinner

vercel · 2025-09-24T00:18:04Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Comments	Updated (UTC)
maple-dev	Ready	Preview	Comment	Oct 29, 2025 1:33am

llm/.gitignore

llm/backfill_summaries.py

chiroptical · 2025-10-15T01:39:21Z

llm/normalize_summaries.py

+
+
+def normalize_summary(summary: str) -> str:
+    strip_summary = re.sub(r"^Summary:", "", summary)


Remove just Summary: that is present at the beginning of the line.

Why not move this and similar comments on the PR to inline comment in the script? Will help future readers.

chiroptical · 2025-10-15T01:39:33Z

llm/normalize_summaries.py

+
+def normalize_summary(summary: str) -> str:
+    strip_summary = re.sub(r"^Summary:", "", summary)
+    lines = strip_summary.splitlines()


Split by newlines

chiroptical · 2025-10-15T01:40:00Z

llm/normalize_summaries.py

+def normalize_summary(summary: str) -> str:
+    strip_summary = re.sub(r"^Summary:", "", summary)
+    lines = strip_summary.splitlines()
+    handle_list_items = [re.sub(r"^- ", "", x) for x in lines]


For any lines which look like - some text here. give us some text here.

chiroptical · 2025-10-15T01:40:25Z

llm/normalize_summaries.py

+    strip_summary = re.sub(r"^Summary:", "", summary)
+    lines = strip_summary.splitlines()
+    handle_list_items = [re.sub(r"^- ", "", x) for x in lines]
+    handle_remaining_whitespace = [x.strip() for x in handle_list_items if x.strip() != ""]


Strip all extraneous whitespace and filter any empty lines

chiroptical · 2025-10-15T01:40:35Z

llm/normalize_summaries.py

+    lines = strip_summary.splitlines()
+    handle_list_items = [re.sub(r"^- ", "", x) for x in lines]
+    handle_remaining_whitespace = [x.strip() for x in handle_list_items if x.strip() != ""]
+    return " ".join(handle_remaining_whitespace)


Put everything back together

llm/test_bill_on_document_created.py

chiroptical · 2025-10-22T00:24:13Z

llm/.gitignore

 databases/
-.secret.local
+.secret.local
+summaries-and-topics.csv


This is generated by running llm/backfill_summaries.py and I assume we don't want to accidentally commit that.

jicruz96

LGTM

nesanders

Looking great! Just a few comments, all minor stuff.

nesanders · 2025-10-29T00:24:09Z

llm/normalize_summaries.py

@@ -0,0 +1,9 @@
+import re


Can I request you add a toplevel comment to all python files briefly explaining their purpose?

nesanders · 2025-10-29T00:25:16Z

llm/normalize_summaries.py

+
+
+def normalize_summary(summary: str) -> str:
+    strip_summary = re.sub(r"^Summary:", "", summary)


Why not move this and similar comments on the PR to inline comment in the script? Will help future readers.

nesanders · 2025-10-29T00:27:43Z

llm/backfill_summaries.py

@@ -0,0 +1,82 @@
+# This script fills any missing 'summary' or 'topics' fields on the data model.


Very minor: recommend changing to multiline comment for readability, i.e.

"""This script... """

Recommend adding to comment the explanation that it queries, by default, for session 194 bills and writes output to a local CSV and also automatically edits it in the firebase, IIUC.

nesanders · 2025-10-29T00:28:38Z

llm/backfill_summaries.py

+import csv
+from normalize_summaries import normalize_summary
+
+# Application Default credentials are automatically created.


Do we have docs on how to connect to the MAPLE prod firebase, assuming that's what you are doing? If so, can we link that here?

Good question, as far as I know yes. In ## Contributing Backend Features to Dev/Prod: in the main README.md file.

nesanders · 2025-10-29T00:30:16Z

llm/backfill_summaries.py

+
+# Conceptually, we want to return a very consistent format when generated status reports.
+# It would allow us to skip LLM regeneration when moving from dev to production.
+def make_bill_summary(bill_id, status, summary, topics):


Not sure I understand the purpose of this. Isn't the standard way to use csv.writer.writerow to just pass it a list of strings, a la here?

Discussed over Zoom, added the doc comment below to try to clarify the purpose.

nesanders · 2025-10-29T00:30:38Z

llm/backfill_summaries.py

+    return [f"{bill_id}", f"{status}", f"{summary}", f"{topics}"]
+
+
+bills_ref = db.collection("generalCourts/194/bills")


Recommend moving global constants to top of script and using ALL_CAPS naming convention, per PEP8.

nesanders · 2025-10-29T00:30:53Z

llm/backfill_summaries.py

+
+bills_ref = db.collection("generalCourts/194/bills")
+bills = bills_ref.get()
+with open("./summaries-and-topics.csv", "w") as csvfile:


Recommend making this filename a global constant surfaced at top of script.

nesanders · 2025-10-29T00:33:44Z

llm/backfill_summaries.py

+        summary = get_summary_api_function(bill_id, document_title, document_text)
+        if summary["status"] in [-1, -2]:
+            csv_writer.writerow(
+                make_bill_summary(bill_id, "failed_summary", None, None)


Is this string "failed_summary" going to show up directly on the site? If so, can we make it more informative to the user, i.e., "NOTE: summary generation failed for this bill, summary not avialable."

It does not end up on the site! Only in the CSV for following up!

nesanders · 2025-10-29T00:34:59Z

llm/backfill_summaries.py

+            continue
+        # Note: `normalize_summary` does some post-processing to clean up the summaries
+        # As of 2025-10-21 this was necessary due to the LLM prompt
+        summary = normalize_summary(summary["summary"])


It can be a followup issue/PR, but do we also need to inject this function somewhere in our production code, i.e. when we run this as a lambda?

Yes we do, good call out.

vercel bot deployed to Preview – maple-dev September 24, 2025 00:21 View deployment

vercel bot deployed to Preview – maple-dev September 24, 2025 00:58 View deployment

vercel bot deployed to Preview – maple-dev September 24, 2025 01:14 View deployment

vercel bot deployed to Preview – maple-dev October 8, 2025 01:05 View deployment

chiroptical marked this pull request as ready for review October 8, 2025 01:21

chiroptical requested review from Mephistic, alexjball, kiminkim724, mertbagt, mvictor55, nesanders, sashamaryl and timblais as code owners October 8, 2025 01:21

chiroptical commented Oct 15, 2025

View reviewed changes

llm/.gitignore Outdated Show resolved Hide resolved

chiroptical commented Oct 15, 2025

View reviewed changes

llm/backfill_summaries.py Outdated Show resolved Hide resolved

chiroptical commented Oct 15, 2025

View reviewed changes

llm/test_bill_on_document_created.py Show resolved Hide resolved

vercel bot deployed to Preview – maple-dev October 15, 2025 01:42 View deployment

vercel bot deployed to Preview – maple-dev October 22, 2025 00:20 View deployment

chiroptical commented Oct 22, 2025

View reviewed changes

vercel bot deployed to Preview – maple-dev October 22, 2025 00:31 View deployment

chiroptical added 4 commits October 21, 2025 20:42

Initial commit

312dbbb

Fill out TODOs

897e30f

Move it so it works

d0331dd

Update documentation on summary script

a9c6457

chiroptical added 6 commits October 21, 2025 20:42

Update

d27c3fd

Remove temporary exit

e7b721a

Progress

afe6ca1

Update with new CSV writer

7127b23

Minor writerow updates

bfb587e

Minor clean-up

2652ed9

chiroptical force-pushed the add_summary_backfill branch from 3e0c451 to 2652ed9 Compare October 22, 2025 00:42

Name the tests

be10fb5

vercel bot deployed to Preview – maple-dev October 22, 2025 00:52 View deployment

jicruz96 reviewed Oct 29, 2025

View reviewed changes

nesanders reviewed Oct 29, 2025

View reviewed changes

Address feedback

00c76b1

vercel bot deployed to Preview – maple-dev October 29, 2025 01:29 View deployment

Address feedback

100df2b

vercel bot deployed to Preview – maple-dev October 29, 2025 01:33 View deployment



		def normalize_summary(summary: str) -> str:
		strip_summary = re.sub(r"^Summary:", "", summary)

		@@ -0,0 +1,82 @@
		# This script fills any missing 'summary' or 'topics' fields on the data model.

		return [f"{bill_id}", f"{status}", f"{summary}", f"{topics}"]


		bills_ref = db.collection("generalCourts/194/bills")

Uh oh!

Add summary backfill #1948

Are you sure you want to change the base?

Add summary backfill #1948

Uh oh!

Conversation

chiroptical commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Screenshots

Known issues

Steps to test/reproduce

Uh oh!

vercel bot commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jicruz96 left a comment

Choose a reason for hiding this comment

Uh oh!

nesanders left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chiroptical commented Sep 24, 2025 •

edited

Loading

vercel bot commented Sep 24, 2025 •

edited

Loading