Skip to content
Open
Show file tree
Hide file tree
Changes from 11 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion llm/.gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
venv/
__pycache__/
databases/
.secret.local
.secret.local
summaries-and-topics.csv
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is generated by running llm/backfill_summaries.py and I assume we don't want to accidentally commit that.

82 changes: 82 additions & 0 deletions llm/backfill_summaries.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# This script fills any missing 'summary' or 'topics' fields on the data model.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor: recommend changing to multiline comment for readability, i.e.

"""This script...
"""

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend adding to comment the explanation that it queries, by default, for session 194 bills and writes output to a local CSV and also automatically edits it in the firebase, IIUC.

# The document must have a 'Title' and 'DocumentText' field to generate them.
#
# Developer notes:
# - you'll need to set the 'OPENAI_API_KEY' environment variable
import firebase_admin
from llm_functions import get_summary_api_function, get_tags_api_function_v2
from firebase_admin import firestore
from bill_on_document_created import get_categories_from_topics, CATEGORY_BY_TOPIC
import csv
from normalize_summaries import normalize_summary

# Application Default credentials are automatically created.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have docs on how to connect to the MAPLE prod firebase, assuming that's what you are doing? If so, can we link that here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, as far as I know yes. In ## Contributing Backend Features to Dev/Prod: in the main README.md file.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Can we link here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By link, do you mean just note that it exists? Or, do you want a link to github directly? Or expect that we'll use sphinx or something in the future and add a relative link in sphinx doc?

app = firebase_admin.initialize_app()
db = firestore.client()


# Conceptually, we want to return a very consistent format when generated status reports.
# It would allow us to skip LLM regeneration when moving from dev to production.
def make_bill_summary(bill_id, status, summary, topics):
return [f"{bill_id}", f"{status}", f"{summary}", f"{topics}"]


bills_ref = db.collection("generalCourts/194/bills")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend moving global constants to top of script and using ALL_CAPS naming convention, per PEP8.

bills = bills_ref.get()
with open("./summaries-and-topics.csv", "w") as csvfile:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend making this filename a global constant surfaced at top of script.

csv_writer = csv.writer(csvfile)
csv_writer.writerow(["bill_id", "status", "summary", "topics"])
for bill in bills:
document = bill.to_dict()
bill_id = document["id"]
document_text = document.get("content", {}).get("DocumentText")
document_title = document.get("content", {}).get("Title")
summary = document.get("summary")

# No document text, skip it because we can't summarize it
if document_text is None:
csv_writer.writerow(make_bill_summary(bill_id, "skipped", None, None))
continue

# If the summary is already populated move on
if summary is not None:
csv_writer.writerow(
make_bill_summary(bill_id, "previous_summary", None, None)
)
continue

summary = get_summary_api_function(bill_id, document_title, document_text)
if summary["status"] in [-1, -2]:
csv_writer.writerow(
make_bill_summary(bill_id, "failed_summary", None, None)
)
continue
# Note: `normalize_summary` does some post-processing to clean up the summaries
# As of 2025-10-21 this was necessary due to the LLM prompt
summary = normalize_summary(summary["summary"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be a followup issue/PR, but do we also need to inject this function somewhere in our production code, i.e. when we run this as a lambda?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we do, good call out.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, did you file a followup issue?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not done that, but I totally can do that quick!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bill.reference.update({"summary": summary})

# If the topics are already populated, just make a note of it
topics = document.get("topics")
if topics is not None:
csv_writer.writerow(
make_bill_summary(bill_id, "previous_topics", None, None)
)

tags = get_tags_api_function_v2(bill_id, document_title, summary)
# If the tags fail, make a note and at least write the summary for debugging
if tags["status"] != 1:
csv_writer.writerow(make_bill_summary(bill_id, "failed_topics", None, None))
csv_writer.writerow(
make_bill_summary(bill_id, "generated_summary", summary, None)
)
continue
topics_and_categories = get_categories_from_topics(
tags["tags"], CATEGORY_BY_TOPIC
)
bill.reference.update({"topics": topics_and_categories})
csv_writer.writerow(
make_bill_summary(
bill_id, "generated_summary_and_topics", summary, topics_and_categories
)
)
9 changes: 9 additions & 0 deletions llm/normalize_summaries.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
import re


def normalize_summary(summary: str) -> str:
strip_summary = re.sub(r"^Summary:", "", summary)
lines = strip_summary.splitlines()
handle_list_items = [re.sub(r"^- ", "", x) for x in lines]
handle_remaining_whitespace = [x.strip() for x in handle_list_items if x.strip() != ""]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strip all extraneous whitespace and filter any empty lines

return " ".join(handle_remaining_whitespace)
40 changes: 40 additions & 0 deletions llm/test_normalize_summaries.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
import normalize_summaries


def test_normalize_summary_handles_summary_prefix_and_bullets():
summary = """Summary:
- The bill allows Joe, the chief of police in Gravity, to continue working.
- The city can require annual health examinations
"""
assert (
normalize_summaries.normalize_summary(summary)
== "The bill allows Joe, the chief of police in Gravity, to continue working. The city can require annual health examinations"
)


def test_normalize_summary_handles_summary_prefix_and_no_bullets():
summary = """Summary:
The bill allows Joe, the chief of police in Gravity, to continue working.
"""
assert (
normalize_summaries.normalize_summary(summary)
== "The bill allows Joe, the chief of police in Gravity, to continue working."
)


def test_normalize_summary_handles_summary_prefix_with_no_linebreak():
summary = "Summary: The bill allows Joe, the chief of police in Gravity, to continue working."
assert (
normalize_summaries.normalize_summary(summary)
== "The bill allows Joe, the chief of police in Gravity, to continue working."
)


def test_normalize_summary_handles_bare_summary():
summary = (
"The bill allows Joe, the chief of police in Gravity, to continue working."
)
assert (
normalize_summaries.normalize_summary(summary)
== "The bill allows Joe, the chief of police in Gravity, to continue working."
)