-
-
Notifications
You must be signed in to change notification settings - Fork 144
Add summary backfill #1948
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add summary backfill #1948
Changes from all commits
312dbbb
897e30f
d0331dd
a9c6457
d27c3fd
e7b721a
afe6ca1
7127b23
bfb587e
2652ed9
be10fb5
00c76b1
100df2b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,4 +1,5 @@ | ||
| venv/ | ||
| __pycache__/ | ||
| databases/ | ||
| .secret.local | ||
| .secret.local | ||
| summaries-and-topics.csv | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,102 @@ | ||
| """This script fills any missing 'summary' or 'topics' fields on the data model. | ||
|
|
||
| The document must have a 'Title' and 'DocumentText' field to generate them. The | ||
| script queries only the general court 194 bills, modifies the firebase database | ||
| in-place, and generates a CSV with a description of what happened. The header for | ||
| the CSV is `bill_id,status,summary,topics`. The possible statuses are, | ||
|
|
||
| - `skipped` - the bill doesn't have either a title or text, skip it | ||
| - `previous_summary` - the bill previously had a summary, skip it | ||
| - `failed_summary` - something went wrong when trying to summarize, skip it | ||
| - `previous_topics` - the bill previously had topics, skip it | ||
| - `failed_topics` - something went wrong when trying to generate topics, skip it | ||
| - `generated_summary` - both the summary and topics were generated successfully | ||
|
|
||
| Developer notes: | ||
| - you'll need to set the 'OPENAI_API_KEY' environment variable | ||
| """ | ||
|
|
||
| import firebase_admin | ||
| from llm_functions import get_summary_api_function, get_tags_api_function_v2 | ||
| from firebase_admin import firestore | ||
| from bill_on_document_created import get_categories_from_topics, CATEGORY_BY_TOPIC | ||
| import csv | ||
| from normalize_summaries import normalize_summary | ||
|
|
||
| # Module constants | ||
| FIREBASE_COLLECTION_PATH = "generalCourts/194/bills" | ||
| CSV_SUMMARY_OUTPUT = "./summaries-and-topics.csv" | ||
|
|
||
| # Application Default credentials are automatically created. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we have docs on how to connect to the MAPLE prod firebase, assuming that's what you are doing? If so, can we link that here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good question, as far as I know yes. In |
||
| app = firebase_admin.initialize_app() | ||
| db = firestore.client() | ||
|
|
||
|
|
||
| def make_bill_summary(bill_id, status, summary, topics): | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure I understand the purpose of this. Isn't the standard way to use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Discussed over Zoom, added the doc comment below to try to clarify the purpose. |
||
| """Generate a row for csv.writerow | ||
|
|
||
| The goal with this function is to not forget all the arguments to subsequent | ||
| csv.writerow calls. | ||
| """ | ||
| return [f"{bill_id}", f"{status}", f"{summary}", f"{topics}"] | ||
|
|
||
|
|
||
| bills_ref = db.collection(FIREBASE_COLLECTION_PATH) | ||
| bills = bills_ref.get() | ||
| with open(CSV_SUMMARY_OUTPUT, "w") as csvfile: | ||
| csv_writer = csv.writer(csvfile) | ||
| csv_writer.writerow(["bill_id", "status", "summary", "topics"]) | ||
| for bill in bills: | ||
| document = bill.to_dict() | ||
| bill_id = document["id"] | ||
| document_text = document.get("content", {}).get("DocumentText") | ||
| document_title = document.get("content", {}).get("Title") | ||
| summary = document.get("summary") | ||
|
|
||
| # No document text or title, skip it because we can't summarize it | ||
| if document_text is None or document_title is None: | ||
| csv_writer.writerow(make_bill_summary(bill_id, "skipped", None, None)) | ||
| continue | ||
|
|
||
| # If the summary is already populated move on | ||
| if summary is not None: | ||
| csv_writer.writerow( | ||
| make_bill_summary(bill_id, "previous_summary", None, None) | ||
| ) | ||
| continue | ||
|
|
||
| summary = get_summary_api_function(bill_id, document_title, document_text) | ||
| if summary["status"] in [-1, -2]: | ||
| csv_writer.writerow( | ||
| make_bill_summary(bill_id, "failed_summary", None, None) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this string There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It does not end up on the site! Only in the CSV for following up! |
||
| ) | ||
| continue | ||
| # Note: `normalize_summary` does some post-processing to clean up the summaries | ||
| # As of 2025-10-21 this was necessary due to the LLM prompt | ||
| summary = normalize_summary(summary["summary"]) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It can be a followup issue/PR, but do we also need to inject this function somewhere in our production code, i.e. when we run this as a lambda? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes we do, good call out. |
||
| bill.reference.update({"summary": summary}) | ||
|
|
||
| # If the topics are already populated, just make a note of it | ||
| topics = document.get("topics") | ||
| if topics is not None: | ||
| csv_writer.writerow( | ||
| make_bill_summary(bill_id, "previous_topics", None, None) | ||
| ) | ||
|
|
||
| tags = get_tags_api_function_v2(bill_id, document_title, summary) | ||
| # If the tags fail, make a note and at least write the summary for debugging | ||
| if tags["status"] != 1: | ||
| csv_writer.writerow(make_bill_summary(bill_id, "failed_topics", None, None)) | ||
| csv_writer.writerow( | ||
| make_bill_summary(bill_id, "generated_summary", summary, None) | ||
| ) | ||
| continue | ||
| topics_and_categories = get_categories_from_topics( | ||
| tags["tags"], CATEGORY_BY_TOPIC | ||
| ) | ||
| bill.reference.update({"topics": topics_and_categories}) | ||
| csv_writer.writerow( | ||
| make_bill_summary( | ||
| bill_id, "generated_summary_and_topics", summary, topics_and_categories | ||
| ) | ||
| ) | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,24 @@ | ||
| """Normalize summary outputs from the LLM | ||
|
|
||
| The summary prompt has some formatting prose that we don't want to persist into | ||
| the database. For example, it prefixes every summary with `Summary:`. We apply a | ||
| few preprocessing steps to every summary to keep things uniform. The steps, | ||
|
|
||
| 1. Remove leading `Summary:` from the input text | ||
| 2. Split any newlines created by unordered lists in the input text | ||
| 3. Remove leading `- ` from the split unordered lists | ||
| 4. Remove any remaining whitespace | ||
| 5. Put everything back together separated with spaces | ||
| """ | ||
|
|
||
| import re | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can I request you add a toplevel comment to all python files briefly explaining their purpose? |
||
|
|
||
|
|
||
| def normalize_summary(summary: str) -> str: | ||
| strip_summary = re.sub(r"^Summary:", "", summary) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Remove just There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why not move this and similar comments on the PR to inline comment in the script? Will help future readers. |
||
| lines = strip_summary.splitlines() | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Split by newlines |
||
| handle_list_items = [re.sub(r"^- ", "", x) for x in lines] | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For any lines which look like |
||
| handle_remaining_whitespace = [ | ||
| x.strip() for x in handle_list_items if x.strip() != "" | ||
| ] | ||
| return " ".join(handle_remaining_whitespace) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Put everything back together |
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,40 @@ | ||
| import normalize_summaries | ||
|
|
||
|
|
||
| def test_normalize_summary_handles_summary_prefix_and_bullets(): | ||
| summary = """Summary: | ||
| - The bill allows Joe, the chief of police in Gravity, to continue working. | ||
| - The city can require annual health examinations | ||
| """ | ||
| assert ( | ||
| normalize_summaries.normalize_summary(summary) | ||
| == "The bill allows Joe, the chief of police in Gravity, to continue working. The city can require annual health examinations" | ||
| ) | ||
|
|
||
|
|
||
| def test_normalize_summary_handles_summary_prefix_and_no_bullets(): | ||
| summary = """Summary: | ||
| The bill allows Joe, the chief of police in Gravity, to continue working. | ||
| """ | ||
| assert ( | ||
| normalize_summaries.normalize_summary(summary) | ||
| == "The bill allows Joe, the chief of police in Gravity, to continue working." | ||
| ) | ||
|
|
||
|
|
||
| def test_normalize_summary_handles_summary_prefix_with_no_linebreak(): | ||
| summary = "Summary: The bill allows Joe, the chief of police in Gravity, to continue working." | ||
| assert ( | ||
| normalize_summaries.normalize_summary(summary) | ||
| == "The bill allows Joe, the chief of police in Gravity, to continue working." | ||
| ) | ||
|
|
||
|
|
||
| def test_normalize_summary_handles_bare_summary(): | ||
| summary = ( | ||
| "The bill allows Joe, the chief of police in Gravity, to continue working." | ||
| ) | ||
| assert ( | ||
| normalize_summaries.normalize_summary(summary) | ||
| == "The bill allows Joe, the chief of police in Gravity, to continue working." | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is generated by running
llm/backfill_summaries.pyand I assume we don't want to accidentally commit that.