Skip to content

Feature: repository search

Britta edited this page Apr 4, 2024 · 50 revisions

We need to show the following materials in our blended search results:

  • Regulations
  • Public policy documents that we link to:
    • Federal Register rules
    • Supplemental content: subregulatory guidance, technical assistance, etc.
  • Uploaded policy document files (internal to CMCS)

The purpose of this page is to explicitly document how we want search to work.

Goals

  • Take less than 5 seconds (ideally much less) to return relevant results.
  • Return meaningful highlighting of where your query term shows up in the document.
  • Interpret what you mean instead of being 100% literal (stop words, stemming, etc.), but not too much.

Good test queries:

  • state
    • At rank filter 0.05, this returns 2800+ results
  • "state plan"
    • At rank filter 0.05, this returns 1400+ results
  • state plan amendment
    • At rank filter 0.05, this returns 2000+ results
    • Should show the stemmed word "amendment" in the headline
  • Medicare
    • At rank filter 0.05, this returns 2100+ results
    • Shouldn't just return results for "medical" at the top
  • personal care services
    • At rank filter 0.05, this returns 1700+ results
    • Should show the stemmed words "personal" and "services" in highlights

Background info about technology

In our Postgres database we have:

  • The full text of regulation sections in scope, imported via eCFR API
  • Metadata about each document:
    • Imported via Federal Register API for post-1994 rules (and hand-corrected as needed)
      • Our FR API parser includes a special step to enable search indexing because the FR website does not allow scraping their normal URLs: we fetch their text-only URL via their API and give that to the Text Extractor Lambda instead of the normal URL.
    • Entered by hand for everything else
  • The full text of most documents, extracted via our Text Extractor Lambda, which uses:
    • Python Requests to grab content from URLs, respecting robots.txt and providing a custom user agent (CMCSeRegsTextExtractorBot/1.0)
    • Google Magika to detect file types
    • AWS Textract to process PDFs, including text detection for scanned documents
    • A variety of open source libraries to process Outlook, Word, Excel, PowerPoint, RTF, TXT, HTML, image, and ZIP files

We use Postgres full-text search via Django's support for Postgres full-text search.

Context about metadata

For team members, see "Structure and content of resources" (requires login).

Our metadata fields for FR docs, supplemental content, and uploaded files are shown here as they look on a subject page:

Screenshot 2023-12-18 at 5 05 43 PM

Search result display

Factors for what you can see:

  • Supplemental content and FR docs can be marked "approved" or not approved in the admin panel. Items that aren't approved are only visible in the admin panel (which is only available to logged-in users), never shown in search results or elsewhere on the site.
  • If you're not logged in, you cannot see internal documents (uploaded files) in search results or elsewhere on the site.

In search results, we always show the following document metadata if available:

  • Document category
  • Date
  • Subjects
  • Related citations

If the desired keyword(s) exist only in the document metadata (FR doc name or description, supplemental content name or description, uploaded file name or summary, etc.), show that document metadata. This means:

  • FR doc: name (grey metadata) and description (blue link)
  • Supplemental content: name (grey metadata) and description (blue link)
  • Uploaded files: name (blue link) and summary (black text)

If the desired keyword(s) also exist in the extracted document text, show the name and description (grey metadata and blue link) AND:

  • For all types of documents, show the relevant headline (excerpt) from the full-text content, in black text. (For uploaded files, this headline replaces the summary.)

Search result ranking

Rank filtering

For background, see Ranking Search Results in the Postgres docs.

When Django directs Postgres to provide results for a query, each potential result for a query gets a ts_rank score. See the definition of ts_rank: "Computes a score showing how well the vector matches the query."

A high score (0.1) means very relevant, while a low score (0.01) means not very relevant.

We have an environment variable that tells Postgres how to filter the results: should it only show fewer results that are most relevant, or should it show lots of results, including less relevant results at the end? A higher filter (like 0.1) mean show fewer results, and a lower filter (like 0.01) means show lots of results.

The rank filter value for each environment is in our parameter store: BASIC_SEARCH_FILTER and QUOTED_SEARCH_FILTER.

Rank filter is 0.05 in all environments, for both basic (not quoted) and phrase (quoted) search queries.

Weights

To make search faster, we create and automatically maintain a "vector_column" with a pre-processed version of each content item. We create the the pre-processed version using "weight" values for various parts of the metadata and content for an item, so that (for example) a word in the title of a document counts more toward relevance than a word in the body of a document.

Context about decisions we made for weights (login required).

Weights for documents:

  • (FR doc) name: A
  • (Supplemental content) name: A
  • (Uploaded file) name: A
  • (FR doc) description: A
  • (Supplemental content) description: A
  • (Uploaded file) summary: B
  • (Uploaded file) filename: C
  • Date: C
  • Subjects (full names, short names, and abbreviations): D
  • Content: D

We could add citations and FR docket numbers to the weighting list if we want to. We may not need date in that list.

Weights for regulation text sections:

  • Section number: A
  • Section title: A
  • Part title: A
  • Content: B

We may be able to add the subpart title to the weighting list if we want to.

Overview

Data

Features

Decisions

User research

Usability studies

Design

Development

Clone this wiki locally