Skip to content

Latest commit

 

History

History
13 lines (10 loc) · 4.14 KB

index-creation.md

File metadata and controls

13 lines (10 loc) · 4.14 KB

Below is a description of the process used to create the Federal Website Index, which is then used as the target URL list for the Site Scanning engine to scan. The actual code that does this is here.

  • The Federal Website Index is created by combining and processing a number of individual source datasets. The list of datasets is managed here (in the fetch_data function), and the urls for these datasets are managed here.
  • The specified source datasets are copied and imported into memory. Snapshots of each individual dataset are stored here.
  • One further source dataset is created by taking the list of federal .gov domains and adding www to the front of each of them.
  • The various source datasets are combined. A snapshot of this combined list is stored here.
  • The combined list of websites is then deduplicated. A snapshot of the dedupped list is stored here. A list of the website that are removed in this step is stored here.
  • The list of websites is filtered to remove any entries that should be ignored, as specified by two ignore files (begins with list first, then the contains list) [note that the contains list actually requires that the specified string have non-alphanumeric characters both before and after it (or just after, if the string is at the beginning)]. The purpose of this is to try and remove non-public websites. Snapshot of the last after it is filtered can be found here (after the begins with list) and here (after the contains list). A list of the websites that are removed with each filter can be found here (for the begins with list) and here (for the contains list).
  • Agency, bureau, and branch information is added to each website by pulling in the relevant information for its base domain from the list of federal .gov domains.
  • The list of websites is then further filtered to only keep those that have a base domain that is on the list of federal .gov domains. This removes non-.gov, non-federal, and expired websites. The result is the Federal Website Index, which can be found here. A list of the websites that are removed at this step can be found here.
  • Note - An summary report of the assembly process can be found here. A summary report of the finalized Federal Website Index can be found here.