CDC Datasets

This week we're exploring datasets that the Trump administration has purged.

The Trump administration has ordered agencies to purge their websites of any references to topics such as LGBTQ+ rights.

An effort is underway to back up this publicly funded data before it is lost. This week's dataset contains metadata about CDC datasets backed up on archive.org.

"The removal of HIV- and LGBTQ-related resources from the websites of the Centers for Disease Control and Prevention and other health agencies is deeply concerning and creates a dangerous gap in scientific information and data to monitor and respond to disease outbreaks," the Infectious Disease Society of America said in a statement. "Access to this information is crucial for infectious diseases and HIV health care professionals who care for people with HIV and members of the LGBTQ community and is critical to efforts to end the HIV epidemic."

Which Bureaus and Programs have the most datasets archived in this collection?
Explore some of the datasets. What keywords do the datasets have in common?

Thank you to Jon Harmon for curating this week's dataset.

The Data

# Option 1: tidytuesdayR package 
## install.packages("tidytuesdayR")

tuesdata <- tidytuesdayR::tt_load('2025-02-11')
## OR
tuesdata <- tidytuesdayR::tt_load(2025, week = 6)

cdc_datasets <- tuesdata$cdc_datasets
fpi_codes <- tuesdata$fpi_codes
omb_codes <- tuesdata$omb_codes

# Option 2: Read directly from GitHub

cdc_datasets <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-02-11/cdc_datasets.csv')
fpi_codes <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-02-11/fpi_codes.csv')
omb_codes <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-02-11/omb_codes.csv')

How to Participate

Explore the data, watching out for interesting relationships. We would like to emphasize that you should not draw conclusions about causation in the data. There are various moderating variables that affect all data, many of which might not have been captured in these datasets. As such, our suggestion is to use the data provided to practice your data tidying and plotting techniques, and to consider for yourself what nuances might underlie these relationships.
Create a visualization, a model, a shiny app, or some other piece of data-science-related output, using R or another programming language.
Share your output and the code used to generate it on social media with the #TidyTuesday hashtag.
Submit your own dataset!

Data Dictionary

`cdc_datasets.csv`

variable	class	description
dataset_url	character	The location to download the metadata about the archived dataset. The dataset itself is at this location with `-meta` removed (replace "-meta.csv" with ".csv").
contact_name	character	A name to contact about the dataset. Sometimes this field contains the name of the dataset.
contact_email	character	A government email to contact about the dataset. Many of these email addresses likely no longer work under the Trump administration.
bureau_code	character	Federal agencies, combined agency and bureau code from OMB Circular A-11, Appendix C (see omb_codes dataset).
program_code	character	The primary program related to this data asset, from the Federal Program Inventory (see fpi_codes dataset).
category	character	Main thematic category of the dataset.
tags	character	Tags (or keywords) to help users discover the dataset. Intended to include terms that would be used by technical and non-technical users.
publisher	character	The publishing entity and optionally their parent organization(s).
public_access_level	character	The degree to which this dataset could be made publicly-available, regardless of whether it has been made available. Choices: public (Data asset is or could be made publicly available to all without restrictions), restricted public (Data asset is available under certain use restrictions), or non-public (Data asset is not available to members of the public).
footnotes	character	Additional notes about this dataset.
license	character	The license or non-license (i.e. Public Domain) status with which the dataset or API has been published.
source_link	character	The location where the dataset was stored.
issued	character	Date of formal issuance.
geographic_coverage	character	The range of spatial applicability of a dataset. Could include a spatial region like a bounding box or a named place.
temporal_applicability	character	The range of temporal applicability of a dataset (i.e., a start and end date of applicability for the data).
update_frequency	character	The frequency with which dataset is published.
described_by	character	URL to the data dictionary for the dataset.
homepage	character	Intended for use if a dataset has a human-friendly hub or landing page that users can be directed to for all resources tied to the dataset.
geographic_unit_of_analysis	character	Likely very similar to geographic_coverage.
suggested_citation	character	How to cite this dataset.
geospatial_resolution	character	The sizes of geospatial units included in the dataset.
references	character	Related documents such as technical information about a dataset, developer documentation, etc.
glossary_methodology	character	A URL or reference to how things were named.
access_level_comment	character	This may include information regarding access or restrictions based on privacy, security, or other policies.
analytical_methods_reference	character	Usually a URL describing the methodology. URL may not be available under the Trump administration.
language	character	The language of the dataset.
collection	character	The collection of which the dataset is a subset.

`fpi_codes.csv`

variable	class	description
agency_name	character	The name of the federal agency.
program_name	character	The name of this program.
additional_information_optional	character	Other notes.
agency_code	character	The three-digit code for the agency housing this program.
program_code	character	The Federal Program Inventory code for this program.
program_code_pod_format	character	The Federal Program Inventory code for this program in "project open data" format.

`omb_codes.csv`

variable	class	description
agency_name	character	The name of the federal agency.
bureau_name	character	The name of the entity within the agency.
agency_code	double	The OMB code for this agency.
bureau_code	double	The OMB code for this bureau within this agency.
treasury_code	character	The Treasury Department code for this agency.
cgac_code	character	Common Government-wide Accounting Classification.

Cleaning Script

library(tidyverse)
library(rvest)
library(janitor)
library(httr2)

index <- rvest::read_html_live("https://archive.org/download/20250128-cdc-datasets")
meta_urls <- index |> 
  rvest::html_element(".download-directory-listing") |> 
  rvest::html_table() |> 
  janitor::clean_names() |> 
  dplyr::filter(stringr::str_ends(name, "-meta.csv")) |> 
  dplyr::mutate(
    url = paste0(
      "https://archive.org/download/20250128-cdc-datasets/",
      URLencode(name)
    )
  ) |> 
  dplyr::select(url)
rm(index)

# As of 2025-02-03, there are 1257 metadata CSVs available. We will load each
# one and widen it, then stitch them all together. This can take a very long
# time.
requests <- meta_urls$url |>
  purrr::map(\(url) {
    httr2::request(url) |> 
      httr2::req_retry(
        max_tries = 10,
        is_transient = \(resp) {
          httr2::resp_status(resp) %in% c(429, 500, 503)
        },
        # Always wait 10 seconds to retry. It seems to be a general throttle,
        # but they don't tell us how long they need us to back off.
        backoff = \(i) 10
      )
  })

resps <- httr2::req_perform_sequential(requests, on_error = "continue")

reqs_to_retry <- resps |> 
  httr2::resps_failures() |> 
  purrr::map("request")

resps2 <- httr2::req_perform_sequential(reqs_to_retry)

resps <- c(httr2::resps_successes(resps), httr2::resps_successes(resps2))

extract_cdc_dataset_row <- function(resp) {
  httr2::resp_body_string(resp) |> 
    stringr::str_trim()
}

cdc_datasets <- tibble::tibble(
  dataset_url = purrr::map_chr(resps, c("request", "url")),
  raw = httr2::resps_data(resps, extract_cdc_dataset_row)
) |>
  tidyr::separate_longer_delim(raw, delim = "\r\n") |> 
  dplyr::filter(stringr::str_detect(raw, ",")) |> 
  tidyr::separate_wider_delim(
    raw,
    delim = ",",
    names = c("field", "value"),
    too_many = "merge",
    too_few = "align_start"
  ) |> 
  # Remove opening/closing quotes and trailing commas.
  dplyr::mutate(
    value = stringr::str_trim(value),
    value = dplyr::if_else(
      stringr::str_starts(value, '"') & stringr::str_ends(value, '"') &
        !stringr::str_detect(stringr::str_sub(value, 2, -2), '"'),
      stringr::str_sub(value, 2, -2),
      value
    ) |> 
      stringr::str_remove(",\\s*$") |> 
      dplyr::na_if("") |> 
      dplyr::na_if("NA") |>  
      dplyr::na_if("n/a") |>  
      dplyr::na_if("N/A") 
  ) |> 
  dplyr::distinct() |> 
  dplyr::filter(!is.na(value)) |> 
  tidyr::pivot_wider(
    id_cols = c(dataset_url),
    names_from = field,
    values_from = value,
    # Paste the contents of multi-value fields together.
    values_fn = \(x) {
      paste(unique(x), collapse = "\n")
    }
  ) |> 
  janitor::clean_names() |> 
  dplyr::mutate(
    tags = purrr::map2_chr(tags, theme, \(tags, theme) {
      if (!is.na(theme)) {
        paste(tags, theme, sep = ", ")
      } else {
        tags
      }
    }),
    language = dplyr::case_match(
      language,
      "English" ~ "en-US",
      .default = language
    )
  ) |> 
  dplyr::mutate(
    dplyr::across(
      c("public_access_level", "update_frequency"),
      tolower
    )
  ) |> 
  # Manually dropped identified meaningless columns.
  dplyr::select(
    -resource_name,
    -system_of_records,
    -theme,
    -is_quality_data
  )

omb_codes <- readr::read_csv("https://resources.data.gov/schemas/dcat-us/v1.1/omb_bureau_codes.csv") |> 
  janitor::clean_names() |> 
  dplyr::mutate(
    cgac_code = dplyr::na_if(cgac_code, "n/a")
  )

fpi_codes <- readr::read_csv("https://resources.data.gov/schemas/dcat-us/v1.1/FederalProgramInventory_FY13_MachineReadable_091613.csv") |> 
  janitor::clean_names()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readme.md

readme.md

CDC Datasets

The Data

How to Participate

Data Dictionary

`cdc_datasets.csv`

`fpi_codes.csv`

`omb_codes.csv`

Cleaning Script

Files

readme.md

Latest commit

History

readme.md

File metadata and controls

CDC Datasets

The Data

How to Participate

Data Dictionary

cdc_datasets.csv

fpi_codes.csv

omb_codes.csv

Cleaning Script

`cdc_datasets.csv`

`fpi_codes.csv`

`omb_codes.csv`