Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add Species Habitat Dataset for Faceted Map Examples #684

Draft
wants to merge 32 commits into
base: main
Choose a base branch
from

Conversation

dsmedia
Copy link
Collaborator

@dsmedia dsmedia commented Feb 11, 2025

will close #683

This PR adds a new dataset to vega-datasets containing county-level species habitat distribution data for the United States. The dataset is designed to support categorical faceted map examples in Vega visualization libraries, addressing needs discussed in vega/altair#1711 and vega/vega-datasets#683.

Implementation Status

To Do

  • Evaluate USGS API integration for more robust data acquisition
  • Generate sample Altair visualizations to validate species selection
  • Address script warnings and improve error handling
  • Replace USGS species codes with human-readable species names
  • Finalize output format (CSV/Parquet/Arrow based on usage requirements)
  • Enhance script functionality:
    • Add flexible output format options (Parquet/CSV/JSON)
    • Enable custom species selection to empower community extensions
  • Reduce precision of pct to three decimal places
  • Validate percentage calculations through county-specific spot checks
  • Ensure script is sufficiently documented to explain to non-geospatial experts
    • precise meaning of pct column
    • technique(s) used to overlay the raster data on counties
  • update datapackage files with new dataset description
  • highlight PR in vega/vega-lite#9292 to motivate issue resolution to allow example visualizations to use .facet(...) rather than a listcomp (as highlighted by @dangotbanned)

Dataset Details

Current Implementation (species.csv)

Structure

  • Format: CSV
  • Content: County-level habitat percentages for four US species (American Bullfrog, American Robin, White-tailed Deer, Common Gartersnake)
  • Location: data/ directory
  • Rows: 11,651
  • Columns: 6

Fields

Field Type Description
item_id string Unique identifier for the species data item on ScienceBase
common_name string Common name of the species (e.g., "American Bullfrog")
scientific_name string Scientific name of the species (e.g., "Lithobates catesbeianus")
gap_species_code string GAP Species Code, a unique identifier for the species within the GAP dataset (e.g., "aAMBUx")
county_id int Combined state and county FIPS code, identifying the US county. Four or five digits (i.e. does not pad with leading zero)
habitat_yearround_pct float Percentage of the county area that is classified as year-round suitable habitat for the species (rounded to 4 decimal places)

Data Generation

The dataset is generated using scripts/species.py, which implements the approach suggested by @mattijn in this comment. The script:

  1. Downloads habitat raster data from USGS ScienceBase
  2. Processes county-level habitat percentages using exactextract
  3. Generates data/species.csv

Known Issues

  • Runtime warning: Spatial reference system of input features does not exactly match raster
  • Species column currently uses USGS codes (e.g., "aAMBUx", "bAMROx")
    • Will be updated to use descriptive species names
  • Percentage precision to be optimized for typical use cases
  • Check for documentation purposes if Alaska data is out of scope

Future work

Consider expanding the dataset (in a backward compatible way) to include summer and winter habitat data, and summer/winter/all_season data on range.

Validation of Habitat Percentage Calculation

image

This image compares the USGS potential habitat map (within this zip file) for bullfrogs (zoomed into the southeastern United States, for clarity) with the output of our code, demonstrating the correct implementation of zonal statistics.

Left: USGS potential habitat map for bullfrogs. This is a raster dataset showing predicted suitable habitat (purple areas) at a 30-meter resolution. It is not aggregated by any administrative boundaries.
Right: Generated choropleth map showing the percentage of suitable habitat within each county. Lighter colors indicate a higher percentage of the county is classified as suitable habitat, while darker colors indicate a lower percentage.

The code successfully overlays the USGS raster data with county boundaries (from us-10m.json, a vector dataset) and uses exactextract to calculate the percentage of each county that falls within the USGS-defined habitat. The resulting map visually confirms that the habitat percentages are being calculated correctly, with higher percentages in counties that overlap significantly with the USGS's predicted habitat area.

Species Habitat Visualization Code
import altair as alt
import pyarrow as pa
import requests
import io
from vega_datasets import data

# Download and load the Arrow file
url = "https://github.com/vega/vega-datasets/raw/c4ff52d22bfa2e70a1b40554cc4c56df216838f3/data/species.arrow"
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
table = pa.ipc.RecordBatchFileReader(io.BytesIO(response.content)).read_all()

# Disable row limit for Altair (if using vegafusion, this may not be needed)
alt.data_transformers.disable_max_rows()

# Load US counties topology
counties = alt.topo_feature(data.us_10m.url, 'counties')

# Filter for the specific species ---
target_item_id = "58fa3f0be4b0b7ea54524859"

# Use a try-except block for more robust error handling
try:
    # Filter the table based on 'item_id'
    filtered_table = table.filter(pa.compute.equal(table.column('item_id'), target_item_id))

    # Check if the item_id was found
    if filtered_table.num_rows == 0:
        raise ValueError(f"item_id '{target_item_id}' not found in the dataset.")

    # Get the species name (assuming it's unique for the given item_id)
    species_name = filtered_table.column('CommonName')[0].as_py()

except pa.lib.ArrowInvalid as e:
    raise ValueError(f"Invalid item_id format or column not found: {e}") from e
except Exception as e:
    raise RuntimeError(f"An unexpected error occurred: {e}") from e

# --- END MODIFICATION ---

# Create a single chart for the filtered species
chart = alt.Chart(counties, title=species_name).mark_geoshape(tooltip=True).encode(
    color=alt.Color(
        'percent_habitat:Q',
        scale=alt.Scale(
            domain=[0, 1],
            scheme='viridis',
            zero=True,
            nice=False
        ),
        title='Habitat Percentage',
        legend=alt.Legend(format=".0%")
    ),
    tooltip=[
        alt.Tooltip('county_id:N', title='County ID'),
        alt.Tooltip('percent_habitat:Q', title='Habitat %', format='.2%')
    ]
).transform_lookup(
    lookup='id',
    from_=alt.LookupData(
        data=filtered_table,  # Use the filtered table directly
        key='county_id',
        fields=['percent_habitat']
    )
).transform_filter(
    "indexof(['2', '15'], '' + floor(datum.id / 1000)) == -1"
).transform_calculate(
    percent_habitat="datum.percent_habitat === null ? 0 : datum.percent_habitat"
).project(type='albersUsa').properties(
    width=600, height=400
)

# Display the chart
chart
</details>

- Replace direct URL downloads with ScienceBase API client (sciencebasepy)
- Remove niquests dependency in favor of native ScienceBase download handling
- Enhance logging for better debugging of file downloads and extractions
- Add explicit ZIP file cleanup after TIF extraction
- Update dependencies to include sciencebasepy and setuptools
@dsmedia
Copy link
Collaborator Author

dsmedia commented Feb 13, 2025

Just as a reference for development, this will generate a complete list of species and all species-level metadata.

Generation script
# /// script
# requires-python = ">=3.8"
# dependencies = [
#     "pandas",
#     "sciencebasepy",
#     "tqdm",
#     "requests",
# ]
# ///
"""
This script retrieves and processes species identifier data from ScienceBase.

It fetches all child items under a specified parent item ID, extracts identifiers
like ECOS and ITIS codes, and compiles the data into a CSV file. The script
utilizes parallel processing to efficiently handle a large number of items and
includes error handling and retry mechanisms for robustness.
"""
import pandas as pd
from sciencebasepy import SbSession
from collections import defaultdict
from tqdm import tqdm
import concurrent.futures
import time
import requests  # Import the requests library


def get_all_item_ids(parent_id: str = "527d0a83e4b0850ea0518326") -> list[str]:
    """
    Retrieves all child item IDs from a given ScienceBase parent item.

    This function efficiently fetches all item IDs that are children of the
    specified parent item ID. It is optimized for performance by using the
    `get_child_ids` method from the ScienceBase API.

    Args:
        parent_id: The ScienceBase item ID of the parent item.
                   Defaults to "527d0a83e4b0850ea0518326" (Habitat Map parent item).

    Returns:
        A list of strings, where each string is a ScienceBase item ID.
        Returns an empty list if there are errors or no child items found.
    """
    print("Retrieving species item IDs from ScienceBase...")
    sb = SbSession()

    try:
        parent_item = sb.get_item(parent_id)
        print(f"Found parent item: '{parent_item.get('title', 'No Title')}'")

        all_ids = list(sb.get_child_ids(parent_id))  # Efficiently get all child IDs

        print(f"Found {len(all_ids)} species items.")
        return all_ids

    except Exception as e:
        print(f"Error retrieving items from ScienceBase: {e}")
        import traceback
        traceback.print_exc()  # Print full traceback for debugging
        return []


def process_single_item(item_id: str, sb: SbSession) -> dict:
    """
    Processes a single ScienceBase item to extract relevant identifier data.

    This function fetches the JSON representation of a ScienceBase item, extracts
    the title and identifiers (like ECOS, ITIS), and returns the data as a dictionary.
    It includes error handling and retry logic for network issues, particularly
    HTTP errors.

    Args:
        item_id: The ScienceBase item ID to process.
        sb: An authenticated SbSession object for interacting with ScienceBase.

    Returns:
        A dictionary containing the extracted item data, including 'item_id',
        'title', and any identifiers found (e.g., 'ECOS', 'ITIS').
        Returns None if the item could not be processed due to errors,
        after retries in case of transient network issues, or if the item is not found (404).
    """
    try:
        item_json = sb.get_item(item_id)
        title = item_json.get('title', 'Unknown Title')

        item_data = defaultdict(lambda: 'Not Available')  # Default value for missing identifiers
        item_data['item_id'] = item_id
        item_data['title'] = title

        if 'identifiers' in item_json:
            for identifier in item_json['identifiers']:
                scheme = identifier.get('scheme', 'Unknown Scheme')
                key = identifier.get('key', 'No Value')
                clean_scheme = scheme.split('/')[-1] if '/' in scheme else scheme  # Extract last part of scheme
                item_data[clean_scheme] = key

        return dict(item_data)

    except requests.exceptions.HTTPError as e:
        print(f"\nHTTPError processing item {item_id}: {e}")
        if e.response.status_code == 404:
            print(f"Item {item_id} not found (404 error). Skipping.")
            return None  # Indicate item not found
        else:
            retries = 3
            for i in range(retries):
                try:
                    print(f"Retrying item {item_id} (attempt {i+1}/{retries})...")
                    time.sleep(2**i)  # Exponential backoff
                    item_json = sb.get_item(item_id)  # Retry fetching the item
                    title = item_json.get('title', 'Unknown Title')

                    item_data = defaultdict(lambda: 'Not Available')
                    item_data['item_id'] = item_id
                    item_data['title'] = title
                    if 'identifiers' in item_json:
                        for identifier in item_json['identifiers']:
                            scheme = identifier.get('scheme', 'Unknown Scheme')
                            key = identifier.get('key', 'No Value')
                            clean_scheme = scheme.split('/')[-1] if '/' in scheme else scheme
                            item_data[clean_scheme] = key
                    return dict(item_data)
                except requests.exceptions.RequestException:
                    if i == retries - 1:
                        print(f"Failed to retrieve item {item_id} after multiple retries.")
                        return None  # Indicate failure after retries
    except Exception as e:
        print(f"\nError processing item {item_id}: {e}")
        return None  # Indicate processing failure


def explore_species_identifiers_parallel(item_ids: list[str], max_workers: int = 10) -> pd.DataFrame:
    """
    Explores and extracts identifiers from ScienceBase items in parallel.

    This function processes a list of ScienceBase item IDs concurrently using
    a thread pool executor. It fetches data for each item using the
    `process_single_item` function and aggregates the results into a pandas DataFrame.

    Args:
        item_ids: A list of ScienceBase item IDs to process.
        max_workers: The maximum number of threads to use for parallel processing.
                     Adjust this value based on your system and network to optimize
                     performance without overloading resources.

    Returns:
        A pandas DataFrame where each row represents a ScienceBase item,
        and columns include 'item_id', 'title', and identifier schemes
        (e.g., 'ECOS', 'ITIS'). Returns an empty DataFrame if no data is processed.
    """
    sb = SbSession()  # Create a single session for all threads
    all_schemes = set()
    results = []

    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = [executor.submit(process_single_item, item_id, sb) for item_id in item_ids]

        for future in tqdm(concurrent.futures.as_completed(futures), total=len(item_ids), desc="Processing species items"):
            result = future.result()
            if result:
                results.append(result)
                all_schemes.update(key for key in result if key not in ('item_id', 'title')) # Update schemes efficiently

    if not results:
        return pd.DataFrame()  # Return empty DataFrame if no results

    df = pd.DataFrame(results)

    # Ensure all identifier scheme columns are present and reorder
    for scheme in all_schemes:
        if scheme not in df.columns:
            df[scheme] = 'Not Available'
    cols = ['item_id', 'title'] + sorted(list(all_schemes))
    df = df[cols]

    return df


def main():
    """
    Main function to execute the species identifier exploration script.

    This function orchestrates the process of:
    1. Retrieving all species item IDs from ScienceBase.
    2. Processing these items in parallel to extract identifiers.
    3. Generating summary statistics of the extracted data.
    4. Saving the results to a CSV file named 'all_species_identifiers.csv'.
    5. Displaying the first few rows of the resulting DataFrame.
    """
    all_ids = get_all_item_ids()  # Get all species item IDs

    if not all_ids:
        print("No species IDs found. Exiting.")
        return

    print("\nProcessing species item identifiers...")
    df = explore_species_identifiers_parallel(all_ids, max_workers=10)  # Process in parallel

    if df.empty:
        print("No data was processed successfully. Exiting.")
        return

    print("\nIdentifier Summary:")
    print(f"Total species items processed: {len(df)}")
    print("\nIdentifier columns found and count of non 'Not Available' values:")
    for col in df.columns:
        # Corrected line to exclude NaN values from the count
        non_empty_count = df[col].ne('Not Available').sum() - df[col].isna().sum()
        print(f"- {col}: {non_empty_count} items have this identifier")

    output_file = 'all_species_identifiers.csv'
    df.to_csv(output_file, index=False)
    print(f"\nResults saved to '{output_file}'")

    print("\nFirst 5 rows of the data:")
    pd.set_option('display.max_columns', None)  # Display all columns
    pd.set_option('display.max_colwidth', None) # Display full column width
    print(df.head()) # Replaced display(df.head()) with print()


if __name__ == "__main__":
    main()

USGS report with domain-specific explanations

💡Note : initial character of GAP_SpeciesCode refers to: amphibians, birds, mammals, reptiles.

💡Note : The dataset includes 1,590 species and 129 subspecies. The hierarchical relationship between species and subspecies can be inferred from the final character of GAP_SpeciesCode. For example, mPASHc and mPASHp refer to Sorex pacificus cascadensis and Sorex pacificus pacificus, respectively, which is part of mPASHx, Pacific Shrew. The species parent should contain the habitat locations for all subspecies.

Column Name Sample Values
item_id 58fa4341e4b0b7ea54524b8f; 58fa6abfe4b0b7ea545258e1; 58fa6b9ce4b0b7ea54525905
title San Simeon Slender Salamander (Batrachoseps incognitus) aSSSSx_CONUS_2001v1 Habitat Map; Hooded Skunk (Mephitis macroura) mHOSKx_CONUS_2001v1 Habitat Map; Santa Rosa Island Fox (Urocyon littoralis santarosae) mISFOr_CONUS_2001v1 Habitat Map
CommonName San Simeon Slender Salamander; Hooded Skunk; Santa Rosa Island Fox
GAP_SpeciesCode aSSSSx; mHOSKx; mISFOr
ScientificName Batrachoseps incognitus; Mephitis macroura; Urocyon littoralis santarosae
doi doi:10.5066/F7X065BT; doi:10.5066/F7FJ2F60; doi:10.5066/F7WM1BSQ
itis_tsn_validMatch 668244; 180563; 726908
itis_tsn_validSpeciesConceptEncompassingUnrecognizedSpeciesConcept 174233; 174092; 174233
itis_tsn_validSpeciesForUnrecognizedSubspecies 180347; 180310; 552479
itis_tsn_validSpeciesFromSubspecies 997723
itis_tsn_validSpeciesFromUnrecognizedSuppopulation 552521
itis_tsn_validSpeciesWithFormerName 174024; 914103; 174238
itis_tsn_validSubspeciesFromSpecies 208992; 209163; 668673
iucn_id_verified 59125; 41634; 103798425
nsid_acceptedMatch 102298; 103207; 102640
nsid_acceptedSpeciesConceptEncompassingUnrecognizedSpeciesConcept 106286; 106231
nsid_acceptedSpeciesFromRecognizedGenus 102290; 101405; 103451
nsid_acceptedSpeciesFromSubspecies 902215
nsid_acceptedSpeciesFromUnrecognizedSubspecies 102073; 101780; 104212
nsid_acceptedSupspeciesFromSpecies 102277

Restructures the species habitat analysis script:

- Implement modular architecture with ScienceBaseClient, RasterSet, and
  HabitatDataProcessor classes for better maintainability
- Integrate sciencebasepy for ZIP-based habitat map downloads from USGS
  ScienceBase, with automatic cleanup of temporary files
- Add multi-format output support (CSV/Parquet/Arrow) with Arrow as default,
  using dictionary encoding for optimized storage and performance
- Enhance metadata by including species common and scientific names from
  ScienceBase API
- Add comprehensive CLI arguments for configuration and debug logging
- Improve robustness with better error handling and type annotations
@dsmedia
Copy link
Collaborator Author

dsmedia commented Feb 16, 2025

@dangotbanned would welcome your thoughts on whether i'm on the right track here.

@dangotbanned
Copy link
Member

@dangotbanned would welcome your thoughts on whether i'm on the right track here.

Thanks @dsmedia

I've only skimmed through, but could you switch from argparse to configuring via a .toml?

You'd be able to remove the defaults and the parsing logic from the script - or at least just validate the options in one place

@dangotbanned
Copy link
Member

@dsmedia everything appears to be working so far 🎉

image

* Switch to using zipfile.Path for more Pythonic ZIP file handling
* Enforce expectation of exactly one TIF file per ZIP
* Add error handling for unexpected file counts
Refactors `species.py` to use TOML configuration (`_data/species.toml`)
instead of `argparse`, improving flexibility and maintainability.
Settings include `item_ids`, `vector_fp`, `output_dir`, `output_format`,
and `debug`. Relative paths (e.g., `../data/us-10m.json`) are resolved
relative to the TOML file, and basic validation is added.

`RasterSet.extract_tifs_from_zips()` now uses `zipfile.Path` and enforces
a single `.tif` file per ZIP, raising a `RuntimeError` otherwise.
Type hinting fixes are also included.
@dsmedia
Copy link
Collaborator Author

dsmedia commented Feb 19, 2025

Currently this PR provides county-level percentages of year-round predicted habitat for four species. I'm considering expanding the dataset to include both predicted and known ranges across different seasons (summer-only, winter-only, and year-round) for the four species, matching the scope of the USGS report on this dataset. This would:

  1. Better represent the full richness of the USGS GAP species data
  2. Enable more complex faceted visualization examples
  3. Provide an engaging use case for Vega's interactive selections and cross-filtering across temporal and spatial dimensions

This expanded scope would also make the dataset more versatile for the broader Vega community beyond the original faceted map use case. Before formally proposing this expansion, I'll first generate a test dataset to evaluate size, performance, and data quality.

image

@mattijn
Copy link

mattijn commented Feb 19, 2025

Thanks for being so sophisticated with this PR @dsmedia!

I'm personally not really up to date with the arrow format. If the source can be parsed into a dataframe or referenced by url in the Vega editor than it is perfect.

Currently the data is in row oriented format as it is the easiest way to facet the chart, but it means we pass the threshold of 5K rows. If we store the data column oriented we stay within the 5K rows limit, but it means there will be additional melt transforms in the specification. I'm leaning towards keeping it as is to simplify the chart specification.

I understand that you want to provide wider access to all options of the data source, but don't lose yourself in all available possibilities. What you currently have is really great already.

Once this referenced feature request is in, vega/vega-lite#9389 we can revisit this dataset to make a map like your screenshot to highlight pixel based seasonal differences for a single or few counties.

But if you do have great ideas to approach this, go for it. I just want to say that what you have is awesome already!

- Rename species columns for consistency and clarity:
  - GAP_Species -> gap_species_code
  - percent_habitat -> habitat_yearround_pct
  - CommonName -> common_name
  - ScientificName -> scientific_name
- Expand module docstring with detailed information about:
  - Data sources and resolution
  - Projection details
  - Habitat value meanings
  - Output format options
- Improve code comments for future extensibility
- reference habitat data source in species.toml
- list alternative output format options in toml
The changes prepare for potential future addition of seasonal
habitat data (and summer/winter habitat data) while maintaining backward compatibility.
@dsmedia
Copy link
Collaborator Author

dsmedia commented Feb 20, 2025

I'm personally not really up to date with the arrow format. If the source can be parsed into a dataframe or referenced by url in the Vega editor than it is perfect.

Happy to revise to csv if that is deemed best here. I'm curious though if this could be a nice opportunity to create the first standalone dataset in arrow format in the repo. (flights-200k exists in this format now but exists alongside a json variant.) The arrow format was suggested by @domoritz (along with csv) if this was intended for use beyond Altair. The compactness of arrow (with its columnar format) seemed to give it the edge. @dangotbanned could perhaps speak more directly to usability/suitability of the arrow format? I do see that Altair docs indicate arrow is supported though experimental?

as a DataFrame that supports the DataFrame Interchange Protocol (contains a dataframe attribute), e.g. polars and pyarrow. This is experimental.

I understand that you want to provide wider access to all options of the data source, but don't lose yourself in all available possibilities. What you currently have is really great already.

Once this referenced feature request is in, vega/vega-lite#9389 we can revisit this dataset to make a map like your screenshot to highlight pixel based seasonal differences for a single or few counties.

But if you do have great ideas to approach this, go for it. I just want to say that what you have is awesome already!

This is great perspective. In aa4516f I've kept it as is (with the all-season habitat data only) but reformulated the column names to permit a backward-compatible update in the future. (The data column is now called habitat_yearround_pct, so it would be just a drop-in to add habitat_summer_pct and/or range_winter_pct, for example.)

Separately, @mattijn, exactextract is generating this warning in the terminal.

(RuntimeWarning: Spatial reference system of input features does not exactly match raster)

If harmless, could you help explain/document why this happens (and what it means practically), or can we revise the code to address the conditions causing the warning?

@dsmedia dsmedia marked this pull request as ready for review February 20, 2025 13:00
@dsmedia dsmedia requested a review from mattijn February 20, 2025 13:03
@dsmedia dsmedia self-assigned this Feb 20, 2025
@dangotbanned
Copy link
Member

dangotbanned commented Feb 21, 2025

Data Format Comparison

File Sizes

* species.csv: 953.5 KB

* species.arrow: 651.6 KB

* species.parquet: 82.7 KB

DataFrame Shapes (rows, columns)

* CSV: (11651, 6)

* Arrow: (11651, 6)

* Parquet: (11651, 6)

Perfect, thanks @dsmedia

In that case, a .csv is still very much within the range (by file size) of some of the larger datasets.
If you were to pick only one format, I would pick that for compatibility.

Providing multiple formats will also be a welcome choice after (vega/altair#3631), since those can be prioritised where a backend supports it 😊

Having all 3 also makes sense, now that (vega/vega#3961) looks achievable

@dsmedia
Copy link
Collaborator Author

dsmedia commented Feb 21, 2025

Great. I'll update the PR to switch to CSV. When support for the other formats is enabled, we can consider generating parquet and/or arrow for a wider swath of the datasets.

@dsmedia
Copy link
Collaborator Author

dsmedia commented Feb 21, 2025

@dangotbanned in the diff, I noticed that the string "git+" seems to be appended below in the datapackage.md and datapackage.json files, in a place unrelated to the species dataset. Would you be able to have a quick look to see what's going on there?

image

@dangotbanned
Copy link
Member

@dangotbanned in the diff, I noticed that the string "git+" seems to be appended below in the datapackage.md and datapackage.json files, in a place unrelated to the species dataset. Would you be able to have a quick look to see what's going on there?

Looks like its pulling through the new url since https://github.com/vega/vega-datasets/pull/682/files#diff-7ae45ad102eab3b6d7e7896acd08c427a9b25b346470d7bc6507b6481575d519

@domoritz

@domoritz
Copy link
Member

I added it since it seems to be recommended by https://publint.dev/[email protected]

Screenshot 2025-02-21 at 10 37 06

@mattijn
Copy link

mattijn commented Feb 22, 2025

Nice that you added a section on validation in OP! Cool to have this dataset to display the suitable habitat fractions by county! Idid a quick check of this section. You currently state:

Darker colors indicate a higher percentage of the county is classified as suitable habitat, while lighter colors indicate a lower percentage.

But where you write "Darker" it should be "Lighter" and where you write "lighter" it is "darker".

In the spec you currently have:

alt.Tooltip('county_id:N', title='County ID'),

Which gives null for counties that are not included in the dataset filtered_table from where you lookup the suitable habitat fractions.

image

but it is better to do

alt.Tooltip('id:N', title='County ID'),

Since the id is within the counties dataset that contains all counties:
image

Apart from that all good!

Comment on lines 616 to 642
# --- Basic Configuration Validation ---
if not isinstance(item_ids, list) or not item_ids:
logger.error("`item_ids` in the TOML file must be a non-empty list.")
msg = "`item_ids` must be a non-empty list."
raise TypeError(msg)

if not isinstance(vector_fp, str | Path):
logger.error("`vector_fp` must be a string or Path.")
msg = "`vector_fp` must be a string or Path."
raise TypeError(msg)
vector_fp = Path(vector_fp)

if not isinstance(output_dir, str | Path):
logger.error("`output_dir` must be a string or Path.")
msg = "`output_dir` must be a string or Path."
raise TypeError(msg)
output_dir = Path(output_dir)

if output_format not in {"csv", "parquet", "arrow"}:
logger.error("Invalid `output_format`: %s", output_format)
msg = f"Invalid `output_format`: {output_format}"
raise ValueError(msg)

if not isinstance(debug, bool):
logger.error("`debug` must be a boolean.")
msg = "`debug` must be a boolean."
raise TypeError(msg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not seen this pattern before of both logging and raising an error.
@dsmedia I think just raising the error would be enough?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simplified now - thank you

dsmedia added a commit to vega/altair that referenced this pull request Feb 22, 2025
This commit introduces a new example to the Altair documentation showcasing  choropleth maps faceted by category.

The example visualizes the distribution of suitable habitat for different species across US counties, using the proposed new Species Habitat dataset from vega-datasets (vega/vega-datasets#684).

Key features of this example:

- Demonstrates the use of `alt.Chart.mark_geoshape()` for geographical visualizations.
- Shows how to create faceted maps for comparing categorical data across geographic regions.
- Utilizes the `transform_lookup` and `transform_calculate` transforms for data manipulation within Altair.
- Uses a CSV data file temporarily hosted in the vega-datasets repository branch (pending dataset merge).

This example addresses issue #1711, which requested a faceted map example for the Altair documentation.

Co-authored-by: Mattijn van Hoek <[email protected]>
@dangotbanned
Copy link
Member

@dsmedia sorry I haven't had a chance to get to this properly yet

If everyone else is happy, don't let me be the blocker 👍
Left some quick comments and fixed the docs a bit in (9a5d0c2)

dsmedia and others added 2 commits February 22, 2025 16:57
Co-authored-by: Dan Redding <[email protected]>
- Remove redundant error logging before raising exceptions
Modify species dataset processing to explicitly set Continental US counties
to 0% habitat when unassigned, while keeping Alaska and Hawaii counties as
null values. This simplifies Altair visualization by removing the need for
transform_filter and allows for better projection choices.
@dsmedia dsmedia marked this pull request as draft February 26, 2025 11:14
dsmedia and others added 3 commits February 27, 2025 12:30
This commit enhances the GAP habitat analysis script to focus specifically on coterminous (contiguous) US counties by implementing a geographic filtering system based on FIPS codes.

## Changes:
- Added geographic filtering configuration in TOML to exclude Alaska, Hawaii, Puerto Rico, and Virgin Islands
- Refactored county data loading into smaller, specialized methods for better maintainability
- Modified habitat analysis to explicitly handle coterminous US counties, filling missing values with zeros
- Updated documentation to clarify the geographic scope of the analysis
- Improved error handling and logging, especially for county boundary processing
- Changed default output format from arrow to csv

## Technical Details:
- The `HabitatDataProcessor` class now accepts a full configuration object
- Added `_filter_to_coterminous_us` method to process excluded areas based on FIPS codes
- Enhanced `save_results` to maintain complete county coverage with consistent data formatting
- Configuration now includes a structured `[processing.geographic_filter]` section

This change ensures consistent habitat analysis across the coterminous United States by standardizing which areas are included in the analysis.
Use `uv run --group="geo-species" scripts/species.py`
pyproject.toml Outdated
Comment on lines 15 to 25
[dependency-groups]
dev = ["ipython[kernel]>=8.30.0", "ruff>=0.8.2", "taplo>=0.9.3"]
geo-species = [
"exactextract",
"geopandas",
"pandas[pyarrow]",
"rasterio",
"requests",
"sciencebasepy",
"setuptools",
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried running the script and was a bit alarmed by the install size.
Definitely a case where I wouldn't want to install that on each run - so following (1415aae) they're now locked.

Use the command below to run the script:

uv run --group="geo-species" scripts/species.py

Comment on lines +109 to +133
for item_id in item_ids:
try:
item_json = self.sb.get_item(item_id)
files_info = self.sb.get_item_file_info(item_json)

for file_info in files_info:
if "HabMap" in file_info["name"] and file_info["name"].endswith(
".zip"
):
zip_path = temp_dir / file_info["name"]
self.sb.download_file(file_info["url"], str(zip_path))
downloaded_zips.append(zip_path)

except Exception as e: # Catch generic Exception
if isinstance(e, requests.exceptions.RequestException):
logger.error(
"Error downloading files for item ID %s: %s", item_id, e
)
else:
logger.error(
"An unexpected error occurred for item ID %s: %s", item_id, e
)
continue # Go to the next item_id

return sorted(downloaded_zips) # ALWAYS return the list
Copy link
Member

@dangotbanned dangotbanned Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Been stuck here for a while with no progress.

System usage seems way too high to be downloading

Screenshot

image

Logs

PS C:\Users\danie\Documents\GitHub\vega-datasets> uv run --group="geo-species" scripts/species.py
Installed 1 package in 24ms
2025-02-27 16:52:39 [INFO] Initializing GAP habitat analysis pipeline
2025-02-27 16:52:39 [INFO] Loading county boundaries from local file: C:\Users\danie\Documents\GitHub\vega-datasets\data\us-10m.json
2025-02-27 16:52:40 [INFO] Successfully loaded county data with these columns: ['id', 'geometry']
2025-02-27 16:52:40 [INFO] Filtering out 4 areas by FIPS code to focus on conterminous US
2025-02-27 16:52:40 [INFO] Excluding 27 counties with FIPS prefix 02 (Alaska)
2025-02-27 16:52:40 [INFO] Excluded county FIPS codes: 02013, 02016, 02020, 02050, 02060, 02068, 02070, 02090, 02100, 02110, 02122, 02130, 02150, 02164, 02170, 02180, 02185, 02188, 02201, 02220, 02232, 02240, 02261, 02270, 02280, 02282, 02290
02261, 02270, 02280, 02282, 02290
2025-02-27 16:52:40 [INFO] Excluding 4 counties with FIPS prefix 15 (Hawaii)
2025-02-27 16:52:40 [INFO] Excluded county FIPS codes: 15001, 15003, 15007, 15009
2025-02-27 16:52:40 [INFO] Excluding 74 counties with FIPS prefix 72 (Puerto Rico)                                                                                                                               72049, 72051, 72053, 72054, 72055, 72057, 72059, 72061, 72063, 
2025-02-27 16:52:40 [INFO] Excluded county FIPS codes: 72001, 72003, 72005, 72007, 72009, 72011, 72013, 72017, 72019, 72021, 72023, 72025, 72027, 72029, 72031, 72035, 72037, 72039, 72041, 72043, 72045, 72047,  72129, 72131, 72133, 72135, 72137, 72139, 72141, 72143, 72145,
72049, 72051, 72053, 72054, 72055, 72057, 72059, 72061, 72063, 72065, 72069, 72071, 72073, 72075, 72077, 72079, 72081, 72083, 72085, 72087, 72089, 72091, 72093, 72095, 72097, 72099, 72101, 72103, 72105, 72107, 72109, 72111, 72113, 72115, 72119, 72121, 72123, 72125, 72127, 72129, 72131, 72133, 72135, 72137, 72139, 72141, 72143, 72145, 72147, 72149, 72151, 72153
2025-02-27 16:52:40 [INFO] Excluding 3 counties with FIPS prefix 78 (Virgin Islands)
2025-02-27 16:52:40 [INFO] Excluded county FIPS codes: 78010, 78020, 78030
2025-02-27 16:52:40 [INFO] Analyzing 3105 counties in conterminous US
2025-02-27 16:52:40 [WARNING] Removed 4 counties with invalid boundaries
2025-02-27 16:52:40 [INFO] Step 1/3: Downloading and extracting habitat data files
2025-02-27 16:52:40 [INFO] Retrieving species information from ScienceBase
2025-02-27 16:52:41 [INFO] Downloading habitat data files                                                                                                                                                        _HabMap_2001v1.zip
downloading https://www.sciencebase.gov/catalog/file/get/58fa449fe4b0b7ea54524c5e?f=__disk__dc%2F2f%2Fa7%2Fdc2fa725a0d36b4f9ceffed282019cf27b9a37f3 to C:\Users\danie\AppData\Local\Temp\tmpnfbi0fvb\bAMROx_CONUS_HabMap_2001v1.zip

Feeling like this might point towards the issue https://docs.astral.sh/ruff/rules/blind-except/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Traceback on CTRL+C

Traceback (most recent call last):
  File "C:\Users\danie\Documents\GitHub\vega-datasets\scripts\species.py", line 791, in <module>
    main()
  File "C:\Users\danie\Documents\GitHub\vega-datasets\scripts\species.py", line 787, in main
    processor.run()
  File "C:\Users\danie\Documents\GitHub\vega-datasets\scripts\species.py", line 702, in run
    tif_files, species_info = self.process_habitat_data(temp_dir)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\danie\Documents\GitHub\vega-datasets\scripts\species.py", line 470, in process_habitat_data
    zip_files = self.sciencebase_client.download_zip_files(self.item_ids, temp_dir)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\danie\Documents\GitHub\vega-datasets\scripts\species.py", line 119, in download_zip_files
    self.sb.download_file(file_info["url"], str(zip_path))
  File "C:\Users\danie\Documents\GitHub\vega-datasets\.venv\Lib\site-packages\sciencebasepy\SbSession.py", line 919, in download_file
    for chunk in response.iter_content(chunk_size=1024):
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\danie\Documents\GitHub\vega-datasets\.venv\Lib\site-packages\requests\models.py", line 820, in generate
    yield from self.raw.stream(chunk_size, decode_content=True)
  File "C:\Users\danie\Documents\GitHub\vega-datasets\.venv\Lib\site-packages\urllib3\response.py", line 967, in stream
    data = self.read(amt=amt, decode_content=decode_content)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\danie\Documents\GitHub\vega-datasets\.venv\Lib\site-packages\urllib3\response.py", line 879, in read
    data = self._raw_read(amt)
           ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\danie\Documents\GitHub\vega-datasets\.venv\Lib\site-packages\urllib3\response.py", line 792, in _raw_read
    with self._error_catcher():
         ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\danie\AppData\Roaming\uv\python\cpython-3.12.8-windows-x86_64-none\Lib\contextlib.py", line 141, in __exit__
    def __exit__(self, typ, value, traceback):

KeyboardInterrupt

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsmedia I've added the rule, but not fixed it (https://github.com/vega/vega-datasets/actions/runs/13571913413/job/37938835261?pr=684)

Was hoping to review in more detail - but I can't get beyond this step sadly 😞

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me it works fine, downloading data in ~2 minutes and analysis in ~11 minutes:

uv run --group="geo-species" scripts/species.py
Using CPython 3.12.9 interpreter at: /Users/mattijnvanhoek/miniconda3/envs/stable/bin/python
Creating virtual environment at: .venv
   Built stringcase==1.2.0
   Built progress==1.6
   Built petl==1.7.15
Installed 104 packages in 352ms
2025-02-27 21:32:06 [INFO] Initializing GAP habitat analysis pipeline
2025-02-27 21:32:06 [INFO] Loading county boundaries from local file: /Users/mattijnvanhoek/vega/vega-datasets/data/us-10m.json
2025-02-27 21:32:17 [INFO] Successfully loaded county data with these columns: ['id', 'geometry']
2025-02-27 21:32:17 [INFO] Filtering out 4 areas by FIPS code to focus on conterminous US
2025-02-27 21:32:17 [INFO] Excluding 27 counties with FIPS prefix 02 (Alaska)
2025-02-27 21:32:17 [INFO] Excluded county FIPS codes: 02013, 02016, 02020, 02050, 02060, 02068, 02070, 02090, 02100, 02110, 02122, 02130, 02150, 02164, 02170, 02180, 02185, 02188, 02201, 02220, 02232, 02240, 02261, 02270, 02280, 02282, 02290
2025-02-27 21:32:17 [INFO] Excluding 4 counties with FIPS prefix 15 (Hawaii)
2025-02-27 21:32:17 [INFO] Excluded county FIPS codes: 15001, 15003, 15007, 15009
2025-02-27 21:32:17 [INFO] Excluding 74 counties with FIPS prefix 72 (Puerto Rico)
2025-02-27 21:32:17 [INFO] Excluded county FIPS codes: 72001, 72003, 72005, 72007, 72009, 72011, 72013, 72017, 72019, 72021, 72023, 72025, 72027, 72029, 72031, 72035, 72037, 72039, 72041, 72043, 72045, 72047, 72049, 72051, 72053, 72054, 72055, 72057, 72059, 72061, 72063, 72065, 72069, 72071, 72073, 72075, 72077, 72079, 72081, 72083, 72085, 72087, 72089, 72091, 72093, 72095, 72097, 72099, 72101, 72103, 72105, 72107, 72109, 72111, 72113, 72115, 72119, 72121, 72123, 72125, 72127, 72129, 72131, 72133, 72135, 72137, 72139, 72141, 72143, 72145, 72147, 72149, 72151, 72153
2025-02-27 21:32:17 [INFO] Excluding 3 counties with FIPS prefix 78 (Virgin Islands)
2025-02-27 21:32:17 [INFO] Excluded county FIPS codes: 78010, 78020, 78030
2025-02-27 21:32:17 [INFO] Analyzing 3105 counties in conterminous US
2025-02-27 21:32:40 [WARNING] Removed 4 counties with invalid boundaries
2025-02-27 21:32:40 [INFO] Step 1/3: Downloading and extracting habitat data files
2025-02-27 21:32:40 [INFO] Retrieving species information from ScienceBase
2025-02-27 21:32:42 [INFO] Downloading habitat data files
downloading https://www.sciencebase.gov/catalog/file/get/58fa449fe4b0b7ea54524c5e?f=__disk__dc%2F2f%2Fa7%2Fdc2fa725a0d36b4f9ceffed282019cf27b9a37f3 to /var/folders/b3/pctp5dwn6t7fm1t4sc4tknb00000gp/T/tmpdfo6ou2c/bAMROx_CONUS_HabMap_2001v1.zip
downloading https://www.sciencebase.gov/catalog/file/get/58fa817ae4b0b7ea54525c2f?f=__disk__57%2F6f%2F3b%2F576f3bb4fe3167204a20961184da3f2aa9440b52 to /var/folders/b3/pctp5dwn6t7fm1t4sc4tknb00000gp/T/tmpdfo6ou2c/mWTDEx_CONUS_HabMap_2001v1.zip
downloading https://www.sciencebase.gov/catalog/file/get/58fa3f0be4b0b7ea54524859?f=__disk__81%2F26%2F6f%2F81266f6d642fe5a03255f19055afd9d4d169a06d to /var/folders/b3/pctp5dwn6t7fm1t4sc4tknb00000gp/T/tmpdfo6ou2c/aAMBUx_CONUS_HabMap_2001v1.zip
downloading https://www.sciencebase.gov/catalog/file/get/58fe0f4fe4b0074928294636?f=__disk__8d%2F6b%2F91%2F8d6b91c206bafddac994009aa7a2ce5ef2d44066 to /var/folders/b3/pctp5dwn6t7fm1t4sc4tknb00000gp/T/tmpdfo6ou2c/rCOGAx_CONUS_HabMap_2001v1.zip
2025-02-27 21:34:33 [INFO] Extracting raster files
2025-02-27 21:34:34 [INFO] Step 2/3: Beginning habitat analysis for 4 species
/Users/mattijnvanhoek/vega/vega-datasets/.venv/lib/python3.12/site-packages/exactextract/exact_extract.py:330: RuntimeWarning: Spatial reference system of input features does not exactly match raster.
  warnings.warn(
Processing 45019: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100.0/100 [10:56<00:00,  6.57s/it]
2025-02-27 21:45:37 [INFO] Step 3/3: Saving analysis results
2025-02-27 21:45:37 [INFO] Analysis complete. Results saved to: /Users/mattijnvanhoek/vega/vega-datasets/data

And with vega/altair@f1a1b5a was able to reduce the chart specifications with around 10 lines for the corresponding charts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me it works fine, downloading data in ~2 minutes and analysis in ~11 minutes:

Hmm 🤔

@mattijn maybe windows is the issue?

Yours

downloading https://www.sciencebase.gov/catalog/file/get/58fa449fe4b0b7ea54524c5e?f=__disk__dc%2F2f%2Fa7%2Fdc2fa725a0d36b4f9ceffed282019cf27b9a37f3 to /var/folders/b3/pctp5dwn6t7fm1t4sc4tknb00000gp/T/tmpdfo6ou2c/bAMROx_CONUS_HabMap_2001v1.zip

Mine

downloading https://www.sciencebase.gov/catalog/file/get/58fa449fe4b0b7ea54524c5e?f=__disk__dc%2F2f%2Fa7%2Fdc2fa725a0d36b4f9ceffed282019cf27b9a37f3 to C:\Users\danie\AppData\Local\Temp\tmp2jvviidr\bAMROx_CONUS_HabMap_2001v1.zip

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was working for me back in (9a778c6)

#684 (comment)

Copy link
Member

@dangotbanned dangotbanned Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed some things around and it seems to have created the file - but not filling it
image

self.sb.download_file(file_info["url"], str(zip_path))

https://github.com/DOI-USGS/sciencebasepy/blob/329192d6a5d5bb9039dc35441dede4ff084b03a9/sciencebasepy/SbSession.py#L912

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was working for me back in (9a778c6)

#684 (comment)

The same thing happened to me earlier today, and when checking I could see that the issue was that the download speed had been throttled significantly. If we can add some kind of counter it should at least demonstrate the progress.

@dsmedia
Copy link
Collaborator Author

dsmedia commented Mar 2, 2025

Issue with exactextract installation

I encountered an installation error when trying to run the script with uv run --group="geo-species" scripts/species.py. The error occurs because:

  1. The current release of exactextract (0.2.0) has its URLs defined in the wrong section of its pyproject.toml file
  2. When installing with newer build tools (like scikit-build-core 0.11.0, released just 2 days ago), this causes a validation error:
    scikit_build_core._vendor.pyproject_metadata.errors.ConfigurationError: Extra keys present in "project": 'url'
    

This has been fixed in a PR that was already merged (isciences/exactextract#155), but the fix is only available in the development version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataset-content Pull requests that add, modify, or update dataset contents
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Species Habitat Dataset for Faceted Map Examples
4 participants