-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add Species Habitat Dataset for Faceted Map Examples #684
base: main
Are you sure you want to change the base?
Conversation
- Replace direct URL downloads with ScienceBase API client (sciencebasepy) - Remove niquests dependency in favor of native ScienceBase download handling - Enhance logging for better debugging of file downloads and extractions - Add explicit ZIP file cleanup after TIF extraction - Update dependencies to include sciencebasepy and setuptools
Just as a reference for development, this will generate a complete list of species and all species-level metadata. Generation script# /// script
# requires-python = ">=3.8"
# dependencies = [
# "pandas",
# "sciencebasepy",
# "tqdm",
# "requests",
# ]
# ///
"""
This script retrieves and processes species identifier data from ScienceBase.
It fetches all child items under a specified parent item ID, extracts identifiers
like ECOS and ITIS codes, and compiles the data into a CSV file. The script
utilizes parallel processing to efficiently handle a large number of items and
includes error handling and retry mechanisms for robustness.
"""
import pandas as pd
from sciencebasepy import SbSession
from collections import defaultdict
from tqdm import tqdm
import concurrent.futures
import time
import requests # Import the requests library
def get_all_item_ids(parent_id: str = "527d0a83e4b0850ea0518326") -> list[str]:
"""
Retrieves all child item IDs from a given ScienceBase parent item.
This function efficiently fetches all item IDs that are children of the
specified parent item ID. It is optimized for performance by using the
`get_child_ids` method from the ScienceBase API.
Args:
parent_id: The ScienceBase item ID of the parent item.
Defaults to "527d0a83e4b0850ea0518326" (Habitat Map parent item).
Returns:
A list of strings, where each string is a ScienceBase item ID.
Returns an empty list if there are errors or no child items found.
"""
print("Retrieving species item IDs from ScienceBase...")
sb = SbSession()
try:
parent_item = sb.get_item(parent_id)
print(f"Found parent item: '{parent_item.get('title', 'No Title')}'")
all_ids = list(sb.get_child_ids(parent_id)) # Efficiently get all child IDs
print(f"Found {len(all_ids)} species items.")
return all_ids
except Exception as e:
print(f"Error retrieving items from ScienceBase: {e}")
import traceback
traceback.print_exc() # Print full traceback for debugging
return []
def process_single_item(item_id: str, sb: SbSession) -> dict:
"""
Processes a single ScienceBase item to extract relevant identifier data.
This function fetches the JSON representation of a ScienceBase item, extracts
the title and identifiers (like ECOS, ITIS), and returns the data as a dictionary.
It includes error handling and retry logic for network issues, particularly
HTTP errors.
Args:
item_id: The ScienceBase item ID to process.
sb: An authenticated SbSession object for interacting with ScienceBase.
Returns:
A dictionary containing the extracted item data, including 'item_id',
'title', and any identifiers found (e.g., 'ECOS', 'ITIS').
Returns None if the item could not be processed due to errors,
after retries in case of transient network issues, or if the item is not found (404).
"""
try:
item_json = sb.get_item(item_id)
title = item_json.get('title', 'Unknown Title')
item_data = defaultdict(lambda: 'Not Available') # Default value for missing identifiers
item_data['item_id'] = item_id
item_data['title'] = title
if 'identifiers' in item_json:
for identifier in item_json['identifiers']:
scheme = identifier.get('scheme', 'Unknown Scheme')
key = identifier.get('key', 'No Value')
clean_scheme = scheme.split('/')[-1] if '/' in scheme else scheme # Extract last part of scheme
item_data[clean_scheme] = key
return dict(item_data)
except requests.exceptions.HTTPError as e:
print(f"\nHTTPError processing item {item_id}: {e}")
if e.response.status_code == 404:
print(f"Item {item_id} not found (404 error). Skipping.")
return None # Indicate item not found
else:
retries = 3
for i in range(retries):
try:
print(f"Retrying item {item_id} (attempt {i+1}/{retries})...")
time.sleep(2**i) # Exponential backoff
item_json = sb.get_item(item_id) # Retry fetching the item
title = item_json.get('title', 'Unknown Title')
item_data = defaultdict(lambda: 'Not Available')
item_data['item_id'] = item_id
item_data['title'] = title
if 'identifiers' in item_json:
for identifier in item_json['identifiers']:
scheme = identifier.get('scheme', 'Unknown Scheme')
key = identifier.get('key', 'No Value')
clean_scheme = scheme.split('/')[-1] if '/' in scheme else scheme
item_data[clean_scheme] = key
return dict(item_data)
except requests.exceptions.RequestException:
if i == retries - 1:
print(f"Failed to retrieve item {item_id} after multiple retries.")
return None # Indicate failure after retries
except Exception as e:
print(f"\nError processing item {item_id}: {e}")
return None # Indicate processing failure
def explore_species_identifiers_parallel(item_ids: list[str], max_workers: int = 10) -> pd.DataFrame:
"""
Explores and extracts identifiers from ScienceBase items in parallel.
This function processes a list of ScienceBase item IDs concurrently using
a thread pool executor. It fetches data for each item using the
`process_single_item` function and aggregates the results into a pandas DataFrame.
Args:
item_ids: A list of ScienceBase item IDs to process.
max_workers: The maximum number of threads to use for parallel processing.
Adjust this value based on your system and network to optimize
performance without overloading resources.
Returns:
A pandas DataFrame where each row represents a ScienceBase item,
and columns include 'item_id', 'title', and identifier schemes
(e.g., 'ECOS', 'ITIS'). Returns an empty DataFrame if no data is processed.
"""
sb = SbSession() # Create a single session for all threads
all_schemes = set()
results = []
with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
futures = [executor.submit(process_single_item, item_id, sb) for item_id in item_ids]
for future in tqdm(concurrent.futures.as_completed(futures), total=len(item_ids), desc="Processing species items"):
result = future.result()
if result:
results.append(result)
all_schemes.update(key for key in result if key not in ('item_id', 'title')) # Update schemes efficiently
if not results:
return pd.DataFrame() # Return empty DataFrame if no results
df = pd.DataFrame(results)
# Ensure all identifier scheme columns are present and reorder
for scheme in all_schemes:
if scheme not in df.columns:
df[scheme] = 'Not Available'
cols = ['item_id', 'title'] + sorted(list(all_schemes))
df = df[cols]
return df
def main():
"""
Main function to execute the species identifier exploration script.
This function orchestrates the process of:
1. Retrieving all species item IDs from ScienceBase.
2. Processing these items in parallel to extract identifiers.
3. Generating summary statistics of the extracted data.
4. Saving the results to a CSV file named 'all_species_identifiers.csv'.
5. Displaying the first few rows of the resulting DataFrame.
"""
all_ids = get_all_item_ids() # Get all species item IDs
if not all_ids:
print("No species IDs found. Exiting.")
return
print("\nProcessing species item identifiers...")
df = explore_species_identifiers_parallel(all_ids, max_workers=10) # Process in parallel
if df.empty:
print("No data was processed successfully. Exiting.")
return
print("\nIdentifier Summary:")
print(f"Total species items processed: {len(df)}")
print("\nIdentifier columns found and count of non 'Not Available' values:")
for col in df.columns:
# Corrected line to exclude NaN values from the count
non_empty_count = df[col].ne('Not Available').sum() - df[col].isna().sum()
print(f"- {col}: {non_empty_count} items have this identifier")
output_file = 'all_species_identifiers.csv'
df.to_csv(output_file, index=False)
print(f"\nResults saved to '{output_file}'")
print("\nFirst 5 rows of the data:")
pd.set_option('display.max_columns', None) # Display all columns
pd.set_option('display.max_colwidth', None) # Display full column width
print(df.head()) # Replaced display(df.head()) with print()
if __name__ == "__main__":
main() USGS report with domain-specific explanations 💡Note : initial character of GAP_SpeciesCode refers to: amphibians, birds, mammals, reptiles. 💡Note : The dataset includes 1,590 species and 129 subspecies. The hierarchical relationship between species and subspecies can be inferred from the final character of GAP_SpeciesCode. For example, mPASHc and mPASHp refer to Sorex pacificus cascadensis and Sorex pacificus pacificus, respectively, which is part of mPASHx, Pacific Shrew. The species parent should contain the habitat locations for all subspecies.
|
Restructures the species habitat analysis script: - Implement modular architecture with ScienceBaseClient, RasterSet, and HabitatDataProcessor classes for better maintainability - Integrate sciencebasepy for ZIP-based habitat map downloads from USGS ScienceBase, with automatic cleanup of temporary files - Add multi-format output support (CSV/Parquet/Arrow) with Arrow as default, using dictionary encoding for optimized storage and performance - Enhance metadata by including species common and scientific names from ScienceBase API - Add comprehensive CLI arguments for configuration and debug logging - Improve robustness with better error handling and type annotations
@dangotbanned would welcome your thoughts on whether i'm on the right track here. |
Thanks @dsmedia I've only skimmed through, but could you switch from You'd be able to remove the defaults and the parsing logic from the script - or at least just validate the options in one place |
@dsmedia everything appears to be working so far 🎉 |
* Switch to using zipfile.Path for more Pythonic ZIP file handling * Enforce expectation of exactly one TIF file per ZIP * Add error handling for unexpected file counts
Refactors `species.py` to use TOML configuration (`_data/species.toml`) instead of `argparse`, improving flexibility and maintainability. Settings include `item_ids`, `vector_fp`, `output_dir`, `output_format`, and `debug`. Relative paths (e.g., `../data/us-10m.json`) are resolved relative to the TOML file, and basic validation is added. `RasterSet.extract_tifs_from_zips()` now uses `zipfile.Path` and enforces a single `.tif` file per ZIP, raising a `RuntimeError` otherwise. Type hinting fixes are also included.
Currently this PR provides county-level percentages of year-round predicted habitat for four species. I'm considering expanding the dataset to include both predicted and known ranges across different seasons (summer-only, winter-only, and year-round) for the four species, matching the scope of the USGS report on this dataset. This would:
This expanded scope would also make the dataset more versatile for the broader Vega community beyond the original faceted map use case. Before formally proposing this expansion, I'll first generate a test dataset to evaluate size, performance, and data quality. |
Thanks for being so sophisticated with this PR @dsmedia! I'm personally not really up to date with the arrow format. If the source can be parsed into a dataframe or referenced by url in the Vega editor than it is perfect. Currently the data is in row oriented format as it is the easiest way to facet the chart, but it means we pass the threshold of 5K rows. If we store the data column oriented we stay within the 5K rows limit, but it means there will be additional melt transforms in the specification. I'm leaning towards keeping it as is to simplify the chart specification. I understand that you want to provide wider access to all options of the data source, but don't lose yourself in all available possibilities. What you currently have is really great already. Once this referenced feature request is in, vega/vega-lite#9389 we can revisit this dataset to make a map like your screenshot to highlight pixel based seasonal differences for a single or few counties. But if you do have great ideas to approach this, go for it. I just want to say that what you have is awesome already! |
- Rename species columns for consistency and clarity: - GAP_Species -> gap_species_code - percent_habitat -> habitat_yearround_pct - CommonName -> common_name - ScientificName -> scientific_name - Expand module docstring with detailed information about: - Data sources and resolution - Projection details - Habitat value meanings - Output format options - Improve code comments for future extensibility - reference habitat data source in species.toml - list alternative output format options in toml The changes prepare for potential future addition of seasonal habitat data (and summer/winter habitat data) while maintaining backward compatibility.
Happy to revise to csv if that is deemed best here. I'm curious though if this could be a nice opportunity to create the first standalone dataset in arrow format in the repo. (
This is great perspective. In aa4516f I've kept it as is (with the all-season habitat data only) but reformulated the column names to permit a backward-compatible update in the future. (The data column is now called Separately, @mattijn, exactextract is generating this warning in the terminal.
If harmless, could you help explain/document why this happens (and what it means practically), or can we revise the code to address the conditions causing the warning? |
Perfect, thanks @dsmedia In that case, a Providing multiple formats will also be a welcome choice after (vega/altair#3631), since those can be prioritised where a backend supports it 😊 Having all 3 also makes sense, now that (vega/vega#3961) looks achievable |
Great. I'll update the PR to switch to CSV. When support for the other formats is enabled, we can consider generating parquet and/or arrow for a wider swath of the datasets. |
@dangotbanned in the diff, I noticed that the string "git+" seems to be appended below in the datapackage.md and datapackage.json files, in a place unrelated to the species dataset. Would you be able to have a quick look to see what's going on there? |
Looks like its pulling through the new url since https://github.com/vega/vega-datasets/pull/682/files#diff-7ae45ad102eab3b6d7e7896acd08c427a9b25b346470d7bc6507b6481575d519 |
I added it since it seems to be recommended by https://publint.dev/[email protected] ![]() |
scripts/species.py
Outdated
# --- Basic Configuration Validation --- | ||
if not isinstance(item_ids, list) or not item_ids: | ||
logger.error("`item_ids` in the TOML file must be a non-empty list.") | ||
msg = "`item_ids` must be a non-empty list." | ||
raise TypeError(msg) | ||
|
||
if not isinstance(vector_fp, str | Path): | ||
logger.error("`vector_fp` must be a string or Path.") | ||
msg = "`vector_fp` must be a string or Path." | ||
raise TypeError(msg) | ||
vector_fp = Path(vector_fp) | ||
|
||
if not isinstance(output_dir, str | Path): | ||
logger.error("`output_dir` must be a string or Path.") | ||
msg = "`output_dir` must be a string or Path." | ||
raise TypeError(msg) | ||
output_dir = Path(output_dir) | ||
|
||
if output_format not in {"csv", "parquet", "arrow"}: | ||
logger.error("Invalid `output_format`: %s", output_format) | ||
msg = f"Invalid `output_format`: {output_format}" | ||
raise ValueError(msg) | ||
|
||
if not isinstance(debug, bool): | ||
logger.error("`debug` must be a boolean.") | ||
msg = "`debug` must be a boolean." | ||
raise TypeError(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've not seen this pattern before of both logging and raising an error.
@dsmedia I think just raising the error would be enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simplified now - thank you
This commit introduces a new example to the Altair documentation showcasing choropleth maps faceted by category. The example visualizes the distribution of suitable habitat for different species across US counties, using the proposed new Species Habitat dataset from vega-datasets (vega/vega-datasets#684). Key features of this example: - Demonstrates the use of `alt.Chart.mark_geoshape()` for geographical visualizations. - Shows how to create faceted maps for comparing categorical data across geographic regions. - Utilizes the `transform_lookup` and `transform_calculate` transforms for data manipulation within Altair. - Uses a CSV data file temporarily hosted in the vega-datasets repository branch (pending dataset merge). This example addresses issue #1711, which requested a faceted map example for the Altair documentation. Co-authored-by: Mattijn van Hoek <[email protected]>
Co-authored-by: Dan Redding <[email protected]>
- Remove redundant error logging before raising exceptions
Modify species dataset processing to explicitly set Continental US counties to 0% habitat when unassigned, while keeping Alaska and Hawaii counties as null values. This simplifies Altair visualization by removing the need for transform_filter and allows for better projection choices.
This commit enhances the GAP habitat analysis script to focus specifically on coterminous (contiguous) US counties by implementing a geographic filtering system based on FIPS codes. ## Changes: - Added geographic filtering configuration in TOML to exclude Alaska, Hawaii, Puerto Rico, and Virgin Islands - Refactored county data loading into smaller, specialized methods for better maintainability - Modified habitat analysis to explicitly handle coterminous US counties, filling missing values with zeros - Updated documentation to clarify the geographic scope of the analysis - Improved error handling and logging, especially for county boundary processing - Changed default output format from arrow to csv ## Technical Details: - The `HabitatDataProcessor` class now accepts a full configuration object - Added `_filter_to_coterminous_us` method to process excluded areas based on FIPS codes - Enhanced `save_results` to maintain complete county coverage with consistent data formatting - Configuration now includes a structured `[processing.geographic_filter]` section This change ensures consistent habitat analysis across the coterminous United States by standardizing which areas are included in the analysis.
Use `uv run --group="geo-species" scripts/species.py`
pyproject.toml
Outdated
[dependency-groups] | ||
dev = ["ipython[kernel]>=8.30.0", "ruff>=0.8.2", "taplo>=0.9.3"] | ||
geo-species = [ | ||
"exactextract", | ||
"geopandas", | ||
"pandas[pyarrow]", | ||
"rasterio", | ||
"requests", | ||
"sciencebasepy", | ||
"setuptools", | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried running the script and was a bit alarmed by the install size.
Definitely a case where I wouldn't want to install that on each run - so following (1415aae) they're now locked.
Use the command below to run the script:
uv run --group="geo-species" scripts/species.py
for item_id in item_ids: | ||
try: | ||
item_json = self.sb.get_item(item_id) | ||
files_info = self.sb.get_item_file_info(item_json) | ||
|
||
for file_info in files_info: | ||
if "HabMap" in file_info["name"] and file_info["name"].endswith( | ||
".zip" | ||
): | ||
zip_path = temp_dir / file_info["name"] | ||
self.sb.download_file(file_info["url"], str(zip_path)) | ||
downloaded_zips.append(zip_path) | ||
|
||
except Exception as e: # Catch generic Exception | ||
if isinstance(e, requests.exceptions.RequestException): | ||
logger.error( | ||
"Error downloading files for item ID %s: %s", item_id, e | ||
) | ||
else: | ||
logger.error( | ||
"An unexpected error occurred for item ID %s: %s", item_id, e | ||
) | ||
continue # Go to the next item_id | ||
|
||
return sorted(downloaded_zips) # ALWAYS return the list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Been stuck here for a while with no progress.
System usage seems way too high to be downloading
Logs
PS C:\Users\danie\Documents\GitHub\vega-datasets> uv run --group="geo-species" scripts/species.py
Installed 1 package in 24ms
2025-02-27 16:52:39 [INFO] Initializing GAP habitat analysis pipeline
2025-02-27 16:52:39 [INFO] Loading county boundaries from local file: C:\Users\danie\Documents\GitHub\vega-datasets\data\us-10m.json
2025-02-27 16:52:40 [INFO] Successfully loaded county data with these columns: ['id', 'geometry']
2025-02-27 16:52:40 [INFO] Filtering out 4 areas by FIPS code to focus on conterminous US
2025-02-27 16:52:40 [INFO] Excluding 27 counties with FIPS prefix 02 (Alaska)
2025-02-27 16:52:40 [INFO] Excluded county FIPS codes: 02013, 02016, 02020, 02050, 02060, 02068, 02070, 02090, 02100, 02110, 02122, 02130, 02150, 02164, 02170, 02180, 02185, 02188, 02201, 02220, 02232, 02240, 02261, 02270, 02280, 02282, 02290
02261, 02270, 02280, 02282, 02290
2025-02-27 16:52:40 [INFO] Excluding 4 counties with FIPS prefix 15 (Hawaii)
2025-02-27 16:52:40 [INFO] Excluded county FIPS codes: 15001, 15003, 15007, 15009
2025-02-27 16:52:40 [INFO] Excluding 74 counties with FIPS prefix 72 (Puerto Rico) 72049, 72051, 72053, 72054, 72055, 72057, 72059, 72061, 72063,
2025-02-27 16:52:40 [INFO] Excluded county FIPS codes: 72001, 72003, 72005, 72007, 72009, 72011, 72013, 72017, 72019, 72021, 72023, 72025, 72027, 72029, 72031, 72035, 72037, 72039, 72041, 72043, 72045, 72047, 72129, 72131, 72133, 72135, 72137, 72139, 72141, 72143, 72145,
72049, 72051, 72053, 72054, 72055, 72057, 72059, 72061, 72063, 72065, 72069, 72071, 72073, 72075, 72077, 72079, 72081, 72083, 72085, 72087, 72089, 72091, 72093, 72095, 72097, 72099, 72101, 72103, 72105, 72107, 72109, 72111, 72113, 72115, 72119, 72121, 72123, 72125, 72127, 72129, 72131, 72133, 72135, 72137, 72139, 72141, 72143, 72145, 72147, 72149, 72151, 72153
2025-02-27 16:52:40 [INFO] Excluding 3 counties with FIPS prefix 78 (Virgin Islands)
2025-02-27 16:52:40 [INFO] Excluded county FIPS codes: 78010, 78020, 78030
2025-02-27 16:52:40 [INFO] Analyzing 3105 counties in conterminous US
2025-02-27 16:52:40 [WARNING] Removed 4 counties with invalid boundaries
2025-02-27 16:52:40 [INFO] Step 1/3: Downloading and extracting habitat data files
2025-02-27 16:52:40 [INFO] Retrieving species information from ScienceBase
2025-02-27 16:52:41 [INFO] Downloading habitat data files _HabMap_2001v1.zip
downloading https://www.sciencebase.gov/catalog/file/get/58fa449fe4b0b7ea54524c5e?f=__disk__dc%2F2f%2Fa7%2Fdc2fa725a0d36b4f9ceffed282019cf27b9a37f3 to C:\Users\danie\AppData\Local\Temp\tmpnfbi0fvb\bAMROx_CONUS_HabMap_2001v1.zip
Feeling like this might point towards the issue https://docs.astral.sh/ruff/rules/blind-except/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Traceback on CTRL+C
Traceback (most recent call last):
File "C:\Users\danie\Documents\GitHub\vega-datasets\scripts\species.py", line 791, in <module>
main()
File "C:\Users\danie\Documents\GitHub\vega-datasets\scripts\species.py", line 787, in main
processor.run()
File "C:\Users\danie\Documents\GitHub\vega-datasets\scripts\species.py", line 702, in run
tif_files, species_info = self.process_habitat_data(temp_dir)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\danie\Documents\GitHub\vega-datasets\scripts\species.py", line 470, in process_habitat_data
zip_files = self.sciencebase_client.download_zip_files(self.item_ids, temp_dir)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\danie\Documents\GitHub\vega-datasets\scripts\species.py", line 119, in download_zip_files
self.sb.download_file(file_info["url"], str(zip_path))
File "C:\Users\danie\Documents\GitHub\vega-datasets\.venv\Lib\site-packages\sciencebasepy\SbSession.py", line 919, in download_file
for chunk in response.iter_content(chunk_size=1024):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\danie\Documents\GitHub\vega-datasets\.venv\Lib\site-packages\requests\models.py", line 820, in generate
yield from self.raw.stream(chunk_size, decode_content=True)
File "C:\Users\danie\Documents\GitHub\vega-datasets\.venv\Lib\site-packages\urllib3\response.py", line 967, in stream
data = self.read(amt=amt, decode_content=decode_content)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\danie\Documents\GitHub\vega-datasets\.venv\Lib\site-packages\urllib3\response.py", line 879, in read
data = self._raw_read(amt)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\danie\Documents\GitHub\vega-datasets\.venv\Lib\site-packages\urllib3\response.py", line 792, in _raw_read
with self._error_catcher():
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\danie\AppData\Roaming\uv\python\cpython-3.12.8-windows-x86_64-none\Lib\contextlib.py", line 141, in __exit__
def __exit__(self, typ, value, traceback):
KeyboardInterrupt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsmedia I've added the rule, but not fixed it (https://github.com/vega/vega-datasets/actions/runs/13571913413/job/37938835261?pr=684)
Was hoping to review in more detail - but I can't get beyond this step sadly 😞
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me it works fine, downloading data in ~2 minutes and analysis in ~11 minutes:
uv run --group="geo-species" scripts/species.py
Using CPython 3.12.9 interpreter at: /Users/mattijnvanhoek/miniconda3/envs/stable/bin/python
Creating virtual environment at: .venv
Built stringcase==1.2.0
Built progress==1.6
Built petl==1.7.15
Installed 104 packages in 352ms
2025-02-27 21:32:06 [INFO] Initializing GAP habitat analysis pipeline
2025-02-27 21:32:06 [INFO] Loading county boundaries from local file: /Users/mattijnvanhoek/vega/vega-datasets/data/us-10m.json
2025-02-27 21:32:17 [INFO] Successfully loaded county data with these columns: ['id', 'geometry']
2025-02-27 21:32:17 [INFO] Filtering out 4 areas by FIPS code to focus on conterminous US
2025-02-27 21:32:17 [INFO] Excluding 27 counties with FIPS prefix 02 (Alaska)
2025-02-27 21:32:17 [INFO] Excluded county FIPS codes: 02013, 02016, 02020, 02050, 02060, 02068, 02070, 02090, 02100, 02110, 02122, 02130, 02150, 02164, 02170, 02180, 02185, 02188, 02201, 02220, 02232, 02240, 02261, 02270, 02280, 02282, 02290
2025-02-27 21:32:17 [INFO] Excluding 4 counties with FIPS prefix 15 (Hawaii)
2025-02-27 21:32:17 [INFO] Excluded county FIPS codes: 15001, 15003, 15007, 15009
2025-02-27 21:32:17 [INFO] Excluding 74 counties with FIPS prefix 72 (Puerto Rico)
2025-02-27 21:32:17 [INFO] Excluded county FIPS codes: 72001, 72003, 72005, 72007, 72009, 72011, 72013, 72017, 72019, 72021, 72023, 72025, 72027, 72029, 72031, 72035, 72037, 72039, 72041, 72043, 72045, 72047, 72049, 72051, 72053, 72054, 72055, 72057, 72059, 72061, 72063, 72065, 72069, 72071, 72073, 72075, 72077, 72079, 72081, 72083, 72085, 72087, 72089, 72091, 72093, 72095, 72097, 72099, 72101, 72103, 72105, 72107, 72109, 72111, 72113, 72115, 72119, 72121, 72123, 72125, 72127, 72129, 72131, 72133, 72135, 72137, 72139, 72141, 72143, 72145, 72147, 72149, 72151, 72153
2025-02-27 21:32:17 [INFO] Excluding 3 counties with FIPS prefix 78 (Virgin Islands)
2025-02-27 21:32:17 [INFO] Excluded county FIPS codes: 78010, 78020, 78030
2025-02-27 21:32:17 [INFO] Analyzing 3105 counties in conterminous US
2025-02-27 21:32:40 [WARNING] Removed 4 counties with invalid boundaries
2025-02-27 21:32:40 [INFO] Step 1/3: Downloading and extracting habitat data files
2025-02-27 21:32:40 [INFO] Retrieving species information from ScienceBase
2025-02-27 21:32:42 [INFO] Downloading habitat data files
downloading https://www.sciencebase.gov/catalog/file/get/58fa449fe4b0b7ea54524c5e?f=__disk__dc%2F2f%2Fa7%2Fdc2fa725a0d36b4f9ceffed282019cf27b9a37f3 to /var/folders/b3/pctp5dwn6t7fm1t4sc4tknb00000gp/T/tmpdfo6ou2c/bAMROx_CONUS_HabMap_2001v1.zip
downloading https://www.sciencebase.gov/catalog/file/get/58fa817ae4b0b7ea54525c2f?f=__disk__57%2F6f%2F3b%2F576f3bb4fe3167204a20961184da3f2aa9440b52 to /var/folders/b3/pctp5dwn6t7fm1t4sc4tknb00000gp/T/tmpdfo6ou2c/mWTDEx_CONUS_HabMap_2001v1.zip
downloading https://www.sciencebase.gov/catalog/file/get/58fa3f0be4b0b7ea54524859?f=__disk__81%2F26%2F6f%2F81266f6d642fe5a03255f19055afd9d4d169a06d to /var/folders/b3/pctp5dwn6t7fm1t4sc4tknb00000gp/T/tmpdfo6ou2c/aAMBUx_CONUS_HabMap_2001v1.zip
downloading https://www.sciencebase.gov/catalog/file/get/58fe0f4fe4b0074928294636?f=__disk__8d%2F6b%2F91%2F8d6b91c206bafddac994009aa7a2ce5ef2d44066 to /var/folders/b3/pctp5dwn6t7fm1t4sc4tknb00000gp/T/tmpdfo6ou2c/rCOGAx_CONUS_HabMap_2001v1.zip
2025-02-27 21:34:33 [INFO] Extracting raster files
2025-02-27 21:34:34 [INFO] Step 2/3: Beginning habitat analysis for 4 species
/Users/mattijnvanhoek/vega/vega-datasets/.venv/lib/python3.12/site-packages/exactextract/exact_extract.py:330: RuntimeWarning: Spatial reference system of input features does not exactly match raster.
warnings.warn(
Processing 45019: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100.0/100 [10:56<00:00, 6.57s/it]
2025-02-27 21:45:37 [INFO] Step 3/3: Saving analysis results
2025-02-27 21:45:37 [INFO] Analysis complete. Results saved to: /Users/mattijnvanhoek/vega/vega-datasets/data
And with vega/altair@f1a1b5a was able to reduce the chart specifications with around 10 lines for the corresponding charts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me it works fine, downloading data in ~2 minutes and analysis in ~11 minutes:
Hmm 🤔
@mattijn maybe windows is the issue?
Yours
downloading https://www.sciencebase.gov/catalog/file/get/58fa449fe4b0b7ea54524c5e?f=__disk__dc%2F2f%2Fa7%2Fdc2fa725a0d36b4f9ceffed282019cf27b9a37f3 to /var/folders/b3/pctp5dwn6t7fm1t4sc4tknb00000gp/T/tmpdfo6ou2c/bAMROx_CONUS_HabMap_2001v1.zip
Mine
downloading https://www.sciencebase.gov/catalog/file/get/58fa449fe4b0b7ea54524c5e?f=__disk__dc%2F2f%2Fa7%2Fdc2fa725a0d36b4f9ceffed282019cf27b9a37f3 to C:\Users\danie\AppData\Local\Temp\tmp2jvviidr\bAMROx_CONUS_HabMap_2001v1.zip
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was working for me back in (9a778c6)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed some things around and it seems to have created the file - but not filling it
vega-datasets/scripts/species.py
Line 119 in 8551a8d
self.sb.download_file(file_info["url"], str(zip_path)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was working for me back in (9a778c6)
The same thing happened to me earlier today, and when checking I could see that the issue was that the download speed had been throttled significantly. If we can add some kind of counter it should at least demonstrate the progress.
Issue with exactextract installationI encountered an installation error when trying to run the script with
This has been fixed in a PR that was already merged (isciences/exactextract#155), but the fix is only available in the development version. |
will close #683
This PR adds a new dataset to
vega-datasets
containing county-level species habitat distribution data for the United States. The dataset is designed to support categorical faceted map examples in Vega visualization libraries, addressing needs discussed in vega/altair#1711 and vega/vega-datasets#683.Implementation Status
To Do
pct
to three decimal placespct
columnDataset Details
Current Implementation (
species.csv
)Structure
data/
directoryFields
item_id
common_name
scientific_name
gap_species_code
county_id
habitat_yearround_pct
Data Generation
The dataset is generated using
scripts/species.py
, which implements the approach suggested by @mattijn in this comment. The script:exactextract
data/species.csv
Known Issues
Spatial reference system of input features does not exactly match raster
Future work
Consider expanding the dataset (in a backward compatible way) to include summer and winter habitat data, and summer/winter/all_season data on range.
Validation of Habitat Percentage Calculation
This image compares the USGS potential habitat map (within this zip file) for bullfrogs (zoomed into the southeastern United States, for clarity) with the output of our code, demonstrating the correct implementation of zonal statistics.
Left: USGS potential habitat map for bullfrogs. This is a raster dataset showing predicted suitable habitat (purple areas) at a 30-meter resolution. It is not aggregated by any administrative boundaries.
Right: Generated choropleth map showing the percentage of suitable habitat within each county. Lighter colors indicate a higher percentage of the county is classified as suitable habitat, while darker colors indicate a lower percentage.
The code successfully overlays the USGS raster data with county boundaries (from
us-10m.json
, a vector dataset) and usesexactextract
to calculate the percentage of each county that falls within the USGS-defined habitat. The resulting map visually confirms that the habitat percentages are being calculated correctly, with higher percentages in counties that overlap significantly with the USGS's predicted habitat area.Species Habitat Visualization Code