Skip to content

Updates to usgs.py routine to use new USGS API#191

Open
kammereraj wants to merge 6 commits intomasterfrom
USGS_api_updates
Open

Updates to usgs.py routine to use new USGS API#191
kammereraj wants to merge 6 commits intomasterfrom
USGS_api_updates

Conversation

@kammereraj
Copy link

@kammereraj kammereraj commented Feb 3, 2026

Summary

This PR migrates the USGS data retrieval module from the legacy NWIS API (dataretrieval.nwis) to the modernized Water Data API (dataretrieval.waterdata). The new API provides continued access to USGS hydrologic data as the legacy services are being phased out.

Breaking Changes

  • Minimum dataretrieval version: Now requires >=1.1.2 (was >=1)
  • Removed parameter: disable_progress_bar removed from get_usgs_data() (was unused)
  • Station metadata columns: begin_date and end_date columns are no longer available in station metadata (not provided by new API)
  • Rate limits changed: Now 50 requests/hour without API key, 1000/hour with API key (was 5/second)

New Features

API Key Support

The modernized Water Data API supports authentication via API key for higher rate limits.

Obtaining an API Key

  1. Register at: https://api.waterdata.usgs.gov/signup
  2. You'll receive a Personal Access Token (PAT)

Configuring the API Key

Option 1: Environment Variable (Recommended)

Set the API_USGS_PAT environment variable:

# Linux/macOS - add to ~/.bashrc or ~/.zshrc
export API_USGS_PAT="your-api-key-here"

# Windows Command Prompt
set API_USGS_PAT=your-api-key-here

# Windows PowerShell
$env:API_USGS_PAT="your-api-key-here"

Option 2: Pass Directly to Functions

from searvey import usgs

# Single station data
df = usgs.get_usgs_station_data(
    usgs_code="01646500",
    api_key="your-api-key-here"
)

# Multiple stations
ds = usgs.get_usgs_data(
    usgs_metadata=stations,
    api_key="your-api-key-here"
)

Rate Limits

Configuration Rate Limit
No API key 50 requests/hour
With API key 1,000 requests/hour

A warning is logged once per session if no API key is detected.

Changes

API Migration

Component Old (NWIS) New (Water Data API)
Module dataretrieval.nwis dataretrieval.waterdata
Station metadata nwis.get_info() waterdata.get_monitoring_locations()
Instantaneous data nwis.get_iv() waterdata.get_continuous()
Site ID format "01646500" "USGS-01646500"
Time parameter start, end time="YYYY-MM-DD/YYYY-MM-DD"

Code Improvements

  1. Centralized API key handling: New get_usgs_api_key() and _set_api_key_env() functions
  2. Reduced code duplication: Shared _normalize_station_data() function with extracted helpers
  3. Better error handling: Consistent try/except blocks across all API calls
  4. KeyError protection: _get_dataset_from_station_data() now validates site_nos exist in metadata
  5. Warning deduplication: API key warning only logged once per session
  6. Removed unused code: Cleaned up unused imports and variables

New Helper Functions

# ID conversion utilities
usgs.site_no_to_monitoring_location_id("01646500")  # Returns "USGS-01646500"
usgs.monitoring_location_id_to_site_no("USGS-01646500")  # Returns "01646500"

# Time range formatting
usgs.format_time_range(start_date, end_date)  # Returns "YYYY-MM-DD/YYYY-MM-DD"

# API key management
usgs.get_usgs_api_key(api_key=None)  # Returns key from param or env var
usgs.get_usgs_rate_limit(api_key=None)  # Returns configured RateLimit object

# Parameter availability (NEW)
usgs.get_station_parameter_availability(site_nos=["01646500", "01647000"])
# Returns DataFrame with: site_no, has_water_level, has_temperature, has_salinity, has_currents

Parameter Availability Tracking (NEW)

A new feature allows querying which variables are available at each station before attempting data retrieval. This enables significant efficiency gains by skipping API calls for unavailable data.

New Function: get_station_parameter_availability()

from searvey import usgs

# Query parameter availability for specific stations
availability = usgs.get_station_parameter_availability(
    site_nos=["01646500", "01647000"]
)
# Returns DataFrame:
#    site_no  has_water_level  has_temperature  has_salinity  has_currents
# 0  01646500            True             True          True         False
# 1  01647000           False            False         False         False

Enhanced get_usgs_stations() with Parameter Availability

The get_usgs_stations() function now accepts an include_parameter_availability parameter:

# Get stations with parameter availability flags
stations = usgs.get_usgs_stations(
    lon_min=-77.2,
    lon_max=-77.0,
    lat_min=38.8,
    lat_max=39.0,
    include_parameter_availability=True,  # NEW parameter
)
# Result includes columns: has_water_level, has_temperature, has_salinity, has_currents

Parameter Code Groups

New constants define which USGS parameter codes map to each variable type:

USGS_WATER_LEVEL_CODES = {"00065", "62614", "62615", "62620", "63158", "63160", ...}
USGS_TEMPERATURE_CODES = {"00010", "00011", "99976", "99980", "99984"}
USGS_SALINITY_CODES = {"00095", "00480", "00096", "70305", "72401", "90860", "90862"}
USGS_CURRENT_CODES = {"00055", "72168", "72254", "72255", "72294", "72321", ...}

Efficiency Impact

By checking parameter availability before data retrieval, downstream applications can avoid unnecessary API calls:

Scenario Without Availability Check With Availability Check
100 stations, 4 variables 400 API calls ~150 API calls (varies)
Failed requests Many (data not available) Near zero
Estimated reduction - 50-70% fewer API calls

Parameter Code Configuration

Parameter codes are now defined in a static dictionary for better maintainability:

USGS_PARAMETER_CODES = {
    "00060": {"name": "Discharge, cubic feet per second", "unit": "ft3/s"},
    "00065": {"name": "Gage height, feet", "unit": "ft"},
    "62614": {"name": "Lake or reservoir water surface elevation above NGVD 1929, feet", "unit": "ft"},
    "62615": {"name": "Lake or reservoir water surface elevation above NAVD 1988, feet", "unit": "ft"},
    "62620": {"name": "Estuary or ocean water surface elevation above NAVD 1988, feet", "unit": "ft"},
    "63158": {"name": "Stream water level elevation above NGVD 1929, in feet", "unit": "ft"},
    "63160": {"name": "Stream water level elevation above NAVD 1988, in feet", "unit": "ft"},
}

Test Updates

New Test Classes

  • TestAPIKeyManagement: Tests for API key retrieval from parameters and environment
  • TestIDConversion: Tests for site_no ↔ monitoring_location_id conversion
  • TestRateLimitConfiguration: Tests for rate limit configuration with/without API key
  • TestParameterInfo: Tests for parameter code lookup

Updated Tests

  • test_get_usgs_station_data: Updated assertions for instantaneous data (15-min intervals)
  • test_get_usgs_data: Added structure verification for xarray Dataset
  • test_get_usgs_station_data_by_string_enddate: Added assertions for multiple readings
  • test_normalize_empty_data_df: Updated to use new _normalize_station_data signature
  • test_request_nonexistant_data: Updated to use minimal DataFrame fixture

Removed Test Assertions

  • Removed begin_date/end_date dtype checks (columns no longer available)
  • Removed parm_cd from test fixtures (not in new API response)

Usage Examples

Basic Usage

from searvey import usgs

# Get stations in a region
stations = usgs.get_usgs_stations(
    lon_min=-77, lon_max=-75,
    lat_min=38, lat_max=40
)

# Get instantaneous data for a single station (last 7 days)
df = usgs.get_usgs_station_data(
    usgs_code="01646500",
    period=7
)

# Get data for multiple stations as xarray Dataset
ds = usgs.get_usgs_data(
    usgs_metadata=stations,
    period=2
)

With API Key

import os
os.environ["API_USGS_PAT"] = "your-api-key"

# Now all calls use the higher rate limit automatically
stations = usgs.get_usgs_stations()
df = usgs.get_usgs_station_data("01646500")

With Parameter Availability (Efficient Data Retrieval)

from searvey import usgs

# Get stations with parameter availability flags
stations = usgs.get_usgs_stations(
    lon_min=-77.2, lon_max=-76.5,
    lat_min=38.5, lat_max=39.5,
    include_parameter_availability=True,
)

# Filter to only stations with water level data
wl_stations = stations[stations["has_water_level"] == True]
print(f"Stations with water level: {len(wl_stations)} of {len(stations)}")

# Only retrieve data for stations that have the variable
# This avoids unnecessary API calls for stations without data
for _, station in wl_stations.iterrows():
    df = usgs.get_usgs_station_data(
        usgs_code=station["site_no"],
        parameter_code="00065",  # Gage height
    )

Dependencies

# pyproject.toml
dataretrieval = ">=1.1.2"  # Required for waterdata.get_continuous()

References

Checklist

  • Updated dataretrieval version requirement to >=1.1.2
  • Migrated from nwis to waterdata module
  • Added API key support with environment variable configuration
  • Updated rate limiting for new API (50/hour without key, 1000/hour with key)
  • Added ID conversion utilities for new USGS- prefix format
  • Refactored normalization code to reduce complexity
  • Added comprehensive error handling
  • Updated tests for new API response format
  • Added new test classes for API key and ID conversion
  • Fixed all ruff linting issues
  • Fixed all black formatting issues
  • Updated poetry.lock file
  • Added parameter availability tracking via get_station_parameter_availability()
  • Added include_parameter_availability option to get_usgs_stations()
  • Added parameter code groups for water_level, temperature, salinity, and currents

@kammereraj kammereraj self-assigned this Feb 9, 2026
@kammereraj kammereraj added enhancement New feature or request usgs labels Feb 9, 2026
@SorooshMani-NOAA SorooshMani-NOAA self-assigned this Feb 9, 2026
@SorooshMani-NOAA
Copy link
Contributor

Thank you @kammereraj, I'll take a look at this, I hope you don't mind some delay. On the first glance there are some obvious things to fix, such as black styling updates, but also we need to update the tests so that they support getting no-geometry dfs, etc. I'll take a deeper dive later to see if I can address any specific ones.

@SorooshMani-NOAA
Copy link
Contributor

@kammereraj I added some minor fixes for the test to work, but I cannot get the actual USGS results. I have set up my API key locally, but I get empty results for stations when I try. Can you confirm if it works on your side and you can get the full station list using:

usgs.get_usgs_stations()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request usgs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments