CA-BioSiting is the backend data platform for Cal BioScape, a web-based tool supporting the development of a circular bioeconomy in California's Northern San Joaquin Valley (San Joaquin, Stanislaus, and Merced counties). The platform is developed at the University of Washington Scientific Software Engineering Center (SSEC) as part of the BioCircular Valley initiative -- a multi-institutional collaboration involving Lawrence Berkeley National Laboratory, UC Berkeley, UC Merced, UC Agriculture and Natural Resources, USDA Albany Agricultural Research Station, the Almond Board of California, and BEAM Circular. The initiative is funded through Schmidt Sciences' Virtual Institute on Feedstocks of the Future, with support from the Foundation for Food & Agriculture Research.
Cal BioScape aims to transform the region's abundant but often underutilized agricultural waste streams -- crop residues, almond shells, fruit peels, orchard trimmings -- into valuable bioproducts, sustainable biofuels, and advanced materials. The platform serves feedstock suppliers, biomanufacturing companies, policymakers, and researchers by providing interactive mapping, comprehensive data integration, spatial analysis, and programmatic API access.
Identifying optimal locations for bioconversion facilities requires integrating heterogeneous datasets across multiple spatial and analytical domains. Biomass composition data (proximate, ultimate, compositional, ICP, XRF, XRD, and calorimetry analyses), field sampling records, parcel-level crop mapping (LandIQ), federal agricultural production statistics (USDA Census and Survey), and DOE Billion Ton Study projections all originate from different sources with varying schemas and spatial resolutions. Meaningful siting analysis depends on spatially joining these datasets to enable high-resolution visualization of agricultural feedstocks, spatial buffer and summary queries for radius-based resource aggregation, and temporal filtering across data sources.
CA-BioSiting provides the data infrastructure to automate ingestion of these datasets, normalize them into a common relational schema with geospatial support (PostgreSQL + PostGIS), and expose them through a REST API for downstream analysis, visualization, and data export.
- Automated ETL pipelines for ingesting biomass characterization data, LandIQ crop mapping, USDA agricultural statistics, and DOE Billion Ton Study records
- Spatially-enabled relational database (PostgreSQL + PostGIS) with 15 domain model groups covering field sampling, analytical records, fermentation/pretreatment experiments, and geographic information
- Materialized views that pre-compute spatial joins across datasets (e.g., LandIQ records with crop mapping, USDA records with commodity lookups, Billion Ton records with spatial tile aggregation)
- REST API (FastAPI) for programmatic access to all ingested and derived data with interactive OpenAPI documentation
- Cloud-native deployment on Google Cloud Run with Cloud SQL (PostgreSQL), infrastructure managed as code via Pulumi, and automated CI/CD through GitHub Actions
- Reproducible environments using Pixi for local development and Docker for containerized production deployment
The database schema covers the following research domains:
| Domain | Description |
|---|---|
| Aim 1 Records | Proximate, ultimate, compositional, ICP, XRF, XRD, and calorimetry analyses |
| Aim 2 Records | Fermentation and pretreatment experiment results |
| Field Sampling | Sample collection metadata and location information |
| Sample Preparation | Prepared sample tracking and provenance |
| External Data | LandIQ crop mapping, USDA Census/Survey records, Billion Ton 2023 projections |
| Resource Information | Biomass resource types and characteristics |
| Places & Infrastructure | Geographic locations, addresses, and facility information |
| People & Organizations | Contacts and institutional affiliations |
CA-BioSiting is organized as a PEP 420 namespace package with three independently installable components:
ca_biositing.datamodels-- SQLModel database models, Alembic migrations, and materialized view definitionsca_biositing.pipeline-- Prefect-orchestrated ETL workflows for data extraction, transformation, and loadingca_biositing.webservice-- FastAPI REST API for data access
The ETL pipeline extracts data from Google Sheets, transforms and validates records with pandas, and loads them into PostgreSQL via SQLAlchemy. Seven materialized views provide pre-computed spatial joins for common query patterns.
| Workflow | Status |
|---|---|
| CI | |
| CD | |
| Migrations | |
| Build and Push Docker Images | |
| Deploy Staging | |
| Deploy Production | |
| Trigger Staging ETL | |
| Deploy Resource Info to GitHub Pages |
| Image | Description |
|---|---|
ghcr.io/sustainability-software-lab/ca-biositing/pipeline |
ETL pipeline (Prefect flows and worker) |
ghcr.io/sustainability-software-lab/ca-biositing/webservice |
FastAPI REST API |
| Component | Technology |
|---|---|
| Language | Python 3 |
| Database | PostgreSQL 15 + PostGIS |
| ORM / Models | SQLModel (SQLAlchemy + Pydantic) |
| Migrations | Alembic |
| Workflow Orchestration | Prefect |
| Web API | FastAPI |
| Geospatial Analysis | QGIS, GeoAlchemy2, Shapely |
| Package Management | Pixi (conda-forge + PyPI) |
| Containerization | Docker, Docker Compose |
| Cloud Deployment | Google Cloud Run, Pulumi |
| CI/CD | GitHub Actions |
git clone https://github.com/sustainability-software-lab/ca-biositing.git
cd ca-biositing
pixi install
pixi run pre-commit-install# Create environment file from template
cp resources/docker/.env.example resources/docker/.env
# Start all services (PostgreSQL, Prefect server, worker)
pixi run start-services
# Deploy and run ETL flows
pixi run deploy
pixi run run-etl
# Monitor via Prefect UI at http://localhost:4200pixi run start-webservice
# API docs at http://localhost:8000/docsFull documentation is available via MkDocs Material and can be previewed locally:
pixi install -e docs
pixi run -e docs docs-serve
# Open http://127.0.0.1:8000See the development guides in the documentation for details on:
This project is licensed under the BSD 3-Clause License. See LICENSE for details.