ca-biositing

CA-BioSiting is the backend data platform for Cal BioScape, a web-based tool supporting the development of a circular bioeconomy in California's Northern San Joaquin Valley (San Joaquin, Stanislaus, and Merced counties). The platform is developed at the University of Washington Scientific Software Engineering Center (SSEC) as part of the BioCircular Valley initiative -- a multi-institutional collaboration involving Lawrence Berkeley National Laboratory, UC Berkeley, UC Merced, UC Agriculture and Natural Resources, USDA Albany Agricultural Research Station, the Almond Board of California, and BEAM Circular. The initiative is funded through Schmidt Sciences' Virtual Institute on Feedstocks of the Future, with support from the Foundation for Food & Agriculture Research.

Cal BioScape aims to transform the region's abundant but often underutilized agricultural waste streams -- crop residues, almond shells, fruit peels, orchard trimmings -- into valuable bioproducts, sustainable biofuels, and advanced materials. The platform serves feedstock suppliers, biomanufacturing companies, policymakers, and researchers by providing interactive mapping, comprehensive data integration, spatial analysis, and programmatic API access.

Motivation

Identifying optimal locations for bioconversion facilities requires integrating heterogeneous datasets across multiple spatial and analytical domains. Biomass composition data (proximate, ultimate, compositional, ICP, XRF, XRD, and calorimetry analyses), field sampling records, parcel-level crop mapping (LandIQ), federal agricultural production statistics (USDA Census and Survey), and DOE Billion Ton Study projections all originate from different sources with varying schemas and spatial resolutions. Meaningful siting analysis depends on spatially joining these datasets to enable high-resolution visualization of agricultural feedstocks, spatial buffer and summary queries for radius-based resource aggregation, and temporal filtering across data sources.

CA-BioSiting provides the data infrastructure to automate ingestion of these datasets, normalize them into a common relational schema with geospatial support (PostgreSQL + PostGIS), and expose them through a REST API for downstream analysis, visualization, and data export.

Key Features

Automated ETL pipelines for ingesting biomass characterization data, LandIQ crop mapping, USDA agricultural statistics, and DOE Billion Ton Study records
Spatially-enabled relational database (PostgreSQL + PostGIS) with 15 domain model groups covering field sampling, analytical records, fermentation/pretreatment experiments, and geographic information
Materialized views that pre-compute spatial joins across datasets (e.g., LandIQ records with crop mapping, USDA records with commodity lookups, Billion Ton records with spatial tile aggregation)
REST API (FastAPI) for programmatic access to all ingested and derived data with interactive OpenAPI documentation
Cloud-native deployment on Google Cloud Run with Cloud SQL (PostgreSQL), infrastructure managed as code via Pulumi, and automated CI/CD through GitHub Actions
Reproducible environments using Pixi for local development and Docker for containerized production deployment

Data Domains

The database schema covers the following research domains:

Domain	Description
Aim 1 Records	Proximate, ultimate, compositional, ICP, XRF, XRD, and calorimetry analyses
Aim 2 Records	Fermentation and pretreatment experiment results
Field Sampling	Sample collection metadata and location information
Sample Preparation	Prepared sample tracking and provenance
External Data	LandIQ crop mapping, USDA Census/Survey records, Billion Ton 2023 projections
Resource Information	Biomass resource types and characteristics
Places & Infrastructure	Geographic locations, addresses, and facility information
People & Organizations	Contacts and institutional affiliations

Architecture

CA-BioSiting is organized as a PEP 420 namespace package with three independently installable components:

ca_biositing.datamodels -- SQLModel database models, Alembic migrations, and materialized view definitions
ca_biositing.pipeline -- Prefect-orchestrated ETL workflows for data extraction, transformation, and loading
ca_biositing.webservice -- FastAPI REST API for data access

The ETL pipeline extracts data from Google Sheets, transforms and validates records with pandas, and loads them into PostgreSQL via SQLAlchemy. Seven materialized views provide pre-computed spatial joins for common query patterns.

GitHub Actions

Workflow	Status
CI
CD
Migrations
Build and Push Docker Images
Deploy Staging
Deploy Production
Trigger Staging ETL
Deploy Resource Info to GitHub Pages

Docker Images

Image	Description
`ghcr.io/sustainability-software-lab/ca-biositing/pipeline`	ETL pipeline (Prefect flows and worker)
`ghcr.io/sustainability-software-lab/ca-biositing/webservice`	FastAPI REST API

Technology Stack

Component	Technology
Language	Python 3
Database	PostgreSQL 15 + PostGIS
ORM / Models	SQLModel (SQLAlchemy + Pydantic)
Migrations	Alembic
Workflow Orchestration	Prefect
Web API	FastAPI
Geospatial Analysis	QGIS, GeoAlchemy2, Shapely
Package Management	Pixi (conda-forge + PyPI)
Containerization	Docker, Docker Compose
Cloud Deployment	Google Cloud Run, Pulumi
CI/CD	GitHub Actions

Quick Start

Prerequisites

Pixi (v0.55.0+)
Docker
Google Cloud credentials (optional, for Google Sheets access)

Installation

git clone https://github.com/sustainability-software-lab/ca-biositing.git
cd ca-biositing
pixi install
pixi run pre-commit-install

Running the ETL Pipeline

# Create environment file from template
cp resources/docker/.env.example resources/docker/.env

# Start all services (PostgreSQL, Prefect server, worker)
pixi run start-services

# Deploy and run ETL flows
pixi run deploy
pixi run run-etl

# Monitor via Prefect UI at http://localhost:4200

Running the Web Service

pixi run start-webservice
# API docs at http://localhost:8000/docs

Documentation

Full documentation is available via MkDocs Material and can be previewed locally:

pixi install -e docs
pixi run -e docs docs-serve
# Open http://127.0.0.1:8000

Contributing

See the development guides in the documentation for details on:

License

This project is licensed under the BSD 3-Clause License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 122 Commits
.devcontainer		.devcontainer
.github		.github
.vscode		.vscode
agent_docs		agent_docs
alembic		alembic
data/test		data/test
deployment		deployment
docs		docs
frontend @ 386e37a		frontend @ 386e37a
plans		plans
resources		resources
scripts		scripts
src/ca_biositing		src/ca_biositing
tests		tests
.codespellrc		.codespellrc
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
alembic.ini		alembic.ini
config.py		config.py
mkdocs.yml		mkdocs.yml
pixi.toml		pixi.toml
readthedocs.yaml		readthedocs.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ca-biositing

Motivation

Key Features

Data Domains

Architecture

GitHub Actions

Docker Images

Technology Stack

Quick Start

Prerequisites

Installation

Running the ETL Pipeline

Running the Web Service

Documentation

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ca-biositing

Motivation

Key Features

Data Domains

Architecture

GitHub Actions

Docker Images

Technology Stack

Quick Start

Prerequisites

Installation

Running the ETL Pipeline

Running the Web Service

Documentation

Contributing

License

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages