Skip to content

avi9664/ca-biositing

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

122 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ca-biositing

BSD License Pixi Badge Ruff DOI pre-commit.ci status

CA-BioSiting is the backend data platform for Cal BioScape, a web-based tool supporting the development of a circular bioeconomy in California's Northern San Joaquin Valley (San Joaquin, Stanislaus, and Merced counties). The platform is developed at the University of Washington Scientific Software Engineering Center (SSEC) as part of the BioCircular Valley initiative -- a multi-institutional collaboration involving Lawrence Berkeley National Laboratory, UC Berkeley, UC Merced, UC Agriculture and Natural Resources, USDA Albany Agricultural Research Station, the Almond Board of California, and BEAM Circular. The initiative is funded through Schmidt Sciences' Virtual Institute on Feedstocks of the Future, with support from the Foundation for Food & Agriculture Research.

Cal BioScape aims to transform the region's abundant but often underutilized agricultural waste streams -- crop residues, almond shells, fruit peels, orchard trimmings -- into valuable bioproducts, sustainable biofuels, and advanced materials. The platform serves feedstock suppliers, biomanufacturing companies, policymakers, and researchers by providing interactive mapping, comprehensive data integration, spatial analysis, and programmatic API access.

Motivation

Identifying optimal locations for bioconversion facilities requires integrating heterogeneous datasets across multiple spatial and analytical domains. Biomass composition data (proximate, ultimate, compositional, ICP, XRF, XRD, and calorimetry analyses), field sampling records, parcel-level crop mapping (LandIQ), federal agricultural production statistics (USDA Census and Survey), and DOE Billion Ton Study projections all originate from different sources with varying schemas and spatial resolutions. Meaningful siting analysis depends on spatially joining these datasets to enable high-resolution visualization of agricultural feedstocks, spatial buffer and summary queries for radius-based resource aggregation, and temporal filtering across data sources.

CA-BioSiting provides the data infrastructure to automate ingestion of these datasets, normalize them into a common relational schema with geospatial support (PostgreSQL + PostGIS), and expose them through a REST API for downstream analysis, visualization, and data export.

Key Features

  • Automated ETL pipelines for ingesting biomass characterization data, LandIQ crop mapping, USDA agricultural statistics, and DOE Billion Ton Study records
  • Spatially-enabled relational database (PostgreSQL + PostGIS) with 15 domain model groups covering field sampling, analytical records, fermentation/pretreatment experiments, and geographic information
  • Materialized views that pre-compute spatial joins across datasets (e.g., LandIQ records with crop mapping, USDA records with commodity lookups, Billion Ton records with spatial tile aggregation)
  • REST API (FastAPI) for programmatic access to all ingested and derived data with interactive OpenAPI documentation
  • Cloud-native deployment on Google Cloud Run with Cloud SQL (PostgreSQL), infrastructure managed as code via Pulumi, and automated CI/CD through GitHub Actions
  • Reproducible environments using Pixi for local development and Docker for containerized production deployment

Data Domains

The database schema covers the following research domains:

Domain Description
Aim 1 Records Proximate, ultimate, compositional, ICP, XRF, XRD, and calorimetry analyses
Aim 2 Records Fermentation and pretreatment experiment results
Field Sampling Sample collection metadata and location information
Sample Preparation Prepared sample tracking and provenance
External Data LandIQ crop mapping, USDA Census/Survey records, Billion Ton 2023 projections
Resource Information Biomass resource types and characteristics
Places & Infrastructure Geographic locations, addresses, and facility information
People & Organizations Contacts and institutional affiliations

Architecture

CA-BioSiting is organized as a PEP 420 namespace package with three independently installable components:

  • ca_biositing.datamodels -- SQLModel database models, Alembic migrations, and materialized view definitions
  • ca_biositing.pipeline -- Prefect-orchestrated ETL workflows for data extraction, transformation, and loading
  • ca_biositing.webservice -- FastAPI REST API for data access

The ETL pipeline extracts data from Google Sheets, transforms and validates records with pandas, and loads them into PostgreSQL via SQLAlchemy. Seven materialized views provide pre-computed spatial joins for common query patterns.

GitHub Actions

Workflow Status
CI CI
CD CD
Migrations Migrations
Build and Push Docker Images Build and Push Docker Images
Deploy Staging Deploy Staging
Deploy Production Deploy Production
Trigger Staging ETL Trigger Staging ETL
Deploy Resource Info to GitHub Pages Deploy Resource Info to GitHub Pages

Docker Images

Image Description
ghcr.io/sustainability-software-lab/ca-biositing/pipeline ETL pipeline (Prefect flows and worker)
ghcr.io/sustainability-software-lab/ca-biositing/webservice FastAPI REST API

Technology Stack

Component Technology
Language Python 3
Database PostgreSQL 15 + PostGIS
ORM / Models SQLModel (SQLAlchemy + Pydantic)
Migrations Alembic
Workflow Orchestration Prefect
Web API FastAPI
Geospatial Analysis QGIS, GeoAlchemy2, Shapely
Package Management Pixi (conda-forge + PyPI)
Containerization Docker, Docker Compose
Cloud Deployment Google Cloud Run, Pulumi
CI/CD GitHub Actions

Quick Start

Prerequisites

  • Pixi (v0.55.0+)
  • Docker
  • Google Cloud credentials (optional, for Google Sheets access)

Installation

git clone https://github.com/sustainability-software-lab/ca-biositing.git
cd ca-biositing
pixi install
pixi run pre-commit-install

Running the ETL Pipeline

# Create environment file from template
cp resources/docker/.env.example resources/docker/.env

# Start all services (PostgreSQL, Prefect server, worker)
pixi run start-services

# Deploy and run ETL flows
pixi run deploy
pixi run run-etl

# Monitor via Prefect UI at http://localhost:4200

Running the Web Service

pixi run start-webservice
# API docs at http://localhost:8000/docs

Documentation

Full documentation is available via MkDocs Material and can be previewed locally:

pixi install -e docs
pixi run -e docs docs-serve
# Open http://127.0.0.1:8000

Contributing

See the development guides in the documentation for details on:

License

This project is licensed under the BSD 3-Clause License. See LICENSE for details.

About

Discussion of general issues related to the project and protyping or research

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 80.6%
  • Jupyter Notebook 18.4%
  • Other 1.0%