Civic Spiegel: NY Civic Research Assistant

Civic Spiegel is a civic intelligence platform that makes New York City and New York State government data legible and searchable. Spiegel is German for mirror — the platform holds a clear, undistorted, non-partisan reflection of official government records back to the residents those records affect.

Using Retrieval-Augmented Generation (RAG), residents can ask plain-English questions and receive answers grounded in official legislative transcripts, bill text, and meeting minutes — with citations to the source documents.

The Problem

NYC and NYS government produces thousands of pages of public records every month — hearing transcripts, bill text, committee minutes, budget documents — spread across more than a dozen separate portals, each with different formats and search interfaces. The information is technically public but practically inaccessible to most residents.

Instead of reading confusing government documents, users receive simple, personalized policy briefings.

Most residents don’t know:

❖ what policies exist

❖ who their local legislative representatives actually are

❖ what the impacts of each bill might be

❖ how those decisions affect their neighborhood and daily life

The Solution

Civic Spiegel scrapes official sources daily, embeds the text into a vector database using AI, and surfaces the most relevant records when you ask a question. The LLM answers directly from retrieved official documents, with citations — not from training memory.

✨ Features

Feature	Description
Policy Briefings	Ask anything about NYC/NYS policy. RAG engine retrieves official transcripts, bills, and meeting records. Llama 3.1 8B synthesizes a structured answer with source citations.
Ask Spiegel	Floating chat widget available on every page. Multi-turn conversation with user profile context. Falls back to GPT-4o-mini when the document index has no matching content.
Legislative Directory	290+ elected officials representing NY across all five government levels — searchable by borough, district, party, committee, subcommittee, and caucus. Updated daily.
Explore Maps	NYC Council district choropleth with address lookup. NYS statewide ArcGIS embed. Civic Hub map with location pins and six toggleable boundary layers.
Civic Calendar	Nine official public meeting calendars, hearing schedules, and livestreams in one place.
Accessibility	Seven display settings (large text, high contrast, reduce motion, underline links, readable font, focus mode, color-blind friendly) plus browser text-to-speech.

📐 Architecture & Tech Stack

Layer	Technology	Notes
Frontend	Next.js 14 + TypeScript + Tailwind CSS	App Router, hosted on Vercel
Backend	FastAPI (Python 3.11)	Hosted on Render free tier
Database	Neon Serverless Postgres + pgvector	6-table schema, 384-dim embeddings
Embeddings	BAAI/bge-small-en-v1.5 via FastEmbed	ONNX, CPU-only, 384-dim float32
Primary LLM	Llama 3.1 8B via Groq Cloud	RAG-grounded structured briefings
Fallback LLM	GPT-4o-mini via OpenAI	When retrieval tier is 'none'
Scraping	Cheerio (TypeScript) + BeautifulSoup (Python)	HTML + API scrapers
NLP	spaCy en_core_web_sm	Policy area classification + NER
Automation	GitHub Actions (×2 daily workflows)	06:00 UTC, zero manual intervention
Geospatial	react-simple-maps + Leaflet.js	Choropleth + pin maps

Total infrastructure cost: $0 — Vercel, Render, Neon, and GitHub Actions all on free tiers.

🛠️ System Architecture

Data we use is gathered through automated workflows:

┌─────────────── GITHUB ACTIONS (daily 06:00 UTC) ────────────────────┐
│                                                                      │
│  run_pipeline.yml (Python)        refresh-politicians.yml (Node)    │
│  ┌──────────────────────────┐     ┌──────────────────────────────┐  │
│  │ Legistar REST API        │     │ council.nyc.gov (HTML)       │  │
│  │ NYS Senate Open Leg. API │     │ nyassembly.gov (HTML)        │  │
│  │ NYC Open Data (Socrata)  │     │ legislation.nysenate.gov API │  │
│  │ NYT Regional RSS         │     │ house.gov (HTML)             │  │
│  │        ↓                 │     │ OpenStates REST + GraphQL    │  │
│  │ Clean → Classify         │     │          ↓                   │  │
│  │ Chunk  → Embed           │     │ Normalize → Merge            │  │
│  │        ↓                 │     │          ↓                   │  │
│  │ Neon Postgres            │     │ politicians.json             │  │
│  │ (DocumentChunk +         │     │ (committed to repo)          │  │
│  │  pgvector embeddings)    │     └──────────────────────────────┘  │
│  └──────────────────────────┘                                       │
└──────────────────────────────────────────────────────────────────────┘
               │                               │
               ▼                               ▼
      FastAPI Backend (Render)        Next.js Frontend (Vercel)
      /api/chat   (RAG)               /api/civic/chat    (proxy)
      /api/politicians                /api/civic/floating-chat
      /api/health                     /api/civic/politicians
      /api/policies                   /api/llm/chat  (direct OpenAI)
      /districts

Key Architecture Decisions

Single Postgres for relational + vector data — pgvector extension collapses the relational DB and vector store into one Neon instance. No Pinecone/Chroma needed. Politicians, districts, legislation, votes, documents, and 384-dim embeddings all live in one place with full SQL join support.

Frontend proxy pattern — The browser never calls the Python backend directly. It calls Next.js /api/civic/* routes which forward to FastAPI. Centralizes error handling, caching, and timeout management.

Static JSON for politicians — The representative directory for NY legislature is pre-built into public/data/politicians.json by GitHub Actions and committed to the repo. Served as a static asset with a 24-hour CDN cache. No scraping happens at request time.

Dual LLM — Groq (Llama 3.1 8B) is the primary LLM for RAG-grounded answers. OpenAI GPT-4o-mini is the fallback when the document retrieval tier is 'none' or the RAG response is empty. This ensures the chat always has something useful to say.

GitHub Actions as ML compute — The Python pipeline uses FastEmbed + spaCy, which together need ~2–4GB RAM. Render's free tier has 512MB. GitHub Actions' Ubuntu runners have 7GB. We use Actions for all the heavy lifting and Render only for live query-time inference.

🏛️ Five Levels of Government Coverage

Level	Body	Members (NY)	Term	What They Control
City Council	NYC City Council	51	4 yrs (2-term limit)	City budget, local laws, zoning (ULURP), sanitation, parks
State Assembly	NYS Assembly	150	2 years	State legislation (lower chamber), education, labor, state budget
State Senate	NYS Senate	63	2 years	State legislation (upper chamber), judicial confirmations, budget oversight
U.S. House	US House	26 (NY)	2 years	Federal laws, appropriations, constituent services
U.S. Senate	US Senate	2 (NY)	6 years	Federal legislation, treaties, cabinet/judicial confirmations

Key insight: Many policies NYC residents think are City Hall decisions are actually state law — rent stabilization, bail reform, MTA funding. This is why covering all five levels matters.

🗂️ Data Sources

See all data used in: Civic Spiegel - Data Sources

Document Corpus (ingested by Python pipeline daily)

Source	Type	Auth
NYC Council Legistar REST API	Bills, resolutions, matter text	`NYC_COUNCIL_API_KEY`
NYC Open Data — Council Meetings (m48u-yjt8)	Finalized meeting records	None
NYS Senate Open Legislation — Bills	Session bills with summaries	`NYS_SENATE_API_KEY`
NYS Senate Open Legislation — Transcripts	Floor + hearing full text	`NYS_SENATE_API_KEY`
NYT Regional RSS	Supplementary news context	None

Legislative Directory (TypeScript scraper daily)

Source	What It Provides
council.nyc.gov/districts/ + per-member pages	51 Council members, committees, caucuses, contact info
nyassembly.gov/mem/ + /comm/ pages	150 Assembly members, committee assignments
legislation.nysenate.gov/api/3/members/	63 Senate member profiles
house.gov/representatives	26 NY House members
OpenStates REST + GraphQL	Party affiliation, committee enrichment for Assembly + Senate
Hardcoded	US Senators Schumer + Gillibrand

Geospatial Boundaries (static files in `/public`)

6 GeoJSON files: NYC Council districts, NYC boroughs, NYC NTAs (195 neighborhoods), Congressional districts, NYS Senate districts, NYS Assembly districts.

Address geocoding via NYC Planning Labs geocoder (free, no key required).

RAG Pipeline — How Answers Are Generated

User question arrives with optional profile demographics (borough, ZIP, interests)
Location augmentation — borough and neighborhood terms from user's ZIP are appended to the query
Embedding — augmented query → 384-dim vector via HuggingFace API or local FastEmbed fallback
pgvector cosine search — ORDER BY embedding <=> query_vector LIMIT top_k×8
Window expansion — each result expands ±1 neighboring chunks for fuller context
Three-tier fallback — vector → lexical ILIKE → recency; retrieval_tier returned in response
LLM generation — Groq Llama 3.1 8B → structured JSON briefing or plain markdown
OpenAI fallback — if retrieval_tier='none', GPT-4o-mini answers from general knowledge

⚙️ Environment Variables

Variable	Used By	Required For
`DATABASE_URL`	Backend, pipeline	All DB operations
`NYS_SENATE_API_KEY`	TS scraper + Python pipeline	Senate members, bills, transcripts
`NYC_COUNCIL_API_KEY`	Python pipeline	Legistar bill text
`OPEN_STATES_API_KEY`	TS scraper	Party + committee enrichment
`GROQ_API_KEY`	Backend	RAG LLM responses
`OPENAI_API_KEY`	Frontend (Next.js)	Floating chat fallback + /chat page
`OPENAI_MODEL`	Frontend	Optional model override (default: `gpt-4o-mini`)
`HF_TOKEN`	Backend embed.py	HuggingFace API (optional, avoids rate limits)
`BACKEND_PROD_URL`	cron/keep_alive.py	Keep-alive pings to Render

🚀 Getting Started

Prerequisites

Python 3.11+
Node.js 20+
Neon Postgres account (with pgvector enabled)
Groq API key
OpenAI API key

Setup

# 1. Clone the repo
git clone <repo-url>
cd civic-spiegel

# 2. Backend setup
cd backend
cp ../.env.example .env   # Fill in DATABASE_URL, GROQ_API_KEY, etc.
pip install -r requirements.txt
python init_db.py          # Initialize all 6 tables + pgvector

# 3. Populate historical data (required for chat to have context)
cd ../pipeline
python backfill_history.py # Backfills 2021, 2023, 2025 session years

# 4. Start the backend
cd ../backend
uvicorn main:app --reload  # Runs on localhost:8000

# 5. Frontend setup
cd ../frontend
cp .env.local.example .env.local  # Fill in OPENAI_API_KEY, NYS_SENATE_API_KEY, NYC_COUNCIL_API_KEY, etc.
npm install
npm run dev                # Runs on localhost:3000

Automated Pipeline (GitHub Actions)

Add all required secrets to GitHub Repository Secrets (Settings → Secrets → Actions).
Push to main. The two workflows (run_pipeline.yml, refresh-politicians.yml) activate on schedule at 06:00 UTC daily.
Both can also be triggered manually via workflow_dispatch in the Actions tab.

Refreshing Politicians Manually

cd frontend
npm run refresh-politicians
# Writes to public/data/politicians.json
# Commit and push to update the live site

🧬 Project Structure

civic-spiegel/
├── frontend/                    # Next.js 14 app (TypeScript)
│   ├── src/
│   │   ├── app/
│   │   │   └── api/
│   │   │       ├── civic/
│   │   │       │   ├── politicians/route.ts   # 4-layer fallback data route
│   │   │       │   ├── chat/route.ts          # RAG proxy to FastAPI
│   │   │       │   ├── floating-chat/route.ts # Dual-LLM orchestration
│   │   │       │   └── policies/route.ts      # Policy feed proxy
│   │   │       └── llm/
│   │   │           └── chat/route.ts          # Direct OpenAI for /chat page
│   │   ├── lib/
│   │   │   ├── politicians.ts       # Types, cache, filter logic
│   │   │   ├── api.ts               # All client-side API calls
│   │   │   ├── scrapers.ts          # Server-side scraper functions
│   │   │   ├── policy-reply.ts      # RAG response normalization
│   │   │   └── useProfile.ts        # localStorage profile hook
│   │   └── components/civiq/
│   │       ├── HomeShell.tsx        # Main page orchestrator
│   │       ├── FloatingChatBot.tsx  # Ask Spiegel widget
│   │       ├── PoliticianCards.tsx  # Representative directory
│   │       ├── CivicMap.tsx         # 4-tab map system
│   │       ├── PolicyBriefingPanel.tsx
│   │       ├── AccessibilityWidget.tsx
│   │       ├── OnboardingModal.tsx
│   │       ├── SettingsModal.tsx
│   │       └── ...
│   ├── public/
│   │   ├── data/politicians.json        # Pre-built rep cache (auto-committed)
│   │   ├── boundaries-districts.geojson
│   │   ├── boundaries-boroughs.geojson
│   │   ├── boundaries-neighborhoods.geojson
│   │   ├── boundaries-congressional.geojson
│   │   ├── boundaries-nys-senate.geojson
│   │   └── boundaries-nys-assembly.geojson
│   └── scripts/
│       └── refresh-politicians.ts   # CLI: scrape all reps → politicians.json
│
├── backend/                     # FastAPI (Python)
│   ├── main.py                  # All endpoints + RAG orchestration
│   ├── schema.py                # SQLModel tables (6 tables + pgvector)
│   ├── db.py                    # Neon engine with pool config
│   ├── embed.py                 # Query-time embedding (HF API → FastEmbed fallback)
│   └── llm_engine.py            # Groq + mock mode LLM wrapper
│
├── pipeline/                    # Python document ingestion
│   ├── run_pipeline.py          # Unified runner (4 scrapers in sequence)
│   ├── backfill_history.py      # One-time historical import (2021, 2023, 2025)
│   ├── base_scraper.py          # Abstract base: DB insert, dedup, junk filter
│   ├── embedding_engine.py      # Sentence chunking + FastEmbed vectors
│   ├── tag_classifier.py        # spaCy NER + keyword policy classification
│   └── scrapers/
│       ├── nyc_council_legistar.py
│       ├── nyc_council_meetings.py
│       ├── nyc_council_rss.py
│       ├── nys_senate_bills.py
│       └── nys_senate_transcripts.py
│
├── scripts/                     # One-time utility scripts
│   ├── seed_politicians.py      # Seed Politician table from NYC Open Data
│   ├── sync_council_members.py  # Sync council members + cache districts.geojson
│   └── geo_crosswalk.py         # Shapely: ZIP + NTA → council district mapping
│
├── cron/
│   └── keep_alive.py            # Pings /api/health to prevent Render sleep
│
├── docs/
│   ├── DATABASE_ARCHITECTURE.md # This file (schema, design decisions)
│   └── DOMAINS_AND_NUANCES.md   # NYC/NYS political context for data/ML team
│
└── .github/workflows/
    ├── run_pipeline.yml          # Daily 06:00 UTC: Python pipeline → Neon
    └── refresh-politicians.yml   # Daily 06:00 UTC: TS scraper → politicians.json

ScreenShot - Examples

🤝 Contributing

Understanding the NYC/NYS political context is required reading for any data/ML actions. Download or fork this project to run the live app locally.

Name		Name	Last commit message	Last commit date
Latest commit History 417 Commits
.github/workflows		.github/workflows
backend		backend
cron		cron
docs		docs
frontend		frontend
images		images
pipeline		pipeline
.gitignore		.gitignore
.npmrc		.npmrc
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
tsconfig.json		tsconfig.json
uv.lock		uv.lock
vercel.json		vercel.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Civic Spiegel: NY Civic Research Assistant

The Problem

The Solution

✨ Features

📐 Architecture & Tech Stack

🛠️ System Architecture

Key Architecture Decisions

🏛️ Five Levels of Government Coverage

🗂️ Data Sources

Document Corpus (ingested by Python pipeline daily)

Legislative Directory (TypeScript scraper daily)

Geospatial Boundaries (static files in `/public`)

RAG Pipeline — How Answers Are Generated

⚙️ Environment Variables

🚀 Getting Started

Prerequisites

Setup

Automated Pipeline (GitHub Actions)

Refreshing Politicians Manually

🧬 Project Structure

ScreenShot - Examples

🤝 Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Civic Spiegel: NY Civic Research Assistant

The Problem

The Solution

✨ Features

📐 Architecture & Tech Stack

🛠️ System Architecture

Key Architecture Decisions

🏛️ Five Levels of Government Coverage

🗂️ Data Sources

Document Corpus (ingested by Python pipeline daily)

Legislative Directory (TypeScript scraper daily)

Geospatial Boundaries (static files in /public)

RAG Pipeline — How Answers Are Generated

⚙️ Environment Variables

🚀 Getting Started

Prerequisites

Setup

Automated Pipeline (GitHub Actions)

Refreshing Politicians Manually

🧬 Project Structure

ScreenShot - Examples

🤝 Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Geospatial Boundaries (static files in `/public`)

Packages