A lightweight, pure-Python SQL query engine for CSV, Parquet, JSON, JSONL, HTML, Markdown, and XML files with lazy evaluation and intelligent optimizations.
π Full Documentation | π Quick Start | π¬ Discussions
# Query a CSV file (explicit source)
$ sqlstream query data.csv "SELECT * FROM data WHERE age > 25"
# Query with inline file path (source inferred from SQL)
$ sqlstream query "SELECT * FROM 'data.csv' WHERE age > 25"
# Query JSON with nested paths
$ sqlstream query "users.json#json:data.users" "SELECT name, email FROM users"
# Query JSONL (JSON Lines) files
$ sqlstream query logs.jsonl "SELECT timestamp, level, message FROM logs WHERE level = 'ERROR'"
# Query HTML tables
$ sqlstream query "report.html#html:0" "SELECT * FROM report WHERE revenue > 1000000"
# Query Markdown tables
$ sqlstream query "README.md#markdown:1" "SELECT column1, column2 FROM readme"
# Query XML files
$ sqlstream query "data.xml#xml:record" "SELECT name, age FROM data WHERE age > 25"
# Query S3 files with partitions
$ sqlstream query "SELECT * FROM 's3://my-bucket/data/year=2024/' WHERE date > '2024-01-01'"
# Join multiple formats (CSV + Parquet + JSON)
$ sqlstream query "SELECT c.name, o.total, u.email
FROM 'customers.csv' c
JOIN 'orders.parquet' o ON c.id = o.customer_id
JOIN 'users.json#json:users' u ON c.user_id = u.id"
# Interactive shell with full TUI
$ sqlstream shell- π Pure Python - No database installation required
- π Multiple Formats - CSV, Parquet, JSON, JSONL, HTML, Markdown, XML with nested path syntax, HTTP URLs, S3 buckets
- β‘ 100x+ Faster - DuckDB backend for complex SQL, Pandas backend for simple queries
- π JOIN Support - INNER, LEFT, RIGHT, FULL OUTER joins across different file formats
- π Aggregations - GROUP BY with COUNT, SUM, AVG, MIN, MAX, DISTINCT
- π’ Rich Type System - 10 data types (INTEGER, FLOAT, DECIMAL, STRING, JSON, BOOLEAN, DATE, TIME, DATETIME, NULL) with automatic inference
- βοΈ S3 Support - Query files directly from Amazon S3 with partition support
- π¨ Beautiful Output - Rich tables, JSON, CSV, Parquet, Markdown formatting
- π₯οΈ Advanced Interactive Shell - Multiple tabs, sidebars, layout cycling, backend toggle, state persistence, file browser
- π Smart Optimizations - Column pruning, predicate pushdown, limit pushdown, lazy evaluation
- π REST API Ready - Query HTTP endpoints and APIs (coming soon)
- π¦ Lightweight - Minimal dependencies, works everywhere
Using uv (recommended)
Basic (CSV only):
uv tool install sqlstreamAll features:
uv tool install "sqlstream[all]"Multiple Sub Dependencies:
uv tool install "sqlstream[interactive,pandas,s3,http,html,duckdb]"Optional Dependencies:
pandas- Pandas backend for 10-100x speedupduckdb- DuckDB backend for 100x+ speedup and advanced SQLparquet- Parquet file supports3- Amazon S3 file accesshttp- HTTP/HTTPS data sourceshtml- HTML table extraction (requires pandas, html5lib, beautifulsoup4)interactive- Interactive shell with rich TUIall- All features combined
Basic (CSV only):
pip install sqlstreamAll features:
pip install "sqlstream[all]"See Installation Guide for more options.
# Simple query
$ sqlstream query data.csv "SELECT name, salary FROM data WHERE salary > 80000"
# With pandas backend for performance
$ sqlstream query data.csv "SELECT * FROM data" --backend pandas
# JSON output
$ sqlstream query data.csv "SELECT * FROM data" --format json
# Interactive shell with TUI
$ sqlstream shell data.csv$ sqlstream shellEnhanced TUI Features:
Query Management:
- Multiple Tabs (
Ctrl+T/Ctrl+W) - Work on multiple queries simultaneously - State Persistence - Auto-save tabs, queries, and layout between sessions
- Query History (
Ctrl+Up/Down) - Navigate previous queries with multiline support - Auto-completion - Schema-aware suggestions for tables and columns
- Syntax Highlighting - SQL syntax with Dracula theme
Sidebars & Layout:
- Dynamic Sidebars (
F2/F3) - Schema browser, File explorer, Filter, Export, Config - Layout Cycling (
Ctrl+L) - Resize query editor: 50%, 60%, 70%, 80%, 100% - File Browser (
Ctrl+O) - Tree view with directory navigation - Schema Browser - Real-time schema and type information
Execution & Performance:
- Backend Toggle (
F5orCtrl+B) - Cycle: Auto β DuckDB β Pandas β Python - Execution Plan (
F4) - View query optimization steps - Async Execution - Responsive UI during long queries
- Cancel Queries (
Ctrl+C) - Stop running queries
Results Management:
- Advanced Filtering (
Ctrl+F) - Column-specific or global search - Smart Export (
Ctrl+X) - CSV, JSON, or Parquet with format selection - Pagination - 100 rows per page, configurable
- Column Sorting - Click headers to sort
- Live Stats - Row counts and filter status
Keyboard Shortcuts:
Ctrl+Delete/Ctrl+Backspace- Word-aware deletionCtrl+QorCtrl+D- Exit with auto-save
from sqlstream import query
# Execute query with explicit source
results = query("data.csv").sql("SELECT * FROM data WHERE age > 25")
# Execute query with inline source (extracted from SQL)
results = query().sql("SELECT * FROM 'data.csv' WHERE age > 25")
# Iterate over results
for row in results:
print(row)
# Or convert to list
results_list = query().sql("SELECT * FROM 'data.csv'").to_list()Full documentation: https://subhayu99.github.io/sqlstream
Key sections:
- Quick Start Guide - Get started in 5 minutes
- SQL Reference - Supported SQL syntax
- CLI Reference - Command-line interface
- Python API - Programmatic usage
- Examples - Real-world examples
- Troubleshooting - Common issues and solutions
- Architecture - How it works
Current Phase: 9 (Enhanced Interactive Shell - Complete!)
- β Phase 0-2: Core query engine with Volcano model
- β Phase 3: Parquet support
- β Phase 4: Aggregations & GROUP BY
- β Phase 5: JOIN operations (INNER, LEFT, RIGHT)
- β Phase 5.5: Pandas backend (10-100x speedup)
- β Phase 6: HTTP data sources
- β Phase 7: CLI with beautiful output
- β Phase 7.5: Interactive mode with Textual
- β Phase 7.6: Inline file path support
- β Phase 8: Type system & schema inference
- β Phase 9: Enhanced interactive shell (multiple tabs, state persistence, file browser, query plan)
- β Phase 10: HTML & Markdown readers with table extraction
- β Phase 11: Enhanced type system (Decimal, DateTime, Date, Time, JSON)
- π§ Phase 12: Comprehensive testing & documentation (15% coverage β 80% target)
Test Coverage: 560 tests, 15% coverage (actively improving)
SQLStream offers three execution backends:
| Backend | Speed | Use Case |
|---|---|---|
| Python | Baseline | Learning, small files (<100K rows) |
| Pandas | 10-100x faster | Basic queries, large files (>100K rows) |
| DuckDB | 100x+ faster | Complex SQL, analytics, huge files (10M+ rows) |
Benchmark (1M rows):
- Python backend: 52s
- Pandas backend: 0.8s β‘ 65x faster
SQLStream uses the Volcano iterator model for query execution:
SQL Query β Parser β AST β Planner β Optimizer β Executor β Results
β
(Column Pruning, Predicate Pushdown,
Lazy Evaluation)
Key concepts:
- Lazy Evaluation: Rows are processed on-demand
- Column Pruning: Only read columns that are used
- Predicate Pushdown: Apply filters early to reduce data scanned
- Three Backends: Pure Python (learning), Pandas (performance), and DuckDB (full SQL)
See Architecture Guide for details.
Contributions are welcome! See Contributing Guide for details.
Development setup:
# Clone repository
git clone https://github.com/subhayu99/sqlstream.git
cd sqlstream
# Install development dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Format code
ruff format .
ruff check .MIT License - see LICENSE for details.
Built with β€οΈ by the SQLStream Team
π Documentation β’ π Issues β’ π¬ Discussions