Skip to content

A lightweight, pure-Python SQL query engine for CSV, Parquet, JSON, JSONL, HTML, and Markdown files with lazy evaluation and intelligent optimizations.

Notifications You must be signed in to change notification settings

subhayu99/sqlstream

Repository files navigation

SQLStream Logo

A lightweight, pure-Python SQL query engine for CSV, Parquet, JSON, JSONL, HTML, Markdown, and XML files with lazy evaluation and intelligent optimizations.

Tests Coverage Documentation License Downloads

πŸ“– Full Documentation | πŸš€ Quick Start | πŸ’¬ Discussions


Quick Example

# Query a CSV file (explicit source)
$ sqlstream query data.csv "SELECT * FROM data WHERE age > 25"

# Query with inline file path (source inferred from SQL)
$ sqlstream query "SELECT * FROM 'data.csv' WHERE age > 25"

# Query JSON with nested paths
$ sqlstream query "users.json#json:data.users" "SELECT name, email FROM users"

# Query JSONL (JSON Lines) files
$ sqlstream query logs.jsonl "SELECT timestamp, level, message FROM logs WHERE level = 'ERROR'"

# Query HTML tables
$ sqlstream query "report.html#html:0" "SELECT * FROM report WHERE revenue > 1000000"

# Query Markdown tables
$ sqlstream query "README.md#markdown:1" "SELECT column1, column2 FROM readme"

# Query XML files
$ sqlstream query "data.xml#xml:record" "SELECT name, age FROM data WHERE age > 25"

# Query S3 files with partitions
$ sqlstream query "SELECT * FROM 's3://my-bucket/data/year=2024/' WHERE date > '2024-01-01'"

# Join multiple formats (CSV + Parquet + JSON)
$ sqlstream query "SELECT c.name, o.total, u.email
  FROM 'customers.csv' c
  JOIN 'orders.parquet' o ON c.id = o.customer_id
  JOIN 'users.json#json:users' u ON c.user_id = u.id"

# Interactive shell with full TUI
$ sqlstream shell

Features

  • πŸš€ Pure Python - No database installation required
  • πŸ“Š Multiple Formats - CSV, Parquet, JSON, JSONL, HTML, Markdown, XML with nested path syntax, HTTP URLs, S3 buckets
  • ⚑ 100x+ Faster - DuckDB backend for complex SQL, Pandas backend for simple queries
  • πŸ”— JOIN Support - INNER, LEFT, RIGHT, FULL OUTER joins across different file formats
  • πŸ“ˆ Aggregations - GROUP BY with COUNT, SUM, AVG, MIN, MAX, DISTINCT
  • πŸ”’ Rich Type System - 10 data types (INTEGER, FLOAT, DECIMAL, STRING, JSON, BOOLEAN, DATE, TIME, DATETIME, NULL) with automatic inference
  • ☁️ S3 Support - Query files directly from Amazon S3 with partition support
  • 🎨 Beautiful Output - Rich tables, JSON, CSV, Parquet, Markdown formatting
  • πŸ–₯️ Advanced Interactive Shell - Multiple tabs, sidebars, layout cycling, backend toggle, state persistence, file browser
  • πŸ” Smart Optimizations - Column pruning, predicate pushdown, limit pushdown, lazy evaluation
  • 🌐 REST API Ready - Query HTTP endpoints and APIs (coming soon)
  • πŸ“¦ Lightweight - Minimal dependencies, works everywhere

Installation

Using uv (recommended)

Basic (CSV only):

uv tool install sqlstream

All features:

uv tool install "sqlstream[all]"

Multiple Sub Dependencies:

uv tool install "sqlstream[interactive,pandas,s3,http,html,duckdb]"

Optional Dependencies:

  • pandas - Pandas backend for 10-100x speedup
  • duckdb - DuckDB backend for 100x+ speedup and advanced SQL
  • parquet - Parquet file support
  • s3 - Amazon S3 file access
  • http - HTTP/HTTPS data sources
  • html - HTML table extraction (requires pandas, html5lib, beautifulsoup4)
  • interactive - Interactive shell with rich TUI
  • all - All features combined

Using pip

Basic (CSV only):

pip install sqlstream

All features:

pip install "sqlstream[all]"

See Installation Guide for more options.

Quick Start

CLI Usage

# Simple query
$ sqlstream query data.csv "SELECT name, salary FROM data WHERE salary > 80000"

# With pandas backend for performance
$ sqlstream query data.csv "SELECT * FROM data" --backend pandas

# JSON output
$ sqlstream query data.csv "SELECT * FROM data" --format json

# Interactive shell with TUI
$ sqlstream shell data.csv

Interactive Shell

$ sqlstream shell

Enhanced TUI Features:

Query Management:

  • Multiple Tabs (Ctrl+T/Ctrl+W) - Work on multiple queries simultaneously
  • State Persistence - Auto-save tabs, queries, and layout between sessions
  • Query History (Ctrl+Up/Down) - Navigate previous queries with multiline support
  • Auto-completion - Schema-aware suggestions for tables and columns
  • Syntax Highlighting - SQL syntax with Dracula theme

Sidebars & Layout:

  • Dynamic Sidebars (F2/F3) - Schema browser, File explorer, Filter, Export, Config
  • Layout Cycling (Ctrl+L) - Resize query editor: 50%, 60%, 70%, 80%, 100%
  • File Browser (Ctrl+O) - Tree view with directory navigation
  • Schema Browser - Real-time schema and type information

Execution & Performance:

  • Backend Toggle (F5 or Ctrl+B) - Cycle: Auto β†’ DuckDB β†’ Pandas β†’ Python
  • Execution Plan (F4) - View query optimization steps
  • Async Execution - Responsive UI during long queries
  • Cancel Queries (Ctrl+C) - Stop running queries

Results Management:

  • Advanced Filtering (Ctrl+F) - Column-specific or global search
  • Smart Export (Ctrl+X) - CSV, JSON, or Parquet with format selection
  • Pagination - 100 rows per page, configurable
  • Column Sorting - Click headers to sort
  • Live Stats - Row counts and filter status

Keyboard Shortcuts:

  • Ctrl+Delete/Ctrl+Backspace - Word-aware deletion
  • Ctrl+Q or Ctrl+D - Exit with auto-save

Python API

from sqlstream import query

# Execute query with explicit source
results = query("data.csv").sql("SELECT * FROM data WHERE age > 25")

# Execute query with inline source (extracted from SQL)
results = query().sql("SELECT * FROM 'data.csv' WHERE age > 25")

# Iterate over results
for row in results:
    print(row)

# Or convert to list
results_list = query().sql("SELECT * FROM 'data.csv'").to_list()

Documentation

Full documentation: https://subhayu99.github.io/sqlstream

Key sections:

Development Status

Current Phase: 9 (Enhanced Interactive Shell - Complete!)

  • βœ… Phase 0-2: Core query engine with Volcano model
  • βœ… Phase 3: Parquet support
  • βœ… Phase 4: Aggregations & GROUP BY
  • βœ… Phase 5: JOIN operations (INNER, LEFT, RIGHT)
  • βœ… Phase 5.5: Pandas backend (10-100x speedup)
  • βœ… Phase 6: HTTP data sources
  • βœ… Phase 7: CLI with beautiful output
  • βœ… Phase 7.5: Interactive mode with Textual
  • βœ… Phase 7.6: Inline file path support
  • βœ… Phase 8: Type system & schema inference
  • βœ… Phase 9: Enhanced interactive shell (multiple tabs, state persistence, file browser, query plan)
  • βœ… Phase 10: HTML & Markdown readers with table extraction
  • βœ… Phase 11: Enhanced type system (Decimal, DateTime, Date, Time, JSON)
  • 🚧 Phase 12: Comprehensive testing & documentation (15% coverage β†’ 80% target)

Test Coverage: 560 tests, 15% coverage (actively improving)

Performance

SQLStream offers three execution backends:

Backend Speed Use Case
Python Baseline Learning, small files (<100K rows)
Pandas 10-100x faster Basic queries, large files (>100K rows)
DuckDB 100x+ faster Complex SQL, analytics, huge files (10M+ rows)

Benchmark (1M rows):

  • Python backend: 52s
  • Pandas backend: 0.8s ⚑ 65x faster

Architecture

SQLStream uses the Volcano iterator model for query execution:

SQL Query β†’ Parser β†’ AST β†’ Planner β†’ Optimizer β†’ Executor β†’ Results
                                          ↓
                            (Column Pruning, Predicate Pushdown,
                             Lazy Evaluation)

Key concepts:

  • Lazy Evaluation: Rows are processed on-demand
  • Column Pruning: Only read columns that are used
  • Predicate Pushdown: Apply filters early to reduce data scanned
  • Three Backends: Pure Python (learning), Pandas (performance), and DuckDB (full SQL)

See Architecture Guide for details.

Contributing

Contributions are welcome! See Contributing Guide for details.

Development setup:

# Clone repository
git clone https://github.com/subhayu99/sqlstream.git
cd sqlstream

# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Format code
ruff format .
ruff check .

License

MIT License - see LICENSE for details.


Built with ❀️ by the SQLStream Team

πŸ“– Documentation β€’ πŸ› Issues β€’ πŸ’¬ Discussions

About

A lightweight, pure-Python SQL query engine for CSV, Parquet, JSON, JSONL, HTML, and Markdown files with lazy evaluation and intelligent optimizations.

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages