Skip to content

feat: add 9 advanced features with performance-optimized implementations#14

Merged
Sanady merged 3 commits into
mainfrom
feat/advanced-features
Apr 2, 2026
Merged

feat: add 9 advanced features with performance-optimized implementations#14
Sanady merged 3 commits into
mainfrom
feat/advanced-features

Conversation

@Sanady
Copy link
Copy Markdown
Owner

@Sanady Sanady commented Apr 2, 2026

Summary

Adds 9 major new feature modules to dataforge-py, all zero-runtime-dependency, with __slots__ on every class, lazy-loaded imports, and performance-optimized hot paths. Includes comprehensive test coverage (1870 tests pass, 9 skipped for optional TUI dependency).

New Feature Modules

Phase 1: Correlated/Conditional Fields (constraints.py + data/correlations/geo.py)

  • Constraint engine with dependency DAG and topological ordering
  • 6 constraint types: depends_on (geographic), temporal (before/after), correlate (statistical via Cholesky), conditional (value-dependent pools), range, and custom
  • Two-pass generation: independent columns batch (column-first), dependent columns row-by-row
  • Dict-based field specs in Schema: {"field": "address.city", "depends_on": "country"}

Phase 2A: Time-Series Generation (timeseries.py)

  • TimeSeriesSchema with trend, seasonality (sinusoidal), noise (Gaussian), anomaly injection, spiky patterns
  • Interval parsing ("1h", "30m", "1d"), regime changes, missing data gaps, range clamping
  • Column-first assembly with pre-computed boolean flags for maximum throughput

Phase 2B: Data Quality/Chaos Mode (chaos.py)

  • ChaosTransformer with configurable injection: nulls, type mismatches, boundary values, duplicates, whitespace, encoding chaos, format inconsistency, truncation
  • Integrates as post-processing step in Schema via chaos= parameter
  • Pre-checks active transforms to skip no-op paths

Phase 3A: Schema Inference (inference.py)

  • SchemaInferrer: analyze CSV/DataFrame/database/list-of-dicts to auto-create matching Schema
  • Pipeline: type detection, semantic type detection (regex for email/phone/UUID/etc.), distribution analysis, null rate, Schema building
  • Cached field aliases and pre-compiled regex for numeric detection

Phase 3B: Data Anonymization (anonymizer.py)

  • Deterministic PII replacement using HMAC-SHA256 derived seeds
  • Value cache for consistent mapping (same input to same output across tables)
  • Format-preserving anonymization for emails, phones; streaming CSV support
  • RNG swap optimization instead of forge.copy() per unique value

Phase 4A: Database Seeding (seeder.py)

  • DatabaseSeeder using SQLAlchemy (optional dep) with table introspection
  • Batched INSERTs with dialect-specific optimizations (PostgreSQL COPY, MySQL FK checks, SQLite pragmas)
  • Auto-schema from table structure using heuristic mappings

Phase 4B: OpenAPI/JSON Schema Import (openapi.py)

  • OpenAPIParser with $ref resolution, type mapping (string+format to provider field)
  • Handles enum, pattern, numeric ranges, arrays, nested objects
  • New regexify() method in backend.py for pattern-based generation

Phase 4C: Streaming to Message Queues (streaming.py)

  • HttpEmitter (stdlib urllib, zero-dep), KafkaEmitter (optional), RabbitMQEmitter (optional)
  • TokenBucketRateLimiter using time.monotonic()
  • Batch emission via schema.generate(count=chunk) for throughput

Phase 5: TUI Schema Builder (tui/)

  • Textual-based interactive app (optional dep) with provider browser, schema builder, live preview, export
  • CLI integration via dataforge --tui

Integration Updates

  • schema.py: chaos= parameter, constraint engine, stream_to/stream_to_http/stream_to_kafka methods, None sentinel optimization for standard path
  • core.py: chaos=, timeseries(), infer_schema(), infer_schema_from_csv()
  • backend.py: regexify() with character classes, quantifiers, alternation, escape sequences
  • cli.py: --tui/--infer/--anonymize/--chaos args
  • pyproject.toml: Optional extras (kafka, rabbitmq, tui, db, all)

Performance (vs main branch, same machine)

Metric Main Feature Branch Change
Startup (init) 4.5 us 4.0 us -12% faster
Startup (first field) 50.5 us 44.5 us -12% faster
Schema generate (100K) 357,521 rows/s 376,918 rows/s +5.4%
Schema to_csv (10K) 282,779 rows/s 333,500 rows/s +17.9%
Schema stream_jsonl (10K) 179,469 rows/s 199,710 rows/s +11.3%
Scalar calls -- -- 0 regressions (171/179 improved)

Test Coverage

  • 9 new test files covering all feature modules
  • 1870 tests pass, 9 skipped (TUI tests -- textual not installed)
  • 0 failures

Files Changed

  • 29 files added/modified, ~7,300 insertions
  • Zero provider files modified -- all new features are additive
  • All new module imports are lazy (no startup impact)

Sanady added 2 commits April 1, 2026 23:56
github.actor reflects the last pusher, not the PR author. When a
human rebases a Dependabot branch, actor changes and the skip
no longer matches. Use github.event.pull_request.user.login
which always returns the PR author regardless of who pushed.
Add constraint engine, time-series, chaos mode, schema
inference, anonymizer, database seeder, OpenAPI parser,
streaming emitters, and TUI builder.

All modules use __slots__, lazy imports, and zero runtime
deps. Includes comprehensive test suite (1870 passed) and
benchmarks showing 5-18% schema speedup, 12% faster
startup, zero scalar regressions.
@Sanady Sanady force-pushed the feat/advanced-features branch from e84e965 to a080f05 Compare April 2, 2026 15:58
Add 10 example files covering time-series, schema inference,
chaos testing, constraints, PII anonymization, database seeding,
OpenAPI import, streaming, TUI, and combined real-world scenarios.

Update README with documentation for all 9 advanced features,
updated benchmarks (343K rows/s schema generation), new ToC
entries, installation instructions with optional extras, and
examples directory listing.
@Sanady Sanady force-pushed the feat/advanced-features branch from 048cdb4 to c995c80 Compare April 2, 2026 16:31
@Sanady Sanady merged commit d9d8a51 into main Apr 2, 2026
6 checks passed
@Sanady Sanady deleted the feat/advanced-features branch April 2, 2026 16:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant