Project: Scrybe - Browser Behavior Intelligence System
Style: TigerStyle (following TigerBeetle principles)
Version: 0.2.0
Status: Post-Review Refinement → Ready for Implementation
Last Updated: 2025-01-22
All RFCs have been refined based on comprehensive multi-disciplinary review!
Critical blockers addressed:
- ✅ API authentication (HMAC signatures)
- ✅ Replay attack prevention (nonce validation)
- ✅ DoS protection (bounded collections)
- ✅ GDPR compliance (cookie consent)
- ✅ Graceful shutdown
- ✅ Secrets management
See CHANGELOG-v0.2.0.md for complete details.
Scrybe is a high-fidelity browser behavior intelligence system designed to detect and understand automation with forensic granularity. It acts as both a data collector and behavior profiler, enabling real-time bot detection, historical session analysis, and ML model training.
- JavaScript SDK - Browser-side signal collection agent
- Rust Ingestion Gateway - High-performance HTTP API for receiving telemetry
- Enrichment Pipeline - Fingerprinting, geo-resolution, anomaly detection
- ClickHouse Storage - Analytical database for session telemetry
- Redis Cache - Fast session correlation and rate limiting
- Analyst UI - React dashboard for visualization and analysis
File: RFC-0001-architecture.md (10KB)
Status: Complete
Created: 2025-01-22
Defines the overall system architecture for Scrybe:
- System design: Multi-layer signal collection (network, browser, behavioral)
- Module structure: Rust crates organization (core, ingestion, enrichment, storage, cache)
- Data flow: Browser → Ingestion → Enrichment → Storage
- Core types: Session, NetworkSignals, BrowserSignals, BehavioralSignals, Fingerprint
- Performance targets: < 5ms ingestion, < 50ms enrichment
- Dependencies: Axum, ClickHouse, Redis, serde, chrono
- TigerStyle compliance: Safety, simplicity, correctness, performance
Key Decisions:
- Rust for performance and safety
- ClickHouse for analytical queries
- Redis for real-time session cache
- JavaScript SDK for client-side collection
File: RFC-0002-javascript-sdk.md (15KB)
Status: Complete
Created: 2025-01-22
Defines the browser-side collection agent:
- Signal categories: Network, browser, behavioral
- Canvas fingerprinting: SHA-256 hash of canvas rendering
- WebGL fingerprinting: Vendor, renderer, extensions
- Audio fingerprinting: Audio context signature
- Behavioral collection: Mouse entropy, scroll smoothness, timing patterns
- Transport: Beacon API (primary), Fetch API (fallback)
- Privacy safeguards: No PII, no input values, opt-out support
- Performance: < 20ms initialization, < 30KB bundle size
- Browser support: Chrome 90+, Firefox 88+, Safari 14+
Key Features:
- Non-blocking async collection
- Comprehensive signal coverage
- Privacy-aware (no PII tracking)
- Lightweight and fast
File: RFC-0003-ingestion-gateway.md (14KB)
Status: Complete
Created: 2025-01-22
Defines the HTTP ingestion service built with Axum:
- API endpoint: POST /api/v1/ingest
- Server-side signals: IP, TLS (JA3/JA4), HTTP headers
- Middleware stack: Tracing, CORS, rate limiting, request ID
- Validation: Payload schema, timestamp checks, size limits
- Error handling: Type-safe errors, appropriate status codes
- Rate limiting: 100 req/min per IP, 1000 req/min per session
- Performance: < 5ms latency (p99), 10k req/sec throughput
- Observability: Structured logging, Prometheus metrics
Key Technologies:
- Axum (web framework)
- Tower (middleware)
- tokio (async runtime)
- redis (session cache)
File: RFC-0004-fingerprinting-enrichment.md (17KB)
Status: Complete
Created: 2025-01-22
Defines the enrichment pipeline:
- Composite fingerprinting: Deterministic device identification (SHA-256)
- Component hashes: Network, browser, behavioral, device
- Geo/ASN resolution: MaxMind GeoIP2 integration
- MinHash similarity: Fingerprint clustering via LSH
- Anomaly detection: Behavioral, timing, header, fingerprint anomalies
- Bot probability scoring: Weighted combination of anomaly scores
- Pipeline stages: Fingerprint → Geo → Similarity → Anomaly → Assembly
- Performance: < 50ms enrichment (p99)
- Caching: GeoIP lookups cached in memory
Key Algorithms:
- SHA-256 for deterministic hashing
- MinHash for similarity detection
- Jaccard similarity for clustering
- ML-based anomaly detection
File: RFC-0005-storage-schema.md (16KB)
Status: Complete
Created: 2025-01-22
Defines the ClickHouse schema and queries:
- Main table:
sessions(partitioned by date, TTL 90 days) - Primary key: (timestamp, session_id)
- Indexes: Bloom filter (fingerprint), token bloom (IP), minmax (bot_probability)
- Materialized views: hourly_stats, fingerprint_clusters, anomaly_events
- Rust integration: clickhouse-rs client, batch writes
- Query patterns: Time-range scans, fingerprint lookups, anomaly aggregations
- Performance: 100k writes/sec, < 100ms queries (p99)
- Compression: > 50:1 ratio with zstd
Key Features:
- Columnar storage for fast analytics
- Automatic TTL-based cleanup
- Materialized views for common queries
- High-cardinality optimization
File: RFC-0006-session-management.md (14KB)
Status: Complete
Created: 2025-01-22
Defines Redis-based session cache:
- Data structures: Hash (session metadata), Set (fingerprint → sessions), String (rate limits)
- Key naming:
session:{uuid},fingerprint:{hash},ratelimit:{ip}:{window} - TTL strategy: 24 hours for sessions, 60 seconds for rate limits
- Session correlation: Fingerprint + IP matching
- Rate limiting: Per-IP counters with sliding windows
- Anomaly feed: Sorted sets for real-time alerts
- Performance: < 1ms latency (p99)
- Connection pooling: deadpool-redis
Key Operations:
- Fast session lookups
- Real-time fingerprint correlation
- Rate limit enforcement
- Anomaly event streaming
File: RFC-0007-security-privacy.md (12KB)
Status: Complete
Created: 2025-01-22
Defines all security and privacy safeguards:
- PII protection: Never collect input values, form data, personal info
- Opt-out mechanisms: DNT header, JavaScript flag, meta tag, localStorage
- Data minimization: Only collect behavioral patterns, not identities
- Cryptographic hashing: Salted SHA-256 for all identifiers
- Salt rotation: Every 30 days (breaks long-term tracking)
- Data retention: Auto-delete after 90 days (ClickHouse TTL)
- GDPR compliance: Right to access, erasure, portability
- Transport security: TLS required, secure cookies
- Rate limiting: Multi-layer (IP, session, global)
- Input validation: Strict schema validation, size limits
- Audit logging: Security events tracked
Privacy Principles:
- Observe behavior, not identity
- No PII collection
- Transparent data practices
- User control (opt-out)
Goal: Working end-to-end pipeline
- Set up Rust workspace
- Implement core types (Session, Fingerprint, etc.)
- Build ingestion gateway (Axum)
- Redis cache integration
- ClickHouse schema setup
Deliverables:
- Ingestion API accepting session data
- Redis session cache working
- ClickHouse writes successful
Goal: Fingerprinting and anomaly detection
- Implement composite fingerprinting
- GeoIP integration (MaxMind)
- MinHash similarity engine
- Anomaly detection algorithms
- Enrichment pipeline executor
Deliverables:
- Deterministic fingerprints generated
- Geo/ASN enrichment working
- Anomaly scores calculated
Goal: Browser signal collection
- Canvas fingerprinting
- WebGL fingerprinting
- Audio fingerprinting
- Behavioral collectors (mouse, scroll)
- Transport layer (Beacon API)
- Bundle and optimize (< 30KB)
Deliverables:
- Working JavaScript SDK
- Signal collection functional
- Privacy safeguards implemented
Goal: Analyst dashboard and queries
- React UI setup
- ClickHouse query interface
- Session visualization
- Fingerprint comparison
- Anomaly filtering
- Real-time event feed
Deliverables:
- Functional analyst dashboard
- Common queries optimized
- Visualization components
Goal: Production-ready deployment
- Load testing (10k req/sec)
- Security audit
- Performance optimization
- Monitoring and alerting
- Documentation
- Deployment automation
Deliverables:
- Performance benchmarks met
- Security audit passed
- Monitoring in place
- Deployment scripts
All code must adhere to TigerStyle principles:
- No
panic!in production code - No
.unwrap()or.expect()(except in tests) - No
unsafe(or explicit justification) - All invariants validated at boundaries
- Explicit > implicit
- Clear > clever
- Minimal abstractions
- Boring solutions preferred
- Type-driven design
- Invalid states unrepresentable
- Comprehensive tests (> 90% coverage)
- Deterministic behavior
- Fast by default
- Benchmarked (not guessed)
- Zero-copy where possible
- Pre-allocation for known sizes
- Minimal, well-vetted crates
- Pinned versions
- Regular audits (
cargo audit) - No unnecessary features
| Component | Metric | Target | Acceptable | Unacceptable |
|---|---|---|---|---|
| JS SDK | Init time | < 10ms | < 20ms | > 20ms |
| JS SDK | Bundle size | < 20KB | < 30KB | > 30KB |
| Ingestion | Latency (p99) | < 5ms | < 10ms | > 10ms |
| Ingestion | Throughput | 10k/s | 5k/s | < 1k/s |
| Enrichment | Latency (p99) | < 50ms | < 100ms | > 100ms |
| Redis | Latency (p99) | < 1ms | < 2ms | > 2ms |
| ClickHouse | Write (p99) | < 100ms | < 500ms | > 500ms |
| ClickHouse | Query (p99) | < 100ms | < 500ms | > 500ms |
- Each module > 90% coverage
- Fast (< 10s total)
- Deterministic (no flakes)
- Mock external services
- End-to-end flows
- Real Redis/ClickHouse (test containers)
- Cross-browser (Playwright)
- Performance benchmarks
- 10k req/sec sustained
- Measure p50, p99, p999
- Memory usage under load
- Graceful degradation
- No PII collected
- Opt-out mechanisms tested
- TLS enforced (HTTPS only)
- Rate limiting functional
- Input validation comprehensive
- SQL injection tests passed
- XSS prevention verified
- Dependency audit clean (
cargo audit) - Security headers configured
- GDPR compliance verified
[workspace.dependencies]
# Core
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
chrono = { version = "0.4", features = ["serde"] }
uuid = { version = "1.6", features = ["v4", "serde"] }
# Web framework
axum = "0.7"
tokio = { version = "1", features = ["full"] }
tower = "0.4"
tower-http = { version = "0.5", features = ["trace", "cors"] }
# Storage
clickhouse = "0.11"
redis = { version = "0.24", features = ["tokio-comp"] }
deadpool-redis = "0.14"
# Cryptography
sha2 = "0.10"
blake3 = "1.5"
md5 = "0.7"
# GeoIP
maxminddb = "0.24"
# Telemetry
tracing = "0.1"
tracing-subscriber = "0.3"
# Error handling
thiserror = "1.0"
anyhow = "1.0"
# Testing
proptest = "1.4"
criterion = "0.5"{
"dependencies": {},
"devDependencies": {
"typescript": "^5.3.0",
"esbuild": "^0.19.0",
"@playwright/test": "^1.40.0"
}
}Ingestion:
- Requests per second
- Latency (p50, p95, p99)
- Error rate
- Payload size distribution
Enrichment:
- Enrichment time per stage
- GeoIP cache hit rate
- Fingerprint similarity matches
- Anomaly detection rate
Storage:
- ClickHouse write throughput
- Query latency
- Disk usage
- Compression ratio
Redis:
- Cache hit rate
- Memory usage
- Connection pool utilization
- Rate limit triggers
- Ingestion latency > 10ms (p99) → Page on-call
- Error rate > 1% → Alert Slack
- Disk usage > 80% → Alert ops
- Rate limit exceeded 10x → Investigate potential attack
- TigerStyle: https://github.com/tigerbeetle/tigerbeetle/blob/main/docs/TIGER_STYLE.md
- JA3/JA4: https://engineering.salesforce.com/tls-fingerprinting-with-ja3-and-ja3s-247362855ced/
- Cloudflare Bot Management: https://www.cloudflare.com/products/bot-management/
- FingerprintJS: https://github.com/fingerprintjs/fingerprintjs
- ClickHouse: https://clickhouse.com/docs/
- Redis: https://redis.io/documentation
- Axum: https://docs.rs/axum/latest/axum/
- GDPR: https://gdpr.eu/
- OWASP Top 10: https://owasp.org/www-project-top-ten/
- Vision:
../vision.md - Development Rules:
/home/ops/Project/mimicron/DEVELOPMENT_RULES.md - Security Rules: Memory (security.md)
- Testing Rules: Memory (testing.md)
- TigerStyle Rules: Memory (tigerstyle.md)
| Date | Decision | Rationale |
|---|---|---|
| 2025-01-22 | Use TigerStyle principles | Safety, simplicity, correctness aligned with project needs |
| 2025-01-22 | Rust + Axum for ingestion | Performance, safety, excellent async ecosystem |
| 2025-01-22 | ClickHouse for analytics | Optimized for time-series, high-cardinality data |
| 2025-01-22 | Redis for session cache | < 1ms latency, simple key-value model |
| 2025-01-22 | MinHash for similarity | Fast approximate matching, scalable |
| 2025-01-22 | No PII collection | Privacy-first design, GDPR compliance |
| 2025-01-22 | 90-day retention | Balance between analysis needs and privacy |
| 2025-01-22 | Salted hashes rotated monthly | Prevent long-term tracking |
- ✅ Complete RFCs - All 7 RFCs written
- ⏳ Review & approval - Team review of RFC designs
- ⏳ Set up workspace - Initialize Rust workspace
- ⏳ Begin Phase 1 - Core infrastructure implementation
- ⏳ Iterate on design - Refine based on implementation learnings
Last Updated: 2025-01-22
Maintainer: Zuub Engineering
Status: Ready for implementation review