Skip to content
View aryan105825's full-sized avatar
🏠
Working from home
🏠
Working from home

Highlights

  • Pro

Block or report aryan105825

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
aryan105825/README.md

Aryan Rajput

AI Engineer · Rust inference · RAG pipelines · Multimodal AI · CPU-constrained LLM serving

I build the infrastructure layer of AI systems — not the wrappers around APIs, but the engines underneath them. My focus is local LLM serving, retrieval-augmented generation, multimodal pipelines, and agent systems that work correctly under real constraints.


What I've Built

Nexus — Three-service local AI workspace

Constraint: 8 GB Docker memory · CPU only · No GPU · No cloud inference

A production-designed system built to answer one question: how do you run a complete AI pipeline locally when memory is the hard limit?

Rust inference engine (infer-engine)

  • Pre-allocated memory pool — 25 slots × 4 MB, Drop-based reclamation, leaks structurally impossible
  • Bounded job queue (depth 20) with try_send — overload rejected at the boundary, never absorbed
  • LRU model registry (max 5) — deterministic memory under model-churn workloads
  • ~2 µs scheduler overhead (Criterion) · 3.93 ms P95 at 50 VUs · 342 req/s at 0.00% error rate (k6)
  • Zero GC pauses. Zero allocation at request time.

Python RAG pipeline (rag-main)

  • Phi-3.5-mini-instruct Q4_K_M at ~143 tok/s warm · 0.91s warm latency · 4.73s cold start
  • Embedding delegated to Rust over loopback HTTP — saves ~500 MB baseline RAM vs PyTorch in-process
  • Guardrails run synchronously before StreamingResponse — HTTP error semantics preserved
  • 3,936 MB free RAM at peak load — guardrail floor: 500 MB

Next.js 14 frontend (nexus-frontend)

  • SSE via Route Handler bypass — Next.js rewrites buffer response bodies (ECONNRESET); Route Handler pipes ReadableStream directly
  • AbortController wired to three abort triggers: query change, palette close, stop button
  • Live knowledge graph via Supabase Realtime subscription on note saves

Every failure mode returns the correct HTTP status. The system rejects work rather than degrading silently.


BehaviorLens — Multimodal RAG behavioral intelligence pipeline

Problem: Naive VLM-on-video pipelines hallucinate ~18% of the time — confabulated UI labels, missed micro-interactions, fabricated causal chains.

An open-source event-driven pipeline that analyzes user sessions by grounding Vision-Language Model outputs in deterministic browser SDK events, eliminating the hallucination problem at the architectural level.

  • TypeScript SDK captures DOM mutations, clicks, rage clicks, scroll positions, form events, JS errors — in-browser PII masking before data leaves the client
  • Three-phase timestamp sync: calibration → EMA drift detection (α=0.3) → quarantine for >500ms drift. ±50ms frame-event alignment.
  • Event-driven frame sampling: frames extracted only on trigger events — ~85% reduction vs. uniform 1fps sampling
  • LangGraph agent router dispatches to Transcription / Anomaly / KPI / Behavior agents by event priority; Pydantic-validated outputs with retry on failure
  • Every model output requires a timestamp_citation referencing a verified SDK event ID — hallucination target: <2% (down from 18% baseline)
  • Provider-agnostic: one .env line switches from Gemini 2.5 Flash → Claude Sonnet 4 + Kafka MSK + Pinecone

[DevSecOps Autopilot] (https://gitlab.com/gitlab-ai-hackathon/participants/35222941) — Four-agent security pipeline on GitLab Duo

Trigger: Mention or assign to any Merge Request. The pipeline does the rest.

A multi-agent system built on the GitLab Duo Agent Platform. One MR mention triggers a four-stage sequential pipeline — scans code, maps to security standards, posts a structured review with fix snippets — zero human involvement.

  • MR Context Builder — pulls diffs, file contents, existing SAST findings, CI failures into a single JSON object. No analysis, no commentary.
  • Security Scanner — scans added_lines for 12 vulnerability classes (SQLi, XSS, SSRF, hardcoded secrets, IDOR, prototype pollution, eval misuse). HIGH/MEDIUM/LOW confidence. Verdicts on existing GitLab findings: CONFIRMED / FALSE_POSITIVE / NEEDS_INVESTIGATION.
  • Compliance Mapper — enriches with OWASP Top 10 (2021), CWE, CVSS 3.1 (chain-of-thought scored), P0–P3 priority, merge recommendation: BLOCK / REVIEW / PASS.
  • Remediation Writer — posts structured code review, creates GitLab vulnerability issues for P0/P1 findings, confirms or dismisses existing SAST findings. Fix snippets touch only the vulnerable lines.

Stack

Systems      Rust · Axum · crossbeam · tract-onnx · tower · Prometheus
AI/ML        llama.cpp · Phi-3.5-mini · sentence-transformers · ONNX · LangGraph · GitLab Duo Agent Platform
Backend      Python · FastAPI · Node.js · Supabase · pgvector · PostgreSQL · Redis Streams · Pydantic
Frontend     Next.js 14 · TypeScript · React · Tiptap · ReactFlow · Zustand · D3
Infra        Docker Compose · k6 · Criterion · Prometheus · Vercel · FFmpeg

Writing

TinyLlama 1.1B consistently fabricated relationships between unrelated entities in multi-document context windows. Every prompt mitigation failed. Here's why model scale was the only fix, and how I diagnosed it.


Background

B.Tech Computer Science — NIET
DeepLearning.AI: Deep Learning · NLP · GANs · TensorFlow · Generative AI · Mathematics for ML
CS50x (Harvard)


Currently

  • Building on Nexus — batching, speculative decoding, extended observability
  • Building on BehaviorLens — production Kafka integration, Pinecone vector store
  • Open to remote AI infrastructure and LLM systems roles — US and EU firms
  • IST (UTC+5:30) — 4–6 hr overlap with EU, available late IST for US East Coast

📬 aryan105825@gmail.com · LinkedIn

Pinned Loading

  1. behaviorlens behaviorlens Public

    Multimodal RAG pipeline that grounds VLM outputs in deterministic SDK events, reducing hallucinations through event-verified reasoning and synchronized video-event analysis.

    Python

  2. nexus-system nexus-system Public

    High-performance AI platform featuring a Rust-based inference engine, memory-safe architecture, and cost-optimized RAG pipeline delivering 342 req/s with sub-4 ms P95 latency.

    5