Skip to content
View bettyguo's full-sized avatar
🧠
🎓 Final-year PhD: 90% caffeine ☕, 10% gradient descent 🤖. Say hi! 👋
🧠
🎓 Final-year PhD: 90% caffeine ☕, 10% gradient descent 🤖. Say hi! 👋

Sponsoring

@nlohmann
@yamadashy
@legesher
@kyegomez

Block or report bettyguo

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
bettyguo/README.md

Dongxin (Betty) Guo

Final-year PhD @ HKU CS  ·  Hong Kong
I prove the architectural limits of LLM reasoning, and build the systems that route around them.

Homepage Google Scholar ORCID OpenReview LinkedIn X / Twitter Email

Status


About

I'm a final-year PhD candidate in the Department of Computer Science at The University of Hong Kong, advised by Prof. Siu-Ming Yiu. My research sits at the intersection of three threads that keep refusing to be separate:

  • What transformers can actually reason about. Tight architectural bounds, plus the tool-delegation systems those bounds force you to build.
  • Trustworthy LLMs in regulated settings. Compliance-grade explainability, distribution-free coverage, atomic claim verification.
  • Serving infrastructure that respects both. Workflow-atomic GPU scheduling with per-tenant fairness guarantees.

Theorems tell you what cannot be done. Systems make precise what can.

The cycle runs both ways: deployment surfaces the limits worth proving, and the proofs become the constraints that keep deployment honest.


🎉 News

  • [05.2026] 🎉 Accepted to TMLR: Tight Bounds and Fundamental Impossibility for Knowledge Editing Side Effects in Transformers. Computable bounds on edit side effects, plus an impossibility theorem ruling out perfect locality and generalization at once.
  • [05.2026] 🎉🎉🎉 Two papers accepted to ICML 2026: The Deterministic Horizon: When Extended Reasoning Fails and Tool Delegation Becomes Necessary (Main) and Current XAI Methods Cannot Satisfy Financial AI Explainability Requirements (Position Track).
  • [05.2026] ✨ On the postdoc market for Fall 2026. Trustworthy / compliance-grade AI, multi-agent systems & mechanism design, LLM theory, and serving systems. Reach out at bettyguo@connect.hku.hk.
  • [05.2026] 📝 Serving as a reviewer for NeurIPS, EMNLP, ACM Multimedia (Main & Dataset Tracks), and UAI.
  • [04.2026] 🏆🏆🏆 Four papers accepted to ACL 2026 Industry Track: FinGround (atomic claim verification), RouteNLP (conformal LLM routing), AgentEval (DAG-structured agent evaluation), and ComplianceNLP (KG-augmented regulatory gap detection).
  • [04.2026] 🚀🚀🚀 SAGA accepted to HPDC 2026. It's a workflow-atomic scheduler for AI agent inference on GPU clusters, with per-tenant fairness guarantees that hold under real multi-tenant load.
  • [03.2026] 📣 Adaptive Retrieval for Large Reasoning Models accepted to SIGIR 2026. When to retrieve during reasoning, with bounds, not heuristics.
  • [02.2026] 💼 Conformal-bound risk management at Brain Investing is now running against live P&L. That's our HKU FinTech spin-out, and the lab's coverage work has finally made it onto a real trading book.
  • [01.2026] 🛠️ Shipped multi-tenant scheduling and conformal-coverage pipelines at Stellaris AI for native-safe foundation-model deployment in regulated industries.
  • [09.2025] 🎓 Began the final year of PhD at HKU CS, advised by Prof. Siu-Ming Yiu. Thesis focuses on the theory-meets-deployment cycle: bounds on transformer reasoning, and the systems those bounds force.
  • [08.2025] 🏅 Continuing Cyberport Incubation (2023–2025 intake). That keeps an unbroken 2018–2025 funding run going across TSSSU, HKSTP Incu-Tech, HKU iAXON Deep Tech, and Cyberport.

Theory. Production. Curation.

Nine ICML / SIGIR / ACL / HPDC / TMLR papers this cycle. Five candidate post-Transformer architectures under test. A retrieval method at 117 stars. Conformal-bound risk on a live trading book. Built across HKU CS, Stellaris AI, and Brain Investing.


⚡   At a Glance

9

papers, 2026 cycle
ICML × 2  ·  SIGIR  ·  HPDC
TMLR  ·  ACL Industry × 4

85+

original OSS repos
architectures  ·  research  ·  agents
benchmarks  ·  tools  ·  curation

2

in production
Stellaris AI  ·  Brain Investing
conformal-bound risk, live P&L

10

years of funding
TSSSU  ·  HKSTP  ·  Cyberport ×2
iAXON  ·  continuous since 2018

HKU CS Stellaris AI Brain Investing

Python PyTorch C++ Go Rust TypeScript Jupyter DuckDB Next.js OpenTelemetry MCP


🌟   Showcase

Four projects worth a second look:

ReaLM-Retrieve · SIGIR 2026. When to retrieve during reasoning, with bounds rather than heuristics. Highest-cloned repo in this account.

Python  ·  ⭐ 117  ·  🍴 13  ·  breakout

🚀   SAGA

HPDC 2026. Workflow-atomic GPU-cluster scheduler for AI agents. Within 1.31× of Bélády-optimal KV-cache eviction, with OpenMP-accelerated C++ kernels and LangChain / AutoGen / CrewAI bridges.

Python C++  ·  concrete-metric flagship

🧬   Vannevar

Open-source agentic harness with citation-grade memory: source URI, temporal validity window, append-only provenance ledger. MCP-native, multi-frontend, fully self-hostable.

Rust  ·  flagship infrastructure

Five post-Transformer architecture candidates: CASCADE, CHIMERA, HELIX, MNEMOSYNE, NOESIS. Per-token routing, tokenizer-free byte models, and latent-space continuous-thought reasoning.

Python  ·  frontier research program


📚   Selected Publications

Paper Venue Code
Tight Bounds and Fundamental Impossibility for Knowledge Editing Side Effects in Transformers TMLR ke-bounds
The Deterministic Horizon: When Extended Reasoning Fails and Tool Delegation Becomes Necessary ICML 2026 deterministic-horizon
When to Retrieve During Reasoning: Adaptive Retrieval for Large Reasoning Models SIGIR 2026 realm-retrieve
Current XAI Methods Cannot Satisfy Financial AI Explainability Requirements ICML 2026 Position position paper
SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters HPDC 2026 SAGA
FinGround: Atomic Claim Verification for Financial LLM Outputs ACL 2026 Industry FinGround
ComplianceNLP: KG-Augmented Regulatory Gap Detection ACL 2026 Industry ComplianceNLP
RouteNLP: Conformal LLM Routing ACL 2026 Industry RouteNLP
AgentEval: DAG-Structured Agent Evaluation ACL 2026 Industry AgentEval

Full publication list, PDFs, and BibTeX at bettyguo.github.io.


🧭   Research Threads

Three lines that keep crossing in our papers. Each thread proves a bound and ships the system that meets it.

🧠   Reasoning & tool use

What softmax attention can realize at inference time, and what it provably cannot. The matching upper and lower bounds become the spec for the tool-delegation layer above them.

📄   The Deterministic Horizon  ·  ICML 2026  ·  Adaptive Retrieval for Large Reasoning Models  ·  SIGIR 2026  ·  code: deterministic-horizon, realm-retrieve

🛡️   Trustworthy LLMs for regulated settings

Explainability and verification that survive financial-services audit, not benchmark conditions. Distribution-free coverage, atomic claim verification, knowledge-graph-augmented regulatory gap detection, and provable bounds on knowledge-editing side effects.

📄   Tight Bounds and Fundamental Impossibility for Knowledge Editing Side Effects in Transformers  ·  TMLR  ·  Current XAI Methods Cannot Satisfy Financial AI Explainability Requirements  ·  ICML 2026 Position  ·  FinGround, ComplianceNLP  ·  ACL 2026 Industry × 2  ·  code: ke-bounds, FinGround, ComplianceNLP, TrustKGRAG

⚡   Serving & agent infrastructure

Workflow-atomic GPU scheduling with per-tenant fairness guarantees that hold under real multi-tenant load. DAG-structured evaluation harnesses and conformal routing for agent cascades.

📄   SAGA  ·  HPDC 2026  ·  RouteNLP, AgentEval  ·  ACL 2026 Industry × 2  ·  code: SAGA, RouteNLP, AgentEval


📐   Method, in four habits

How we approach problems, across every thread:

  1. Tight bounds with explicit constants. Upper and lower bounds in the same paper. No asymptotic hand-waving.
  2. Impossibility paired with construction. When a thing can't be done, that result becomes a design constraint, not a stopping point.
  3. Guarantees that survive reality. Distribution-free coverage, conformal prediction, fair scheduling. No idealized assumptions.
  4. Theory and the system that meets it, shipped together. The proof tells the algorithm what to achieve; the algorithm tells the proof what's worth bounding.

"Theorems tell you what cannot be done. Systems make precise what can."


🗂️   What lives in this account

85+ original public repos. Research code behind every paper, the architecture program we are betting on next, and the developer infrastructure our team relies on every day across HKU CS, Stellaris AI, and Brain Investing.
Browse the full index → github.com/bettyguo?tab=repositories

🧬
5
architectures
🔬
22
research
🔌
8
MCP servers
🤖
6
agent systems
🧪
8
eval & safety
🔭
6
interpretability
🛠️
9
dev tools
📚
14
curated maps

🧬   Post-Transformer architectures

An exploratory program: candidate sequence architectures beyond attention.

Repo What it is
research-prototypes The program. Five post-Transformer candidates evaluated head to head. Each now has its own repo below.
cascade-lm CASCADE. Cascaded, multi-stage sequence processing.
chimera-lm CHIMERA. Per-token learned routing across SSM, sliding-window, and full attention.
helix-lm HELIX. Tokenizer-free, byte-level hierarchical entropy-linked information exchange.
mnemosyne-lm MNEMOSYNE. Memory-centric sequence architecture.
noesis-lm NOESIS. Continuous-thought reasoning LM that thinks in latent space and allocates its own thinking budget.

🔬   Research code

One repo per paper. Theory and the system that meets it, in the same artifact.

Reasoning, retrieval & serving

Repo What it is
deterministic-horizon ICML '26 companion. Bounds on extended reasoning, and the regime where tool delegation becomes necessary. Explicit constants.
realm-retrieve ReaLM-Retrieve · SIGIR '26 companion. When to retrieve during reasoning, with bounds rather than heuristics. 117 stars.
SAGA HPDC '26 companion. Workflow-atomic GPU-cluster scheduler. Within 1.31× of Bélády-optimal KV-cache eviction, with OpenMP-accelerated C++ kernels and LangChain / AutoGen / CrewAI bridges.
RouteNLP ACL '26 Industry companion. Conformal-coverage router for LLM cascade serving.
AgentEval ACL '26 Industry companion. DAG-structured evaluation harness for multi-step agents.

Trustworthy & regulated AI

Repo What it is
ke-bounds TMLR '26 companion. Computable bounds on knowledge-editing side effects, plus the impossibility result ruling out perfect locality and generalization at once.
FinGround ACL '26 Industry companion. Three-stage verify-then-ground pipeline for financial document QA that detects and mitigates hallucinations.
ComplianceNLP ACL '26 Industry companion. KG-augmented regulatory gap detection.
TrustKGRAG Probabilistic certified robustness and anomaly detection against knowledge-graph poisoning in RAG.
conformalized-neural-operators Distribution-free, spatially adaptive UQ for neural-operator PDE surrogates via physics-informed conformal prediction.
VerBPM Temporal-logic framework for formal verification and repair of LLM-generated business process models.
NeSyDisc Neuro-symbolic declarative process discovery with consistency guarantees.

Learning theory & systems

Repo What it is
SafeAnchor Safety-preserving continual domain adaptation of LLMs via Fisher-based subspace identification and orthogonal gradient projection.
SigGate-GT Sigmoid-gated attention for graph transformers. Eliminates over-smoothing and stabilizes training via element-wise output gating.
pac-learned-index PAC learning with tight VC-dimension bounds and provable sample-complexity guarantees for learned database indexes.
JoinPAC PAC learnability for join cardinality estimation. Decomposition bounds, drift detection, hybrid-estimation guarantees.
AdaptQO Structure-aware bandit optimization for learned query hints, with semi-bandit feedback, monotone pruning, and predictive convergence guarantees.
neural-precond-spectral Spectral-equivalence theory with mesh-independent convergence bounds for neural-operator preconditioning of PDE systems.

LLM science: behavior, collapse & brains

Repo What it is
iterated-collapse Discriminative tests of iterated-learning predictions for LLM model collapse: non-monotonic compositionality, cross-linguistic regularization, the compression and communication tradeoff.
llm-statistical-preemption Causal and correlational evidence for statistical preemption in LLMs, dissociating negative-knowledge acquisition from entrenchment across English verb-construction alternations.
cross-lingual-brain-llm Cross-lingual alignment between brain activity and LLM representations.
sae-brain-topography Sparse-autoencoder decomposition of brain–LLM alignment with a priori cortical semantic topography mapping.

🔌   MCP servers

Eight live integrations across our research workflow: code, data, papers, knowledge bases.

Repo What it is Lang
mcp-gateway Any OpenAPI 3.x spec into a Model Context Protocol server. Auth, rate-limiting, OpenTelemetry baked in. Go
mcp-postgres Postgres MCP server for agents. Four-tier safety: role grants, pglast AST guard, per-tx envelope, audit log. Schema introspection, EXPLAIN analysis, pgvector. PG 13 to 17. Python
mcp-jupyter MCP server for Jupyter. Live kernel state (variables, dataframes, plots, tracebacks) instead of just the .ipynb JSON. Python
mcp-wandb-2 Analytical MCP server for Weights & Biases: hparam importance, sweep summaries, run-delta analysis, inline charts, gated Launch actions. Python
paperbase-mcp Research-grade MCP composing arXiv, Semantic Scholar, and OpenAlex. Related work, citation graphs, BibTeX in your chat. Python
mcp-overleaf MCP server and Skills bundle for finishing a LaTeX paper: bib cleanup, venue rule packs, latexdiff, related-work drafting. Python
obsidian_mcp MCP plus 7 Claude skills for Obsidian vaults. Read, search, write, and link notes from Claude / Cursor / ChatGPT. Filesystem-direct, local-first, round-trip safe. Python
semantic-grep Local semantic code search. CLI and MCP server, all on your machine. pre-alpha Python

🤖   Agent systems & runtimes

Local-first when possible; verifiable when not.

Repo What it is Lang
Vannevar Open-source agentic harness with citation-grade memory. Every fact carries a source URI, a temporal validity window, and an append-only provenance ledger. MCP-native, multi-frontend, fully self-hostable. Rust
agent-memory Verifiable memory for LLM agents. Every recalled claim is HMAC-signed back to its originating trajectory span. Python
computer_use_agent Open-source local-VLM browser agent. AT-tree-first routing with VLM fallback, refusals enforced in code, honest benchmarks including the failure atlas. Python
whisper_agent Hands-free local voice agent: faster-whisper STT, local LLM with tool use, TTS. Runs entirely on your machine. Python
agent-tracer-2 OpenTelemetry-native, local-first observability for AI agents. DuckDB on disk, Next.js viewer on localhost, no SaaS. Adapters for Anthropic, OpenAI, LangGraph, AutoGen, CrewAI. Python
local-deep-research Self-hosted deep-research agent: multi-step query planning, source synthesis, report generation. Ollama / llama.cpp / vLLM friendly, with SearXNG, FAISS, and BM25. Python

🔭   Interpretability

Make model internals visible, on a laptop.

Repo What it is Lang
see-the-ai-think Watch an LLM think. Visualizes sparse-autoencoder features firing live across every token. Runs on a laptop, no GPU required. Python
llm-fossils Reproducible catalog of LLM behaviors that vanished as models scaled. Jupyter

Plus web inspectors: prompt-x-ray, tokenviewer, policy-microscope, catch-the-ai-lying.


🧪   Benchmarks, audits & red-teaming

Reproducible by default. Probe for contamination, leakage, and reward hacks before declaring a number.

Repo What it is Lang
agent_eval Open-source benchmark for Claude Code skill bundles. Pass@k plus cost plus reliability, content-addressed leaderboard across Anthropic / OpenAI / Google. Python
bench_audit Probes for agent benchmarks: contamination, gold-answer leaks, harness-injection vulnerabilities, reward hacking. CIs on every result. Python
benchprobe Audits AI-agent benchmarks for the eight exploit families catalogued by Berkeley / RDI. Python
agent-backtest-lab Statistical-rigor audit harness for LLM trading-agent frameworks, with a leakage firewall. Python
ai-red-team-in-a-box Red-team toolkit for probing LLM systems. Python
rag-bench Small, reproducible benchmark for RAG pipelines. Python
agent-arena Arena-style framework for head-to-head agent comparison. Python
paper-replay Replay and reproduce paper experiments with locked seeds, environments, and artifacts. Python

🛠️   Developer tools & skills

Quality layers, lockfiles, and ergonomics for the agent stack.

Repo What it is Lang
promptlock Production prompt workflow: semantic diff, eval-on-PR, lockfile, drift detection, and rollback for plain-markdown prompts in your repo. Go
rigging Typed, trust-bearing, schema-mediated coupling layer that composes heterogeneous components. Python
skill-forge-2 Quality layer for Claude Code Skills: lint, test, and bench before you ship. Rust
browser-skills 15 reusable, agent-agnostic browser recipes plus an MCP server. Cookie banners, infinite scroll, calendar widgets, all solved once. Python
diagram-skills Generate validated diagrams across Mermaid, PlantUML, Graphviz, D2, and Excalidraw. MCP server, CLI, and Claude Code skills. Python
capture-engine Capture any web page as vector PDF, standalone HTML, or high-DPR raster. Local-only, MV3. TypeScript
paper_pod Local-first audio overviews for academic papers. Take an arXiv URL, PDF, or BibTeX in, get an 8 to 15 minute two-host podcast out. Python
paper2repro Paper to reproducible experiment scaffold. Python
test_forge Test-generation toolkit for Python research code. Python

📚   Curated knowledge

What we had to learn the hard way, written down for the next person.

📓   Atlases & annotated notebooks

Repo What it is
awesome-llm-circuits-atlas Interactive atlas of discovered circuits and SAE features in large language models, with Colab reproductions on open-weights models.
awesome-reasoning-models-theory Theory-first map of why reasoning models (o1/o3, DeepSeek-R1, Claude-thinking, Qwen-QwQ) actually work. 8 chapters, 60+ annotated papers, 13 models compared, 5 reproduction notebooks, live benchmarks.
retrieval-from-scratch Modern Information Retrieval from scratch in PyTorch. BM25, dense bi-encoders, ColBERT late interaction, cross-encoder reranking, and RAG, in annotated notebooks that run on a single GPU.

🗺️   Maps, lists & roadmaps

Repo What it is
awesome-why-llms-work Falsifiable-hypothesis atlas of why LLMs work. Five competing research programmes, 41 tracked claims with epistemic status (🟢🟡🔴⚪) and named falsifiers.
awesome-llm-reasoning-foundations Curated, rigorously-verified map of the theoretical foundations of LLM reasoning: transformer expressivity, chain-of-thought error bounds, circuit complexity, logical characterizations, learnability.
llm-impossibility-results Verified, assumption-explicit catalog of published impossibility and lower-bound results for LLMs and AI agents: circuit-complexity ceilings, hallucination bounds, watermarking impossibility, alignment.
awesome-llm-theory Companion list: theory papers for LLM behavior, expressiveness, and learnability.
build-your-own-ai Master modern AI by building it from scratch: curated index of the best build-it-yourself guides for tokenizers, attention, training, RAG, agents, and evals.
awesome-research-agents Opinionated, curated list of agents, skills, MCP servers, and tools ML researchers actually use.
ai-engineer-roadmap Interactive end-to-end roadmap for AI engineers. 12 stages, 122 nodes, 276 link-verified resources from math prerequisites to the research frontier.
harness-engineer-roadmap Interactive roadmap for harness engineering: the agent loop, tool layers, context engineering, memory, retrieval, eval.
awesome-llm-trading-agents Curated, verified map of the LLM-trading-agent ecosystem: frameworks, papers, and tools.
OpenProblems A credit-rating-agency-style platform for open problems in LLM and AI research.
llm-interview-prep Interview-prep notebook for LLM and ML-systems roles.

🏭   Deployment

Translational work. Coverage proofs and scheduling guarantees in production, against real workloads.

  • Stellaris AI. Conformal-coverage pipelines and multi-tenant scheduling for native-safe foundation models, in regulated deployments.
  • Brain Investing. HKU FinTech spin-out. Conformal-bound risk management running against live P&L. The lab's coverage work, in a real trading book.

🏅   Service & Recognition


📬   Availability

Postdoc, Fall 2026

Open to positions where theory and deployment share a research agenda.

Areas. Trustworthy & compliance-grade AI  ·  Multi-agent systems & mechanism design  ·  LLM theory (descriptive complexity, in-context reasoning)  ·  Serving systems for inference.

Reach me at   bettyguo@connect.hku.hk


Dongxin (Betty) Guo  ·  The University of Hong Kong  ·  Department of Computer Science
homepage  ·  scholar  ·  orcid  ·  openreview  ·  linkedin
Last updated May 2026

Pinned Loading

  1. OpenProblems OpenProblems Public

    A credit-rating-agency-style platform for open problems in LLM and AI research, with 5 rating dimensions per problem, immutable action log, leaderboards, bilingual EN/FR.

    TypeScript 5

  2. realm-retrieve realm-retrieve Public

    When to Retrieve During Reasoning: Adaptive RAG for Large Reasoning Models

    Python 117 13

  3. deterministic-horizon deterministic-horizon Public

    When extended chain-of-thought stops helping, and tool delegation becomes the only way forward.

    Python 9 6

  4. SAGA SAGA Public

    Workflow-atomic GPU-cluster scheduler for AI agents, within 1.31× of Bélády-optimal KV-cache eviction, with OpenMP-accelerated C++ kernels and LangChain/AutoGen/CrewAI bridges.

    Python 5 1

  5. FinGround FinGround Public

    FinGround is a three-stage verify-then-ground pipeline for financial document question answering that detects and mitigates LLM hallucinations.

    Python 7

  6. rigging rigging Public

    The typed, trust-bearing, schema-mediated coupling layer that composes heterogeneous harnessed agents into a single coherent system.

    Python 5