Skip to content

research-direction: cross-method benchmark at fixed compute budget on N2/Cr2 active-space series (only realistic publishable angle) #51

@thc1006

Description

@thc1006

TL;DR

Each NQS+SCI method paper reports its own numbers on its own setup. No paper systematically compares wall-time-to-accuracy across HI-NQS-SQD, GTNN-SCI, ADAPT-QSCI, ph-AFQMC+SQD, NNQS-Transformer at fixed compute budget on the same hardware. This is a workshop/benchmark paper opportunity (not a new-method paper).

What's open

Literature is full of "X method achieves Y accuracy" but nobody plots the Pareto frontier at fixed compute:

  • Method A: chem-acc in 4h on H100
  • Method B: chem-acc in 1h on H200 with 4× memory
  • Method C: 5 mHa off in 30min, chem-acc in 8h

Researchers have to reproduce each method themselves, often failing due to insufficient detail or missing dependencies.

Proposal: build the missing benchmark

Test molecules (N2-CAS series, fixed CASSCF orbitals, fixed reference):

  • N2-CAS(10,12) — 24Q, Hilbert 627k
  • N2-CAS(10,15) — 30Q, Hilbert 9M
  • N2-CAS(10,20) — 40Q, Hilbert 240M
  • N2-CAS(10,26) — 52Q, Hilbert 4.3B
  • Cr2-CAS(12,18) — 36Q (multireference test)
  • Cr2-CAS(12,26) — 52Q (harder multireference)

Methods to compare (5 baselines + ours):

  1. HI-NQS-SQD (qvartools pipeline 010)
  2. ADAPT-QSCI (JCTC 4c00846)
  3. GTNN-SCI (JCTC 5c01429)
  4. NNQS-Transformer / QiankunNet (arXiv:2306.16705)
  5. ph-AFQMC + SQD trial (JCTC 5c01407)
  6. HCI gold reference (compactness baseline)

Fixed compute budget: e.g., 4h on H200 with 12 CPUs. Run each method to terminate or budget-exhaust.

Metrics:

  • Final E vs HCI gold standard
  • Wall time to chem-acc (1.6 mHa)
  • Subspace size at termination
  • |c|² histogram tail thickness
  • Sample efficiency (configs / unique-important-determinants)

Output:

  • Pareto frontier plot per system size
  • Failure mode analysis per method
  • Reproducibility recipe (Docker image + scripts)

Why this is publishable

  1. Workshop venues (e.g., NeurIPS ML4Sci, ICML SciML, IEEE QCE) actively want benchmark papers
  2. Reproducibility crisis is real in NQS — multiple implementations claim chem-acc but nobody can reproduce
  3. Fills a clear gap — single search turns up no equivalent paper

Why this is NOT new method research

Fully clear: this is engineering benchmarking, not algorithmic novelty. Workshop venue not top-tier journal.

Effort: 3-6 months

  • ~1 month: implement / wrap each baseline in qvartools
  • ~1 month: run benchmarks on H200 cluster (nano4 access)
  • ~1 month: analysis + plotting + figure generation
  • ~1 month: writing + iterating

Risks

  • Each baseline implementation might require significant porting work
  • Authors of original methods might dispute "our reimplementation" — mitigate by using their official code where available
  • Reviewers might say "not novel enough" — counter with "but rigorous benchmarking is the literature gap we identify"

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions