EternWang · EternWang · Jun 11, 2026 · Jun 11, 2026
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,32 @@
+# Contributing
+
+LegalScope is currently a public research preview with a strict release boundary.
+Contributions are welcome for documentation, metadata cleanup, public helper
+code, tests, and release-process improvements.
+
+Please do not submit:
+
+- private workbooks;
+- full prompts or reference answers;
+- model-output matrices;
+- human review sheets or adjudication notes;
+- non-de-identified legal source documents;
+- provider credentials, local paths, logs, or API keys.
+
+## Development
+
+```bash
+python -m pip install -r requirements.txt
+python -m pytest -q
+```
+
+## Documentation Changes
+
+Keep public files aligned with the project name `LegalScope`. Historical
+internal names should not be introduced into public documentation.
+
+## Privacy Review
+
+Any proposed row-level data release should be reviewed for source licensing,
+de-identification, re-identification risk, provider terms, and paper review
+requirements before it is merged.
diff --git a/README.md b/README.md
@@ -1,40 +1,16 @@
-# LegalScope: Measuring Exam-to-Case Transfer in LLM Legal Reasoning
+# LegalScope
 
-LegalScope studies a simple question with high stakes for legal AI evaluation:
-do strong public legal-exam scores actually transfer to real-case legal reasoning?
+**LegalScope** is a benchmark for measuring whether large language model
+performance on public legal exams transfers to real-case legal analysis.
 
-I built LegalScope as an independent first-author benchmark project that pairs scalable
-public legal-exam tasks with lawyer-reviewed, de-identified Chinese civil judgment
-analysis. The public repository is intentionally a preview: it documents the research
-question, benchmark design, evaluation counts, scoring protocol, release boundary, and
-reproducible helper code without publishing the paper draft, full workbook, model
-outputs, human review sheets, or non-de-identified case materials.
+The project pairs a scalable public-exam track with a lawyer-reviewed,
+de-identified Chinese civil-judgment track. The public repository is a
+research release scaffold: it documents the benchmark design, evaluation
+counts, scoring protocol, public figures, metadata, and helper utilities while
+withholding private workbooks, full prompts, model outputs, human review
+sheets, and non-de-identified legal materials.
 
-## Start Here
-
-| If you want to understand... | Read |
-| --- | --- |
-| The research idea and motivation | [Project Brief](docs/PROJECT_BRIEF.md) |
-| Main empirical findings and figures | [Results Summary](docs/RESULTS_SUMMARY.md) |
-| Dataset scope and release boundary | [Data Card](docs/DATA_CARD.md) |
-| Scoring design | [Scoring Rubric](docs/SCORING_RUBRIC.md) |
-| Human validation protocol | [Annotation Protocol](docs/ANNOTATION_PROTOCOL.md) |
-
-## What I Contributed
-
-- Built a dual-track benchmark connecting public legal-exam evaluation with
-  lawyer-reviewed real-case legal analysis.
-- Designed a paired issue-stance protocol for Chinese civil judgments, so the same
-  factual background can be tested under supporting and opposing legal positions.
-- Developed two scoring protocols: reference-aware 0-4 exam scoring and a calibrated
-  real-case rubric for citation relevance, constraint extraction, and argument
-  validity.
-- Validated automated scores against human legal review and identified constraint
-  extraction as the main real-case failure mode.
-
-## Benchmark at a Glance
-
-<img src="assets/figures/paper_collection_pipeline.png" alt="LegalScope benchmark construction pipeline" width="920">
+## Snapshot
 
 | Component | Count |
 | --- | ---: |
@@ -45,71 +21,126 @@ outputs, human review sheets, or non-de-identified case materials.
 | Model groups evaluated | 20 |
 | Public-exam model responses | 17,360 |
 | Real-case model responses | 1,520 |
-| Total dataset model responses | 18,880 |
+| Total model responses | 18,880 |
 | Human-validation responses | 1,800 |
 
-The pipeline figure above is rendered from `8.pdf`, which is referenced by the paper
-source. The full paper PDF is not committed to this repository.
+<img src="assets/figures/paper_collection_pipeline.png" alt="LegalScope benchmark construction pipeline" width="920">
+
+## What LegalScope Tests
+
+LegalScope asks whether exam performance is a reliable proxy for applied legal
+reasoning. The public-exam track uses reference-aware open-ended legal-exam
+questions. The real-case track asks models to write stance-aware Chinese legal
+analysis over de-identified civil-judgment materials under closed-record
+constraints.
+
+The benchmark separates three questions that are often blurred together:
+
+1. How well do models answer public legal-exam questions?
+2. How well do the same models reason over bounded real-case legal facts?
+3. Do rankings, reasoning-mode gains, and evaluator agreement transfer across
+   those settings?
 
 ## Main Findings
 
 - Public-exam scores correlate with Chinese real-case scores at the model level
-  (Pearson `r = 0.835`, Spearman `rho = 0.661`), but rankings and reasoning-mode gains
-  do not transfer uniformly.
-- Real-case legal reasoning exposes a constraint-extraction bottleneck: models write
-  fluent legal arguments more easily than they recover the operative legal and factual
-  conditions that control those arguments.
+  (Pearson `r = 0.835`, Spearman `rho = 0.661`), but the transfer is incomplete.
+- Real-case legal reasoning exposes a constraint-extraction bottleneck: models
+  often produce fluent legal prose while missing operative rule conditions,
+  factual boundaries, stance requirements, or evidence limits.
 - Automated evaluation aligns strongly with human review on public-exam answers
   (answer-level Pearson `r = 0.925`) but weakens on real-case analysis
-  (`r = 0.450`), showing why expert-grounded evaluation remains important.
+  (`r = 0.450`), motivating expert-grounded validation for high-stakes legal
+  evaluation.
 
 ## Repository Map
 
-```text
-assets/figures/
-  paper_collection_pipeline.png
-  paper_score_distribution.png
-  paper_transfer_model_judge.png
-  paper_transfer_human.png
-data/
-  README.md
-  metadata/dataset_summary.json
-  metadata/model_groups.csv
-  metadata/source_composition.csv
-  sample/README.md
-docs/
-  PROJECT_BRIEF.md
-  RESULTS_SUMMARY.md
-  DATA_CARD.md
-  SCORING_RUBRIC.md
-  ANNOTATION_PROTOCOL.md
-  AI_WORKFLOW.md
-  FIGURE_SOURCES.md
-  RELEASE_STATUS.md
-scripts/
-  extract_public_sample.py
-src/legalscope/
-  workbook.py
-tests/
-  test_workbook.py
+| Path | Purpose |
+| --- | --- |
+| `assets/figures/` | Public-safe figures rendered from paper figure sources. |
+| `data/metadata/` | Machine-readable counts, model groups, and source composition. |
+| `data/sample/` | Reserved for public row-level samples after release review. |
+| `docs/PROJECT_BRIEF.md` | Research question, benchmark design, and paper-facing summary. |
+| `docs/RESULTS_SUMMARY.md` | Public-facing result figures and transfer metrics. |
+| `docs/DATA_CARD.md` | Dataset scope, intended use, limitations, and release boundary. |
+| `docs/SCORING_RUBRIC.md` | Public-exam and real-case scoring protocols. |
+| `docs/ANNOTATION_PROTOCOL.md` | Human-validation protocol and review focus. |
+| `docs/AI_WORKFLOW.md` | AI-assisted workflow and human-control safeguards. |
+| `docs/PROVENANCE.md` | Public-safe construction and processing provenance. |
+| `docs/HUGGINGFACE_RELEASE.md` | Hugging Face dataset-card release plan. |
+| `scripts/` | Lightweight public helper scripts. |
+| `src/legalscope/` | Small Python utilities for authorized local workbooks. |
+| `tests/` | Unit tests for public helpers. |
+
+## Quickstart
+
+Install the public helper package from a local checkout:
+
+```bash
+python -m pip install -r requirements.txt
+python -m pytest -q
+```
+
+Inspect the public metadata:
+
+```bash
+python - <<'PY'
+import json
+from pathlib import Path
+
+summary = json.loads(Path("data/metadata/dataset_summary.json").read_text())
+print(summary["project"])
+print(summary["counts"]["dataset_model_responses_total"])
+PY
+```
+
+Use the workbook helpers only with authorized local workbooks:
+
+```python
+from legalscope.workbook import summarize_workbook
+
+for sheet in summarize_workbook("private_authorized_workbook.xlsx"):
+    print(sheet.title, sheet.data_rows, sheet.model_count)
 ```
 
 ## Public Release Boundary
 
 This repository does not publish:
 
-- the paper draft or PDF;
+- the paper draft, review PDF, or LaTeX source;
 - the full benchmark workbook;
-- complete prompts, reference answers, model answers, or row-level model-output
-  matrices;
-- lawyer review sheets or adjudication notes;
+- complete prompt matrices, reference answers, model answers, or row-level
+  model-output tables;
+- human review sheets, adjudication notes, or reviewer annotations;
 - non-de-identified judgments or private source documents.
 
-The public code is a reproducibility scaffold for collaborators with authorized local
-access to the private workbook. It is not enough to reconstruct the full benchmark from
-the public repository alone.
+The public materials are sufficient to understand the research design, scope,
+counts, release boundary, and public-facing results. They are not sufficient to
+reconstruct the full benchmark.
+
+## Hugging Face Release
+
+A Hugging Face dataset-card-ready public preview is described in
+[`docs/HUGGINGFACE_RELEASE.md`](docs/HUGGINGFACE_RELEASE.md). The recommended
+first Hub release is a metadata-only preview containing this README-style
+dataset card plus the public metadata files. Row-level samples should be added
+only after source redistribution, privacy, and review constraints are cleared.
+
+## Citation
+
+Citation metadata is provided in [`CITATION.cff`](CITATION.cff). The final paper
+citation should replace the placeholder citation once the paper has a stable
+public identifier.
+
+## License
+
+Code and public documentation in this repository are released under the MIT
+License unless otherwise noted. This license does not grant redistribution
+rights for withheld source documents, full workbooks, model outputs, or private
+review materials.
 
 ## Disclaimer
 
-LegalScope is a research benchmark for model evaluation. It is not legal advice, a
-legal research product, or a substitute for jurisdiction-specific legal review.
+LegalScope is a research benchmark for model evaluation. It is not legal advice,
+a legal research product, or a substitute for jurisdiction-specific legal
+review.
diff --git a/docs/AI_WORKFLOW.md b/docs/AI_WORKFLOW.md
@@ -1,38 +1,78 @@
 # AI-Assisted Research Workflow
 
-LegalScope uses LLMs as research tools while keeping source selection, legal review,
-release decisions, and paper claims under human control.
+LegalScope uses AI systems as research tools while keeping dataset design,
+source selection, de-identification, legal review, scoring decisions, release
+decisions, and paper claims under human control.
 
 ## Pipeline Overview
 
-1. Collect public legal-exam sources and de-identified civil-judgment materials.
+1. Collect public legal-exam sources and candidate Chinese civil-judgment
+   materials.
 2. Parse, normalize, redact, deduplicate, and audit source records.
-3. Build standardized public-exam and real-case prompt templates.
+3. Construct public-exam prompts and real-case issue-stance prompts.
 4. Generate model answers across 20 model groups.
-5. Score public-exam answers with reference-aware scoring.
-6. Score real-case answers with the A/B/C legal-reasoning rubric.
-7. Validate selected rows against human legal review.
-8. Analyze transfer, human agreement, length effects, and error patterns.
+5. Score public-exam answers with reference-aware 0-4 scoring.
+6. Score real-case answers with the citation, constraint, and argument rubric.
+7. Calibrate the real-case rubric against human legal review.
+8. Analyze exam-to-case transfer, human agreement, score distributions, length
+   effects, and error patterns.
+9. Prepare public documentation and metadata while withholding sensitive
+   artifacts.
+
+## Public-Safe Script Provenance
+
+The Drive public-material folder documents 78 copied and annotated pipeline
+scripts grouped into five stages:
+
+| Stage | Scripts | Public-safe description |
+| --- | ---: | --- |
+| Public bar source collection and cleaning | 16 | Collection, parsing, normalization, duplicate repair, reference repair, and source-audit scripts. |
+| Chinese real-case prompt construction | 7 | Judgment preview, issue/stance prompt construction, repair, rerun, and workbook writeback scripts. |
+| Model answer generation | 17 | Model catalog, provider runners, batch launchers, answer merge, and answer writeback scripts. |
+| Scoring and regrading | 29 | Public-exam scoring, real-case rubric calibration, blind packets, V2 scoring, and validation utilities. |
+| Final conversion, translation, and release | 9 | Workbook-to-JSON conversion, English cleanup, metadata repair, and public release packaging. |
+
+Some internal scripts retain absolute paths or require provider credentials.
+They should be treated as provenance records and rerun only after path,
+credential, privacy, and redistribution checks.
 
 ## Where AI Assistance Is Used
 
-AI tools may help draft transformation code, normalize text, prepare prompt templates,
-generate model answers under controlled settings, and identify candidate failure modes
-for inspection.
+AI tools may help:
+
+- draft transformation code;
+- normalize and translate text;
+- prepare prompt templates;
+- generate model answers under controlled settings;
+- score answers according to documented rubrics;
+- identify candidate failure modes for human inspection;
+- prepare release documentation.
+
+## Human-Controlled Steps
+
+AI tools do not replace:
 
-AI tools do not replace source-selection decisions, de-identification review, final
-legal judgment, manuscript claims, licensing review, or release decisions.
+- source-selection decisions;
+- privacy and de-identification review;
+- legal-domain review;
+- final scoring policy;
+- human-validation judgments;
+- licensing and redistribution decisions;
+- paper claims and release decisions.
 
 ## Safeguards
 
-- De-identification before public release.
-- Separate scorer-side references and prompt-facing text.
-- Stance and closed-book constraints for real-case prompts.
-- Human validation for selected public-exam and real-case rows.
-- Public release boundary for full prompts, model outputs, and review sheets.
+- Separate public prompt-facing text from scorer-side references.
+- Keep real-case prompts closed-record and stance-constrained.
+- Mask names, institutions, identifiers, and other sensitive details before
+  public release.
+- Validate selected rows with human legal review.
+- Preserve a clear public/private release boundary.
+- Withhold full prompts, model outputs, human review sheets, and non-de-identified
+  source documents until a later release review.
 
 ## Public Repository Boundary
 
-This repository keeps documentation, selected paper figures, high-level metadata, and
-small workbook utilities. Full data and review artifacts remain private until privacy,
-licensing, and review constraints are resolved.
+This repository includes documentation, selected figures, high-level metadata,
+and lightweight helper code. The full data workflow remains private until
+privacy, licensing, and review constraints are resolved.