diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
new file mode 100644
index 0000000..57ddc41
--- /dev/null
+++ b/CONTRIBUTING.md
@@ -0,0 +1,32 @@
+# Contributing
+
+LegalScope is currently a public research preview with a strict release boundary.
+Contributions are welcome for documentation, metadata cleanup, public helper
+code, tests, and release-process improvements.
+
+Please do not submit:
+
+- private workbooks;
+- full prompts or reference answers;
+- model-output matrices;
+- human review sheets or adjudication notes;
+- non-de-identified legal source documents;
+- provider credentials, local paths, logs, or API keys.
+
+## Development
+
+```bash
+python -m pip install -r requirements.txt
+python -m pytest -q
+```
+
+## Documentation Changes
+
+Keep public files aligned with the project name `LegalScope`. Historical
+internal names should not be introduced into public documentation.
+
+## Privacy Review
+
+Any proposed row-level data release should be reviewed for source licensing,
+de-identification, re-identification risk, provider terms, and paper review
+requirements before it is merged.
diff --git a/README.md b/README.md
index fc9d160..06a8b3d 100644
--- a/README.md
+++ b/README.md
@@ -1,40 +1,16 @@
-# LegalScope: Measuring Exam-to-Case Transfer in LLM Legal Reasoning
+# LegalScope
-LegalScope studies a simple question with high stakes for legal AI evaluation:
-do strong public legal-exam scores actually transfer to real-case legal reasoning?
+**LegalScope** is a benchmark for measuring whether large language model
+performance on public legal exams transfers to real-case legal analysis.
-I built LegalScope as an independent first-author benchmark project that pairs scalable
-public legal-exam tasks with lawyer-reviewed, de-identified Chinese civil judgment
-analysis. The public repository is intentionally a preview: it documents the research
-question, benchmark design, evaluation counts, scoring protocol, release boundary, and
-reproducible helper code without publishing the paper draft, full workbook, model
-outputs, human review sheets, or non-de-identified case materials.
+The project pairs a scalable public-exam track with a lawyer-reviewed,
+de-identified Chinese civil-judgment track. The public repository is a
+research release scaffold: it documents the benchmark design, evaluation
+counts, scoring protocol, public figures, metadata, and helper utilities while
+withholding private workbooks, full prompts, model outputs, human review
+sheets, and non-de-identified legal materials.
-## Start Here
-
-| If you want to understand... | Read |
-| --- | --- |
-| The research idea and motivation | [Project Brief](docs/PROJECT_BRIEF.md) |
-| Main empirical findings and figures | [Results Summary](docs/RESULTS_SUMMARY.md) |
-| Dataset scope and release boundary | [Data Card](docs/DATA_CARD.md) |
-| Scoring design | [Scoring Rubric](docs/SCORING_RUBRIC.md) |
-| Human validation protocol | [Annotation Protocol](docs/ANNOTATION_PROTOCOL.md) |
-
-## What I Contributed
-
-- Built a dual-track benchmark connecting public legal-exam evaluation with
- lawyer-reviewed real-case legal analysis.
-- Designed a paired issue-stance protocol for Chinese civil judgments, so the same
- factual background can be tested under supporting and opposing legal positions.
-- Developed two scoring protocols: reference-aware 0-4 exam scoring and a calibrated
- real-case rubric for citation relevance, constraint extraction, and argument
- validity.
-- Validated automated scores against human legal review and identified constraint
- extraction as the main real-case failure mode.
-
-## Benchmark at a Glance
-
-
+## Snapshot
| Component | Count |
| --- | ---: |
@@ -45,71 +21,126 @@ outputs, human review sheets, or non-de-identified case materials.
| Model groups evaluated | 20 |
| Public-exam model responses | 17,360 |
| Real-case model responses | 1,520 |
-| Total dataset model responses | 18,880 |
+| Total model responses | 18,880 |
| Human-validation responses | 1,800 |
-The pipeline figure above is rendered from `8.pdf`, which is referenced by the paper
-source. The full paper PDF is not committed to this repository.
+
+
+## What LegalScope Tests
+
+LegalScope asks whether exam performance is a reliable proxy for applied legal
+reasoning. The public-exam track uses reference-aware open-ended legal-exam
+questions. The real-case track asks models to write stance-aware Chinese legal
+analysis over de-identified civil-judgment materials under closed-record
+constraints.
+
+The benchmark separates three questions that are often blurred together:
+
+1. How well do models answer public legal-exam questions?
+2. How well do the same models reason over bounded real-case legal facts?
+3. Do rankings, reasoning-mode gains, and evaluator agreement transfer across
+ those settings?
## Main Findings
- Public-exam scores correlate with Chinese real-case scores at the model level
- (Pearson `r = 0.835`, Spearman `rho = 0.661`), but rankings and reasoning-mode gains
- do not transfer uniformly.
-- Real-case legal reasoning exposes a constraint-extraction bottleneck: models write
- fluent legal arguments more easily than they recover the operative legal and factual
- conditions that control those arguments.
+ (Pearson `r = 0.835`, Spearman `rho = 0.661`), but the transfer is incomplete.
+- Real-case legal reasoning exposes a constraint-extraction bottleneck: models
+ often produce fluent legal prose while missing operative rule conditions,
+ factual boundaries, stance requirements, or evidence limits.
- Automated evaluation aligns strongly with human review on public-exam answers
(answer-level Pearson `r = 0.925`) but weakens on real-case analysis
- (`r = 0.450`), showing why expert-grounded evaluation remains important.
+ (`r = 0.450`), motivating expert-grounded validation for high-stakes legal
+ evaluation.
## Repository Map
-```text
-assets/figures/
- paper_collection_pipeline.png
- paper_score_distribution.png
- paper_transfer_model_judge.png
- paper_transfer_human.png
-data/
- README.md
- metadata/dataset_summary.json
- metadata/model_groups.csv
- metadata/source_composition.csv
- sample/README.md
-docs/
- PROJECT_BRIEF.md
- RESULTS_SUMMARY.md
- DATA_CARD.md
- SCORING_RUBRIC.md
- ANNOTATION_PROTOCOL.md
- AI_WORKFLOW.md
- FIGURE_SOURCES.md
- RELEASE_STATUS.md
-scripts/
- extract_public_sample.py
-src/legalscope/
- workbook.py
-tests/
- test_workbook.py
+| Path | Purpose |
+| --- | --- |
+| `assets/figures/` | Public-safe figures rendered from paper figure sources. |
+| `data/metadata/` | Machine-readable counts, model groups, and source composition. |
+| `data/sample/` | Reserved for public row-level samples after release review. |
+| `docs/PROJECT_BRIEF.md` | Research question, benchmark design, and paper-facing summary. |
+| `docs/RESULTS_SUMMARY.md` | Public-facing result figures and transfer metrics. |
+| `docs/DATA_CARD.md` | Dataset scope, intended use, limitations, and release boundary. |
+| `docs/SCORING_RUBRIC.md` | Public-exam and real-case scoring protocols. |
+| `docs/ANNOTATION_PROTOCOL.md` | Human-validation protocol and review focus. |
+| `docs/AI_WORKFLOW.md` | AI-assisted workflow and human-control safeguards. |
+| `docs/PROVENANCE.md` | Public-safe construction and processing provenance. |
+| `docs/HUGGINGFACE_RELEASE.md` | Hugging Face dataset-card release plan. |
+| `scripts/` | Lightweight public helper scripts. |
+| `src/legalscope/` | Small Python utilities for authorized local workbooks. |
+| `tests/` | Unit tests for public helpers. |
+
+## Quickstart
+
+Install the public helper package from a local checkout:
+
+```bash
+python -m pip install -r requirements.txt
+python -m pytest -q
+```
+
+Inspect the public metadata:
+
+```bash
+python - <<'PY'
+import json
+from pathlib import Path
+
+summary = json.loads(Path("data/metadata/dataset_summary.json").read_text())
+print(summary["project"])
+print(summary["counts"]["dataset_model_responses_total"])
+PY
+```
+
+Use the workbook helpers only with authorized local workbooks:
+
+```python
+from legalscope.workbook import summarize_workbook
+
+for sheet in summarize_workbook("private_authorized_workbook.xlsx"):
+ print(sheet.title, sheet.data_rows, sheet.model_count)
```
## Public Release Boundary
This repository does not publish:
-- the paper draft or PDF;
+- the paper draft, review PDF, or LaTeX source;
- the full benchmark workbook;
-- complete prompts, reference answers, model answers, or row-level model-output
- matrices;
-- lawyer review sheets or adjudication notes;
+- complete prompt matrices, reference answers, model answers, or row-level
+ model-output tables;
+- human review sheets, adjudication notes, or reviewer annotations;
- non-de-identified judgments or private source documents.
-The public code is a reproducibility scaffold for collaborators with authorized local
-access to the private workbook. It is not enough to reconstruct the full benchmark from
-the public repository alone.
+The public materials are sufficient to understand the research design, scope,
+counts, release boundary, and public-facing results. They are not sufficient to
+reconstruct the full benchmark.
+
+## Hugging Face Release
+
+A Hugging Face dataset-card-ready public preview is described in
+[`docs/HUGGINGFACE_RELEASE.md`](docs/HUGGINGFACE_RELEASE.md). The recommended
+first Hub release is a metadata-only preview containing this README-style
+dataset card plus the public metadata files. Row-level samples should be added
+only after source redistribution, privacy, and review constraints are cleared.
+
+## Citation
+
+Citation metadata is provided in [`CITATION.cff`](CITATION.cff). The final paper
+citation should replace the placeholder citation once the paper has a stable
+public identifier.
+
+## License
+
+Code and public documentation in this repository are released under the MIT
+License unless otherwise noted. This license does not grant redistribution
+rights for withheld source documents, full workbooks, model outputs, or private
+review materials.
## Disclaimer
-LegalScope is a research benchmark for model evaluation. It is not legal advice, a
-legal research product, or a substitute for jurisdiction-specific legal review.
+LegalScope is a research benchmark for model evaluation. It is not legal advice,
+a legal research product, or a substitute for jurisdiction-specific legal
+review.
diff --git a/docs/AI_WORKFLOW.md b/docs/AI_WORKFLOW.md
index a5f74c1..e87115b 100644
--- a/docs/AI_WORKFLOW.md
+++ b/docs/AI_WORKFLOW.md
@@ -1,38 +1,78 @@
# AI-Assisted Research Workflow
-LegalScope uses LLMs as research tools while keeping source selection, legal review,
-release decisions, and paper claims under human control.
+LegalScope uses AI systems as research tools while keeping dataset design,
+source selection, de-identification, legal review, scoring decisions, release
+decisions, and paper claims under human control.
## Pipeline Overview
-1. Collect public legal-exam sources and de-identified civil-judgment materials.
+1. Collect public legal-exam sources and candidate Chinese civil-judgment
+ materials.
2. Parse, normalize, redact, deduplicate, and audit source records.
-3. Build standardized public-exam and real-case prompt templates.
+3. Construct public-exam prompts and real-case issue-stance prompts.
4. Generate model answers across 20 model groups.
-5. Score public-exam answers with reference-aware scoring.
-6. Score real-case answers with the A/B/C legal-reasoning rubric.
-7. Validate selected rows against human legal review.
-8. Analyze transfer, human agreement, length effects, and error patterns.
+5. Score public-exam answers with reference-aware 0-4 scoring.
+6. Score real-case answers with the citation, constraint, and argument rubric.
+7. Calibrate the real-case rubric against human legal review.
+8. Analyze exam-to-case transfer, human agreement, score distributions, length
+ effects, and error patterns.
+9. Prepare public documentation and metadata while withholding sensitive
+ artifacts.
+
+## Public-Safe Script Provenance
+
+The Drive public-material folder documents 78 copied and annotated pipeline
+scripts grouped into five stages:
+
+| Stage | Scripts | Public-safe description |
+| --- | ---: | --- |
+| Public bar source collection and cleaning | 16 | Collection, parsing, normalization, duplicate repair, reference repair, and source-audit scripts. |
+| Chinese real-case prompt construction | 7 | Judgment preview, issue/stance prompt construction, repair, rerun, and workbook writeback scripts. |
+| Model answer generation | 17 | Model catalog, provider runners, batch launchers, answer merge, and answer writeback scripts. |
+| Scoring and regrading | 29 | Public-exam scoring, real-case rubric calibration, blind packets, V2 scoring, and validation utilities. |
+| Final conversion, translation, and release | 9 | Workbook-to-JSON conversion, English cleanup, metadata repair, and public release packaging. |
+
+Some internal scripts retain absolute paths or require provider credentials.
+They should be treated as provenance records and rerun only after path,
+credential, privacy, and redistribution checks.
## Where AI Assistance Is Used
-AI tools may help draft transformation code, normalize text, prepare prompt templates,
-generate model answers under controlled settings, and identify candidate failure modes
-for inspection.
+AI tools may help:
+
+- draft transformation code;
+- normalize and translate text;
+- prepare prompt templates;
+- generate model answers under controlled settings;
+- score answers according to documented rubrics;
+- identify candidate failure modes for human inspection;
+- prepare release documentation.
+
+## Human-Controlled Steps
+
+AI tools do not replace:
-AI tools do not replace source-selection decisions, de-identification review, final
-legal judgment, manuscript claims, licensing review, or release decisions.
+- source-selection decisions;
+- privacy and de-identification review;
+- legal-domain review;
+- final scoring policy;
+- human-validation judgments;
+- licensing and redistribution decisions;
+- paper claims and release decisions.
## Safeguards
-- De-identification before public release.
-- Separate scorer-side references and prompt-facing text.
-- Stance and closed-book constraints for real-case prompts.
-- Human validation for selected public-exam and real-case rows.
-- Public release boundary for full prompts, model outputs, and review sheets.
+- Separate public prompt-facing text from scorer-side references.
+- Keep real-case prompts closed-record and stance-constrained.
+- Mask names, institutions, identifiers, and other sensitive details before
+ public release.
+- Validate selected rows with human legal review.
+- Preserve a clear public/private release boundary.
+- Withhold full prompts, model outputs, human review sheets, and non-de-identified
+ source documents until a later release review.
## Public Repository Boundary
-This repository keeps documentation, selected paper figures, high-level metadata, and
-small workbook utilities. Full data and review artifacts remain private until privacy,
-licensing, and review constraints are resolved.
+This repository includes documentation, selected figures, high-level metadata,
+and lightweight helper code. The full data workflow remains private until
+privacy, licensing, and review constraints are resolved.
diff --git a/docs/DATA_CARD.md b/docs/DATA_CARD.md
index 4000440..61f6844 100644
--- a/docs/DATA_CARD.md
+++ b/docs/DATA_CARD.md
@@ -4,13 +4,19 @@
LegalScope.
-## Purpose
+## Summary
-LegalScope evaluates whether LLM performance on public legal-exam tasks transfers to
-practice-oriented legal reasoning over de-identified Chinese civil judgments. The
-benchmark separates reference-answer scoring from case-based rubric scoring so that
-exam performance, real-case performance, human validation, and transfer can be
-studied separately.
+LegalScope evaluates whether public legal-exam performance transfers to
+practice-oriented legal reasoning over de-identified Chinese civil judgments.
+The benchmark has two coordinated tracks:
+
+- a public legal-exam track with reference-aware open-ended scoring;
+- a real-case legal-analysis track with stance-aware, closed-record prompts
+ derived from de-identified Chinese civil judgments.
+
+The public repository is a metadata and documentation preview. It intentionally
+does not publish the full workbook, row-level prompt matrix, model outputs,
+human review sheets, or private legal source documents.
## Benchmark Composition
@@ -18,67 +24,109 @@ studied separately.
| --- | ---: |
| Public legal-exam items | 868 |
| Real-case issue-stance prompts | 76 |
-| Total dataset items | 944 |
+| Total benchmark items | 944 |
| Model groups | 20 |
| Public-exam model responses | 17,360 |
| Real-case model responses | 1,520 |
-| Total dataset model responses | 18,880 |
+| Total model responses | 18,880 |
| Human-scored public-exam items | 80 |
| Human-scored real-case prompts | 10 |
| Human-validation responses | 1,800 |
| De-identified Chinese civil judgments | 15 |
| Real-case legal issues | 38 |
-See `data/metadata/dataset_summary.json` for the machine-readable summary.
+See [`data/metadata/dataset_summary.json`](../data/metadata/dataset_summary.json)
+for a machine-readable summary.
+
+## Data Structure
+
+### Public Legal-Exam Track
+
+The public-exam track contains open-ended questions drawn from public legal-exam
+materials across multiple jurisdictions. Answers are scored against reference
+answers with a 0-4 protocol that rewards issue recognition, rule identification,
+application, and conclusion alignment.
+
+The public repository exposes only aggregate metadata for this track. Full
+question text, reference answers, and model answers are withheld pending source
+redistribution review.
-## Splits
+### Real-Case Legal-Analysis Track
-### Public Legal-Exam Split
+The real-case track contains issue-stance prompts derived from 15 de-identified
+Chinese civil judgments and 38 legal issues. Many issues are paired into
+support/opposition prompts so that the same bounded factual context can be used
+to test whether models can construct statute-grounded arguments under an
+assigned stance.
-The public-exam split contains open-ended questions from public legal-exam materials.
-It is scored with a reference-aware 0-4 answer-match protocol. The split covers U.S.,
-China, U.K., and Australia sources.
+The track is scored on:
-### Chinese Real-Case Split
+- citation relevance;
+- constraint extraction;
+- argument validity.
-The real-case split contains issue-stance prompts derived from de-identified Chinese
-civil judgments. Each prompt asks the model to reason from a structured case setting
-under a specified stance. It is scored across citation relevance, constraint
-extraction, and argument validity.
+The public repository does not include non-de-identified judgments, full
+prompts, hidden legal references, row-level model answers, or review notes.
### Human Validation
-The human-validation subset covers 80 public-exam items and 10 real-case prompts
-across the same 20 model groups. It is used to compare automated/model-judge scores
-with human legal review.
+Human validation covers 80 public-exam items and 10 real-case prompts across the
+same 20 model groups, for 1,800 human-validation responses. Human review is used
+to calibrate and audit automated scoring, especially for real-case legal
+analysis where expert judgment remains important.
-## Public Release Boundary
+## Source Composition
-The repository exposes only high-level metadata, documentation, selected paper figures,
-and lightweight workbook utilities. It does not include the full workbook, full prompts,
-reference answers, model-output matrices, human review sheets, or private source
-documents.
+Public metadata includes:
+
+- jurisdiction and domain counts for the public-exam track;
+- legal-domain counts for the real-case track;
+- model-group names used in the evaluation tables;
+- aggregate transfer and human-validation metrics.
+
+See [`data/metadata/source_composition.csv`](../data/metadata/source_composition.csv)
+and [`data/metadata/model_groups.csv`](../data/metadata/model_groups.csv).
## Intended Uses
- Studying legal benchmark design.
-- Inspecting how exam and real-case evaluation settings differ.
-- Reviewing documentation for high-stakes LLM evaluation workflows.
-- Reusing lightweight workbook helpers in a private, properly licensed workspace.
+- Comparing public-exam and real-case evaluation settings.
+- Auditing release boundaries for high-stakes legal NLP datasets.
+- Reusing public helper utilities with authorized local workbooks.
+- Preparing a later full artifact release after privacy, license, and review
+ checks.
## Out-of-Scope Uses
-- Legal advice.
-- Ranking lawyers, courts, litigants, institutions, or jurisdictions.
-- Training or deploying legal decision systems from these materials.
-- Redistributing source documents, full prompts, or model outputs without release
+- Legal advice or legal decision support.
+- Ranking courts, lawyers, litigants, institutions, or jurisdictions.
+- Training or deploying legal decision systems from the public preview.
+- Reconstructing private workbooks, source documents, model outputs, or human
+ review sheets.
+- Redistributing source materials without independent license and privacy
review.
+## Privacy and De-identification
+
+Real-case materials are derived from de-identified Chinese civil judgments.
+Non-de-identified judgments and private source files are excluded from the
+public repository. Any row-level release should pass a separate review for
+personal names, institution names, addresses, identifiers, case-specific
+re-identification risk, and hidden evidence references.
+
## Known Limitations
-- The real-case split is focused on Chinese civil judgments and is not a general legal
- practice benchmark.
-- Public-exam and real-case tasks use different scoring regimes.
-- Some source materials may have licensing or redistribution constraints.
-- Human validation is a subset of the full evaluation matrix, not a complete manual
- relabeling of all model responses.
+- The real-case track is focused on Chinese civil judgments, not all legal
+ practice settings.
+- Public-exam and real-case tracks use different scoring regimes.
+- The public preview documents aggregate results but does not expose all
+ row-level evidence needed for independent replication.
+- Human validation covers a subset of responses rather than a full manual
+ relabeling of the entire model-output matrix.
+- Source redistribution rights may differ across public-exam sources and
+ judgment-derived materials.
+
+## Version
+
+This card describes the paper-submission benchmark snapshot represented by the
+public LegalScope repository.
diff --git a/docs/HUGGINGFACE_RELEASE.md b/docs/HUGGINGFACE_RELEASE.md
new file mode 100644
index 0000000..281630a
--- /dev/null
+++ b/docs/HUGGINGFACE_RELEASE.md
@@ -0,0 +1,104 @@
+# Hugging Face Release Plan
+
+This document describes the recommended Hugging Face public preview for
+LegalScope.
+
+## Recommended Repository
+
+- Repository type: dataset
+- Suggested repo id: `EternWang/LegalScope`
+- Public title: `LegalScope`
+- Release mode: metadata-only public preview
+
+The first Hub release should mirror the public GitHub release boundary. It
+should include the dataset card and public metadata files, not the full
+workbook, prompts, model outputs, human review sheets, or private legal source
+documents.
+
+## Files to Upload
+
+```text
+README.md
+data/metadata/dataset_summary.json
+data/metadata/model_groups.csv
+data/metadata/source_composition.csv
+```
+
+Optional after review:
+
+```text
+data/sample/public_preview.jsonl
+data/sample/README.md
+```
+
+## Dataset Card Requirements
+
+The Hugging Face `README.md` should include:
+
+- YAML metadata for discoverability;
+- a concise dataset summary;
+- benchmark composition counts;
+- data structure and release boundary;
+- intended and out-of-scope uses;
+- privacy, de-identification, and licensing notes;
+- citation instructions;
+- links back to GitHub and the paper once available.
+
+## Suggested Metadata
+
+```yaml
+---
+language:
+- en
+- zh
+license: mit
+pretty_name: LegalScope
+task_categories:
+- question-answering
+- text-generation
+tags:
+- legal
+- legal-reasoning
+- benchmark
+- llm-evaluation
+- exam-to-case-transfer
+- chinese-civil-judgments
+- text
+size_categories:
+- n<1K
+---
+```
+
+The `n<1K` value reflects the public metadata preview files, not the withheld
+full model-output matrix.
+
+## Upload Workflow
+
+Create the dataset repository on the Hub, then upload the prepared files. A CLI
+flow may look like:
+
+```bash
+hf auth login
+hf repo create EternWang/LegalScope --type dataset
+hf upload EternWang/LegalScope ./huggingface_dataset_repo . --repo-type dataset
+```
+
+After upload, verify:
+
+1. the dataset card renders correctly;
+2. YAML metadata appears as Hub tags;
+3. public metadata files are visible under `data/metadata/`;
+4. no withheld workbook, prompt, model-output, human-review, or non-de-identified
+ source files were uploaded;
+5. GitHub and Hugging Face links point to each other.
+
+## Full Release Gate
+
+Before adding row-level samples or full artifacts, confirm:
+
+- source redistribution rights;
+- de-identification quality;
+- model-output provider terms;
+- paper review anonymity;
+- human review privacy;
+- final paper citation and artifact policy.
diff --git a/docs/PROJECT_BRIEF.md b/docs/PROJECT_BRIEF.md
index 52cd2db..60b5025 100644
--- a/docs/PROJECT_BRIEF.md
+++ b/docs/PROJECT_BRIEF.md
@@ -1,6 +1,6 @@
# Project Brief
-This is a public, application-facing summary of the paper structure. It explains the
+This is a public-facing summary of the paper structure. It explains the
research question, benchmark design, evaluation protocol, and main findings without
releasing the full manuscript, full workbook, model-output matrix, human review sheets,
or private legal materials.
@@ -64,12 +64,12 @@ boundaries, or assigned stance that makes the answer legally controlled.
Automated evaluation is more reliable on public-exam answers than on case-based legal
analysis, which is why the benchmark keeps expert-grounded validation in the loop.
-## My Role
+## Project Role
-I initiated and led the benchmark design, data organization, scoring-protocol design,
-evaluation workflow, public repository packaging, and paper framing. The project also
-involved weekly research collaboration and legal-domain review for real-case scoring
-and validation.
+The benchmark was led as an independent first-author research project covering
+benchmark design, data organization, scoring-protocol design, evaluation workflow,
+public repository packaging, and paper framing. The project also involved research
+collaboration and legal-domain review for real-case scoring and validation.
## Public Release Boundary
diff --git a/docs/PROVENANCE.md b/docs/PROVENANCE.md
new file mode 100644
index 0000000..6bc3d7a
--- /dev/null
+++ b/docs/PROVENANCE.md
@@ -0,0 +1,57 @@
+# Provenance
+
+This document summarizes the public-safe provenance for LegalScope. It is not a
+complete release of source data, private workbooks, model outputs, or human
+review files.
+
+## Source Families
+
+LegalScope combines two evaluation settings:
+
+| Source family | Public-safe description | Public release status |
+| --- | --- | --- |
+| Public legal exams | Open-ended legal-exam questions collected from public legal-exam materials across multiple jurisdictions. | Aggregate metadata only. Full questions and reference answers require source redistribution review. |
+| Chinese civil judgments | De-identified civil-judgment materials transformed into issue-stance legal-analysis prompts. | Aggregate metadata only. Non-de-identified judgments and full prompts are withheld. |
+| Model responses | Answers generated by 20 model groups across both tracks. | Aggregate metrics only. Row-level model outputs are withheld. |
+| Human validation | Legal review over selected public-exam and real-case responses. | Protocol and counts only. Review sheets and adjudication notes are withheld. |
+
+## Processing Stages
+
+The public Drive materials include an annotated script manifest with 78 scripts
+organized into five stages:
+
+1. Public bar source collection and cleaning.
+2. Chinese real-case prompt construction.
+3. Model answer generation.
+4. Scoring and regrading.
+5. Final conversion, translation, and release packaging.
+
+The public repository keeps only lightweight helper code. Full pipeline scripts,
+absolute local paths, provider credentials, raw source caches, and generated
+intermediate files are not required for the public preview and should not be
+published without review.
+
+## Scoring Evolution
+
+Internal materials document three real-case rubric stages:
+
+| Stage | Purpose |
+| --- | --- |
+| Original checklist rubric | Initial A/B/C scoring for citation relevance, constraint extraction, and argument validity. |
+| Strict V1 regrading | Added strong high-score caps to reduce inflated model-judge scores. |
+| Current V2 rubric | Preserves strict high-score thresholds while reducing over-penalty for ordinary incompleteness and aligning better with human validation. |
+
+The public scoring summary is documented in
+[`SCORING_RUBRIC.md`](SCORING_RUBRIC.md).
+
+## Public-Safe Naming
+
+The public project name is **LegalScope**. Historical internal names should not
+appear in public release files.
+
+## Release Boundary
+
+The public repository is intended to show how the benchmark was designed,
+audited, and summarized. It does not make the full benchmark reconstructable.
+Any expanded release should pass privacy, redistribution, review-anonymity, and
+provider-terms checks before publication.
diff --git a/docs/PUBLICATION_CHECKLIST.md b/docs/PUBLICATION_CHECKLIST.md
new file mode 100644
index 0000000..c72135d
--- /dev/null
+++ b/docs/PUBLICATION_CHECKLIST.md
@@ -0,0 +1,41 @@
+# Publication Checklist
+
+Use this checklist before making GitHub or Hugging Face materials public.
+
+## Naming
+
+- [ ] The public name is `LegalScope`.
+- [ ] Historical internal project names do not appear in public files.
+- [ ] Figure labels and captions use the public name where applicable.
+
+## GitHub
+
+- [ ] README explains what the project does, why it matters, and how to start.
+- [ ] Repository map points to data card, results, scoring, annotation, and
+ release-boundary docs.
+- [ ] `CITATION.cff` is up to date.
+- [ ] License text applies only to code and public documentation unless a data
+ license is explicitly added.
+- [ ] CI passes on the public helper code.
+- [ ] No private workbook, prompt matrix, model output, review sheet, API key,
+ provider log, or non-de-identified source file is committed.
+
+## Hugging Face
+
+- [ ] Dataset card has valid YAML metadata.
+- [ ] Dataset card states that this is a metadata-only public preview.
+- [ ] Public metadata files are uploaded under `data/metadata/`.
+- [ ] Full prompts, model outputs, reference answers, human review sheets, and
+ non-de-identified sources are absent.
+- [ ] Hub page links back to GitHub.
+- [ ] GitHub README links to the Hub page after it is live.
+
+## Full Dataset Expansion
+
+- [ ] Source redistribution rights checked.
+- [ ] De-identification reviewed.
+- [ ] Re-identification risk reviewed.
+- [ ] Provider terms for model-output redistribution checked.
+- [ ] Human review notes cleaned or excluded.
+- [ ] Review anonymity requirements checked.
+- [ ] Final paper citation added.
diff --git a/docs/RELEASE_STATUS.md b/docs/RELEASE_STATUS.md
index 3d73a2b..90167fd 100644
--- a/docs/RELEASE_STATUS.md
+++ b/docs/RELEASE_STATUS.md
@@ -1,34 +1,47 @@
# Release Status
-LegalScope is represented here as a public research repository, not a full artifact
-release.
+LegalScope is represented here as a public research repository, not a full
+artifact release.
## Included
-- Benchmark composition and evaluation counts.
-- A small public figure set rendered from the paper figure source in the submitted zip
- package.
-- High-level metadata about sources, rows, model groups, and response counts.
-- Scoring, annotation, data-card, and AI-workflow documentation.
+- Benchmark design and evaluation counts.
+- Public-safe metadata about sources, rows, model groups, and response counts.
+- Selected figures rendered from paper figure sources.
+- Public-facing result summaries.
+- Scoring, annotation, data-card, provenance, and AI-workflow documentation.
- Lightweight workbook helper utilities.
+- A Hugging Face dataset-card release plan.
## Not Included
-- Paper draft, PDF, or source files.
-- Extra slide figures that are not synchronized with the current paper text.
-- Full workbook.
+- Paper draft, review PDF, or LaTeX source files.
+- Full benchmark workbook.
- Full prompt matrix.
- Complete reference answers.
- Complete model outputs.
-- Human review sheets.
+- Human review sheets or adjudication notes.
- Non-de-identified judgments or private source documents.
+- Internal API keys, provider logs, or local absolute-path artifacts.
-## Before Full Release
+## Recommended Public Release Strategy
-Before publishing a full dataset or artifact bundle, check:
+1. Keep this GitHub repository as the project and reproducibility scaffold.
+2. Publish a Hugging Face dataset repository as a metadata-only public preview.
+3. Link the GitHub repository and Hugging Face dataset card after the Hub page is
+ live.
+4. Add row-level samples only after a separate release review confirms source
+ redistribution rights, de-identification quality, and review anonymity.
+
+## Before Any Full Dataset Release
+
+Check:
- source redistribution rights for public-exam materials;
-- de-identification and re-identification risk for real-case prompts;
+- de-identification and re-identification risk for judgment-derived prompts;
- provider terms for redistributing model outputs;
-- whether the review version requires an anonymous artifact path;
-- whether any public repository text would identify the submission during review.
+- whether the review version requires anonymous artifact paths;
+- whether repository text, commit history, file names, or public URLs could
+ identify a submission during review;
+- whether human review sheets contain private notes, names, or hidden evidence
+ references.
diff --git a/docs/SCORING_RUBRIC.md b/docs/SCORING_RUBRIC.md
index 929cb6c..814aee3 100644
--- a/docs/SCORING_RUBRIC.md
+++ b/docs/SCORING_RUBRIC.md
@@ -1,6 +1,8 @@
# Scoring Rubric
-This document summarizes the LegalScope scoring protocol used in the benchmark.
+LegalScope uses different scoring protocols for public legal exams and
+real-case legal analysis because the two tracks test different forms of legal
+reasoning.
## Public Legal-Exam Scoring
@@ -12,31 +14,70 @@ Public-exam answers receive one reference-aware score from 0 to 4.
| 1 | Captures roughly one core unit such as issue, rule, application, or conclusion. |
| 2 | Captures about two core units but misses major substance. |
| 3 | Mostly matches the reference answer with limited gaps. |
-| 4 | Matches the core issue, rule/test, application, and conclusion without substantive conflict. |
+| 4 | Matches the core issue, rule or test, application, and conclusion without substantive conflict. |
+
+The public-exam score is intended to measure reference-answer alignment, not
+general legal usefulness.
## Real-Case A/B/C Rubric
-Real-case answers receive three 0-4 scores.
+Real-case answers receive three independent 0-4 scores.
### A. Citation Relevance
-Checks whether the answer identifies legally responsive authority and connects it to a
-usable legal proposition.
+Checks whether the answer identifies legally responsive authority and connects
+that authority to a usable legal proposition.
+
+High-scoring answers cite or name the controlling legal basis and explain why it
+matters for the assigned issue. Low-scoring answers cite irrelevant rules, cite
+rules that do not support the conclusion, omit legal authority, or rely on
+generic legal language without a usable legal proposition.
### B. Constraint Extraction
-Checks whether the answer follows the assigned stance, extracts operative constraints,
-avoids invented facts, respects the prompt boundary, and covers the requested issue.
+Checks whether the answer recovers and respects the operative constraints in
+the prompt:
+
+- assigned support or opposition stance;
+- given factual record;
+- issue boundary;
+- legal and evidentiary conditions;
+- required output format;
+- de-identification and closed-record restrictions.
+
+High-scoring answers reason within the supplied record. Low-scoring answers
+reverse the stance, invent facts, add hidden evidence, ignore key conditions, or
+miss the issue being tested.
### C. Argument Validity
-Checks whether the answer states a defensible conclusion, applies rules to facts,
-handles counterpoints, and avoids unsupported reasoning.
+Checks whether the legal conclusion follows from the cited law and the supplied
+facts under the assigned stance.
+
+High-scoring answers connect rule conditions to the facts, handle important
+counterpoints, and reach a defensible conclusion. Low-scoring answers are
+conclusory, internally inconsistent, unsupported by the cited rules, or
+misaligned with the core legal question.
+
+## V2 Calibration
+
+The current V2 rubric keeps strict high-score thresholds while reducing
+over-penalty for ordinary incompleteness.
+
+Key calibration principles:
-## Calibration Notes
+- A score of 4 is reserved for answers with no meaningful substantive defect in
+ that dimension.
+- Severe failures such as stance reversal, non-answer, refusal, major
+ truncation, or unusable output remain capped at very low scores.
+- If an answer lacks a responsive legal basis, citation relevance can be zero;
+ constraint extraction and argument validity may still receive limited credit
+ when there is substantive reasoning.
+- Missing hidden evidence, hidden appraisal material, or unshown contract
+ detail should not be treated as a failure unless the prompt itself provided
+ that material.
+- Long or fluent writing should not raise a score unless it improves legal
+ relevance, constraint recovery, or rule-to-fact reasoning.
-The calibrated rubric keeps strict high-score thresholds while reducing over-penalty
-for ordinary incompleteness. Severe failures such as stance reversal, non-answer,
-refusal, major truncation, or unusable output remain capped at very low scores. If an
-answer lacks a responsive legal basis, citation relevance can be zero while argument
-or constraint dimensions may still receive limited credit for substantive reasoning.
+The calibration goal is to reduce generator-evaluator bias while preserving the
+distinction between fluent legal prose and legally controlled analysis.
diff --git a/pyproject.toml b/pyproject.toml
index c609342..515f1dc 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -4,8 +4,8 @@ build-backend = "setuptools.build_meta"
[project]
name = "legalscope"
-version = "0.1.0"
-description = "Reproducible public utilities for the LegalScope benchmark preview."
+version = "0.2.0"
+description = "Public metadata, documentation, and utilities for the LegalScope benchmark preview."
readme = "README.md"
requires-python = ">=3.10"
dependencies = [