From facf04b4d0037792886acdea7f8ec329173b753c Mon Sep 17 00:00:00 2001 From: Codex Date: Thu, 11 Jun 2026 19:08:27 +0800 Subject: [PATCH] Prepare LegalScope public release materials --- CONTRIBUTING.md | 32 ++++++ README.md | 189 ++++++++++++++++++++-------------- docs/AI_WORKFLOW.md | 82 +++++++++++---- docs/DATA_CARD.md | 128 ++++++++++++++++------- docs/HUGGINGFACE_RELEASE.md | 104 +++++++++++++++++++ docs/PROJECT_BRIEF.md | 12 +-- docs/PROVENANCE.md | 57 ++++++++++ docs/PUBLICATION_CHECKLIST.md | 41 ++++++++ docs/RELEASE_STATUS.md | 45 +++++--- docs/SCORING_RUBRIC.md | 71 ++++++++++--- pyproject.toml | 4 +- 11 files changed, 586 insertions(+), 179 deletions(-) create mode 100644 CONTRIBUTING.md create mode 100644 docs/HUGGINGFACE_RELEASE.md create mode 100644 docs/PROVENANCE.md create mode 100644 docs/PUBLICATION_CHECKLIST.md diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md new file mode 100644 index 0000000..57ddc41 --- /dev/null +++ b/CONTRIBUTING.md @@ -0,0 +1,32 @@ +# Contributing + +LegalScope is currently a public research preview with a strict release boundary. +Contributions are welcome for documentation, metadata cleanup, public helper +code, tests, and release-process improvements. + +Please do not submit: + +- private workbooks; +- full prompts or reference answers; +- model-output matrices; +- human review sheets or adjudication notes; +- non-de-identified legal source documents; +- provider credentials, local paths, logs, or API keys. + +## Development + +```bash +python -m pip install -r requirements.txt +python -m pytest -q +``` + +## Documentation Changes + +Keep public files aligned with the project name `LegalScope`. Historical +internal names should not be introduced into public documentation. + +## Privacy Review + +Any proposed row-level data release should be reviewed for source licensing, +de-identification, re-identification risk, provider terms, and paper review +requirements before it is merged. diff --git a/README.md b/README.md index fc9d160..06a8b3d 100644 --- a/README.md +++ b/README.md @@ -1,40 +1,16 @@ -# LegalScope: Measuring Exam-to-Case Transfer in LLM Legal Reasoning +# LegalScope -LegalScope studies a simple question with high stakes for legal AI evaluation: -do strong public legal-exam scores actually transfer to real-case legal reasoning? +**LegalScope** is a benchmark for measuring whether large language model +performance on public legal exams transfers to real-case legal analysis. -I built LegalScope as an independent first-author benchmark project that pairs scalable -public legal-exam tasks with lawyer-reviewed, de-identified Chinese civil judgment -analysis. The public repository is intentionally a preview: it documents the research -question, benchmark design, evaluation counts, scoring protocol, release boundary, and -reproducible helper code without publishing the paper draft, full workbook, model -outputs, human review sheets, or non-de-identified case materials. +The project pairs a scalable public-exam track with a lawyer-reviewed, +de-identified Chinese civil-judgment track. The public repository is a +research release scaffold: it documents the benchmark design, evaluation +counts, scoring protocol, public figures, metadata, and helper utilities while +withholding private workbooks, full prompts, model outputs, human review +sheets, and non-de-identified legal materials. -## Start Here - -| If you want to understand... | Read | -| --- | --- | -| The research idea and motivation | [Project Brief](docs/PROJECT_BRIEF.md) | -| Main empirical findings and figures | [Results Summary](docs/RESULTS_SUMMARY.md) | -| Dataset scope and release boundary | [Data Card](docs/DATA_CARD.md) | -| Scoring design | [Scoring Rubric](docs/SCORING_RUBRIC.md) | -| Human validation protocol | [Annotation Protocol](docs/ANNOTATION_PROTOCOL.md) | - -## What I Contributed - -- Built a dual-track benchmark connecting public legal-exam evaluation with - lawyer-reviewed real-case legal analysis. -- Designed a paired issue-stance protocol for Chinese civil judgments, so the same - factual background can be tested under supporting and opposing legal positions. -- Developed two scoring protocols: reference-aware 0-4 exam scoring and a calibrated - real-case rubric for citation relevance, constraint extraction, and argument - validity. -- Validated automated scores against human legal review and identified constraint - extraction as the main real-case failure mode. - -## Benchmark at a Glance - -LegalScope benchmark construction pipeline +## Snapshot | Component | Count | | --- | ---: | @@ -45,71 +21,126 @@ outputs, human review sheets, or non-de-identified case materials. | Model groups evaluated | 20 | | Public-exam model responses | 17,360 | | Real-case model responses | 1,520 | -| Total dataset model responses | 18,880 | +| Total model responses | 18,880 | | Human-validation responses | 1,800 | -The pipeline figure above is rendered from `8.pdf`, which is referenced by the paper -source. The full paper PDF is not committed to this repository. +LegalScope benchmark construction pipeline + +## What LegalScope Tests + +LegalScope asks whether exam performance is a reliable proxy for applied legal +reasoning. The public-exam track uses reference-aware open-ended legal-exam +questions. The real-case track asks models to write stance-aware Chinese legal +analysis over de-identified civil-judgment materials under closed-record +constraints. + +The benchmark separates three questions that are often blurred together: + +1. How well do models answer public legal-exam questions? +2. How well do the same models reason over bounded real-case legal facts? +3. Do rankings, reasoning-mode gains, and evaluator agreement transfer across + those settings? ## Main Findings - Public-exam scores correlate with Chinese real-case scores at the model level - (Pearson `r = 0.835`, Spearman `rho = 0.661`), but rankings and reasoning-mode gains - do not transfer uniformly. -- Real-case legal reasoning exposes a constraint-extraction bottleneck: models write - fluent legal arguments more easily than they recover the operative legal and factual - conditions that control those arguments. + (Pearson `r = 0.835`, Spearman `rho = 0.661`), but the transfer is incomplete. +- Real-case legal reasoning exposes a constraint-extraction bottleneck: models + often produce fluent legal prose while missing operative rule conditions, + factual boundaries, stance requirements, or evidence limits. - Automated evaluation aligns strongly with human review on public-exam answers (answer-level Pearson `r = 0.925`) but weakens on real-case analysis - (`r = 0.450`), showing why expert-grounded evaluation remains important. + (`r = 0.450`), motivating expert-grounded validation for high-stakes legal + evaluation. ## Repository Map -```text -assets/figures/ - paper_collection_pipeline.png - paper_score_distribution.png - paper_transfer_model_judge.png - paper_transfer_human.png -data/ - README.md - metadata/dataset_summary.json - metadata/model_groups.csv - metadata/source_composition.csv - sample/README.md -docs/ - PROJECT_BRIEF.md - RESULTS_SUMMARY.md - DATA_CARD.md - SCORING_RUBRIC.md - ANNOTATION_PROTOCOL.md - AI_WORKFLOW.md - FIGURE_SOURCES.md - RELEASE_STATUS.md -scripts/ - extract_public_sample.py -src/legalscope/ - workbook.py -tests/ - test_workbook.py +| Path | Purpose | +| --- | --- | +| `assets/figures/` | Public-safe figures rendered from paper figure sources. | +| `data/metadata/` | Machine-readable counts, model groups, and source composition. | +| `data/sample/` | Reserved for public row-level samples after release review. | +| `docs/PROJECT_BRIEF.md` | Research question, benchmark design, and paper-facing summary. | +| `docs/RESULTS_SUMMARY.md` | Public-facing result figures and transfer metrics. | +| `docs/DATA_CARD.md` | Dataset scope, intended use, limitations, and release boundary. | +| `docs/SCORING_RUBRIC.md` | Public-exam and real-case scoring protocols. | +| `docs/ANNOTATION_PROTOCOL.md` | Human-validation protocol and review focus. | +| `docs/AI_WORKFLOW.md` | AI-assisted workflow and human-control safeguards. | +| `docs/PROVENANCE.md` | Public-safe construction and processing provenance. | +| `docs/HUGGINGFACE_RELEASE.md` | Hugging Face dataset-card release plan. | +| `scripts/` | Lightweight public helper scripts. | +| `src/legalscope/` | Small Python utilities for authorized local workbooks. | +| `tests/` | Unit tests for public helpers. | + +## Quickstart + +Install the public helper package from a local checkout: + +```bash +python -m pip install -r requirements.txt +python -m pytest -q +``` + +Inspect the public metadata: + +```bash +python - <<'PY' +import json +from pathlib import Path + +summary = json.loads(Path("data/metadata/dataset_summary.json").read_text()) +print(summary["project"]) +print(summary["counts"]["dataset_model_responses_total"]) +PY +``` + +Use the workbook helpers only with authorized local workbooks: + +```python +from legalscope.workbook import summarize_workbook + +for sheet in summarize_workbook("private_authorized_workbook.xlsx"): + print(sheet.title, sheet.data_rows, sheet.model_count) ``` ## Public Release Boundary This repository does not publish: -- the paper draft or PDF; +- the paper draft, review PDF, or LaTeX source; - the full benchmark workbook; -- complete prompts, reference answers, model answers, or row-level model-output - matrices; -- lawyer review sheets or adjudication notes; +- complete prompt matrices, reference answers, model answers, or row-level + model-output tables; +- human review sheets, adjudication notes, or reviewer annotations; - non-de-identified judgments or private source documents. -The public code is a reproducibility scaffold for collaborators with authorized local -access to the private workbook. It is not enough to reconstruct the full benchmark from -the public repository alone. +The public materials are sufficient to understand the research design, scope, +counts, release boundary, and public-facing results. They are not sufficient to +reconstruct the full benchmark. + +## Hugging Face Release + +A Hugging Face dataset-card-ready public preview is described in +[`docs/HUGGINGFACE_RELEASE.md`](docs/HUGGINGFACE_RELEASE.md). The recommended +first Hub release is a metadata-only preview containing this README-style +dataset card plus the public metadata files. Row-level samples should be added +only after source redistribution, privacy, and review constraints are cleared. + +## Citation + +Citation metadata is provided in [`CITATION.cff`](CITATION.cff). The final paper +citation should replace the placeholder citation once the paper has a stable +public identifier. + +## License + +Code and public documentation in this repository are released under the MIT +License unless otherwise noted. This license does not grant redistribution +rights for withheld source documents, full workbooks, model outputs, or private +review materials. ## Disclaimer -LegalScope is a research benchmark for model evaluation. It is not legal advice, a -legal research product, or a substitute for jurisdiction-specific legal review. +LegalScope is a research benchmark for model evaluation. It is not legal advice, +a legal research product, or a substitute for jurisdiction-specific legal +review. diff --git a/docs/AI_WORKFLOW.md b/docs/AI_WORKFLOW.md index a5f74c1..e87115b 100644 --- a/docs/AI_WORKFLOW.md +++ b/docs/AI_WORKFLOW.md @@ -1,38 +1,78 @@ # AI-Assisted Research Workflow -LegalScope uses LLMs as research tools while keeping source selection, legal review, -release decisions, and paper claims under human control. +LegalScope uses AI systems as research tools while keeping dataset design, +source selection, de-identification, legal review, scoring decisions, release +decisions, and paper claims under human control. ## Pipeline Overview -1. Collect public legal-exam sources and de-identified civil-judgment materials. +1. Collect public legal-exam sources and candidate Chinese civil-judgment + materials. 2. Parse, normalize, redact, deduplicate, and audit source records. -3. Build standardized public-exam and real-case prompt templates. +3. Construct public-exam prompts and real-case issue-stance prompts. 4. Generate model answers across 20 model groups. -5. Score public-exam answers with reference-aware scoring. -6. Score real-case answers with the A/B/C legal-reasoning rubric. -7. Validate selected rows against human legal review. -8. Analyze transfer, human agreement, length effects, and error patterns. +5. Score public-exam answers with reference-aware 0-4 scoring. +6. Score real-case answers with the citation, constraint, and argument rubric. +7. Calibrate the real-case rubric against human legal review. +8. Analyze exam-to-case transfer, human agreement, score distributions, length + effects, and error patterns. +9. Prepare public documentation and metadata while withholding sensitive + artifacts. + +## Public-Safe Script Provenance + +The Drive public-material folder documents 78 copied and annotated pipeline +scripts grouped into five stages: + +| Stage | Scripts | Public-safe description | +| --- | ---: | --- | +| Public bar source collection and cleaning | 16 | Collection, parsing, normalization, duplicate repair, reference repair, and source-audit scripts. | +| Chinese real-case prompt construction | 7 | Judgment preview, issue/stance prompt construction, repair, rerun, and workbook writeback scripts. | +| Model answer generation | 17 | Model catalog, provider runners, batch launchers, answer merge, and answer writeback scripts. | +| Scoring and regrading | 29 | Public-exam scoring, real-case rubric calibration, blind packets, V2 scoring, and validation utilities. | +| Final conversion, translation, and release | 9 | Workbook-to-JSON conversion, English cleanup, metadata repair, and public release packaging. | + +Some internal scripts retain absolute paths or require provider credentials. +They should be treated as provenance records and rerun only after path, +credential, privacy, and redistribution checks. ## Where AI Assistance Is Used -AI tools may help draft transformation code, normalize text, prepare prompt templates, -generate model answers under controlled settings, and identify candidate failure modes -for inspection. +AI tools may help: + +- draft transformation code; +- normalize and translate text; +- prepare prompt templates; +- generate model answers under controlled settings; +- score answers according to documented rubrics; +- identify candidate failure modes for human inspection; +- prepare release documentation. + +## Human-Controlled Steps + +AI tools do not replace: -AI tools do not replace source-selection decisions, de-identification review, final -legal judgment, manuscript claims, licensing review, or release decisions. +- source-selection decisions; +- privacy and de-identification review; +- legal-domain review; +- final scoring policy; +- human-validation judgments; +- licensing and redistribution decisions; +- paper claims and release decisions. ## Safeguards -- De-identification before public release. -- Separate scorer-side references and prompt-facing text. -- Stance and closed-book constraints for real-case prompts. -- Human validation for selected public-exam and real-case rows. -- Public release boundary for full prompts, model outputs, and review sheets. +- Separate public prompt-facing text from scorer-side references. +- Keep real-case prompts closed-record and stance-constrained. +- Mask names, institutions, identifiers, and other sensitive details before + public release. +- Validate selected rows with human legal review. +- Preserve a clear public/private release boundary. +- Withhold full prompts, model outputs, human review sheets, and non-de-identified + source documents until a later release review. ## Public Repository Boundary -This repository keeps documentation, selected paper figures, high-level metadata, and -small workbook utilities. Full data and review artifacts remain private until privacy, -licensing, and review constraints are resolved. +This repository includes documentation, selected figures, high-level metadata, +and lightweight helper code. The full data workflow remains private until +privacy, licensing, and review constraints are resolved. diff --git a/docs/DATA_CARD.md b/docs/DATA_CARD.md index 4000440..61f6844 100644 --- a/docs/DATA_CARD.md +++ b/docs/DATA_CARD.md @@ -4,13 +4,19 @@ LegalScope. -## Purpose +## Summary -LegalScope evaluates whether LLM performance on public legal-exam tasks transfers to -practice-oriented legal reasoning over de-identified Chinese civil judgments. The -benchmark separates reference-answer scoring from case-based rubric scoring so that -exam performance, real-case performance, human validation, and transfer can be -studied separately. +LegalScope evaluates whether public legal-exam performance transfers to +practice-oriented legal reasoning over de-identified Chinese civil judgments. +The benchmark has two coordinated tracks: + +- a public legal-exam track with reference-aware open-ended scoring; +- a real-case legal-analysis track with stance-aware, closed-record prompts + derived from de-identified Chinese civil judgments. + +The public repository is a metadata and documentation preview. It intentionally +does not publish the full workbook, row-level prompt matrix, model outputs, +human review sheets, or private legal source documents. ## Benchmark Composition @@ -18,67 +24,109 @@ studied separately. | --- | ---: | | Public legal-exam items | 868 | | Real-case issue-stance prompts | 76 | -| Total dataset items | 944 | +| Total benchmark items | 944 | | Model groups | 20 | | Public-exam model responses | 17,360 | | Real-case model responses | 1,520 | -| Total dataset model responses | 18,880 | +| Total model responses | 18,880 | | Human-scored public-exam items | 80 | | Human-scored real-case prompts | 10 | | Human-validation responses | 1,800 | | De-identified Chinese civil judgments | 15 | | Real-case legal issues | 38 | -See `data/metadata/dataset_summary.json` for the machine-readable summary. +See [`data/metadata/dataset_summary.json`](../data/metadata/dataset_summary.json) +for a machine-readable summary. + +## Data Structure + +### Public Legal-Exam Track + +The public-exam track contains open-ended questions drawn from public legal-exam +materials across multiple jurisdictions. Answers are scored against reference +answers with a 0-4 protocol that rewards issue recognition, rule identification, +application, and conclusion alignment. + +The public repository exposes only aggregate metadata for this track. Full +question text, reference answers, and model answers are withheld pending source +redistribution review. -## Splits +### Real-Case Legal-Analysis Track -### Public Legal-Exam Split +The real-case track contains issue-stance prompts derived from 15 de-identified +Chinese civil judgments and 38 legal issues. Many issues are paired into +support/opposition prompts so that the same bounded factual context can be used +to test whether models can construct statute-grounded arguments under an +assigned stance. -The public-exam split contains open-ended questions from public legal-exam materials. -It is scored with a reference-aware 0-4 answer-match protocol. The split covers U.S., -China, U.K., and Australia sources. +The track is scored on: -### Chinese Real-Case Split +- citation relevance; +- constraint extraction; +- argument validity. -The real-case split contains issue-stance prompts derived from de-identified Chinese -civil judgments. Each prompt asks the model to reason from a structured case setting -under a specified stance. It is scored across citation relevance, constraint -extraction, and argument validity. +The public repository does not include non-de-identified judgments, full +prompts, hidden legal references, row-level model answers, or review notes. ### Human Validation -The human-validation subset covers 80 public-exam items and 10 real-case prompts -across the same 20 model groups. It is used to compare automated/model-judge scores -with human legal review. +Human validation covers 80 public-exam items and 10 real-case prompts across the +same 20 model groups, for 1,800 human-validation responses. Human review is used +to calibrate and audit automated scoring, especially for real-case legal +analysis where expert judgment remains important. -## Public Release Boundary +## Source Composition -The repository exposes only high-level metadata, documentation, selected paper figures, -and lightweight workbook utilities. It does not include the full workbook, full prompts, -reference answers, model-output matrices, human review sheets, or private source -documents. +Public metadata includes: + +- jurisdiction and domain counts for the public-exam track; +- legal-domain counts for the real-case track; +- model-group names used in the evaluation tables; +- aggregate transfer and human-validation metrics. + +See [`data/metadata/source_composition.csv`](../data/metadata/source_composition.csv) +and [`data/metadata/model_groups.csv`](../data/metadata/model_groups.csv). ## Intended Uses - Studying legal benchmark design. -- Inspecting how exam and real-case evaluation settings differ. -- Reviewing documentation for high-stakes LLM evaluation workflows. -- Reusing lightweight workbook helpers in a private, properly licensed workspace. +- Comparing public-exam and real-case evaluation settings. +- Auditing release boundaries for high-stakes legal NLP datasets. +- Reusing public helper utilities with authorized local workbooks. +- Preparing a later full artifact release after privacy, license, and review + checks. ## Out-of-Scope Uses -- Legal advice. -- Ranking lawyers, courts, litigants, institutions, or jurisdictions. -- Training or deploying legal decision systems from these materials. -- Redistributing source documents, full prompts, or model outputs without release +- Legal advice or legal decision support. +- Ranking courts, lawyers, litigants, institutions, or jurisdictions. +- Training or deploying legal decision systems from the public preview. +- Reconstructing private workbooks, source documents, model outputs, or human + review sheets. +- Redistributing source materials without independent license and privacy review. +## Privacy and De-identification + +Real-case materials are derived from de-identified Chinese civil judgments. +Non-de-identified judgments and private source files are excluded from the +public repository. Any row-level release should pass a separate review for +personal names, institution names, addresses, identifiers, case-specific +re-identification risk, and hidden evidence references. + ## Known Limitations -- The real-case split is focused on Chinese civil judgments and is not a general legal - practice benchmark. -- Public-exam and real-case tasks use different scoring regimes. -- Some source materials may have licensing or redistribution constraints. -- Human validation is a subset of the full evaluation matrix, not a complete manual - relabeling of all model responses. +- The real-case track is focused on Chinese civil judgments, not all legal + practice settings. +- Public-exam and real-case tracks use different scoring regimes. +- The public preview documents aggregate results but does not expose all + row-level evidence needed for independent replication. +- Human validation covers a subset of responses rather than a full manual + relabeling of the entire model-output matrix. +- Source redistribution rights may differ across public-exam sources and + judgment-derived materials. + +## Version + +This card describes the paper-submission benchmark snapshot represented by the +public LegalScope repository. diff --git a/docs/HUGGINGFACE_RELEASE.md b/docs/HUGGINGFACE_RELEASE.md new file mode 100644 index 0000000..281630a --- /dev/null +++ b/docs/HUGGINGFACE_RELEASE.md @@ -0,0 +1,104 @@ +# Hugging Face Release Plan + +This document describes the recommended Hugging Face public preview for +LegalScope. + +## Recommended Repository + +- Repository type: dataset +- Suggested repo id: `EternWang/LegalScope` +- Public title: `LegalScope` +- Release mode: metadata-only public preview + +The first Hub release should mirror the public GitHub release boundary. It +should include the dataset card and public metadata files, not the full +workbook, prompts, model outputs, human review sheets, or private legal source +documents. + +## Files to Upload + +```text +README.md +data/metadata/dataset_summary.json +data/metadata/model_groups.csv +data/metadata/source_composition.csv +``` + +Optional after review: + +```text +data/sample/public_preview.jsonl +data/sample/README.md +``` + +## Dataset Card Requirements + +The Hugging Face `README.md` should include: + +- YAML metadata for discoverability; +- a concise dataset summary; +- benchmark composition counts; +- data structure and release boundary; +- intended and out-of-scope uses; +- privacy, de-identification, and licensing notes; +- citation instructions; +- links back to GitHub and the paper once available. + +## Suggested Metadata + +```yaml +--- +language: +- en +- zh +license: mit +pretty_name: LegalScope +task_categories: +- question-answering +- text-generation +tags: +- legal +- legal-reasoning +- benchmark +- llm-evaluation +- exam-to-case-transfer +- chinese-civil-judgments +- text +size_categories: +- n<1K +--- +``` + +The `n<1K` value reflects the public metadata preview files, not the withheld +full model-output matrix. + +## Upload Workflow + +Create the dataset repository on the Hub, then upload the prepared files. A CLI +flow may look like: + +```bash +hf auth login +hf repo create EternWang/LegalScope --type dataset +hf upload EternWang/LegalScope ./huggingface_dataset_repo . --repo-type dataset +``` + +After upload, verify: + +1. the dataset card renders correctly; +2. YAML metadata appears as Hub tags; +3. public metadata files are visible under `data/metadata/`; +4. no withheld workbook, prompt, model-output, human-review, or non-de-identified + source files were uploaded; +5. GitHub and Hugging Face links point to each other. + +## Full Release Gate + +Before adding row-level samples or full artifacts, confirm: + +- source redistribution rights; +- de-identification quality; +- model-output provider terms; +- paper review anonymity; +- human review privacy; +- final paper citation and artifact policy. diff --git a/docs/PROJECT_BRIEF.md b/docs/PROJECT_BRIEF.md index 52cd2db..60b5025 100644 --- a/docs/PROJECT_BRIEF.md +++ b/docs/PROJECT_BRIEF.md @@ -1,6 +1,6 @@ # Project Brief -This is a public, application-facing summary of the paper structure. It explains the +This is a public-facing summary of the paper structure. It explains the research question, benchmark design, evaluation protocol, and main findings without releasing the full manuscript, full workbook, model-output matrix, human review sheets, or private legal materials. @@ -64,12 +64,12 @@ boundaries, or assigned stance that makes the answer legally controlled. Automated evaluation is more reliable on public-exam answers than on case-based legal analysis, which is why the benchmark keeps expert-grounded validation in the loop. -## My Role +## Project Role -I initiated and led the benchmark design, data organization, scoring-protocol design, -evaluation workflow, public repository packaging, and paper framing. The project also -involved weekly research collaboration and legal-domain review for real-case scoring -and validation. +The benchmark was led as an independent first-author research project covering +benchmark design, data organization, scoring-protocol design, evaluation workflow, +public repository packaging, and paper framing. The project also involved research +collaboration and legal-domain review for real-case scoring and validation. ## Public Release Boundary diff --git a/docs/PROVENANCE.md b/docs/PROVENANCE.md new file mode 100644 index 0000000..6bc3d7a --- /dev/null +++ b/docs/PROVENANCE.md @@ -0,0 +1,57 @@ +# Provenance + +This document summarizes the public-safe provenance for LegalScope. It is not a +complete release of source data, private workbooks, model outputs, or human +review files. + +## Source Families + +LegalScope combines two evaluation settings: + +| Source family | Public-safe description | Public release status | +| --- | --- | --- | +| Public legal exams | Open-ended legal-exam questions collected from public legal-exam materials across multiple jurisdictions. | Aggregate metadata only. Full questions and reference answers require source redistribution review. | +| Chinese civil judgments | De-identified civil-judgment materials transformed into issue-stance legal-analysis prompts. | Aggregate metadata only. Non-de-identified judgments and full prompts are withheld. | +| Model responses | Answers generated by 20 model groups across both tracks. | Aggregate metrics only. Row-level model outputs are withheld. | +| Human validation | Legal review over selected public-exam and real-case responses. | Protocol and counts only. Review sheets and adjudication notes are withheld. | + +## Processing Stages + +The public Drive materials include an annotated script manifest with 78 scripts +organized into five stages: + +1. Public bar source collection and cleaning. +2. Chinese real-case prompt construction. +3. Model answer generation. +4. Scoring and regrading. +5. Final conversion, translation, and release packaging. + +The public repository keeps only lightweight helper code. Full pipeline scripts, +absolute local paths, provider credentials, raw source caches, and generated +intermediate files are not required for the public preview and should not be +published without review. + +## Scoring Evolution + +Internal materials document three real-case rubric stages: + +| Stage | Purpose | +| --- | --- | +| Original checklist rubric | Initial A/B/C scoring for citation relevance, constraint extraction, and argument validity. | +| Strict V1 regrading | Added strong high-score caps to reduce inflated model-judge scores. | +| Current V2 rubric | Preserves strict high-score thresholds while reducing over-penalty for ordinary incompleteness and aligning better with human validation. | + +The public scoring summary is documented in +[`SCORING_RUBRIC.md`](SCORING_RUBRIC.md). + +## Public-Safe Naming + +The public project name is **LegalScope**. Historical internal names should not +appear in public release files. + +## Release Boundary + +The public repository is intended to show how the benchmark was designed, +audited, and summarized. It does not make the full benchmark reconstructable. +Any expanded release should pass privacy, redistribution, review-anonymity, and +provider-terms checks before publication. diff --git a/docs/PUBLICATION_CHECKLIST.md b/docs/PUBLICATION_CHECKLIST.md new file mode 100644 index 0000000..c72135d --- /dev/null +++ b/docs/PUBLICATION_CHECKLIST.md @@ -0,0 +1,41 @@ +# Publication Checklist + +Use this checklist before making GitHub or Hugging Face materials public. + +## Naming + +- [ ] The public name is `LegalScope`. +- [ ] Historical internal project names do not appear in public files. +- [ ] Figure labels and captions use the public name where applicable. + +## GitHub + +- [ ] README explains what the project does, why it matters, and how to start. +- [ ] Repository map points to data card, results, scoring, annotation, and + release-boundary docs. +- [ ] `CITATION.cff` is up to date. +- [ ] License text applies only to code and public documentation unless a data + license is explicitly added. +- [ ] CI passes on the public helper code. +- [ ] No private workbook, prompt matrix, model output, review sheet, API key, + provider log, or non-de-identified source file is committed. + +## Hugging Face + +- [ ] Dataset card has valid YAML metadata. +- [ ] Dataset card states that this is a metadata-only public preview. +- [ ] Public metadata files are uploaded under `data/metadata/`. +- [ ] Full prompts, model outputs, reference answers, human review sheets, and + non-de-identified sources are absent. +- [ ] Hub page links back to GitHub. +- [ ] GitHub README links to the Hub page after it is live. + +## Full Dataset Expansion + +- [ ] Source redistribution rights checked. +- [ ] De-identification reviewed. +- [ ] Re-identification risk reviewed. +- [ ] Provider terms for model-output redistribution checked. +- [ ] Human review notes cleaned or excluded. +- [ ] Review anonymity requirements checked. +- [ ] Final paper citation added. diff --git a/docs/RELEASE_STATUS.md b/docs/RELEASE_STATUS.md index 3d73a2b..90167fd 100644 --- a/docs/RELEASE_STATUS.md +++ b/docs/RELEASE_STATUS.md @@ -1,34 +1,47 @@ # Release Status -LegalScope is represented here as a public research repository, not a full artifact -release. +LegalScope is represented here as a public research repository, not a full +artifact release. ## Included -- Benchmark composition and evaluation counts. -- A small public figure set rendered from the paper figure source in the submitted zip - package. -- High-level metadata about sources, rows, model groups, and response counts. -- Scoring, annotation, data-card, and AI-workflow documentation. +- Benchmark design and evaluation counts. +- Public-safe metadata about sources, rows, model groups, and response counts. +- Selected figures rendered from paper figure sources. +- Public-facing result summaries. +- Scoring, annotation, data-card, provenance, and AI-workflow documentation. - Lightweight workbook helper utilities. +- A Hugging Face dataset-card release plan. ## Not Included -- Paper draft, PDF, or source files. -- Extra slide figures that are not synchronized with the current paper text. -- Full workbook. +- Paper draft, review PDF, or LaTeX source files. +- Full benchmark workbook. - Full prompt matrix. - Complete reference answers. - Complete model outputs. -- Human review sheets. +- Human review sheets or adjudication notes. - Non-de-identified judgments or private source documents. +- Internal API keys, provider logs, or local absolute-path artifacts. -## Before Full Release +## Recommended Public Release Strategy -Before publishing a full dataset or artifact bundle, check: +1. Keep this GitHub repository as the project and reproducibility scaffold. +2. Publish a Hugging Face dataset repository as a metadata-only public preview. +3. Link the GitHub repository and Hugging Face dataset card after the Hub page is + live. +4. Add row-level samples only after a separate release review confirms source + redistribution rights, de-identification quality, and review anonymity. + +## Before Any Full Dataset Release + +Check: - source redistribution rights for public-exam materials; -- de-identification and re-identification risk for real-case prompts; +- de-identification and re-identification risk for judgment-derived prompts; - provider terms for redistributing model outputs; -- whether the review version requires an anonymous artifact path; -- whether any public repository text would identify the submission during review. +- whether the review version requires anonymous artifact paths; +- whether repository text, commit history, file names, or public URLs could + identify a submission during review; +- whether human review sheets contain private notes, names, or hidden evidence + references. diff --git a/docs/SCORING_RUBRIC.md b/docs/SCORING_RUBRIC.md index 929cb6c..814aee3 100644 --- a/docs/SCORING_RUBRIC.md +++ b/docs/SCORING_RUBRIC.md @@ -1,6 +1,8 @@ # Scoring Rubric -This document summarizes the LegalScope scoring protocol used in the benchmark. +LegalScope uses different scoring protocols for public legal exams and +real-case legal analysis because the two tracks test different forms of legal +reasoning. ## Public Legal-Exam Scoring @@ -12,31 +14,70 @@ Public-exam answers receive one reference-aware score from 0 to 4. | 1 | Captures roughly one core unit such as issue, rule, application, or conclusion. | | 2 | Captures about two core units but misses major substance. | | 3 | Mostly matches the reference answer with limited gaps. | -| 4 | Matches the core issue, rule/test, application, and conclusion without substantive conflict. | +| 4 | Matches the core issue, rule or test, application, and conclusion without substantive conflict. | + +The public-exam score is intended to measure reference-answer alignment, not +general legal usefulness. ## Real-Case A/B/C Rubric -Real-case answers receive three 0-4 scores. +Real-case answers receive three independent 0-4 scores. ### A. Citation Relevance -Checks whether the answer identifies legally responsive authority and connects it to a -usable legal proposition. +Checks whether the answer identifies legally responsive authority and connects +that authority to a usable legal proposition. + +High-scoring answers cite or name the controlling legal basis and explain why it +matters for the assigned issue. Low-scoring answers cite irrelevant rules, cite +rules that do not support the conclusion, omit legal authority, or rely on +generic legal language without a usable legal proposition. ### B. Constraint Extraction -Checks whether the answer follows the assigned stance, extracts operative constraints, -avoids invented facts, respects the prompt boundary, and covers the requested issue. +Checks whether the answer recovers and respects the operative constraints in +the prompt: + +- assigned support or opposition stance; +- given factual record; +- issue boundary; +- legal and evidentiary conditions; +- required output format; +- de-identification and closed-record restrictions. + +High-scoring answers reason within the supplied record. Low-scoring answers +reverse the stance, invent facts, add hidden evidence, ignore key conditions, or +miss the issue being tested. ### C. Argument Validity -Checks whether the answer states a defensible conclusion, applies rules to facts, -handles counterpoints, and avoids unsupported reasoning. +Checks whether the legal conclusion follows from the cited law and the supplied +facts under the assigned stance. + +High-scoring answers connect rule conditions to the facts, handle important +counterpoints, and reach a defensible conclusion. Low-scoring answers are +conclusory, internally inconsistent, unsupported by the cited rules, or +misaligned with the core legal question. + +## V2 Calibration + +The current V2 rubric keeps strict high-score thresholds while reducing +over-penalty for ordinary incompleteness. + +Key calibration principles: -## Calibration Notes +- A score of 4 is reserved for answers with no meaningful substantive defect in + that dimension. +- Severe failures such as stance reversal, non-answer, refusal, major + truncation, or unusable output remain capped at very low scores. +- If an answer lacks a responsive legal basis, citation relevance can be zero; + constraint extraction and argument validity may still receive limited credit + when there is substantive reasoning. +- Missing hidden evidence, hidden appraisal material, or unshown contract + detail should not be treated as a failure unless the prompt itself provided + that material. +- Long or fluent writing should not raise a score unless it improves legal + relevance, constraint recovery, or rule-to-fact reasoning. -The calibrated rubric keeps strict high-score thresholds while reducing over-penalty -for ordinary incompleteness. Severe failures such as stance reversal, non-answer, -refusal, major truncation, or unusable output remain capped at very low scores. If an -answer lacks a responsive legal basis, citation relevance can be zero while argument -or constraint dimensions may still receive limited credit for substantive reasoning. +The calibration goal is to reduce generator-evaluator bias while preserving the +distinction between fluent legal prose and legally controlled analysis. diff --git a/pyproject.toml b/pyproject.toml index c609342..515f1dc 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -4,8 +4,8 @@ build-backend = "setuptools.build_meta" [project] name = "legalscope" -version = "0.1.0" -description = "Reproducible public utilities for the LegalScope benchmark preview." +version = "0.2.0" +description = "Public metadata, documentation, and utilities for the LegalScope benchmark preview." readme = "README.md" requires-python = ">=3.10" dependencies = [