Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Contributing

LegalScope is currently a public research preview with a strict release boundary.
Contributions are welcome for documentation, metadata cleanup, public helper
code, tests, and release-process improvements.

Please do not submit:

- private workbooks;
- full prompts or reference answers;
- model-output matrices;
- human review sheets or adjudication notes;
- non-de-identified legal source documents;
- provider credentials, local paths, logs, or API keys.

## Development

```bash
python -m pip install -r requirements.txt
python -m pytest -q
```

## Documentation Changes

Keep public files aligned with the project name `LegalScope`. Historical
internal names should not be introduced into public documentation.

## Privacy Review

Any proposed row-level data release should be reviewed for source licensing,
de-identification, re-identification risk, provider terms, and paper review
requirements before it is merged.
189 changes: 110 additions & 79 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,40 +1,16 @@
# LegalScope: Measuring Exam-to-Case Transfer in LLM Legal Reasoning
# LegalScope

LegalScope studies a simple question with high stakes for legal AI evaluation:
do strong public legal-exam scores actually transfer to real-case legal reasoning?
**LegalScope** is a benchmark for measuring whether large language model
performance on public legal exams transfers to real-case legal analysis.

I built LegalScope as an independent first-author benchmark project that pairs scalable
public legal-exam tasks with lawyer-reviewed, de-identified Chinese civil judgment
analysis. The public repository is intentionally a preview: it documents the research
question, benchmark design, evaluation counts, scoring protocol, release boundary, and
reproducible helper code without publishing the paper draft, full workbook, model
outputs, human review sheets, or non-de-identified case materials.
The project pairs a scalable public-exam track with a lawyer-reviewed,
de-identified Chinese civil-judgment track. The public repository is a
research release scaffold: it documents the benchmark design, evaluation
counts, scoring protocol, public figures, metadata, and helper utilities while
withholding private workbooks, full prompts, model outputs, human review
sheets, and non-de-identified legal materials.

## Start Here

| If you want to understand... | Read |
| --- | --- |
| The research idea and motivation | [Project Brief](docs/PROJECT_BRIEF.md) |
| Main empirical findings and figures | [Results Summary](docs/RESULTS_SUMMARY.md) |
| Dataset scope and release boundary | [Data Card](docs/DATA_CARD.md) |
| Scoring design | [Scoring Rubric](docs/SCORING_RUBRIC.md) |
| Human validation protocol | [Annotation Protocol](docs/ANNOTATION_PROTOCOL.md) |

## What I Contributed

- Built a dual-track benchmark connecting public legal-exam evaluation with
lawyer-reviewed real-case legal analysis.
- Designed a paired issue-stance protocol for Chinese civil judgments, so the same
factual background can be tested under supporting and opposing legal positions.
- Developed two scoring protocols: reference-aware 0-4 exam scoring and a calibrated
real-case rubric for citation relevance, constraint extraction, and argument
validity.
- Validated automated scores against human legal review and identified constraint
extraction as the main real-case failure mode.

## Benchmark at a Glance

<img src="assets/figures/paper_collection_pipeline.png" alt="LegalScope benchmark construction pipeline" width="920">
## Snapshot

| Component | Count |
| --- | ---: |
Expand All @@ -45,71 +21,126 @@ outputs, human review sheets, or non-de-identified case materials.
| Model groups evaluated | 20 |
| Public-exam model responses | 17,360 |
| Real-case model responses | 1,520 |
| Total dataset model responses | 18,880 |
| Total model responses | 18,880 |
| Human-validation responses | 1,800 |

The pipeline figure above is rendered from `8.pdf`, which is referenced by the paper
source. The full paper PDF is not committed to this repository.
<img src="assets/figures/paper_collection_pipeline.png" alt="LegalScope benchmark construction pipeline" width="920">

## What LegalScope Tests

LegalScope asks whether exam performance is a reliable proxy for applied legal
reasoning. The public-exam track uses reference-aware open-ended legal-exam
questions. The real-case track asks models to write stance-aware Chinese legal
analysis over de-identified civil-judgment materials under closed-record
constraints.

The benchmark separates three questions that are often blurred together:

1. How well do models answer public legal-exam questions?
2. How well do the same models reason over bounded real-case legal facts?
3. Do rankings, reasoning-mode gains, and evaluator agreement transfer across
those settings?

## Main Findings

- Public-exam scores correlate with Chinese real-case scores at the model level
(Pearson `r = 0.835`, Spearman `rho = 0.661`), but rankings and reasoning-mode gains
do not transfer uniformly.
- Real-case legal reasoning exposes a constraint-extraction bottleneck: models write
fluent legal arguments more easily than they recover the operative legal and factual
conditions that control those arguments.
(Pearson `r = 0.835`, Spearman `rho = 0.661`), but the transfer is incomplete.
- Real-case legal reasoning exposes a constraint-extraction bottleneck: models
often produce fluent legal prose while missing operative rule conditions,
factual boundaries, stance requirements, or evidence limits.
- Automated evaluation aligns strongly with human review on public-exam answers
(answer-level Pearson `r = 0.925`) but weakens on real-case analysis
(`r = 0.450`), showing why expert-grounded evaluation remains important.
(`r = 0.450`), motivating expert-grounded validation for high-stakes legal
evaluation.

## Repository Map

```text
assets/figures/
paper_collection_pipeline.png
paper_score_distribution.png
paper_transfer_model_judge.png
paper_transfer_human.png
data/
README.md
metadata/dataset_summary.json
metadata/model_groups.csv
metadata/source_composition.csv
sample/README.md
docs/
PROJECT_BRIEF.md
RESULTS_SUMMARY.md
DATA_CARD.md
SCORING_RUBRIC.md
ANNOTATION_PROTOCOL.md
AI_WORKFLOW.md
FIGURE_SOURCES.md
RELEASE_STATUS.md
scripts/
extract_public_sample.py
src/legalscope/
workbook.py
tests/
test_workbook.py
| Path | Purpose |
| --- | --- |
| `assets/figures/` | Public-safe figures rendered from paper figure sources. |
| `data/metadata/` | Machine-readable counts, model groups, and source composition. |
| `data/sample/` | Reserved for public row-level samples after release review. |
| `docs/PROJECT_BRIEF.md` | Research question, benchmark design, and paper-facing summary. |
| `docs/RESULTS_SUMMARY.md` | Public-facing result figures and transfer metrics. |
| `docs/DATA_CARD.md` | Dataset scope, intended use, limitations, and release boundary. |
| `docs/SCORING_RUBRIC.md` | Public-exam and real-case scoring protocols. |
| `docs/ANNOTATION_PROTOCOL.md` | Human-validation protocol and review focus. |
| `docs/AI_WORKFLOW.md` | AI-assisted workflow and human-control safeguards. |
| `docs/PROVENANCE.md` | Public-safe construction and processing provenance. |
| `docs/HUGGINGFACE_RELEASE.md` | Hugging Face dataset-card release plan. |
| `scripts/` | Lightweight public helper scripts. |
| `src/legalscope/` | Small Python utilities for authorized local workbooks. |
| `tests/` | Unit tests for public helpers. |

## Quickstart

Install the public helper package from a local checkout:

```bash
python -m pip install -r requirements.txt
python -m pytest -q
```

Inspect the public metadata:

```bash
python - <<'PY'
import json
from pathlib import Path

summary = json.loads(Path("data/metadata/dataset_summary.json").read_text())
print(summary["project"])
print(summary["counts"]["dataset_model_responses_total"])
PY
```

Use the workbook helpers only with authorized local workbooks:

```python
from legalscope.workbook import summarize_workbook

for sheet in summarize_workbook("private_authorized_workbook.xlsx"):
print(sheet.title, sheet.data_rows, sheet.model_count)
```

## Public Release Boundary

This repository does not publish:

- the paper draft or PDF;
- the paper draft, review PDF, or LaTeX source;
- the full benchmark workbook;
- complete prompts, reference answers, model answers, or row-level model-output
matrices;
- lawyer review sheets or adjudication notes;
- complete prompt matrices, reference answers, model answers, or row-level
model-output tables;
- human review sheets, adjudication notes, or reviewer annotations;
- non-de-identified judgments or private source documents.

The public code is a reproducibility scaffold for collaborators with authorized local
access to the private workbook. It is not enough to reconstruct the full benchmark from
the public repository alone.
The public materials are sufficient to understand the research design, scope,
counts, release boundary, and public-facing results. They are not sufficient to
reconstruct the full benchmark.

## Hugging Face Release

A Hugging Face dataset-card-ready public preview is described in
[`docs/HUGGINGFACE_RELEASE.md`](docs/HUGGINGFACE_RELEASE.md). The recommended
first Hub release is a metadata-only preview containing this README-style
dataset card plus the public metadata files. Row-level samples should be added
only after source redistribution, privacy, and review constraints are cleared.

## Citation

Citation metadata is provided in [`CITATION.cff`](CITATION.cff). The final paper
citation should replace the placeholder citation once the paper has a stable
public identifier.

## License

Code and public documentation in this repository are released under the MIT
License unless otherwise noted. This license does not grant redistribution
rights for withheld source documents, full workbooks, model outputs, or private
review materials.

## Disclaimer

LegalScope is a research benchmark for model evaluation. It is not legal advice, a
legal research product, or a substitute for jurisdiction-specific legal review.
LegalScope is a research benchmark for model evaluation. It is not legal advice,
a legal research product, or a substitute for jurisdiction-specific legal
review.
82 changes: 61 additions & 21 deletions docs/AI_WORKFLOW.md
Original file line number Diff line number Diff line change
@@ -1,38 +1,78 @@
# AI-Assisted Research Workflow

LegalScope uses LLMs as research tools while keeping source selection, legal review,
release decisions, and paper claims under human control.
LegalScope uses AI systems as research tools while keeping dataset design,
source selection, de-identification, legal review, scoring decisions, release
decisions, and paper claims under human control.

## Pipeline Overview

1. Collect public legal-exam sources and de-identified civil-judgment materials.
1. Collect public legal-exam sources and candidate Chinese civil-judgment
materials.
2. Parse, normalize, redact, deduplicate, and audit source records.
3. Build standardized public-exam and real-case prompt templates.
3. Construct public-exam prompts and real-case issue-stance prompts.
4. Generate model answers across 20 model groups.
5. Score public-exam answers with reference-aware scoring.
6. Score real-case answers with the A/B/C legal-reasoning rubric.
7. Validate selected rows against human legal review.
8. Analyze transfer, human agreement, length effects, and error patterns.
5. Score public-exam answers with reference-aware 0-4 scoring.
6. Score real-case answers with the citation, constraint, and argument rubric.
7. Calibrate the real-case rubric against human legal review.
8. Analyze exam-to-case transfer, human agreement, score distributions, length
effects, and error patterns.
9. Prepare public documentation and metadata while withholding sensitive
artifacts.

## Public-Safe Script Provenance

The Drive public-material folder documents 78 copied and annotated pipeline
scripts grouped into five stages:

| Stage | Scripts | Public-safe description |
| --- | ---: | --- |
| Public bar source collection and cleaning | 16 | Collection, parsing, normalization, duplicate repair, reference repair, and source-audit scripts. |
| Chinese real-case prompt construction | 7 | Judgment preview, issue/stance prompt construction, repair, rerun, and workbook writeback scripts. |
| Model answer generation | 17 | Model catalog, provider runners, batch launchers, answer merge, and answer writeback scripts. |
| Scoring and regrading | 29 | Public-exam scoring, real-case rubric calibration, blind packets, V2 scoring, and validation utilities. |
| Final conversion, translation, and release | 9 | Workbook-to-JSON conversion, English cleanup, metadata repair, and public release packaging. |

Some internal scripts retain absolute paths or require provider credentials.
They should be treated as provenance records and rerun only after path,
credential, privacy, and redistribution checks.

## Where AI Assistance Is Used

AI tools may help draft transformation code, normalize text, prepare prompt templates,
generate model answers under controlled settings, and identify candidate failure modes
for inspection.
AI tools may help:

- draft transformation code;
- normalize and translate text;
- prepare prompt templates;
- generate model answers under controlled settings;
- score answers according to documented rubrics;
- identify candidate failure modes for human inspection;
- prepare release documentation.

## Human-Controlled Steps

AI tools do not replace:

AI tools do not replace source-selection decisions, de-identification review, final
legal judgment, manuscript claims, licensing review, or release decisions.
- source-selection decisions;
- privacy and de-identification review;
- legal-domain review;
- final scoring policy;
- human-validation judgments;
- licensing and redistribution decisions;
- paper claims and release decisions.

## Safeguards

- De-identification before public release.
- Separate scorer-side references and prompt-facing text.
- Stance and closed-book constraints for real-case prompts.
- Human validation for selected public-exam and real-case rows.
- Public release boundary for full prompts, model outputs, and review sheets.
- Separate public prompt-facing text from scorer-side references.
- Keep real-case prompts closed-record and stance-constrained.
- Mask names, institutions, identifiers, and other sensitive details before
public release.
- Validate selected rows with human legal review.
- Preserve a clear public/private release boundary.
- Withhold full prompts, model outputs, human review sheets, and non-de-identified
source documents until a later release review.

## Public Repository Boundary

This repository keeps documentation, selected paper figures, high-level metadata, and
small workbook utilities. Full data and review artifacts remain private until privacy,
licensing, and review constraints are resolved.
This repository includes documentation, selected figures, high-level metadata,
and lightweight helper code. The full data workflow remains private until
privacy, licensing, and review constraints are resolved.
Loading
Loading