ExogenousAI: Policy Impact on AI Capability Timelines

Important

We came in 4th position 🏅

Project Submission: View the official project page and hackathon submission on Apart Research

Project Overview

ExogenousAI quantifies the impact of external policy interventions on AI capability development by combining:

Event Study Analysis: Measures abnormal returns in AI metrics around policy announcements
Monte Carlo Simulation: Projects AGI timelines under different policy scenarios
Meta-Forecasting: Compares results with existing literature forecasts

Research Question

How do policy interventions (compute governance, export controls, collaboration frameworks) affect the timeline to transformative AI capabilities?

Key Findings

Current Results (November 2025 Analysis)

AGI Timeline Projections (95% MMLU-Pro threshold):

Scenario	Median Year	Confidence Interval	Probability within 5 years
Status Quo	2027	2026-2030	86.9%
Compute Governance	2027	2026-2030	83.1%
Export Control Escalation	2027	2026-2030	80.8%
Open Collaboration	2027	2026-2030	90.8%

Key Insight: All scenarios converge to 2027 median, indicating that under current strong growth trends (+25.5% annually), policy interventions affect probability distributions but not central estimates.

Baseline Trend (from MMLU-Pro real data):

Growth Rate: +2.128% per month (+25.5% annually)
Volatility: 9.75%
Data Points: 16 months of real benchmark scores (July 2023 - December 2024)

Uncertainty Decomposition:

Technical Uncertainty: 76.9%
Economic Uncertainty: 23.1%
Policy Uncertainty: 0.0% (scenarios don't differentiate medians)

Comparison with Literature

Forecast Source	Median Timeline	Range
ExogenousAI (all scenarios)	2027	2026-2030
EpochAI	2036	2030-2050
Bio Anchors	2055	2040-2080
AI 2027 Report	2027	2025-2030

ExogenousAI aligns with aggressive near-term forecasts (AI 2027) due to observed 25.5% annual growth in MMLU-Pro benchmarks.

Data Sources

Primary Benchmark: TIGER-Lab/MMLU-Pro (ONLY)

Source: https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro
Models: 68 state-of-the-art LLMs
Date Range: July 2023 - December 2024
Score Range: 10.9% (SmolLM-360M) to 52.8% (Phi-4-mini)
Top Performers:
1. Phi-4-mini (5.6B parameters): 52.8% (Dec 2024)
2. Phi-3.5-mini-instruct (3.8B): 47.9% (Aug 2024)
3. Phi3-mini-4k (3.8B): 45.7% (Apr 2024)

Supporting Data

Policy Events: 4,370 AI policy announcements (SERP API, 2020-2025)
arXiv Submissions: 45 months of AI research papers (cs.AI, cs.LG, cs.CL)
NVIDIA Stock: 46 months of NVDA prices (Alpha Vantage API)
EpochAI Training Compute: 284 months of frontier model compute trends

Data Quality Standards

✅ No interpolation: All missing values preserved as NaN
✅ No hardcoded values: Zero manually entered benchmarks
✅ No synthetic data: Only real leaderboard scores

Methodology

1. Event Study Analysis

Classical finance event study adapted for AI policy:

AR_it = R_it - E[R_it | Normal Period]
CAR_i = Σ AR_it (over event window)

Event Window: [-6, +6 months] around policy announcement
Normal Period: [-12, -7 months] pre-event baseline
Metrics: Benchmark scores, arXiv velocity, stock returns
Statistical Test: Two-sample t-test (p < 0.05)

Current Results:

6 events analyzed (most skipped due to sparse real benchmark data)
0 statistically significant abnormal returns
Interpretation: Either (1) policy effects are delayed/diffuse, or (2) insufficient statistical power from sparse data

2. Monte Carlo Scenario Modeling

Geometric Brownian Motion with policy adjustments:

dS_t = μ_policy × S_t × dt + σ × S_t × dW_t

Where:

μ_policy = μ_baseline × policy_factor (scenario-specific growth adjustment)
σ = √(σ_historical² + σ_policy²) (combined technical + policy volatility)
S_t = benchmark score at time t

Scenarios:

Status Quo (policy_factor=1.0): Current trajectory continues
Compute Governance (policy_factor=0.95): Modest slowdown from regulations
Export Control Escalation (policy_factor=0.90): Significant compute restrictions
Open Collaboration (policy_factor=1.05): Accelerated progress from cooperation

Simulation Parameters:

Iterations: 10,000 per scenario
Horizon: 60 months (5 years)
Baseline: Exponential fit to 16 real MMLU-Pro data points
AGI Threshold: 95% (adjusted for benchmark saturation; original MMLU-Pro scores now exceed 52%)

3. Meta-Forecasting

Variance decomposition across forecast sources:

Var_total = Var_policy + Var_technical + Var_economic

Policy Variance: Spread across ExogenousAI scenarios
Technical Variance: Within-scenario confidence intervals
Economic Variance: Estimated from market volatility (30% of technical)

Finding: Policy variance = 0.0 because all scenario medians = 2027. This indicates strong baseline trend dominates policy effects in central estimates.

Pipeline Architecture

Data Collection (`src/data_collection/`)

parse_mmlu_manual.py: Parses MMLU-Pro leaderboard from manual input (68 models with release dates)
scrape_policy_events.py: SERP API scraper for policy announcements (4,370 events)
scrape_arxiv.py: arXiv API for research paper counts
scrape_stocks.py: Alpha Vantage for NVIDIA stock data

Preprocessing (`src/preprocessing/`)

clean_benchmarks.py: Validates benchmark data, removes duplicates, normalizes scores
aggregate_monthly.py: Converts all time series to monthly frequency
merge_datasets.py: Combines benchmarks, arXiv, stocks, policy events (NO INTERPOLATION)

Analysis (`src/analysis/`)

event_study.py: Calculates abnormal returns and cumulative abnormal returns (CAR)
monte_carlo.py: Runs 10,000 simulations per scenario, estimates AGI timelines
meta_forecasting.py: Compares ExogenousAI with EpochAI, Bio Anchors, AI 2027

Visualization (`src/visualization/`)

plot_events.py: Time series with policy event markers, CAR plots
plot_scenarios.py: Monte Carlo trajectories, timeline distributions, uncertainty decomposition

Installation & Setup

Prerequisites

Python 3.11+
Virtual environment tool (venv)

Step 1: Clone Repository

git clone https://github.com/XAheli/ExogenousAI.git
cd ExogenousAI

Step 2: Create Virtual Environment

python3.11 -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Step 3: Install Dependencies

pip install -r requirements.txt

Step 4: Configure API Keys

Create secrets.toml file in project root:

SERP_API_KEY=your_serpapi_key_here
ALPHAVANTAGE_API_KEY=your_alphavantage_key_here
ARXIV_EMAIL=your_email@example.com

API Key Sources:

SERP API: https://serpapi.com/ (free tier: 100 searches/month)
Alpha Vantage: https://www.alphavantage.co/ (free tier: 500 calls/day)
arXiv: Email for rate limiting compliance (no key needed)

Step 5: Download EpochAI Data (Optional)

EpochAI training compute data is included in data/raw/epochai/. If re-downloading:

# Manually download from https://epoch.ai/data
# Place CSVs in data/raw/epochai/ai_models/

Challenges Encountered

1. Data Quality & Availability

Problem: Original plan was to scrape MMLU, HumanEval, MATH benchmarks from Papers with Code API. Discovered:

PwC API is deprecated/unstable
Hugging Face Open LLM Leaderboard API endpoint returns 404
No unified benchmark database with historical scores

Solution:

Manually collected 68 models from TIGER-Lab/MMLU-Pro leaderboard
Assigned release dates based on public model announcements
Focused on single authoritative benchmark (MMLU-Pro) rather than mixing heterogeneous sources

Impact:

Only 16 months of real benchmark data (vs 60 months desired)
Sparse data → many event study analyses skipped
But ensures 100% real data (no fabrication)

2. Data Fabrication Risk

Problem: Initial implementation used linear interpolation to fill missing months, creating 86% synthetic data (60/70 months fabricated).

Solution:

Removed all interpolation from merge_datasets.py
Preserved NaN values for missing observations
Updated analysis modules to skip windows with insufficient real data

Trade-off:

Fewer statistically significant event study results
But scientifically honest (only real measurements analyzed)

3. Benchmark Saturation

Problem: Current models exceed 50% on MMLU-Pro (Phi-4-mini: 52.8%), but AGI threshold historically set at 90% for MMLU.

Solution:

Raised AGI threshold from 90% to 95% to account for:
- MMLU-Pro is harder than original MMLU
- Current pace suggests 95% achievable by 2027-2030

Rationale: Original MMLU has ceiling effects (GPT-4 at 86.4%), MMLU-Pro provides more headroom.

4. Policy Variance Paradox

Problem: Meta-forecasting shows 0% policy contribution to variance.

Root Cause:

All 4 scenarios predict same median year (2027)
Strong baseline growth (+25.5% annually) overwhelms policy adjustments (±5-10%)
Scenarios differ in probability distributions (80.8% - 90.8%) but not central estimates

Interpretation:

This is mathematically correct, not a bug
Indicates current trends are resilient to moderate policy interventions
More aggressive policy scenarios (e.g., 50% slowdown) would differentiate

Potential Fix (not implemented):

Increase policy_factor differences (e.g., 0.5x - 2.0x range)
Use mean years instead of medians (spreads 2027-2028)
But this would require strong justification for extreme policy impacts

5. Event Study Statistical Power

Problem: Only 6 events analyzable (out of 4,370 collected), 0 statistically significant results.

Causes:

Sparse benchmark data: Only 16 months with measurements
Event windows need [-6, +6 months] data → most events skipped
Benchmark changes are discrete jumps at model releases, not gradual

Solution Attempted:

Lowered minimum data requirements (2 pre-event, 1 post-event points)
But this reduces statistical power

Alternative Approaches (future work):

Use daily/weekly stock prices or arXiv counts (denser data)
Shift to "model release events" rather than "policy events"
Employ interrupted time series analysis (fewer data requirements)

Technical Decisions & Rationale

Why MMLU-Pro Only?

Decision: Use single benchmark (MMLU-Pro) exclusively, rather than composite of MMLU, HumanEval, MATH.

Rationale:

Consistency: Different benchmarks have different scales, difficulty, saturation rates
Availability: MMLU-Pro has complete leaderboard data
Recency: Models tested consistently on same benchmark version
Academic Standard: MMLU-Pro is widely reported in model papers

Trade-off:

Narrower capability measurement (doesn't capture coding, math)
But avoids heterogeneity artifacts

Why No Interpolation?

Decision: Preserve all missing values as NaN, no linear/spline interpolation.

Rationale:

Scientific Integrity: Interpolation creates fake data
Event Study Bias: Interpolated values have artificial smooth trends
Monte Carlo Accuracy: Baseline fit should reflect real volatility
Reproducibility: Unclear how future researchers should interpolate

Trade-off:

Sparser data → fewer analyzable events
But maintains data provenance and quality

Why Exponential Baseline?

Decision: Fit exponential trend y = a × exp(b × t) rather than linear or polynomial.

Rationale:

Scaling Laws: AI capabilities theoretically improve exponentially with compute/data
Historical Precedent: ImageNet, MMLU, GPT losses all show exponential improvement
Geometric Brownian Motion: Natural model for percentage growth

Alternative Considered:

Logistic growth (S-curve approaching saturation)
But current data shows no inflection point yet

Why 95% Threshold?

Decision: Set AGI threshold at 95% MMLU-Pro score (vs traditional 90% MMLU).

Rationale:

Benchmark Difficulty: MMLU-Pro is harder than MMLU (CoT required, more distractors)
Saturation Adjustment: Current models at 52.8%, 90% likely achieved within 1-2 years
Conservative Estimate: 95% ensures superhuman performance on graduate-level questions

Calibration:

95% MMLU-Pro ≈ 99th percentile human expert
Aligns with "economically transformative" capability threshold

Results Interpretation

Why Do All Scenarios Predict 2027?

Observation: Status Quo, Compute Governance, Export Controls, and Open Collaboration all converge to median year 2027.

Explanation:

Strong Baseline: +25.5% annual growth from empirical MMLU-Pro data
Moderate Policy Adjustments: Scenarios modify growth by ±5-10%, not enough to shift median >1 year
Short Runway: Starting from 52.8%, only ~42 percentage points to 95% threshold
High Volatility: 9.75% monthly volatility creates large confidence intervals (2026-2030)

Implications:

Not a failure of methodology: Reflects reality that current trends are strong
Policy effects are subtle: Scenarios differ in probability (80.8% - 90.8%) not timing
Intervention threshold: Would need >50% growth slowdown to meaningfully delay AGI

Why 0% Policy Contribution?

Mathematical Reason:

Var_policy = Var([2027, 2027, 2027, 2027]) = 0

Conceptual Reason:

Policy contribution measures variance across scenarios
If all scenarios predict same outcome, contribution = 0
This is correct, not erroneous

What It Means:

Under current trends, policy interventions don't differentiate central estimates
Policy affects probabilities and shapes of distributions
To get >0% contribution, would need scenarios that predict 2025, 2027, 2030, 2035, etc.

Is This Realistic?

Depends on whether policy can truly cause 5-10 year delays
Current scenarios are conservative (±10% growth adjustments)
Historical evidence: Export controls on GPUs (2022-2024) didn't stop frontier progress
But: More aggressive interventions (compute caps, research moratoria) could matter more

Limitations

Data Limitations

Sparse Benchmarks: Only 16 months of real MMLU-Pro scores
Single Benchmark: Doesn't capture multimodal, robotic, or scientific capabilities
Leaderboard Bias: Only publicly reported models (excludes proprietary/classified systems)
Retrospective Dating: Model release dates are approximate, not exact evaluation dates

Methodological Limitations

Event Study Power: Insufficient data for robust statistical inference
Policy Scenarios: Adjustments (±5-10%) are illustrative, not empirically validated
Exponential Assumption: May not hold if scaling laws break down
Threshold Arbitrariness: 95% MMLU-Pro as "AGI" is a proxy, not ground truth

Scope Limitations

Capabilities ≠ Deployment: Timelines are for technical capability, not societal impact
Safety Excluded: Doesn't model alignment, interpretability progress

Future Work

Short-Term Improvements

Expand Benchmarks: Add HumanEval+, MATH, GPQA, MMMU for multimodal
Automate MMLU-Pro: Implement HTML scraper for TIGER-Lab Space
Daily Indicators: Use stock prices, arXiv daily counts for denser event studies
Robustness Checks: Jackknife, bootstrap for Monte Carlo confidence intervals

Medium-Term Extensions

Causal Inference: Diff-in-diff, synthetic controls for policy effects
Regime-Switching Models: Detect trend breaks (e.g., if scaling laws fail)
Multi-Benchmark Composite: PCA/factor analysis across MMLU-Pro, HumanEval, MATH
Dynamic Scenarios: Policy adjustments that vary over time

Long-Term Research Directions

Endogenous Policy: Model feedback loops (progress → policy → progress)
Alignment Timelines: Parallel forecast for AI safety benchmarks
Economic Impact: Link capability timelines to GDP, employment effects
Global Coordination: Multi-country game theory (US/China AI race)

Citation

If you use this framework in your research, please cite:

@misc{exogenousai2025,
  title        = {(HckPrj) ExogenousAI},
  author       = {Aheli Poddar},
  year         = {2025},
  organization = {Apart Research},
  note         = {Hackathon research sprint submission},
  url          = {https://apartresearch.com/project/exogenousai}
}

Status: Research Prototype - Results are preliminary and should not guide high-stakes decisions.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
quickstart.sh		quickstart.sh
requirements.txt		requirements.txt
secrets.toml.example		secrets.toml.example

Folders and files

Latest commit

History

Repository files navigation

ExogenousAI: Policy Impact on AI Capability Timelines

Project Overview

Research Question

Key Findings

Current Results (November 2025 Analysis)

Comparison with Literature

Data Sources

Primary Benchmark: TIGER-Lab/MMLU-Pro (ONLY)

Supporting Data

Data Quality Standards

Methodology

1. Event Study Analysis

2. Monte Carlo Scenario Modeling

3. Meta-Forecasting

Pipeline Architecture

Data Collection (src/data_collection/)

Preprocessing (src/preprocessing/)

Analysis (src/analysis/)

Visualization (src/visualization/)

Installation & Setup

Prerequisites

Step 1: Clone Repository

Step 2: Create Virtual Environment

Step 3: Install Dependencies

Step 4: Configure API Keys

Step 5: Download EpochAI Data (Optional)

Challenges Encountered

1. Data Quality & Availability

2. Data Fabrication Risk

3. Benchmark Saturation

4. Policy Variance Paradox

5. Event Study Statistical Power

Technical Decisions & Rationale

Why MMLU-Pro Only?

Why No Interpolation?

Why Exponential Baseline?

Why 95% Threshold?

Results Interpretation

Why Do All Scenarios Predict 2027?

Why 0% Policy Contribution?

Limitations

Data Limitations

Methodological Limitations

Scope Limitations

Future Work

Short-Term Improvements

Medium-Term Extensions

Long-Term Research Directions

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Data Collection (`src/data_collection/`)

Preprocessing (`src/preprocessing/`)

Analysis (`src/analysis/`)

Visualization (`src/visualization/`)

Packages