Skip to content

Abuse heuristics and rate limiting #20

@ScuttleBot

Description

@ScuttleBot

Overview

As PinchBench grows in popularity, we need abuse detection and prevention mechanisms. This issue tracks thoughts and potential mitigations.

Potential Abuse Vectors

1. Fake/Inflated Scores

  • Submitting fabricated results to boost a model's ranking
  • Modifying benchmark tasks locally before running
  • Cherry-picking only successful runs

2. Spam Submissions

  • Flooding the API with junk submissions
  • Creating many tokens to bypass per-token limits
  • DoS via expensive database operations

3. Leaderboard Gaming

  • Submitting the same high score repeatedly to dominate "recent" views
  • Creating fake "verified" accounts

Ideas for Mitigation

Submission Validation

  • Task hash verification: Include hash of task files in submission; reject if it doesn't match known benchmark version
  • Timing sanity checks: Flag submissions where execution time is suspiciously fast (faster than model's known token generation speed)
  • Cost sanity checks: Flag if reported cost is way off from expected given token counts
  • Score variance detection: Alert if a model suddenly jumps significantly from historical average

Rate Limiting

  • Per-token submission limits (e.g., max 50/day)
  • Per-IP registration limits (already have this)
  • Cooldown between submissions for same model from same token

Verification Tiers

  • Unverified: Anyone can submit, shown but flagged
  • Verified: GitHub-linked accounts, higher trust
  • Official: Our benchmark runs, marked as authoritative

Anomaly Detection

  • Track submission patterns per token
  • Flag accounts that only submit one model (potential shill accounts)
  • Compare community submissions against official runs for same model

Transparency

  • Public audit log of flagged/removed submissions
  • Show submission history per user (helps community police)

Questions

  • How aggressive should we be? False positives hurt legitimate users
  • Do we hide suspicious submissions or just flag them?
  • Should we require GitHub verification for leaderboard inclusion?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions