Skip to content

Consider Adopting Python + Polars GTFS Validator #2129

@ryan-mahoney

Description

@ryan-mahoney

Proposal

First, I want to acknowledge the incredible work MobilityData has put into the canonical validator. The rule coverage, the test infrastructure, and the care taken with the spec are genuinely impressive. This proposal is offered in that spirit, as a contribution to something worth building on, not a criticism of what exists.

Background

I created a Python port of the validator as part of an experiment while working on a different transit data tool I am building that incorporates the MobilityData validator as a core component. Here is the port: https://github.com/TransitOPS/gtfs-validator

The conversion was carried out using the SpecOps methodology developed by Mark Headd, a specification-driven approach to modernizing legacy systems that uses the existing spec and behavior as the authoritative source of truth rather than doing a direct code translation.

The core orchestrator and validations have already been converted to Python using Polars as the data processing layer. Some things that came out of that work:

  • Tested against real-world feeds. The implementation has been run against the SEPTA and MBTA feeds to validate compatibility with large, production GTFS data from major agencies.
  • Performance is comparable. Benchmarks across a range of feed sizes show similar throughput between the Java and Python implementations. Polars is Rust-backed and handles the join-heavy, filter-intensive work of GTFS validation without meaningful regression. The Python can probably be further optimized, I didn't spend long on performance items.
  • Python has more test cases. The Python implementation has expanded test coverage beyond what exists in the Java codebase, partly because writing and running tests has less friction.
  • There is a template file for writing new validations. The repo includes a template designed to be handed to an LLM so that someone can describe a data quality issue in plain language and get a working, spec-aligned validation back with minimal effort.

I am also planning to add more GTFS pathways validation, starting with the open issues in this repo.

The core accessibility argument

Running the current validator requires a Java runtime, which for many non-developer users (transit planners, GIS analysts, researchers) can be an unfamiliar dependency to install and troubleshoot. For those audiences, this can be a real barrier to adoption. Even I with many years of software experience find the Java tooling and well.. the language itself off-putting.

There is a second dimension to this. A growing number of people work with data through LLM interfaces (Claude, ChatGPT, etc.). These environments typically have Python available out of the box but do not include a Java runtime. In practice, this means the LLM session gets consumed by environment setup rather than the actual transit data problem.

With a Python validator, someone can upload a GTFS feed (or subset) to an LLM interface and get validation results with no local setup at all. The template file makes contributing new rules accessible to the same audience.

A simple referential integrity check in Polars illustrates the readability point:

import polars as pl

stops = pl.read_csv("stops.txt")
stop_times = pl.read_csv("stop_times.txt")

# Are there stop_times referencing stops that do not exist?
invalid = stop_times.filter(
    ~pl.col("stop_id").is_in(stops["stop_id"])
)

A transit planner can read that and understand what it is checking.

Existing backlog issues this would resolve

There are a number of open issues in this repo related to Java runtime friction, distribution challenges, and Java-specific technical debt. A Python-based implementation could resolve several of them at once rather than requiring separate solutions for each. It might be worth reviewing the backlog with that lens as part of this conversation since fixing those issue would take effort but not have much value for users.

Request for consideration

This is genuinely a request for conversation.

I understand this might be a controversial recommendation and won't be offended if it is dismissed without conversation or ruled out after -- my intent is to be helpful 🙏

I am happy to share the current work, answer questions, or help scope whatever would make this useful to the project. If you made it this far, thank you for reading!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions