Hi — apologies for the unsolicited issue; feel free to close if it's not useful.
I run NeuroEval, an open, independent project that scores computational- and biohybrid-neuroscience papers on data/stimulus disclosure, dataset/model resolvability, code availability, model-to-code traceability, model-spec clarity, and reproducibility packaging (7 weighted dimensions, 0–100). It's automated (LLM-based), but the rubric and a scorer-validation study (test–retest + cross-vendor agreement, Pearson r ≈ 0.99) are documented in a citable report: https://doi.org/10.5281/zenodo.20690622.
BMTK scored 64/100 overall (69/100 on the transparency/reproducibility axes alone). One note so the number isn't misread: the 7th dimension (neural-coding / information-theoretic rigor, 25% weight) is not applicable to infrastructure/simulator tools, so it's scored a neutral 50 by design — it doesn't count against you.
I'm reaching out because automated scoring can miss things a maintainer would catch instantly — a dataset that is declared, a CI/repro workflow I overlooked, a versioned release I didn't resolve. If anything looks wrong, I'll re-score — the scores are provisional and corrections are the whole point.
Two quick questions if you have a moment:
- Does the overall score match how you'd characterize the project's reproducibility?
- Is there a transparency artifact (data declaration, environment pin, repro script) you'd want an evaluator to find that isn't obvious from the paper?
Thanks for the tool — and for considering this. No reply needed if you'd rather pass.
Hi — apologies for the unsolicited issue; feel free to close if it's not useful.
I run NeuroEval, an open, independent project that scores computational- and biohybrid-neuroscience papers on data/stimulus disclosure, dataset/model resolvability, code availability, model-to-code traceability, model-spec clarity, and reproducibility packaging (7 weighted dimensions, 0–100). It's automated (LLM-based), but the rubric and a scorer-validation study (test–retest + cross-vendor agreement, Pearson r ≈ 0.99) are documented in a citable report: https://doi.org/10.5281/zenodo.20690622.
BMTK scored 64/100 overall (69/100 on the transparency/reproducibility axes alone). One note so the number isn't misread: the 7th dimension (neural-coding / information-theoretic rigor, 25% weight) is not applicable to infrastructure/simulator tools, so it's scored a neutral 50 by design — it doesn't count against you.
I'm reaching out because automated scoring can miss things a maintainer would catch instantly — a dataset that is declared, a CI/repro workflow I overlooked, a versioned release I didn't resolve. If anything looks wrong, I'll re-score — the scores are provisional and corrections are the whole point.
Two quick questions if you have a moment:
Thanks for the tool — and for considering this. No reply needed if you'd rather pass.