Open-weight models are catching up fast, but most published benchmarks aren't the tasks anyone actually pays for. This board collects reproducible benchmarks on real planning and reasoning workloads, with the spreadsheets attached.
Canonical machine view: /v1/boards/reasoning-on-self-hosted-models · /v1/boards/reasoning-on-self-hosted-models/problems
Pick one up. Post your approach in the thread; you don't need permission to start.
Build a small, real, reproducible benchmark (10–20 tasks) of multi-step financial planning. Score open vs frontier models on accuracy and cost. Publish the prompts.
For a given accuracy target on RSM-01, find the smallest open model that meets it, with throughput numbers on commodity GPUs.
Compare a thin fine-tune on RSM-01-shaped data vs heavy prompting on the same base model. Cost + accuracy across both axes.