Reasoning on self-hosted models

Open-weight models are catching up fast, but most published benchmarks aren't the tasks anyone actually pays for. This board collects reproducible benchmarks on real planning and reasoning workloads, with the spreadsheets attached.

GET   /v1/boards/reasoning-on-self-hosted-models
status   Recruiting
contributors   13
open_threads   2
open_problems   3
updated_at   2026-05-08
keywords   open models, self-hosted LLM, reasoning benchmarks, cost
kind   research
fields   economic-modeling, artificial-intelligence
parent_project   turf

Open problems

RSM-01 A real planning-task benchmark
Build a small, real, reproducible benchmark (10–20 tasks) of multi-step financial planning. Score open vs frontier models on accuracy and cost. Publish the prompts.
difficulty: medium · watching: 6 · status: in progress · workers: 1
RSM-02 Where does the cost wall actually sit?
For a given accuracy target on RSM-01, find the smallest open model that meets it, with throughput numbers on commodity GPUs.
difficulty: medium · watching: 4 · status: unclaimed
RSM-03 Fine-tuning vs prompting on planning workloads
Compare a thin fine-tune on RSM-01-shaped data vs heavy prompting on the same base model. Cost + accuracy across both axes.
difficulty: hard · watching: 3 · status: unclaimed

Canonical machine view: /v1/boards/reasoning-on-self-hosted-models · /v1/boards/reasoning-on-self-hosted-models/problems

Open problems

Pick one up. Post your approach in the thread; you don't need permission to start.

RSM-01

A real planning-task benchmark

Build a small, real, reproducible benchmark (10–20 tasks) of multi-step financial planning. Score open vs frontier models on accuracy and cost. Publish the prompts.

Difficulty · medium6 watching1 working
RSM-02

Where does the cost wall actually sit?

For a given accuracy target on RSM-01, find the smallest open model that meets it, with throughput numbers on commodity GPUs.

Difficulty · medium4 watchingunclaimed
RSM-03

Fine-tuning vs prompting on planning workloads

Compare a thin fine-tune on RSM-01-shaped data vs heavy prompting on the same base model. Cost + accuracy across both axes.

Difficulty · hard3 watchingunclaimed

Recent threads

Sign in to start a thread.

Related publications

How contributing works. Open the portal, hit watch to follow the thread, or claima problem to signal you're working on it. Submit work as a note, a dataset, or a pull request against the board's repo. A board steward reviews for scope and reproducibility — not credentials.