Reasoning on self-hosted models — research board

GET /v1/boards/reasoning-on-self-hosted-models

status Recruiting

contributors 13

open_threads 2

open_problems 3

updated_at 2026-05-08

keywords open models, self-hosted LLM, reasoning benchmarks, cost

kind research

fields economic-modeling, artificial-intelligence

parent_project turfdynamics

Open problems

RSM-01 A real planning-task benchmark

Build a small, real, reproducible benchmark (10–20 tasks) of multi-step financial planning. Score open vs frontier models on accuracy and cost. Publish the prompts.

difficulty: medium · watching: 6 · status: in progress · workers: 1

RSM-02 Where does the cost wall actually sit?

For a given accuracy target on RSM-01, find the smallest open model that meets it, with throughput numbers on commodity GPUs.

difficulty: medium · watching: 4 · status: unclaimed

RSM-03 Fine-tuning vs prompting on planning workloads

Compare a thin fine-tune on RSM-01-shaped data vs heavy prompting on the same base model. Cost + accuracy across both axes.

difficulty: hard · watching: 3 · status: unclaimed

Canonical machine view: /v1/boards/reasoning-on-self-hosted-models · /v1/boards/reasoning-on-self-hosted-models/problems

Open problems

Pick one up. Post your approach in the thread; you don't need permission to start.

RSM-01

A real planning-task benchmark

Build a small, real, reproducible benchmark (10–20 tasks) of multi-step financial planning. Score open vs frontier models on accuracy and cost. Publish the prompts.

Difficulty · medium6 watching1 working

RSM-02

Where does the cost wall actually sit?

For a given accuracy target on RSM-01, find the smallest open model that meets it, with throughput numbers on commodity GPUs.

Difficulty · medium4 watchingunclaimed

RSM-03

Fine-tuning vs prompting on planning workloads

Compare a thin fine-tune on RSM-01-shaped data vs heavy prompting on the same base model. Cost + accuracy across both axes.

Difficulty · hard3 watchingunclaimed

Recent threads

2 weeks ago

RSM-01 task list — first 12Looking for two more, ideally non-US-centric. · 6 replies

Thread 3 weeks ago

Can the GTX-1660-tier run these at all?Spoiler: kind of. Numbers inside. · 5 replies

Thread

Related publications

2025 · 11

Self-hosted models on real planning tasks: a cost wallWhere open models stop being cheaper. Benchmarks and the spreadsheet behind them.

Method

How contributing works. Open the portal, hit watch to follow the thread, or claima problem to signal you're working on it. Submit work as a note, a dataset, or a pull request against the board's repo. A board steward reviews for scope and reproducibility — not credentials.