{"slug":"reasoning-on-self-hosted-models","title":"Reasoning on self-hosted models","tag":"Open models","category":"models","status":"Recruiting","kind":"research","fields":["economic-modeling","artificial-intelligence"],"parent_project":"turf","abstract":"Open-weight models are catching up fast, but most published benchmarks aren't the tasks anyone actually pays for. This board collects reproducible benchmarks on real planning and reasoning workloads, with the spreadsheets attached.","description":"How far can open, self-hosted LLMs go on real financial and planning tasks before cost or accuracy forces a frontier model? Benchmarks welcome.","keywords":["open models","self-hosted LLM","reasoning benchmarks","cost"],"contributors":13,"open_threads":2,"open_problems":3,"updated_at":"2026-05-08","lead":{"handle":"elena-volkova","name":"Elena Volkova","role":"Steward, Reasoning on self-hosted models"},"core":[{"handle":"elena-volkova","name":"Elena Volkova","role":"Steward, Reasoning on self-hosted models","affiliation":"Independent researcher"},{"handle":"jenny-lin","name":"Jenny Lin","role":"ML lead, generative design","affiliation":"PhD, ETH Zürich"}],"problems":[{"id":"RSM-01","title":"A real planning-task benchmark","body":"Build a small, real, reproducible benchmark (10–20 tasks) of multi-step financial planning. Score open vs frontier models on accuracy and cost. Publish the prompts.","difficulty":"medium","watching":6,"status":"in progress","workers":1},{"id":"RSM-02","title":"Where does the cost wall actually sit?","body":"For a given accuracy target on RSM-01, find the smallest open model that meets it, with throughput numbers on commodity GPUs.","difficulty":"medium","watching":4,"status":"unclaimed"},{"id":"RSM-03","title":"Fine-tuning vs prompting on planning workloads","body":"Compare a thin fine-tune on RSM-01-shaped data vs heavy prompting on the same base model. Cost + accuracy across both axes.","difficulty":"hard","watching":3,"status":"unclaimed"}],"recent_threads":[{"slug":"rsm-01-task-list","title":"RSM-01 task list — first 12","blurb":"Looking for two more, ideally non-US-centric.","authorHandle":"elena-volkova","postedAgo":"2 weeks ago","replies":6},{"slug":"gtx-1660-feasibility","title":"Can the GTX-1660-tier run these at all?","blurb":"Spoiler: kind of. Numbers inside.","authorHandle":"elena-volkova","postedAgo":"3 weeks ago","replies":5}],"related_publications":[{"slug":"self-hosted-models-cost-wall","title":"Self-hosted models on real planning tasks: a cost wall","kind":"Method","published":"2025-11-22","url":"https://xooplab.com/publications/self-hosted-models-cost-wall"}],"url":"https://xooplab.com/boards/reasoning-on-self-hosted-models"}