AI article
When Generic Benchmarks Fail: Building a Sales-Domain Evaluation Bench from Scratch
How I built Tenacious-Bench — a 240-task domain-specific benchmark for a B2B sales agent — trained a SimPO LoRA judge, and lifted held-out preference accurac...
Dev.to | May 2, 2026 | Nati A