AI article
We put confidence intervals on our LLM-judge scores. The error bars ate three weeks of "trend"
We track weekly agreement between an LLM judge and human labels (Cohen's kappa) on a sample of...
Dev.to | Jun 11, 2026 | Maya Andersson
AI article
We track weekly agreement between an LLM judge and human labels (Cohen's kappa) on a sample of...
Dev.to | Jun 11, 2026 | Maya Andersson