AI article

We put confidence intervals on our LLM-judge scores. The error bars ate three weeks of "trend"

We track weekly agreement between an LLM judge and human labels (Cohen's kappa) on a sample of...

Dev.to | Jun 11, 2026 | Maya Andersson

Read the original article

More AI news