Tech article

More eval traces will not stabilize your kappa. Stratify the ones you have

TL;DR: Our LLM-as-judge agreement (Cohen's kappa against human labels) swung between 0.41 and 0.63...

Dev.to | Jun 9, 2026 | Maya Andersson

Read the original article

More tech news