Tech article
More eval traces will not stabilize your kappa. Stratify the ones you have
TL;DR: Our LLM-as-judge agreement (Cohen's kappa against human labels) swung between 0.41 and 0.63...
Dev.to | Jun 9, 2026 | Maya Andersson
Tech article
TL;DR: Our LLM-as-judge agreement (Cohen's kappa against human labels) swung between 0.41 and 0.63...
Dev.to | Jun 9, 2026 | Maya Andersson