Hotnews2

lesswrong

lesswrong 23h ago

Grounding Coding Agents via Dixit

[Epistemic status: ideas in this post are mine. I've published them previously in the form summarized by Claude, but this got auto-rejected. Here, I present them in my own voice. The ideas are still not evaluated, but I am working on implementing them to see if this works in practice. Still, the ideas presented here are my best bet on what could work in practice.
lesswrong 41h ago

ARENA 7.0 Impact Report

The impact report from ARENA’s previous iteration, ARENA 6.0, is available here.Summary:ARENA 7.0 took place at the London Initiative for Safe AI (LISA) between January 5th and February 6th, 2026.
lesswrong 37h ago

Untrusted monitoring: extra bits

The following are some further notes related to untrusted monitoring I had while working on our untrusted monitoring paper. The sections are mostly independent of each other.Untrusted-only Monitoring Doesn’t WorkIn some of our experiments we looked at the situation where the trusted monitor TM is missing (untrusted-only monitoring) as a means of stress-testing the situation where it is not useful.
lesswrong 1h ago

Noticing a Teacher's Password Pattern

Yudkowsky writes about Guessing the Teacher's Password as an abstract educational concept. At a young age, perhaps ten years old, I had guessed one commonly used meta-password: In the Finnish school system it's typical for multiple choice answers to include options that are somewhat similar, and often the actual answer can be reasoned without knowing much at all about the actual topic. Here's an example from 2024 admission exam for technical universities.
1 2