The Federal AI Policy Framework has been released. Well, it is a four page outline. Mostly it just reiterates existing such outlines. But that is four more pages than we had previously. It includes the beginnings of actual policy proposals, some of which are highly welcome and actively good.
The CCP writes in its 15th 5-year plan that it will.Encourage innovation in multimodal, agentic, embodied, and swarm intelligence technologies, and explore development paths for general artificial intelligence.This is translated from the original:鼓励多模态、智能体、具身智能、群体智能等技术创新,探索通用人工智能发展路径。Source: https://www.spp.gov.cn/spp/tt/202603/t20260313_723954.shtmlThe English-language commentary I found does not have much more to say about this, e.g.: https://triviumchina.com/2026/03/06/15...
Epistemic Status: I think this is right, but a lot of this is empirical, and it seems the field is moving fastCurrent methods are badI should start by saying that this is dangerous territory. And there are obvious ways to botch this. E.g. training CoT to look nice is very stupid.
I am the second most spoiler-averse person I know.I once was considering going to an immersive experience, and someone told me the company that ran the experience, and this was enough for me to derive an important twist that'd happen to me in the first few minutes, and I was like "augh that was a spoiler!!!" and they were like "!??".I then went to the experience, and indeed, it was a lot worse than it would have been if I had gotten to be delighted by the opening twist.This i...
TL;DRTo understanding the conditions under which LLM agents engage in scheming behavior, we develop a framework that decomposes the decision to scheme into agent factors (model, system prompt, tool access) and environmental factors (stakes, oversight, outcome influence)We systematically vary these factors in four realistic settings, each with scheming opportunities for agents that pursue instrumentally convergent goals such as self-preservation, resource acquisition, and goal...
There is now an enormous amount of incredibly useful information in the world. But at the same time, there is also a problem of access to it.On the one hand, access to knowledge is now better than it has ever been in human history.
Though I spend most of my time studying what is labelled “history” in some manuscripts and “malignant lies” in others and the “siren scrawls of that fell demon” by many more, I find myself more interested in those works which exist not to edify or inform but instead to entertain. That is to say, in those hours of leisure my master grants me, I read widely of that section of the library we set aside for books proven by us bibliognosts to be mere entertainments.
I am the second most spoiler-averse person I know. (Maybe tied for 2nd with a couple other people?).
These views are my own and not necessarily representative of those of any colleagues with whom I have worked on AI control.TL;DR: It's much cheaper and quicker to just throw some honeypots at your monitor models than to robustly prove trustedness for every model you want to use.
Summary: "Reward hacking" commonly refers to two different phenomena: misspecified-reward exploitation, where RL reinforces undesired behaviors that score highly under the reward function, and task gaming, where models cheat on tasks specified to them in-context. While these often coincide, they can come apart, require distinct interventions, and lead to distinct threat models.
High-level summary:Anthropic's recent "Hot Mess of AI" paper makes an important empirical observation: as models reason longer and take more actions, their errors become more incoherent rather than more systematically misaligned.
Note: This is a research update sharing preliminary results as part of ongoing work.Figure 1: Contrastive (difference-of-means, English→Mandarin) feature directions elicit a downstream response at much smaller perturbation magnitudes than SAE directions, which behave similarly to random directions.
Firstly, I finally made it :~DIt's my second attempt, firstly I tried to finish Hammertime around a year ago.
A project from Verkor, a chip design startup. "Verkor is working with multiple of the top 10 fabless companies to deploy DC(Design Conductor; their AI agent for chip design) to accelerate their time to market".I wonder how impressive this is for practitioners working on chip design. As a somewhat-adjacent amateur (I wrote some Verilog myself at all), it seems very impressive.