Hotnews

lesswrong

lesswrong 43h ago 19°

The Federal AI Policy Framework: An Improvement, But My Offer Is (Still Almost) Nothing

The Federal AI Policy Framework has been released. Well, it is a four page outline. Mostly it just reiterates existing such outlines. But that is four more pages than we had previously. It includes the beginnings of actual policy proposals, some of which are highly welcome and actively good.
lesswrong 19h ago 14°

China declares AGI development to be a part of 5-year plan

The CCP writes in its 15th 5-year plan that it will.Encourage innovation in multimodal, agentic, embodied, and swarm intelligence technologies, and explore development paths for general artificial intelligence.This is translated from the original:鼓励多模态、智能体、具身智能、群体智能等技术创新,探索通用人工智能发展路径。Source: https://www.spp.gov.cn/spp/tt/202603/t20260313_723954.shtmlThe English-language commentary I found does not have much more to say about this, e.g.: https://triviumchina.com/2026/03/06/15...
lesswrong 37h ago 14°

The Future of Aligning Deep Learning systems will probably look like "training on interp"

Epistemic Status: I think this is right, but a lot of this is empirical, and it seems the field is moving fastCurrent methods are badI should start by saying that this is dangerous territory. And there are obvious ways to botch this. E.g. training CoT to look nice is very stupid.
lesswrong 8h ago 13°

The Toy Story Saga is not yet finished

I am the second most spoiler-averse person I know.I once was considering going to an immersive experience, and someone told me the company that ran the experience, and this was enough for me to derive an important twist that'd happen to me in the first few minutes, and I was like "augh that was a spoiler!!!" and they were like "!??".I then went to the experience, and indeed, it was a lot worse than it would have been if I had gotten to be delighted by the opening twist.This i...
lesswrong 14h ago 10°

Key to Life No. 9: Access

There is now an enormous amount of incredibly useful information in the world. But at the same time, there is also a problem of access to it.On the one hand, access to knowledge is now better than it has ever been in human history.
lesswrong 15h ago 10°

Understanding when and why agents scheme

TL;DRTo understanding the conditions under which LLM agents engage in scheming behavior, we develop a framework that decomposes the decision to scheme into agent factors (model, system prompt, tool access) and environmental factors (stakes, oversight, outcome influence)We systematically vary these factors in four realistic settings, each with scheming opportunities for agents that pursue instrumentally convergent goals such as self-preservation, resource acquisition, and goal...
lesswrong 45h ago

The Distaff Texts

Though I spend most of my time studying what is labelled “history” in some manuscripts and “malignant lies” in others and the “siren scrawls of that fell demon” by many more, I find myself more interested in those works which exist not to edify or inform but instead to entertain. That is to say, in those hours of leisure my master grants me, I read widely of that section of the library we set aside for books proven by us bibliognosts to be mere entertainments.
lesswrong 8h ago

Pre-Review of Toy Story 5

I am the second most spoiler-averse person I know. (Maybe tied for 2nd with a couple other people?).
lesswrong 46h ago

Untrusted Monitoring is Default; Trusted Monitoring is not

These views are my own and not necessarily representative of those of any colleagues with whom I have worked on AI control.TL;DR: It's much cheaper and quicker to just throw some honeypots at your monitor models than to robustly prove trustedness for every model you want to use.
lesswrong 44h ago

Confusion around the term reward hacking

Summary: "Reward hacking" commonly refers to two different phenomena: misspecified-reward exploitation, where RL reinforces undesired behaviors that score highly under the reward function, and task gaming, where models cheat on tasks specified to them in-context. While these often coincide, they can come apart, require distinct interventions, and lead to distinct threat models.
lesswrong 33h ago

The Hot Mess Paper Conflates Three Distinct Failure Modes

High-level summary:Anthropic's recent "Hot Mess of AI" paper makes an important empirical observation: as models reason longer and take more actions, their errors become more incoherent rather than more systematically misaligned.
lesswrong 39h ago

Finding features in Transformers: Contrastive directions elicit stronger low-level perturbation responses than baselines

Note: This is a research update sharing preliminary results as part of ongoing work.Figure 1: Contrastive (difference-of-means, English→Mandarin) feature directions elicit a downstream response at much smaller perturbation magnitudes than SAE directions, which behave similarly to random directions.
lesswrong 14h ago

My Hammertime Final Exam

Firstly, I finally made it :~DIt's my second attempt, firstly I tried to finish Hammertime around a year ago.
lesswrong 37h ago

An agent autonomously builds a 1.5 GHz Linux-capable RISC-V CPU

A project from Verkor, a chip design startup. "Verkor is working with multiple of the top 10 fabless companies to deploy DC(Design Conductor; their AI agent for chip design) to accelerate their time to market".I wonder how impressive this is for practitioners working on chip design. As a somewhat-adjacent amateur (I wrote some Verilog myself at all), it seems very impressive.
lesswrong 17h ago

China Derangement Syndrome

Often I see people claim it’s essential for America to win the AI race against China (in whatever sense) for reasons like these:“What is the reason we want America to win the AI race? It’s because we want to make sure free open societies can defend themselves” (Alec Stapp)“We should seek to win the race to global AI technological superiority and ensure that China does not…
lesswrong 25h ago

Grounding Coding Agents via Dixit

[Epistemic status: ideas in this post are mine. I've published them previously in the form summarized by Claude, but this got auto-rejected. Here, I present them in my own voice. The ideas are still not evaluated, but I am working on implementing them to see if this works in practice. Still, the ideas presented here are my best bet on what could work in practice.
1 2