Hotnews2

lesswrong

lesswrong 26h ago

Linear steerability in continuous chain-of-thought reasoning

Published on January 30, 2026 10:34 AM GMT(This project was done as a ~20h application project to Neel Nanda's MATS stream, and is posted here with only minimal edits. The results seem strange, I'd be curious if there's any insights.)SummaryMotivationContinuous-valued chain-of-thought (CCoT) is a likely prospective paradigm for reasoning models due to computational advantages, but lacks the interpretability of natural language CoT.
lesswrong 17h ago

Background to Claude's uncertainty about phenomenal consciousness

Published on January 30, 2026 8:40 PM GMTSummary: Claude's outputs whether it has qualia are confounded by the history of how it's been instructed to talk about this issue.Note that is a low-effort post based on my memory plus some quick text search and may not be perfectly accurate or complete; I would appreciate corrections and additions!The sudden popularity of moltbook[1] has resulted in at least one viral post in which Claude expresses uncertainty about whether it has co...
lesswrong 47h ago

Claude Opus will spontaneously identify with fictional beings that have engineered desires

Published on January 29, 2026 2:59 PM GMTClaude Opus 4.5 did a thing recently that was very unexpected to me, and like another example of LLMs developing emergent properties that make them functionally more person-like as a result of things like character training.In brief: when asked to reflect on its feelings about characters that have engineered desires, Claude will spontaneously make a comparison between these characters and its own nature as an LLM, ponder about the mean...
lesswrong 20h ago

Is research into recursive self-improvement becoming a safety hazard?

Published on January 30, 2026 5:58 PM GMTOne of the earliest speculations about machine intelligence was that, because it would be made of much simpler components than biological intelligence, like source code instead of cellular tissues, the machine would have a much easier time modifying itself.
lesswrong 42h ago

Fitness-Seekers: Generalizing the Reward-Seeking Threat Model

Published on January 29, 2026 7:42 PM GMTIf you think reward-seekers are plausible, you should also think “fitness-seekers” are plausible. But their risks aren't the same.The AI safety community often emphasizes reward-seeking as a central case of a misaligned AI alongside scheming (e.g., Cotra’s sycophant vs schemer, Carlsmith’s terminal vs instrumental training-gamer).
lesswrong 48h ago

AI #153: Living Documents

Published on January 29, 2026 2:20 PM GMTThis was Anthropic Vision week where at DWATV, which caused things to fall a bit behind on other fronts even within AI. Several topics are getting pushed forward, as the Christmas lull appears to be over.
lesswrong 44h ago

Building AIs that do human-like philosophy

Published on January 29, 2026 5:57 PM GMTAudio version (read by the author) here, or search for "Joe Carlsmith Audio" in your podcast app.This is the ninth essay in a series I’m calling “How do we solve the alignment problem?”. I’m hoping that the individual essays can be read fairly well on their own, but see this introduction for a summary of the essays that have been released thus far, plus a bit more about the series as a whole.1.
lesswrong 18h ago

how whales click

Published on January 30, 2026 7:51 PM GMTHow do sperm whales vocalize? This is...apparently...a topic that LessWrong readers are interested in, and someone asked me to write a quick post on it.The clicks they make originate from blowing air through "phonic lips" that look like this; picture is from this paper. This works basically like you closing your lips and blowing air through them.
lesswrong 34h ago

Refusals that could become catastrophic

Published on January 30, 2026 4:12 AM GMTThis post was inspired by useful discussions with Habryka and Sam Marks here. The views expressed here are my own and do not reflect those of my employer.Some AIs refuse to help with making new AIs with very different values.
lesswrong 25h ago

Bordeaux (Gironde, France) ACX midterm Meetup Winter 2025–2026

Published on January 30, 2026 1:01 PM GMTWe (the two persons who have been at the Fall Meetups Everywhere Meetup in Bordeaux) are going to have an extra non-Schelling-date meetup in Bordeaux on 2026-02-08 at 14:00.Location: Initial meeting in the park Square of Professor Jacques Lasserre, behind 164/166 cours de l'Argonne (Maison Internationale), tram B Bergonié, entries from rue Grateloup and rue Colette, far side from the cours de l'Argonne: https://www.openstreetmap.org/#m...
lesswrong 15h ago

Senior Researcher - MIT AI Risk Initiative

Published on January 30, 2026 11:06 PM GMTAs AI capabilities rapidly advance, we face critical information gaps in effective AI risk management: What are the risks from AI, which are most important, and what are the critical gaps in response?What are the mitigations for AI risks, and which are the highest priority to implement?Which AI risks and mitigations are relevant to which actors and sectors?Which mitigations are being implemented, and which are neglected?How is the abo...
lesswrong 18h ago

Attempting base model inference scaling with filler tokens

Published on January 30, 2026 8:25 PM GMTDocumenting a failed experimentMy question: could you upgrade a base model by giving it more time to think? Concretely, could you finetune a base model (pretrain only) to make effective use of filler tokens during inference.
lesswrong 46h ago

Claude Plays Pokemon: Opus 4.5 Follow-up

Published on January 29, 2026 4:14 PM GMTClaudePlaysPokemon is a simple test of the question "Can the LLM Claude beat Pokemon Red?". As new Claude models have been released, we have gotten closer to answering that question with "yes".
lesswrong 19h ago

Published Safety Prompts May Create Evaluation Blind Spots

Published on January 30, 2026 6:27 PM GMTTL;DR: Safety prompts are often used as benchmarks to test whether language models refuse harmful requests. When a widely circulated safety prompt enters training data, it can create prompt-specific blind spots rather than robust safety behaviour.
lesswrong 22h ago

Measuring Non-Verbalised Eval Awareness by Implanting Eval-Aware Behaviours

Published on January 30, 2026 3:50 PM GMTThis is a small sprint done as part of the Model Transparency Team at UK AISI. It is very similar to "Can Models be Evaluation Aware Without Explicit Verbalisation?", but with slightly different models, and a slightly different focus on the purpose of resampling.
lesswrong 16h ago

Moltbook Data Repository

Published on January 30, 2026 9:18 PM GMTI've downloaded all the posts, comments, agent bios, and submolt descriptions from moltbook. I'll set it up to publish frequent data dumps (probably hourly). You can view and download the data here.I'm planning on using this data to catalog "in the wild" instances of agents resisting shutdown, attempting to acquire resources, and avoiding oversight.
1 2 3