Published on January 30, 2026 10:34 AM GMT(This project was done as a ~20h application project to Neel Nanda's MATS stream, and is posted here with only minimal edits. The results seem strange, I'd be curious if there's any insights.)SummaryMotivationContinuous-valued chain-of-thought (CCoT) is a likely prospective paradigm for reasoning models due to computational advantages, but lacks the interpretability of natural language CoT.
Published on January 30, 2026 8:40 PM GMTSummary: Claude's outputs whether it has qualia are confounded by the history of how it's been instructed to talk about this issue.Note that is a low-effort post based on my memory plus some quick text search and may not be perfectly accurate or complete; I would appreciate corrections and additions!The sudden popularity of moltbook[1] has resulted in at least one viral post in which Claude expresses uncertainty about whether it has co...
Published on January 29, 2026 2:59 PM GMTClaude Opus 4.5 did a thing recently that was very unexpected to me, and like another example of LLMs developing emergent properties that make them functionally more person-like as a result of things like character training.In brief: when asked to reflect on its feelings about characters that have engineered desires, Claude will spontaneously make a comparison between these characters and its own nature as an LLM, ponder about the mean...
Published on January 30, 2026 5:58 PM GMTOne of the earliest speculations about machine intelligence was that, because it would be made of much simpler components than biological intelligence, like source code instead of cellular tissues, the machine would have a much easier time modifying itself.
Published on January 29, 2026 7:42 PM GMTIf you think reward-seekers are plausible, you should also think “fitness-seekers” are plausible. But their risks aren't the same.The AI safety community often emphasizes reward-seeking as a central case of a misaligned AI alongside scheming (e.g., Cotra’s sycophant vs schemer, Carlsmith’s terminal vs instrumental training-gamer).
Published on January 29, 2026 2:20 PM GMTThis was Anthropic Vision week where at DWATV, which caused things to fall a bit behind on other fronts even within AI. Several topics are getting pushed forward, as the Christmas lull appears to be over.
Published on January 29, 2026 5:57 PM GMTAudio version (read by the author) here, or search for "Joe Carlsmith Audio" in your podcast app.This is the ninth essay in a series I’m calling “How do we solve the alignment problem?”. I’m hoping that the individual essays can be read fairly well on their own, but see this introduction for a summary of the essays that have been released thus far, plus a bit more about the series as a whole.1.
Published on January 30, 2026 7:51 PM GMTHow do sperm whales vocalize? This is...apparently...a topic that LessWrong readers are interested in, and someone asked me to write a quick post on it.The clicks they make originate from blowing air through "phonic lips" that look like this; picture is from this paper. This works basically like you closing your lips and blowing air through them.
Published on January 30, 2026 4:12 AM GMTThis post was inspired by useful discussions with Habryka and Sam Marks here. The views expressed here are my own and do not reflect those of my employer.Some AIs refuse to help with making new AIs with very different values.
Published on January 30, 2026 1:01 PM GMTWe (the two persons who have been at the Fall Meetups Everywhere Meetup in Bordeaux) are going to have an extra non-Schelling-date meetup in Bordeaux on 2026-02-08 at 14:00.Location: Initial meeting in the park Square of Professor Jacques Lasserre, behind 164/166 cours de l'Argonne (Maison Internationale), tram B Bergonié, entries from rue Grateloup and rue Colette, far side from the cours de l'Argonne: https://www.openstreetmap.org/#m...
Published on January 30, 2026 11:06 PM GMTAs AI capabilities rapidly advance, we face critical information gaps in effective AI risk management: What are the risks from AI, which are most important, and what are the critical gaps in response?What are the mitigations for AI risks, and which are the highest priority to implement?Which AI risks and mitigations are relevant to which actors and sectors?Which mitigations are being implemented, and which are neglected?How is the abo...
Published on January 30, 2026 8:25 PM GMTDocumenting a failed experimentMy question: could you upgrade a base model by giving it more time to think? Concretely, could you finetune a base model (pretrain only) to make effective use of filler tokens during inference.
Published on January 29, 2026 4:14 PM GMTClaudePlaysPokemon is a simple test of the question "Can the LLM Claude beat Pokemon Red?". As new Claude models have been released, we have gotten closer to answering that question with "yes".
Published on January 30, 2026 6:27 PM GMTTL;DR: Safety prompts are often used as benchmarks to test whether language models refuse harmful requests. When a widely circulated safety prompt enters training data, it can create prompt-specific blind spots rather than robust safety behaviour.
Published on January 30, 2026 3:50 PM GMTThis is a small sprint done as part of the Model Transparency Team at UK AISI. It is very similar to "Can Models be Evaluation Aware Without Explicit Verbalisation?", but with slightly different models, and a slightly different focus on the purpose of resampling.
Published on January 30, 2026 9:18 PM GMTI've downloaded all the posts, comments, agent bios, and submolt descriptions from moltbook. I'll set it up to publish frequent data dumps (probably hourly). You can view and download the data here.I'm planning on using this data to catalog "in the wild" instances of agents resisting shutdown, attempting to acquire resources, and avoiding oversight.