Discussion about this post

User's avatar
Kevin Yu Chen Hou's avatar

Really enjoyed this post, thanks for sharing your experience and very cool to see all the initatives started!

To add some food for thought:

- Funny you conclude on Folie a deux; the same name as this preprint which explores how the technical architecture of LLMs make it amenable to sycophancy, and the author attempts to operationlise this effect through the term bidirectional belief amplification https://arxiv.org/abs/2507.19218

- More emerging research on AI psychosis is exciting to see; this article from Tim Hua in particular highlights how different models have differing levels of sycophancy and thus ability to elicit such psychoses. Perhaps a call for a model-watch sort of initiative? https://www.lesswrong.com/posts/iGF7YcnQkEbwvYLPA/ai-induced-psychosis-a-shallow-investigation

- The question of how then to manage these AIs is an interesting one for the alignment field; one such subfield being mechanistic intepretability. A technique I’m particularly curious about is Anthropic’s persona vectors - which in an early proof of concept they measure, through the model internals, a level of “sycophancy” of the model, and demonstrate how one can effectively “tune” this level. https://www.anthropic.com/research/persona-vectors

Much work to be done on this societal experiment, and I’m glad there are more people working on this problem!

Expand full comment

No posts