that's a really nice survey!! really like the way s3.2.1 lays it out.
on rogue AI... yeah I kind of didn't want to get into the discussion about AI agency/ intrinsic motivations/etc here, but tbh i've always really struggled imagining the mechanism of that risk pathway and felt it the least substantiated. Like at some point someone gave it instructions with the goal of having it do something right?
that's a really nice survey!! really like the way s3.2.1 lays it out.
on rogue AI... yeah I kind of didn't want to get into the discussion about AI agency/ intrinsic motivations/etc here, but tbh i've always really struggled imagining the mechanism of that risk pathway and felt it the least substantiated. Like at some point someone gave it instructions with the goal of having it do something right?
Ok I looked back at some writing on the topic, and (1) people often do imagine humans giving power to AI systems before the AI do bad things to humans, but (2) the "instructions" still remain well-intentioned, with the harms arising from misaligned goals in the AI.
So I think the main concern is that instrumental goals / proxy gaming / goal drift are all likely to mean that an AI has some goals that diverge from human intentions, no matter what the instructions are. Then you add in some deception, and an AI which humans trust enough to cede some power to ends up turning on humans.
that's a really nice survey!! really like the way s3.2.1 lays it out.
on rogue AI... yeah I kind of didn't want to get into the discussion about AI agency/ intrinsic motivations/etc here, but tbh i've always really struggled imagining the mechanism of that risk pathway and felt it the least substantiated. Like at some point someone gave it instructions with the goal of having it do something right?
Ok I looked back at some writing on the topic, and (1) people often do imagine humans giving power to AI systems before the AI do bad things to humans, but (2) the "instructions" still remain well-intentioned, with the harms arising from misaligned goals in the AI.
See Sec. 5 of https://arxiv.org/pdf/2306.12001.pdf, particularly the story on pg 41, and Sec. 4.3 of https://arxiv.org/pdf/2209.00626.pdf.
So I think the main concern is that instrumental goals / proxy gaming / goal drift are all likely to mean that an AI has some goals that diverge from human intentions, no matter what the instructions are. Then you add in some deception, and an AI which humans trust enough to cede some power to ends up turning on humans.
This paper like many others assumes AI has goals. But GPT-4 doesn't want anything and there is no need to add volition to GPT-17 either
GPT-4 doesn't want anything, but language models can and will be turned into agents.
See this related post: https://www.lesswrong.com/posts/kpPnReyBC54KESiSn/optimality-is-the-tiger-and-agents-are-its-teeth
You can use LLM that works as Oracle to verify actions of the Agent AI. I described it here https://medium.com/@jan.matusiewicz/autonomous-agi-with-solved-alignment-problem-49e6561b8295