Aug 19, 2023

How are we actually “aligning AI with human values”?

15 Comments

Sep 5, 2023

So many great quotes here! I've been seeing so many different opinions on 'AI alignment' that it has been difficult to get an accurate picture of what's going on. Really appreciate this overview.

Expand full comment

Reply (1)

jessica dai

Sep 5, 2023

thanks for reading!!

Expand full comment

Misha Belkin

Aug 22, 2023

A nice essay!

Unfortunately, I don't see how alignment can be solved at all, as long as there is more than one human who desires power. People are fundamentally not aligned and will use AI to further their agenda (not necessarily with bad intent). It is a little surprising that a philosopher would call it "a technical problem", while it seems to be a part of the human condition.

The most we can hope for is risk mitigation, and for that we need much better fundamental understanding of the principles of statistical learning than what we currently have.

Expand full comment

Reply (1)

jessica dai

Aug 23, 2023

thanks for reading! I'm probably also skeptical of capital-s "Solving" alignment but still feel like maybe we can do better than we're currently doing. i hope theory can help with that!

Expand full comment

Reply (1)

Misha Belkin

Aug 23, 2023

I hope we can do better. I don't see any alternative to fundamental theory though. These models are already too smart/human-like to think we can control them any other way. Another alternative is restricting access which seems totalitarian and probably ultimately ineffective.

Expand full comment

Peter Hase

Aug 20, 2023

Nice post!

> It’s just a bit too coincidental to me that the major alignment research directions just so happen to be incredibly well-designed to building better products.

Agreed, this is also a big worry of mine. See also this paper critical of RLHF (slight self-plug): https://arxiv.org/pdf/2307.15217.pdf

> the pathways to AI x-risk ultimately require a society where relying on — and trusting — algorithms for making consequential decisions is not only commonplace, but encouraged and incentivized…All of the AI x-risk scenarios involve a world where we have decided to abdicate responsibility to an algorithm.

I think this is notably missing the argument from people worried about rogue AI. The worry here is about break out of an unaligned AI, without us intentionally ceding power to it. Not saying that’s likely or not – it’s just missing an argument I often hear. Maybe even doing some large scale pretraining of an AI model with the idea of it being useful downstream for replacing human labor counts as abdicating responsibility? Or there’s the case where some tech person with developer access to GPT-7 decides to create the next autoGPT and hook it up to the internet and then maybe that leads to x-risk, which isn’t really abdication of any significant societal responsibility to AI so much as OpenAI selling a useful email/code-first-draft machine and then it going poorly for us when made into an agent.

> There is a rich and nuanced discussion to be had about when and whether algorithms can be used to improve human decision-making… And there is a large community of activists, academics, and community organizers who have been pushing this conversation for years.

True and important

Expand full comment

Reply (2)

jessica dai

Aug 21, 2023

that's a really nice survey!! really like the way s3.2.1 lays it out.

on rogue AI... yeah I kind of didn't want to get into the discussion about AI agency/ intrinsic motivations/etc here, but tbh i've always really struggled imagining the mechanism of that risk pathway and felt it the least substantiated. Like at some point someone gave it instructions with the goal of having it do something right?

Expand full comment

Reply (1)

Peter Hase

Aug 22, 2023

Ok I looked back at some writing on the topic, and (1) people often do imagine humans giving power to AI systems before the AI do bad things to humans, but (2) the "instructions" still remain well-intentioned, with the harms arising from misaligned goals in the AI.

See Sec. 5 of https://arxiv.org/pdf/2306.12001.pdf, particularly the story on pg 41, and Sec. 4.3 of https://arxiv.org/pdf/2209.00626.pdf.

So I think the main concern is that instrumental goals / proxy gaming / goal drift are all likely to mean that an AI has some goals that diverge from human intentions, no matter what the instructions are. Then you add in some deception, and an AI which humans trust enough to cede some power to ends up turning on humans.

Expand full comment

Reply (1)

Jan Matusiewicz

Oct 8, 2023

This paper like many others assumes AI has goals. But GPT-4 doesn't want anything and there is no need to add volition to GPT-17 either

Expand full comment

Reply (1)

Chris L

Jan 3, 2024

GPT-4 doesn't want anything, but language models can and will be turned into agents.

Expand full comment

Reply (1)

Jan Matusiewicz

Jan 5, 2024

You can use LLM that works as Oracle to verify actions of the Agent AI. I described it here https://medium.com/@jan.matusiewicz/autonomous-agi-with-solved-alignment-problem-49e6561b8295

Expand full comment

Chris L

Jan 3, 2024

Seconding what Peter is saying about the argument for rogue AI being missing. Historically, the largest concern in the alignment community has been misaligned AI from instrumental convergence, for example, that if you put an AI in charge of producing a lot of paperclips, that it ends up taking over the world. And maybe you thought it was safe because it had only instructed it to operate in a very limited domain, but the AI decided that the best way to achieve that goal was to take over the world, and maybe very little power had been ceded to AI, but it didn't really matter becuase it out-smarted you.

(Another classic thought experiment involves an AI in a box. Eliezer claimed it would break out. In this scenario, you hadn't ceded any power, but that wasn't sufficient to stop the AI).

Expand full comment

UnabashedWatershed

Jan 3, 2024

As an EA-aligned person who's concerned about AI x-risk, I found a lot to agree with here!

I'm pretty wary of OpenAI / Anthropic's alignment strategies (but not sure if they’re wrong). And I completely agree that the technical aspect of alignment is just one part of a larger problem, which involves many society-wide decisions not to deploy AI in dangerous ways. I also think “critihype” is a real concern — my understanding is that similar concerns have been responsible for the AI x-risk movement’s wariness about things like government lobbying. If you say "AI will be powerful and might kill us all" to someone at the Pentagon, the Pentagon might just hear "AI will be powerful [and we should get it first]," making the problem worse.

(I don't think AI x-risk advocates have been extremely successful at avoiding things like critihype elsewhere. e.g, OpenAI was founded largely due to x-risk concerns and might now be making things worse.)

I think my main disagreement is that you seem to envision ‘retaining human responsibility’ as the default path that we would have to actively *decide* to move away from. I think technological change doesn’t usually look like this. If AI models are able to automate lots of human jobs, then any company / organization / government that automates those jobs will have a big efficiency advantage, putting pressure on every other organization to also automate. Preventing this means preventing *each individual* organization from grabbing what might look like huge, cheap efficiency gains. Talking about a “we” who are deciding whether or not to abdicate responsibility obscures the difficulty, since “we” are not a unified body and aren’t by default making worldwide collective decisions about AI at all.

It’s exactly because of the (potential) advanced capabilities of AI that I expect “deciding not to abdicate responsibility” to be extremely difficult, and highly unlikely without targeted preparation and regulation.

This heavily influences whether it makes sense to discuss AI x-risk. If not abdicating responsibility is relatively easy, then people talking about x-risk are possibly making things worse: hyping up the tech, possibly giving rhetorical cover for more AI work that somehow leads to more abdication of responsibility.

In my view, talking about advanced AI is the only way to be ready to do the extremely difficult task of *retaining* control at the crucial moments.

But I want to close by noting that, at least from this perspective, our positions still aren’t all that far apart. Plausibly they mostly boil down to an empirical question about how quickly AI capabilities will advance, and how far, rather than anything deeply ideological.

Expand full comment

Chris L

Jan 3, 2024

"The “financial sidequest” sidesteps both of these issues, which captures my real concern here: the existence of financial incentives means that alignment work often turns into product development in disguise rather than actually making progress on mitigating long-term harms"

It's worth noting that the alignment community has had these kinds of worries from the beginning. Admittedly, there was very little public discussion of this until recently. It seems like many people kept their concerns about OpenAI, for example, on the low for quite a while as they didn't want to antagonise them/making it harder to influence them, but over time people have started being more willing to speak up about these concerns and the possibility that certain companies may be engaged in "safety-washing" (https://www.lesswrong.com/posts/PY3HEHc5fMQkTrQo4/beware-safety-washing).

Expand full comment

johan

Dec 11, 2023

Nice piece. Of course, a (relatively) short piece like this can't go into the details of a decades-long debate (granted: much of it esoteric). What I really do like about this article is that it makes sure to get some important stuff right. Technology isn't deterministic, we can influence its course. There's path-dependency and all that, but still. It's about recognizing that responsibility which, it seems to me, is the opposite of speculating about how AI might take over.

One can sense, and this has been palpable in the last year or so and before that mostly in the esoteric communities, that the fear and hope of AI reaching problematic levels of capability comes in the same bag; they travel together and many of the top people in the industry carry this bag of a weird mixture of fear and hope around. Psychologists of social behavior that used to study prepper mentalities should jump at this group as an object of study.

But, something like actual catastrophic outcomes, as this article point out, depends on us in sequential steps giving up power, decision, control, responsibility, judgment and so on. So it's not about levels of intelligence, really, but, rather, about our willingness to hand these things over to machines. A desire that is driven, largely, by the same old boring capitalism. Now, we should remember that while we must question claims of technological determinism, we must still pay attention to the dynamics of complex systems at scale; while not autonomous or fully automatic, they will behave in ways we didn't foresee. This, of course, is an old story and requires no impressive intelligence on part of the machine.

Anyway, here's a quote from the article which I think sums up what we must not forget in these discussions and this makes it an important contribution:

”The punchline is this: the pathways to AI x-risk ultimately require a society where relying on — and trusting — algorithms for making consequential decisions is not only commonplace, but encouraged and incentivized. It is precisely this world that the breathless speculation about AI capabilities makes real. ”

We build the world that, when built, becomes hard to oversee and control. And we do it by bolstering a discourse about the dangers of such a world. By presenting it at once as a grand vision and a mass devastation. I think it's consistent with Turing's take on this in 1950: why bother discussing whether machines can or will be intelligent – soon enough (end of 20th century was his hunch), people will talk and behave as though they are.

To me, this is always what it comes down to: what are we (those in power and those with a right to vote and protest) willing to accept as true and reasonable?

Expand full comment

The Artificiality of Alignment