Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

Lex Fridman Podcast

Conversations about science, technology, history, philosophy and the nature of intelligence, consciousness, love, and power. Lex is an AI researcher at MIT and beyond. Conversations about science, technology, history, philosophy and the nature of intelligence, consciousness, love, and power. Lex is an AI researcher at MIT and beyond.

Transcribed podcasts: 442
Time transcribed: 44d 12h 13m 31s

results.

Mention graph

This graph shows how many times the word ______ has been mentioned throughout the history of the program.

If you extrapolate the curves that we've had so far, right, if you say, well, I don't know,
we're starting to get to like PhD level. And last year we were at undergraduate level. And the year
before we were at like the level of a high school student. Again, you can quibble with at what tasks
and for what. We're still missing modalities, but those are being added, like computer use was added,
like image generation has been added. If you just kind of like eyeball the rate at which these
capabilities are increasing, it does make you think that we'll get there by 2026 or 2027.
I think there are still worlds where it doesn't happen in a hundred years. The number of those
worlds is rapidly decreasing. We are rapidly running out of truly convincing blockers,
truly compelling reasons why this will not happen in the next few years. The scale up is very quick.
We do this today. We make a model and then we deploy thousands, maybe tens of thousands of
instances of it. I think by the time, certainly within two to three years, whether we have these
super powerful AIs or not, clusters are going to get to the size where you'll be able to deploy
millions of these. I am optimistic about meaning. I worry about economics and the concentration of
power. That's actually what I worry about more, the abuse of power.
And AI increases the amount of power in the world. And if you concentrate that power
and abuse that power, it can do immeasurable damage.
Yes. It's very frightening. It's very, it's very frightening.
The following is a conversation with Dario Amadei, CEO of Anthropic, the company that created
Claude that is currently and often at the top of most LLM benchmark leaderboards. On top of that,
Dario and the Anthropic team have been outspoken advocates for taking the topic of AI safety
very seriously. And they have continued to publish a lot of fascinating AI research on this and other
topics. I'm also joined afterwards by two other brilliant people from Anthropic. First, Amanda
Askell, who is a researcher working on alignment and fine tuning of Claude, including the design of
Claude's character and personality. A few folks told me she has probably talked with Claude more than
any human at Anthropic. So she was definitely a fascinating person to talk to about prompt
engineering and practical advice on how to get the best out of Claude. After that, Chris Ola stopped by
for a chat. He's one of the pioneers of the field of mechanistic interpretability, which is an exciting
set of efforts that aims to reverse engineer neural networks to figure out what's going on inside,
inferring behaviors from neural activation patterns inside the network. This is a very promising approach
for keeping future super intelligent AI systems safe. For example, by detecting from the activations
when the model is trying to deceive the human it is talking to. This is Alex Friedman podcast. To
support it, please check out our sponsors in the description. And now, dear friends, here's Dario
Amadei. Let's start with the big idea of scaling laws and the scaling hypothesis. What is it? What is
its history? And where do we stand today? So I can only describe it as it, you know, as it relates to kind of
my own experience, but I've been in the AI field for about 10 years. And it was something I noticed
very early on. So I first joined the AI world when I was working at Baidu with Andrew Ng in late 2014,
which is almost exactly 10 years ago now. And the first thing we worked on was speech recognition
systems. And in those days, I think deep learning was a new thing. It had made lots of progress,
but everyone was always saying, we don't have the algorithms we need to succeed. You know, we're not,
we're only matching a tiny, tiny fraction. There's so much we need to kind of discover
algorithmically. We haven't found the picture of how to match the human brain. And when, you know,
in some ways it was fortunate, I was kind of, you know, you can have almost beginner's luck,
right? I was like a newcomer to the field. And, you know, I looked at the neural net that we were
using for speech, the recurrent neural networks. And I said, I don't know, what if you make them bigger
and give them more layers? And what if you scale up the data along with this, right? I just saw these
as like independent dials that you could turn. And I noticed that the model started to do better and
better as you gave them more data, as you, as you made the models larger, as you trained them for
longer. And I didn't measure things precisely in those days, but, but along with, with colleagues,
we very much got the informal sense that the more data and the more compute and the more training you put
into these models, the better they perform. And so initially my thinking was, Hey, maybe that is
just true for speech recognition systems, right? Maybe, maybe that's just one particular quirk,
one particular area. I think it wasn't until 2017 when I first saw the results from GPT-1 that it
clicked for me that language is probably the area in which we can do this. We can get trillions of words
of language data. We can train on them. And the models we were trained in those days were tiny.
You could train them on one to eight GPUs. Whereas, you know, now we train jobs on tens of
thousands, soon going to hundreds of thousands of GPUs. And so when I, when I saw those two things
together and, you know, there were a few people like Ilya Sutskiver, who, who you've interviewed,
who had somewhat similar views, right? He might've been the first one, although I think
a few people came to, came to similar views around the same time, right? There was, you know,
Rich Sutton's bitter lesson. There was Gorin wrote about the scaling hypothesis, but I think somewhere
between 2014 and 2017 was when it really clicked for me, when I really got conviction that, Hey,
we're going to be able to do these incredibly wide cognitive tasks. If we just, if we just scale up
the models and at every stage of scaling, there are always arguments. And, you know, when I first
heard them, honestly, I thought probably I'm the one who's wrong. And, you know, all these,
all these experts in the field are right. They know the situation better, better than I do,
right? There's, you know, the Chomsky argument about like, you can get syntactics, but you can't
get semantics. There was this idea, Oh, you can make a sentence make sense, but you can't make a
paragraph make sense. The latest one we have today is, uh, you know, we're going to run out of data or
the data isn't high quality enough or models can't reason. And, and each time, every time we managed to,
we managed to either find a way around or scaling just is the way around. Um, sometimes it's one,
sometimes it's the other. Uh, and, and so I'm now at this point, I, I, I still think,
you know, it's, it's, it's always quite uncertain. We have nothing but inductive inference to tell us
that the next few years are going to be like the next, the last 10 years. But, but I've seen,
I've seen the movie enough times. I've seen the story happen for, for enough times to, to really
believe that probably the scaling is going to continue and that there's some magic to it that
we haven't really explained on a theoretical basis yet. And of course the scaling here is
bigger networks, bigger data, bigger compute. Yes. In particular, linear scaling up of bigger
networks, bigger training times, and, uh, more and, and more data. Uh, so all of these things,
it's almost like a chemical reaction. You know, you have three ingredients in the chemical reaction
and you need to linearly scale up the three ingredients. If you scale up one, not the others,
you run out of the other reagents and the, and the reaction stops. But if you scale up everything,
everything in series, then, then the reaction can proceed. And of course, now that you have this
kind of empirical science slash art, you can apply it to other, uh, more nuanced things like
scaling laws applied to interpretability or scaling laws applied to post-training or just seeing how
does this thing scale, but the big scaling law, I guess the underlying scaling hypothesis has to do
with big networks, big data leads to intelligence. Yeah. We've, we've documented scaling laws in lots
of domains other than language, right? So, uh, initially the, the paper we did that first showed
it was in early 2020 where we first showed it for language. There was then some work late in 2020
where we showed the same thing for other modalities, like images, video, text to image, image to text,
math, that they all had the same pattern. And, and you're right now there are other stages like
post-training or there are new types of reasoning models. And in, in, in all of those cases that we've
measured, we see similar, similar types of scaling laws. A bit of a philosophical question, but what's
your intuition about why bigger is better in terms of network size and data size? Why does it lead to
more intelligent models? So in my previous career as a, as a biophysicist, so I did physics undergrad
and then biophysics in, in, in, in, in grad school. So I think back to what I know as a physicist, which is
actually much less than what some of my colleagues at Anthropic have in terms of, in terms of expertise
in physics, uh, there's this, there's this concept called the one over F noise and one over X
distributions, um, where, where often, um, uh, you know, just, just like if you add up a bunch of
natural processes, you get a Gaussian. If you add up a bunch of kind of differently distributed natural
processes, if you like, if you like, take a, take a, um, probe and, and hook it up to a resistor,
the distribution of the thermal noise and the resistor goes as one over the frequency. Um, it's
some kind of natural convergent distribution. Uh, and, and I, I, I, and, and I think what it amounts
to is that if you look at a lot of things that are, that are produced by some natural process that has
a lot of different scales, right? Not a Gaussian, which is kind of narrowly distributed, but, you know,
if I look at kind of like large and small fluctuations that lead to, lead to electrical
noise, um, they have this decaying one over X distribution. And so now I think of like patterns
in the physical world, right? If I, if, or, or in language, if I think about the patterns in
language, there are some really simple patterns. Some words are much more common than others,
like the, then there's basic noun verb structure. Then there's the fact that, you know, nouns and verbs
have to agree, they have to coordinate, and there's the higher level sentence structure.
Then there's the thematic structure of paragraphs. And so the fact that there's this regressing
structure, you can imagine that as you make the networks larger, first, they capture the really
simple correlations, the really simple patterns. And there's this long tail of other patterns.
And if that long tail of other patterns is really smooth, like it is with the one over F noise in,
you know, physical processes, like, like, like, like resistors, then you can imagine as you make
the network larger, it's kind of capturing more and more of that distribution. And so that smoothness
gets reflected in how well the models are at predicting that how well they perform. Language
is an evolved process, right? We've, we've developed language. We have common words and less common
words. We have common expressions and less common expressions. We have ideas, cliches that are expressed
frequently. And we have novel ideas. And that process has, has developed, has evolved with humans
over millions of years. And so the, the, the guess, and this is pure speculation, would be, would be that
there is, there's some kind of long tail distribution of, of, of the distribution of these ideas.
So there's the long tail, but also there's the height of the hierarchy of concepts that you're building
up. So the bigger the network, presumably you have a higher capacity to.
Exactly. If you have a small network, you only get the common stuff, right? If, if I take a tiny
neural network, it's very good at understanding that, you know, a sentence has to have, you know,
verb, adjective, noun, right? But it's, it's terrible at deciding what those verb, adjective
and noun should be and whether they should make sense. If I make it just a little bigger, it gets
good at that. Then suddenly it's good at the sentences, but it's not good at the paragraphs. And so
these, these rarer and more complex patterns get picked up as I add, as I add more capacity to the
network. Well, the natural question then is what's the ceiling of this? Yeah. Like how complicated
and complex is the real world? How much is this stuff is there to learn? I don't think any of us
knows the answer to that question. Um, I S my strong instinct would be that there's no ceiling below the
level of humans, right? We humans are able to understand these various patterns. And so that,
that makes me think that if we continue to, you know, scale up these, these, these models to kind of
develop new methods for training them and scaling them up, uh, that will at least get to the level
that we've gotten to with humans. There's then a question of, you know, how much more is it possible
to understand that humans do how much, how much is it possible to be smarter and more perceptive than
humans? I, I would guess the answer has, has got to be domain dependent. If I look at an area like
biology and, you know, I wrote this essay machines of loving grace, it seems to me that humans are
struggling to understand the complexity of biology, right? If you go to Stanford or to Harvard or to
Berkeley, you have whole departments of, you know, folks trying to study, you know, like the immune
system or metabolic pathways. And, and each person understands only a tiny bit, part of it specializes
and they're struggling to combine their knowledge with that of, with that of other humans. And so
I have an instinct that there's, there's a lot of room at the top for AIs to get smarter. If I think
of something like materials in the, in the physical world, or, you know, um, like addressing, you know,
conflicts between humans or something like that, I mean, you know, it may be there's only some of these
problems are not intractable, but much harder. And, and it may be that there's only, there's only so well
you can do with some of these things, right? Just like with speech recognition, there's only so clear
I can hear your speech. So I think in some areas there may be ceilings in, in, you know, that are
very close to what humans have done in, in other areas, those ceilings may be very far away. And I
think we'll only find out when we build these systems. Uh, there's, it's very hard to know in
advance. We can speculate, but we can't be sure. And in some domains, the ceiling might have to do with
human bureaucracies and things like this, as you write about. Yes. So humans fundamentally have to be part of
the loop. That's the cause of the ceiling, not maybe the limits of the intelligence.
Yeah. I think in many cases, um, you know, in theory, technology could change very fast. For
example, all the things that we might invent with respect to biology. Um, but remember there's,
there's a, you know, there's a clinical trial system that we have to go through to actually
administer these things to humans. I think that's a mixture of things that are unnecessary and
bureaucratic and things that kind of protect the integrity of society. And the whole challenge is
that it's hard to tell. It's hard to tell what's going on. Uh, it's hard to tell which is which,
right? My, my view is definitely, I think in terms of drug development, we, my view is that we're too
slow and we're too conservative, but certainly if you get these things wrong, you know, it's, it's
possible to, to, to risk people's lives by, by being, by being, by being too reckless. And so at least,
at least some of these human institutions are in fact protecting people. So it's, it's all about
finding the balance. I strongly suspect that balance is kind of more on the side of pushing
to make things happen faster, but there is a balance. If we do hit a limit, if we do hit a
slowdown in the scaling laws, what do you think would be the reason? Is it compute limited data
limited? Uh, is it something else ideal limited? So a few things now we're talking about hitting the
limit before we get to the level of, of humans and the skill of humans. Um, so, so I think one
that's, you know, one that's popular today, and I think, you know, could be a limit that we run into,
I like most of the limits, I would bet against it, but it's definitely possible is we simply run out
of data. There's only so much data on the internet and there's issues with the quality of the data,
right? You can get hundreds of trillions of words on the internet, but a lot of it is,
is repetitive or it's search engine, you know, search engine optimization drivel,
or maybe in the future, it'll even be text generated by AIs itself. Uh, and, and so I think
there are limits to what, to, to, to what can be produced in this way. That said, we, and I would
guess other companies are working on ways to make data synthetic, uh, where you can, you know, you can
use the model to generate more data of the type that you have, that you have already, or even generate
data from scratch. If you think about, uh, what was done with, uh, deep minds, alpha go zero,
they managed to get a bot all the way from, you know, no ability to play go whatsoever to above
human level, just by playing against itself. There was no example data from humans required in the,
the alpha go zero version of it. The other direction of course, is these reasoning models
that do chain of thought and stop to think, um, and, and reflect on their own thinking in a way.
That's another kind of synthetic data coupled with reinforcement learning. So my, my guess is
with one of those methods, we'll get around the data limitation, or there may be other sources of
data that are, that are available. Um, we could just observe that even if there's no problem with
data, as we start to scale models up, they just stopped getting better. It's, it seemed to be a
reliable observation that they've gotten better. That could just stop at some point for a reason we
don't understand. Um, the answer could be that we need to, uh, you know, we need to invent some new
architecture. Um, it's been, there have been problems in the past where, with say numerical
stability of models where it looked like things were, were leveling off, but, but actually, you know,
when we, when we, when we found the right on block or they didn't end up doing so. So perhaps
there's new, some new optimization method or some new, uh, technique we need to, to unblock things.
I've seen no evidence of that so far, but if things were to, to slow down, that perhaps could
be one reason. What about the limits of compute? Meaning, uh, the expensive, uh, nature of building
bigger and bigger data centers. So right now, I think, uh, you know, most of the frontier model
companies I would guess are, are operating, you know, roughly, you know, $1 billion scale plus or
minus a factor of three, right? Those are the models that exist now or are being trained now.
Uh, I think next year we're going to go to a few billion and then, uh, 2026, we may go to, uh, uh,
you know, above 10, 10, 10 billion and probably by 2027, their ambitions to build a hundred,
a hundred billion dollar, a hundred billion dollar clusters. And I think all of that actually will
happen. There's a lot of determination to build the compute, to do it within this country. Uh, and I
would guess that it actually does happen. Now, if we get to a hundred billion, that's still not
enough compute, that's still not enough scale, then either we need even more scale or we need
to develop some way of doing it more efficiently of shifting the curve. Um, I think between all of
these, one of the reasons I'm bullish about powerful AI happening so fast is just that if
you extrapolate the next few points on the curve, we're very quickly getting towards human level
ability, right? Some of the new models that, that we developed, some, some reasoning models
that have come from other companies, they're starting to get to what I would call the PhD
or professional level, right? If you look at their, their coding ability, um, the latest model
we released Sonnet 3.5, the new or updated version, it gets something like 50% on Sweebench. And Sweebench
is an example of a bunch of professional real world software engineering tasks at the beginning of the
year. I think the state of the art was three or 4%. So in 10 months, we've gone from 3% to 50% on this
task. And I think in another year, we'll probably be at 90%. I mean, I don't know, but might, might even
be, might even be less than that. Uh, we've seen similar things in graduate level math, physics, and
biology from models like open AI's 01. Uh, so, uh, if we, if we just continue to extrapolate this, right,
in terms of skill, skill that we have, I think if we extrapolate the straight curve within a few
years, we will get to these models being, you know, above the, the highest professional level
in terms of humans. Now, will that curve continue? You've pointed to, and I've pointed to a lot of
reasons why, you know, possible reasons why that might not happen. But if the, if the extrapolation
curve continues, that is the trajectory we're on. So Anthropic has several competitors. It'd be
interesting to get your sort of view of it all. Open AI, Google, XAI, Meta. What does it take to
win in the broad sense of win in the space? Yeah. So I want to separate out a couple things,
right? So, you know, Anthropic's, Anthropic's mission is to kind of try to make this all go well,
right? And, and, you know, we have a theory of change called race to the top, right? Race to the
top is about trying to push the other players to do the right thing by setting an example. It's not
about being the good guy. It's about setting things up so that all of us can be the good guy.
I'll give a few examples of this early in the history of Anthropic, one of our co-founders,
Chris Ola, who I believe you're, you're interviewing soon. You know, he's the co-founder of the field of
mechanistic interpretability, which is an attempt to understand what's going on inside AI models.
Uh, so we had him and one of our early teams focus on this area of interpretability, which we think
is good for making models safe and transparent. For three or four years, that had no commercial
application whatsoever. It still doesn't today. We're doing some early betas with it and probably
it will eventually, but, uh, you know, this is a very, very long research bed and one in which we've,
we've built in public and shared our results publicly. And, and we did this because, you know,
we think it's a way to make models safer. And interesting thing is that as we've done this,
other companies have started doing it as well. In some cases, because they've been inspired by it.
In some cases, because they're worried that, uh, you know, if, if other companies are doing this,
that look more responsible, they want to look more responsible too. No one wants to look like the
irresponsible actor. And, and so they adopt this, they adopt this as well. When folks come to
Anthropic interpretability is often a draw. And I tell them the other places you didn't go,
tell them why you came here. Um, and, and then you see soon that there, that there's
interpretability teams else elsewhere as well. And in a way that takes away our competitive advantage
because it's like, oh, now others are doing it as well, but it's good. It's good for the broader
system. And so we have to invent some new thing that we're doing that others aren't doing as well.
And the hope is to basically bid up, bid up the importance of, of, of doing the right thing.
And it's not, it's not about us in particular, right? It's not about having one particular good
guy. Other companies can do this as well. If they, if they, if they join the race to do this,
that's, that's, you know, that's the best news ever, right? Um, uh, it's, it's just, it's about
kind of shaping the incentives to point upward instead of shaping the incentives to point,
to point downward. And we should say this example of the field of, uh, mechanistic interpretability
is just a, a rigorous, non hand wavy way of doing AI safety or it's tending that way.
Trying to, I mean, I think we're still early, um, in terms of our ability to see things,
but I've been surprised at how much we've been able to look inside these systems and understand what we
see, right? Unlike with the scaling laws, where it feels like there's some, you know, law that's
driving these models to perform better on, on the inside, the models aren't, you know, there's no
reason why they should be designed for us to understand them, right? They're designed to
operate. They're designed to work just like the human brain or human biochemistry. They're not
designed for a human to open up the hatch, look inside and understand them. But we have found,
and you know, you can talk in much more detail about this to Chris, that when we open them up,
when we do look inside them, we, we find things that are surprisingly interesting.
And as a side effect, you also get to see the beauty of these models. You get to explore the
sort of, uh, the beautiful nature of large neural networks through the Mecanterp kind of methodology.
I'm amazed at how clean it's been. I'm amazed at things like induction heads. I'm amazed at things
like, uh, you know, that, that we can, you know, use sparse auto encoders to find these directions
within the networks, uh, and that the directions correspond to these very clear concepts. We demonstrated
this a bit with the Golden Gate Bridge Claude. So this was an experiment where we found a direction
inside one of the, the neural networks layers that corresponded to the Golden Gate Bridge.
And we just turned that way up. And so we, we released this model as a demo. It was kind of
half a joke, uh, for a couple of days. Uh, but it was, it was illustrative of, of the method we
developed. And, uh, you could, you could take the Golden Gate, you could take the model, you could ask it
about anything, you know, you know, it would be like how you could say, how was your day? And anything
you asked because this feature was activated, it would connect to the Golden Gate Bridge. So it would
say, you know, I'm, I'm, I'm feeling relaxed and expansive, much like the arches of the Golden Gate
Bridge, or, you know, it would masterfully change topic to the Golden Gate Bridge and integrate it.
There was also a sadness to it, to, to the focus it had on the Golden Gate Bridge. I think people
quickly fell in love with it. I think so people already miss it because it was taken down, I think,
after a day. Somehow these interventions on the model, um, where, where, where, where you kind
of adjust its behavior somehow emotionally made it seem more human than any other version of the
model. It's a strong personality, strong identity. It has a strong personality. It has these kind of
like obsessive interests. You know, we can all think of someone who's like obsessed with something. So
it does make it feel somehow a bit more human. Let's talk about the present. Let's talk about Claude.
So this year, a lot has happened. In March, Claude III, Opus, Sonnet, Haiku were released,
then Claude III, V, Sonnet in July, with an updated version just now released, and then also
Claude III, V, Haiku was released. Okay. Can you explain the difference between Opus, Sonnet, and Haiku,
and how we should think about the different versions? Yeah. So let's go back to March when
we first released, uh, these three models. So, you know, our thinking was, you know, different
companies produce kind of large and small models, better and worse models. We felt that there was
demand both for a really powerful model. Um, you know, when you, that might be a little bit slower
that you'd have to pay more for and also for fast, cheap models that are as smart as they can be for
how fast and cheap, right? Whenever you want to do some kind of like, you know, difficult analysis,
like if I, you know, I want to write code for instance, or, you know, I want to, I want to
brainstorm ideas or I want to do creative writing. I want the really powerful model, but then there's a
lot of practical applications in a business sense where it's like, I'm interacting with a website. I,
I, you know, like I'm like doing my taxes or I'm, you know, talking to, uh, you know, to like a legal
advisor and I want to analyze a contract or, you know, we have plenty of companies that are just
like, you know, I, you know, I want to do auto complete on my, on my IDE or something. Uh, and, and
for all of those things you want to act fast and you want to use the model very broadly. So we wanted to
serve that whole spectrum of needs. Um, so we ended up with this, uh, you know, this kind of poetry
theme. And so what's a really short poem, it's a haiku. And so haiku is the small, fast, cheap model
that is, you know, was at the time was released surprisingly, surprisingly, uh, intelligent for
how fast and cheap it was. Uh, sonnet is a, is a medium sized poem, right? A couple of paragraphs.
And so sonnet was the middle model. It is smarter, but also a little bit slower, a little bit more
expensive and, and opus like a magnum opus is a large work. Uh, opus was the, the largest,
smartest model at the time. Um, so that, that was the original kind of thinking behind it.
Um, and our, our thinking then was, well, each new generation of models should shift
that trade-off curve. Uh, so when we released sonnet 3.5, it has the same, roughly the same,
you know, cost and speed as the sonnet 3 model. Uh, but, uh, it, it increased its intelligence to
the point where it was smarter than the original opus 3 model, uh, especially for code, but, but also
just in general. And so now, you know, we've shown results for, uh, haiku 3.5 and I believe haiku 3.5,
the smallest new model is about as good as opus 3, the largest old model. So basically the aim here
is to shift the curve. And then at some point there's going to be an opus 3.5. Um, now every
new generation of models has its own thing. They use new data, their personality changes in ways that
we kind of, you know, try to steer, but are not fully able to steer. And, and so, uh, there's never
quite that exact equivalence where the only thing you're changing is intelligence. Um,
we always try and improve other things and some things change without us, without us knowing or
measuring. So it's, it's very much an exact science in many ways, the manner and personality of these
models is more an art than it is a science. So what is sort of the reason for, uh,
the span of time between say, uh, cloud opus 3.0 and 3.5, what is it, what takes that time if you can speak
to? Yeah. So there's, there's different, there's different, uh, processes. Um, uh, there's pre-training,
which is, you know, just kind of the normal language model training. And that takes a very
long time. Um, that uses, you know, these days, you know, tens, you know, tens of thousands,
sometimes many tens of thousands of, uh, GPUs or TPUs or Tranium, or, you know, what we use different
platforms, but, you know, accelerator chips, um, often, often training for months. Uh, there's then a kind of
post-training phase where we do reinforcement learning from human feedback, as well as other
kinds of reinforcement learning that that phase is getting, uh, larger and larger now. And, you know,
you know, often that's less of an exact science. It often takes effort to get it right. Um, models are
then tested with some of our early partners to see how good they are. And they're then tested both
internally and externally for their safety, particularly for catastrophic and autonomy
risks. Uh, so, uh, we do internal testing according to our responsible scaling policy,
which I, you know, could talk more about that in detail. And then we have an agreement with
the U S and the UK AI safety Institute, as well as other third party testers in specific domains
to test the models for what are called CBRN risk, chemical, biological, radiological, and nuclear,
which are, you know, we don't think that models pose these risks seriously yet, but, but every
new model we want to evaluate to see if we're starting to get close to some of these, these,
these more dangerous, um, uh, these more dangerous capabilities. So those are the phases. And then,
uh, you know, then, then it just takes some time to get the model working in terms of inference and
launching it in the API. So there's just a lot of steps to, uh, to actually, to actually making a model
work. And of course, you know, we're always trying to make the processes as streamlined as possible,
right? We want our safety testing to be rigorous, but we want it to be rigorous and to be, you know,
to be automatic, to happen as fast as it can without compromising on rigor. Same with our
pre-training process and our post-training process. So, you know, it's just like building anything else.
It's just like building airplanes. You want to make them, you know, you want to make them safe,
but you want to make the process streamlined. And I think the creative tension between those is,
is, you know, is an important thing in making the models work.
Yeah. Uh, rumor on the street, I forget who was saying that, uh, Anthropic has really good tooling.
So I, uh, probably a lot of the challenge here is on the software engineering side is to build the tooling
to, to have, uh, like, uh, efficient, low friction interaction with the infrastructure.
You would be surprised how much of the challenges of, uh, you know, building these models
comes down to, you know, software engineering, performance engineering, you know, you, you know,
from the outside, you might think, oh man, we had this Eureka breakthrough, right? You know,
this movie with the science that we discovered it, we figured it out. But, but, but I think,
I think all things, even, even, even, you know, incredible discoveries, like they, they, they, they,
they almost always come down to the details. Um, and, and often super, super boring details.
I can't speak to whether we have better tooling than, than other companies. I mean, you know,
I haven't been at those other companies, at least, at least not recently. Um, but it's certainly
something we give a lot of attention to. I don't know if you can say, but from three,
from Claude three to Claude three, five, is there any extra pre-training going on or is
they mostly focused on the post training? There's been leaps in performance.
Yeah. I think, I think at any given stage, we're focused on improving everything at once.
Um, just, just naturally, like there are different teams. Each team makes progress in a particular
area in, in, in making a particular, you know, their particular segment of the relay race better.
And it's just natural that when we make a new model, we put, we put all of these things in at once.
So, uh, the data you have, like the preference data you get from RLHF, is that applicable? Is
there ways to apply it to newer models as it get trained up?
Yeah. Preference data from old models sometimes gets used for new models. Although of course,
uh, it, it performs somewhat better when it's, you know, trained on it's trained on the new models.
Note that we have this, you know, constitutional AI method such that we don't only use preference data.
We kind of, there's also a post training process where we train the model against itself.
And there's, you know, new types of post training, the model against itself that are used every
day. So it's not just RLHF, it's a bunch of other methods as well. Um, post training,
I think, you know, is becoming more and more sophisticated.
Well, what explains the big leap in performance for the new Sonnet 3.5? I mean, at least in the
programming side, and maybe this is a good place to talk about benchmarks. What does it mean to get
better? It's just the number went up, but you know, I, I, I program, but I also love programming and I, um,
claw 3.5 through cursors, what I use, uh, to assist me in programming. And there was, at least
experientially, anecdotally, it's gotten smarter at programming. So what, like, what, what does it take
to get it, uh, to get it smarter? We observed that as well, by the way, there were a couple,
uh, very strong engineers here at Anthropic, um, who all previous code models, both produced by us
and produced by all the other companies hadn't really been useful to, hadn't really been useful
to them. You know, they said, you know, maybe, maybe this is useful to beginner. It's not useful to me,
but Sonnet 3.5, the original one for the first time, they said, oh my God, this helped me with
something that, you know, that it would have taken me hours to do. This is the first model that's
actually saved me time. So again, the waterline is rising. And, and then I think, you know,
the new Sonnet has been, has been even better in terms of what it, what it takes. I mean,
I'll just say it's been across the board. It's in the pre-training it's in the post-training it's in
various evaluations that we do. We've observed this as well. And if we go into the details of the
benchmark, so Sweebench is basically, you know, since, since, you know, since, since you're a
programmer, you know, you'll be familiar with like pull requests and, you know, uh, just, just pull
requests are like the, you know, the, like a sort of, a sort of atomic unit of work. You know,
you could say, I'm, you know, I'm implementing one, I'm implementing one thing. Um, uh, and, and so
Sweebench actually gives you kind of a real world situation where the code base is in the current state
and I'm trying to implement something that's, you know, that's described in, described in language.
We have internal benchmarks where we, where we measure the same thing and you say,
just give the model free reign to like, you know, do anything, run, run, run anything,
edit anything. Um, how, how well is it able to complete these tasks? And it's that benchmark
that's gone from, it can do it 3% of the time to it can do it about 50% of the time. Um, so I actually
do believe that if we get, uh, you can gain benchmarks, but I think if we get to a hundred
percent on that benchmark and in a way that isn't kind of like overtrained or, or, or game for that
particular benchmark probably represents a real and serious increase in kind of, in kind of programming,
programming ability. And, and I would suspect that if we can get to, you know, 90, 90, 95%,
that, that, that, you know, it will, it will represent ability to autonomously do a significant
fraction of software engineering tasks. Well, ridiculous timeline question. Uh, when is
cloud opus, uh, 3.5 coming out? Uh, not giving you an exact date. Uh, but you know, they're, they're,
uh, uh, you know, as far as we know, the plan is still to have a cloud 3.5 opus.
Are we going to get it before GTA six or no? Like Duke Nukem forever. So what was that game
that there was some game that was delayed 15 years. Was that Duke Nukem forever? Yeah.
And I think GTA is now just releasing trailers. You know, it's only been three months since we
released the first sonnet. Yeah. It's incredible. The incredible pace. It just, it just tells you about
the pace. Yeah. The expectations for when things are going to come out.
So, uh, what about 4.0? So how do you think about sort of, as these models get bigger and
bigger about versioning and also just versioning in general, why sonnet 3.5 updated with the date?
Why not sonnet 3.6? Yeah. It's actually, naming is actually an interesting challenge here, right?
Because I think a year ago, most of the model was pre-training. And so you could start from the
beginning and just say, okay, we're going to have models of different sizes. We're going to train
them all together. And, you know, we'll have a family of naming schemes and then we'll put some
new magic into them. And then, you know, we'll have the next, the next generation. Um, the trouble
starts already when some of them take a lot longer than others to train, right? That already messes up
your time, time a little bit, but as you make big improvements in, as you make big improvements in
pre-training, uh, then you suddenly notice, oh, I can make better pre-trained model. And that doesn't
take very long to do. And, but, you know, clearly it has the same, you know, size and shape of previous
models. Uh, uh, so I think those two together, as well as the timing, timing issues, any kind of scheme
you come up with, uh, you know, the reality tends to kind of frustrate that scheme, right? It tends to
kind of break out of the, break out of the scheme. It's not like software where you can say, oh, this is
like, you know, 3.7, this is 3.8. No, you have models with different, different trade-offs. You can
change some things in your models. You can train, you can change other things. Some are faster and
slower at inference. Some have to be more expensive. Some have to be less expensive. And so I think all
the companies have struggled with this. Um, I think we did very, you know, I think, think we were in a
good, good position in terms of naming when we had Haiku, Sonnet and Opus. It was great, great start.
We're trying to maintain it, but it's not, it's not, it's not perfect. Um, so we'll, we'll, we'll try and get
back to the simplicity, but it, it, um, uh, just the, the, the nature of the field, I feel like no
one's figured out naming. It's somehow a different paradigm from like normal software. And, and, and so
we, we just, none of the companies have been perfect at it. Um, it's something we struggle with
surprisingly much relative to, you know, how, relative to how trivial it is to, you know, for
the, the, the, the grand science of training the models. So from the user side,
the user experience of the updated Sonnet 3.5 is just different than the previous, uh,
June, 2024 Sonnet 3.5. It would be nice to come up with some kind of labeling that embodies that
because people talk about Sonnet 3.5, but now there's a different one. And so how do you refer to the
previous one and the new one? And it, it, uh, when there's a distinct improvement, it just makes
conversation about it, uh, just challenging. Yeah. Yeah. I, I definitely think this question
of there are lots of properties of the models that are not reflected in the benchmarks. Um,
I, I think, I think that's, that's definitely the case and everyone agrees and not all of them
are capabilities. Some of them are, you know, models can be polite or brusque. They can be,
uh, you know, uh, very reactive or they can ask you questions. Um, they can have what,
what feels like a warm personality or a cold personality. They can be boring or they can be
very distinctive like Golden Gate Claude was. Um, and we have a whole, you know, we have a whole team
kind of focused on, I think we call it Claude character. Uh, Amanda leads that team and we'll,
we'll talk to you about that, but it's still a very inexact science. Um, and, and often we find
that models have properties that we're not aware of the, the fact of the matter is that you can,
you know, talk to a model 10,000 times and there are some behaviors you might not see,
uh, just like, just like with a human, right? I can know someone for a few months and, you know,
not know that they have a certain skill or not know that there's a certain side to them. And so I
think, I think we just have to get used to this idea and we're always looking for better ways of
testing our models to, to demonstrate these capabilities and, and, and also to decide which are,
which are the, which are the personality properties we want models to have in which we don't want to
have that itself. The normative question is also super interesting. I got to ask you a question
from Reddit. From Reddit. Oh boy. You know, there, there's just this fascinating to me,
at least it's a psychological social phenomenon where people report that Claude has gotten dumber for them
over time. And so, uh, the question is, does the user complaint about the dumbing down of Claude
three, five sonnet hold any water? So are these anecdotal reports a kind of social phenomena or
did Claude, is there any cases where Claude would get dumber?
So, uh, this actually doesn't apply. This, this isn't just about Claude. I, I believe this,
I believe I've seen these complaints for every foundation model produced by a major company.
Um, people said this about GPT four. They said it about GPT four turbo. Um, so, so, so a couple
things, um, one, the actual weights of the model, right? The actual brain of the model that does not
change unless we introduce a new model. Um, there, there are just a number of reasons why it would
not make sense practically to be randomly substituting in substituting in new versions of the model.
It's difficult from an inference perspective, and it's actually hard to control all the consequences
of changing the weights of the model. Let's say you wanted to fine tune the model to be like,
I don't know, to like, to say certainly less, which, you know, an old version of sonnet used to do.
Um, you actually end up changing a hundred things as well. So we have a whole process for it and we have
a whole process for modifying the model. We do a bunch of testing on it. We do a bunch of,
um, like we do a bunch of user testing and early customers. So it, it, we both have never changed the
weights of the model without, without telling anyone. And it, it, it wouldn't certainly in the
current setup, it would not make sense to do that. Now, there are a couple of things that we do
occasionally do. Um, one is sometimes we run AB tests. Um, but those are typically very close to
when a model is being, is being, uh, released and for a very small fraction of time. Um, so, uh,
you know, like the, the, you know, the, the day before the new sonnet 3.5, I, I agree,
we should have, we should have had a better name. It's clunky to refer to it. Um, there were some
comments from people that like, it's got, it's got, it's gotten a lot better. And that's because,
you know, a fraction were exposed to, to an AB test for, for those one or, for those one or two days.
Um, the other is that occasionally the system prompt will change, um, on the system prompt can have some
effects, although it's, it, it, it's unlikely to dumb down models. It's unlikely to make them dumber.
Um, and, and, and, and we've seen that while these two things, which I'm listing to be very complete,
um, happen relatively, happen quite infrequently, um, the complaints about, for us and for other model
companies about the model change, the model isn't good at this, the model got more censored, the model
was dumbed down. Those complaints are constant. And so I don't want to say like people are imagining
these or anything, but like the models are for the most part, not changing. Um, if I were to offer a
theory, um, I, I think it actually relates to one of the things I said before, which is that
models have many are very complex and have many aspects to them. And so often, you know, if I,
if I, if I, if I ask the model a question, you know, if I'm like, if I'm like do task X versus can
you do task X, the model might respond in different ways. Uh, and, and so there are all kinds of subtle
things that you can change about the way you interact with the model that can give you very
different results. Um, to be clear, this, this itself is like a failing by, by us and by the
other model providers that, that the models are, are just, just often sensitive to like small,
small changes in wording. It's yet another way in which the science of how these models work is
very poorly developed. Uh, and, and so, you know, if I go to sleep one night and I was like talking to
the model in a certain way and I like slightly change the phrasing of how I talk to the model,
you know, I could, I could get different results. So that's, that's one possible way.
The other thing is, man, it's just hard to quantify this stuff. Uh, it's hard to quantify this stuff.
I think people are very excited by new models when they come out. And then as time goes on,
they, they become very aware of the, they become very aware of the limitations. So that may be another
effect, but that's, that's all a very long rendered way of saying for the most part with some fairly
narrow exceptions, the models are not changing. I think there is a psychological effect. You should
start getting used to it. The baseline raises, like when people have first gotten wifi on airplanes,
it's like amazing, magic. It's like amazing. Yeah. And then, and then you start, I can't get
this thing to work. Yeah. This is such a piece of crap. Exactly. So then it's easy to have the
conspiracy theory of they're making wifi slower and slower. This is probably something I'll talk to
Amanda much more about, but, um, another Reddit question. Uh, when will Claude stop trying to be my
puritanical grandmother imposing its moral worldview on me as a paying customer? And also what is the
psychology behind making Claude overly apologetic? So this kind of, um, reports about the experience,
a different angle and the frustration has to do with the character.
Yeah. So a couple points on this first one is, um, like things that people say on Reddit and Twitter
or X or whatever it is. Um, there's actually a huge distribution shift between like the stuff that
people complain loudly about on social media and what actually kind of like, you know,
statistically users care about, and that drives people to use the models. Like people are frustrated
with, you know, things like, you know, the model, not writing out all the code or the model, uh, you
know, just, just not being as good at, at code as it could be, even though it's the best model in the
world on code. Um, I, I think the majority of thing of things are about that. Um, uh, but, uh, certainly
a, a, a kind of vocal minority are, uh, you know, kind of, kind of, kind of raised these concerns,
right. Are frustrated by the model, refusing things that it shouldn't refuse or like apologizing too
much, or just, just having these kind of like annoying verbal tics. Um, the second caveat, and I
just want to say this like super clearly, because I think it's like some people don't know it. Uh,
others like kind of know it, but forget it. Like it is very difficult to control across the board,
how the models behave. You cannot just reach in there and say, oh, I want the model to like
apologize less. Like you can do that. You can include trading data that says like, oh, the model
should like apologize less. But then in some other situation, they end up being like super rude or like
overconfident in a way that's like misleading people. So there, there are all these trade-offs.
Um, uh, for example, another thing is if there was a period during which models ours, and I think
others as well were too verbose, right? They would like repeat themselves. They would say too much.
Um, you can cut down on the verbosity by penalizing the models for, for just talking for too long.
What happens when you do that, if you do it in a crude way is when the models are coding,
sometimes they'll say rest of the code goes here, right? Because they've learned that that's
a way to economize and that they see it. And then, and then, so that leads the model to be
so-called lazy in coding where they, where they, where they're just like, ah, you can finish the
rest of it. It's not, it's not because we want to, you know, save on compute or because, you know,
the models are lazy and, you know, during winter break or any of the other kinds of conspiracy
theories that have, that have, that have come up. It's actually, it's just very hard to control the
behavior of the model to steer the behavior of the model in all circumstances at once.
You can kind of, there's this, this whack-a-mole aspect where you push on one thing and like,
you know, these, these, these, you know, these other things start to move as well that you may
not even notice or measure. And so one of the reasons that I, that I care so much about,
uh, you know, kind of grand alignment of these AI systems in the future is actually,
these systems are actually quite unpredictable. They're actually quite hard to steer and control.
Um, and this version we're seeing today of you make one thing better, it makes another thing worse.
Uh, I think that's, that's like a present day analog of future control problems in AI systems
that we can start to study today. Right. I think, I think that, that, that difficulty in,
in steering the behavior and in making sure that if we push an AI system in one direction,
it doesn't push it in another direction in some, in some other ways that we didn't want.
Uh, I think that's, that's kind of an, that's kind of an early sign of things to come.
And if we can do a good job of solving this problem, right. Of like, you ask the model to like,
you know, to like make and distribute smallpox and it says no, but it's willing to like help you in
your graduate level virology class. Like how do we get both of those things at once? It's hard.
It's very easy to go to one side or the other, and it's a multi-dimensional problem. And so,
uh, I, you know, I think these questions of like shaping the model's personality,
I think they're very hard. I think we haven't done perfectly on them. I think we've actually done
the best of all the AI companies, but still so far from perfect. Uh, and I think if we can get this
right, if we can control the, the, you know, control the false positives and false negatives
in this, this very kind of controlled present day environment, we'll be much better at doing it for
the future. When our worry is, you know, will the models be super autonomous? Will they be able to,
you know, make very dangerous things? Will they be able to autonomously, you know, build whole
companies and are those companies aligned? So, so I, I think of this, this present task as both vexing,
but also good practice for the future. What's the current best way of gathering sort of user
feedback, like, uh, not anecdotal data, but just large scale data about pain points or the opposite
of pain points, positive things. So on, is it internal testing? Is it a specific group testing,
AB testing? What works? So, so typically, um, we'll have internal model bashings where all of
anthropic, anthropic is almost a thousand people. Um, you know, people just, just try and break the
model. They try and interact with it various ways. Um, uh, we have a suite of evals, uh,
for, you know, oh, is the model refusing in ways that it, that it couldn't. I think we even had a
certainly eval because, you know, our, our, uh, again, at one point model had this problem where
like it had this annoying tick where it would like respond to a wide range of questions by saying,
certainly I can help you with that. Certainly I would be happy to do that. Certainly this is correct.
Um, uh, and so we had a, like, certainly eval, which is like how, how often does the model say
certainly, uh, uh, but, but look, this is just a whack-a-mole. Like, like what if it switches from
certainly to definitely like, uh, uh, so, you know, every time we add a new eval and we're, we're always
evaluating for all the old things. So we have hundreds of these evaluations, but we find that
there's no substitute for human interacting with it. And so it's very much like the ordinary product
development process. We have like hundreds of people within Anthropic bash the model. Then we do,
uh, you know, then we do external AB tests. Sometimes we'll run tests with contractors.
We pay contractors to interact with the model. Um, so you put all of these things together and
it's still not perfect. You still see behaviors that you don't quite want to see, right?
You know, you see, you still see the model, like refusing things that it just doesn't make sense to
refuse. Um, but I, I, I think trying to, trying to solve this challenge, right? Trying to stop the
model from doing, you know, genuinely bad things that, you know, know what everyone agrees it
shouldn't do, right? You know, everyone, everyone, you know, everyone agrees that, you know, the model
shouldn't talk about, you know, I, I don't know child abuse material, right? Like everyone agrees
the model shouldn't do that. Uh, but, but at the same time that it doesn't refuse in these dumb
and stupid ways. Uh, I think, I think draw drawing that line as finely as possible, approaching
perfectly is still, is still a challenge and we're getting better at it every day, but there's,
there's a lot to be solved. And again, I would point to that as, as an indicator of a challenge
ahead in terms of steering much more powerful models. Do you think Claude 4.0 is ever coming out?
I don't want to commit to any naming scheme. Cause if I say, if I say here,
we're going to have Claude 4.0 next year and then, and then, you know, then we decide that like,
you know, we should start over cause there's a new type of model. Like I, I, I, I don't want to,
I don't want to commit to it. I would expect in a normal course of business that Claude 4.0
would come after Claude 3.5, but, but you know, you know, you know, you never know in this wacky
field, right? But the sort of this idea of scaling is continuing. Scaling, scaling is continuing.
There, there will definitely be more powerful models coming from us than the models that exist today.
That is, that is certain. Or if there, if there aren't, we've, we've deeply failed as a company.
Okay. Can you explain the responsible scaling policy and the AI safety level standards, ASL levels?
As much as I'm excited about the benefits of these models. And, you know, we'll talk about that. If we
talk about machines of loving grace, um, I'm, I'm worried about the risks and I continue to be
worried about the risks. Uh, no one should think that, you know, machines of loving grace was me,
me saying, uh, you know, I'm no longer worried about the risks of these models. I think they're
two sides of the same coin. The, the, uh, power of the models and their ability to solve all these
problems in, you know, biology, neuroscience, economic development, government, governance,
and peace, large parts of the economy. Those, those come with risks as well, right? With great power
comes great responsibility, right? That's the, the two are, the two are paired. Uh, things that are
powerful can do good things and they can do bad things. Um, I think of those risks as, as being
in, you know, several different, different categories, perhaps the two biggest risks that I
think about, and that's not to say that there aren't risks today that are, that are important.
But when I think of the really the, the, you know, the things that would happen on the grandest scale,
um, one is what I call catastrophic misuse. These are misuse of the models in domains like
cyber bio radiological nuclear, right? Things that could, you know, that could harm or even kill
thousands, even millions of people if they really, really go wrong. Um, like these are the, you know,
number one priority to prevent. And, and here I would just make a simple observation, which is that
the models, you know, if, if I look today at people who have done really bad things in the world,
um, uh, I think actually humanity has been protected by the fact that the overlap between
really smart, well-educated people and people who want to do really horrific things has generally
been small. Like, you know, let's say, let's say I'm someone who, you know, uh, you know,
I have a PhD in this field. I have a well-paying job. Um, there's so much to lose. Why do I want to,
like, you know, even, even assuming I'm completely evil, which, which most people are not,
um, why, why, you know, why would such a person risk their risk their, you know, risk their life,
risk their, their legacy, their reputation to, to do something like, you know, truly, truly evil.
If we had a lot more people like that, the world would be a much more dangerous place.
And so my, my, my worry is that by being a, a, a much more intelligent agent, AI could break that
correlation. And so I, I, I, I do have serious worries about that. I believe we can prevent those
worries. Uh, but you know, I, I think as a counterpoint to machines of loving grace,
I want to say that this is the, I, there's still serious risks. And, and the second range of risks
would be the autonomy risks, which is the idea that models might on their own, particularly as we give
them more agency than they've had in the past, uh, particularly as we give them supervision over
wider tasks, like, you know, writing whole code bases or someday even, you know, effectively
operating entire, entire companies. They're on a long enough leash. Are they, are they doing what
we really want them to do? It's very difficult to even understand in detail what they're doing,
let alone, let alone control it. And like I said, this, these early signs that it's,
it's hard to perfectly draw the boundary between things the model should do and things the model
shouldn't do that, that, you know, if you go to one side, you get things that are annoying and
useless and you go to the other side, you get other behaviors. If you fix one thing,
it creates other problems. We're getting better and better at solving this. I don't think this is
an unsolvable problem. I think this is a, you know, this is a science like,
like the safety of airplanes or the safety of cars or the safety of drugs. I, you know,
I don't think there's any big thing we're missing. I just think we need to get better at controlling
these models. And so these are, these are the two risks I'm worried about and our responsible scaling
plan, which all recognizes a very long winded answer to your question. I love it. I love it.
Our responsible scaling plan is designed to address these two types of risks. And so every
time we develop a new model, we basically test it for its ability to do both of these bad things.
So if I were to back up a little bit, um, I, I think we have, I think we have an interesting dilemma
with AI systems where they're not yet powerful enough to present these catastrophes. I don't know that,
I don't know if they'll ever present prevent these catastrophes. It's possible they won't,
but the, the case for worry, the case for risk is strong enough that we should, we should act now.
And, and they're, they're getting better very, very fast. Right. I, you know, I testified in the
Senate that, you know, we might have serious bio risks within two to three years. That was about a
year ago. Things have proceeded, proceeded a pace. Uh, uh, so we have this thing where it's like, it's,
it's, it's surprisingly hard to, to address these risks because they're not here today. They don't
exist. They're like ghosts, but they're coming at us so fast because the models are improving so fast.
So how do you deal with something that's not here today, doesn't exist, but is, is coming at us very
fast. Uh, so the solution we came up with for that in, in collaboration with, uh, you know, people
like, uh, the organization meter and Paul Cristiano is okay. What, what, what you need for that are you
need tests to tell you when the risk is getting close. You need an early warning system. And, and so
every time we have a new model, we test it for its capability to do these CBRN tasks, as well as
testing it for, you know, how capable it is of doing tasks autonomously on its own. And, uh, in
the latest version of our RSP, which we released in the last, in the last month or two, uh, the way
we test autonomy risks is the model, the AI models ability to do aspects of AI research itself. Uh,
which when the model, when the AI models can do AI research, they become kind of truly, truly autonomous.
Uh, and that, you know, that threshold is important for a bunch of other ways.
And, and so what do we then do with these tasks? The RSP basically develops what we've called an
if then structure, which is if the models pass a certain capability, then we impose a certain
set of safety and security requirements on them. So today's models are what's called ASL two models
that were ASL one is for systems that manifestly don't pose any risk of autonomy or misuse. So for
example, a chess plane, bought deep blue would be ASL one. It's just manifestly the case that you
can't use deep blue for anything other than chess. It was just designed for chess. No, one's going to
use it to like, you know, to conduct a masterful cyber attack or to, you know, run wild and take
over the world. ASL two is today's AI systems where we've measured them. And we think these systems are
simply not smart enough to, uh, to, you know, autonomously self-replicate or conduct a bunch of
tasks, uh, and also not smart enough to provide meaningful information about CBRN risks and how to
build CBRN weapons above and beyond what can be known from looking at Google. Uh, in fact, sometimes
they do provide information, but, but not above and beyond a search engine, but not in a way that
can be stitched together. Um, not, not in a way that kind of end to end is dangerous enough.
So ASL three is going to be the point at which, uh, the models are helpful enough to enhance the
capabilities of non-state actors, right? State actors can already do a lot, a lot of, unfortunately,
to a high level of proficiency. A lot of these very dangerous and destructive things. The difference
is that non-state non-state actors are not capable of it. And so when we get to ASL three,
we'll take special security precautions designed to be sufficient to prevent theft of the model by
non-state actors and misuse of the model as it's deployed. Uh, we'll have to have enhanced filters
targeted at these particular areas. Cyber, bio, nuclear. Cyber, bio, nuclear,
and model autonomy, which is less a misuse risk and more risk of the model doing bad things itself.
ASL four getting to the point where these models could, could enhance the capability of a, of a,
of a already knowledgeable state actor and, or become the, you know, the main source of such a risk.
Like if you wanted to engage in such a risk, the main way you would do it is through a model.
And then I think ASL four on the autonomy side, it's, it's some, some, some amount of acceleration
in AI research capabilities with an, with an AI model. And then ASL five is where we would get
to the models that are, you know, that are, that are kind of, that are kind of, you know,
truly capable that it could exceed humanity in their ability to do, to do any of these tasks.
And so the, the, the point of the, if then structure commitment is, is basically to say, look,
I don't know. I've been, I've been working with these models for many years,
and I've been worried about risk for many years. It's actually kind of dangerous to cry wolf.
It's actually kind of dangerous to say this, you know, this, this model is, this model is risky.
And, you know, people look at it and they say, this is manifestly not dangerous. Again, it's, it's,
it's the, the delicacy of the risk isn't here today, but it's coming at us fast. How do you
deal with that? It's, it's really vexing to a risk planner to deal with it. And so this,
if then structure basically says, look, we don't want to antagonize a bunch of people.
We don't want to harm our own, you know, our, our, our kind of own ability to have a place in the
conversation by imposing these, these very onerous burdens on models that are not dangerous today.
So the, if then the trigger commitment is basically a way to deal with that. It says you clamp down
hard when you can show that the model is dangerous. And of course, what has to come with that is, you
know, enough of a buffer threshold that, that, you know, you can, you can, uh, you, you know,
you're, you're, you're, you're not at high risk of kind of missing the danger. It's not a perfect
framework. We've had to change it every, every, uh, you know, we came out with a new one just a few
weeks ago and probably, probably going forward, we might release new ones multiple times a year,
because it's, it's hard to get these policies, right? Like technically organizationally
from a research perspective, but that is the proposal. If then commitments and triggers
in order to minimize burdens and false alarms now, but really react appropriately when the dangers are
here. What do you think the timeline for ASL three is where several of the triggers are fired?
And what do you think the timeline is for ASL four?
Yeah. So that is hotly debated within the company. Um, uh, we are working actively to
prepare ASL three, uh, security, uh, security measures as well as ASL three deployment measures.
Um, I'm not going to go into detail, but we've made, we've made a lot of progress on both and,
you know, we're, we're prepared to be, I think ready quite soon. Uh, I would, I would not be
surprised. I would not be surprised at all if we hit ASL three, uh, next year, there was some concern
that we, we might even hit it, uh, uh, this year. That's still, that's still possible. That could
still happen. It's like very hard to say, but like, I would be very, very surprised if it was
like 20, 30. Uh, I think it's much sooner than that. So there's, uh, protocols for detecting it,
if then, and then there's protocols for how to respond to it. Yes. How difficult is the second,
the latter? Yeah. I think for ASL three, it's primarily about security, um, and, and about,
you know, filters on the model relating to a very narrow set of areas when we deploy the model,
because at ASL three, the model isn't autonomous yet. Um, uh, and, and so you don't have to worry
about, you know, kind of the model itself behaving in a bad way, even when it's deployed internally.
So I think the ASL three measures are, are, I won't say straightforward. They're, they're,
they're rigorous, but they're easier to reason about. I think once we get to ASL four, um,
we start to have worries about the models being smart enough that they might sandbag tests. They
might not tell the truth about tests. Um, we had some results came out about like sleeper agents,
and there was a more recent paper about, you know, can, can the models, uh, uh, mislead attempts to,
you know, sandbag their own abilities, right? Show them, you know, uh, uh, uh, uh, present themselves
as being less capable than they are. And so I think with ASL four, there's going to be an important
component of using other things than just interacting with the models. For example,
interpretability or hidden chains of thought, uh, where you have to look inside the model and verify
via some other mechanism that, that is not, you know, is not as easily corrupted as what the model
says, uh, that, that, you know, the, that, that the model indeed has some property. Uh, so we're still
working on ASL four. One of the properties of the RSP is that we, we don't specify ASL four until we've
hit ASL three B and, and I think that's proven to be a wise decision because even with ASL three, it, again,
it's hard to know this stuff in detail and, and it, it, we want to take as much time as we can
possibly take to get these things right. So for ASL three, the bad actor will be the humans.
Humans. Yes. And so there's a little bit more, um, for ASL four, it's both. I think it's both.
And so deception, and that's where mechanistic interpretability comes into play. And, uh,
hopefully the techniques used for that are not made accessible to the model.
Yeah. I mean, uh, of course you can hook up the mechanistic interpretability to the model itself.
Um, but then you've, then you, then you, then you've kind of lost it as a reliable indicator of,
uh, of, uh, of, of, of, of the model state. There are a bunch of exotic ways you can think of that.
It might also not be reliable. Like if the model gets smart enough that it can like, you know,
jump computers and like read the code where you're like looking at its internal state. We've thought
about some of those. I think they're exotic enough. There are ways to render them unlikely, but yeah,
generally you want to, you want to preserve mechanistic interpretability as a kind of
verification set or test set that's separate from the training process of the model.
See, I think, uh, as these models become better and better conversation and become smarter,
social engineer becomes a threat too, because they, they can start being very convincing to the
engineers inside companies. Oh yeah. Yeah. It's actually like, you know, we've, we've seen lots of
examples of demagoguery in our life from humans. And, and, you know, there's a concern that models
could do that, could do that as well. One of the ways that cloud has been getting more
and more powerful is it's now able to do some agentic stuff, um, computer use. Uh, there's
also an analysis within the sandbox of cloud.ai itself, but let's talk about computer use.
That's seems to me super exciting that you can just give cloud a task and it, uh, it takes a bunch
of actions figures it out and it has access to the, your computer through screenshots. So can you
explain how that works, uh, and where that's headed? Yeah, it's actually relatively simple. So
cloud has, has had for a long time since, since cloud three back in March, the ability to analyze
images and respond to them with text. The, the only new thing we added is those images can be
screenshots of a computer. And in response, we train the model to give a location on the screen
where you can click and or buttons on the keyboard, you can press in order to take action. And it turns
out that with actually not all that much additional training, the models can get quite good at that task.
It's a good example of generalization. Um, you know, people sometimes say, if you get to low earth
orbit, you're like halfway to anywhere, right? Because of how much it takes to escape the gravity. Well,
if you have a strong pre-trained model, I feel like you're halfway to anywhere, uh, in, in terms
of, in terms of the intelligence space. Uh, uh, uh, and, and, and so actually it didn't,
it didn't take all that much to get, to get Claude to do this. And you can just set that in a loop,
give the model a screenshot, tell it what to click on, give it the next screenshot, tell it what to click
on. And, and that turns into a full kind of almost, almost 3d video interaction of the model.
And it's able to do all of these tasks, right? You know, we, we showed these demos where it's able
to like fill out spreadsheets. It's able to kind of like interact with a website. It's able to,
you know, um, you know, it's able to open all kinds of, you know, programs,
different operating systems, windows, Linux, Mac. Uh, uh, so, uh, you know, I think all of that is
very exciting. I, I will say while in theory, there's nothing you could do there that you
couldn't have done through just giving the model, the API to drive the computer screen,
uh, this really lowers the barrier. And, you know, there's, there's, there's a lot of folks who,
who, who either, you know, kind of, kind of aren't, aren't, aren't, aren't in a position to,
to interact with those APIs or it takes them a long time to do. It's just, the screen is just
a universal interface. That's a lot easier to interact with. And so I expect over time,
this is going to lower a bunch of barriers. Now, honestly, the current model has, there's,
there, it leaves a lot still to be desired. And we were, we were honest about that in the blog,
right? It makes mistakes. It misclicks. And we, we, you know, we were careful to warn people,
Hey, this thing isn't, you can't just leave this thing to, you know, run on your computer for minutes
and minutes. Um, you gotta give this thing boundaries and guard rails. And I think that's
one of the reasons we released it first in an API form, rather than kind of, you know, this,
this kind of just, just hand it, just hand it to the consumer and, and give it control of their,
of their, of their, of their computer. Um, but, but, you know, I definitely feel that it's
important to get these capabilities out there as models get more powerful. We're going to have to
grapple with, you know, how do we use these capabilities safely? How do we prevent them
from being abused? Uh, and, and, you know, I think, I think releasing, releasing the model while,
while, while the capabilities are, are, you know, are, are still, are still limited is, is,
is very helpful in terms of, in terms of doing that. Um, you know, I think since it's been released,
a number of customers, I think, uh, Replit was maybe, was maybe one of the, the, the most, uh,
quickest, quickest, quickest to, quickest to deploy things, um, have, have, you know, have made use of it
in various ways. People have hooked up demos for, you know, Windows desktops, Macs, uh, uh, you know,
Linux, Linux machines. Uh, so yeah, it's been, it's been, it's been very exciting. I think as with,
as with anything else, you know, it, it, it, it comes with new exciting abilities and then, then,
then, you know, then, then with those new exciting abilities, we have to think about how to,
how to, you know, make the model, you know, safe, reliable, do what humans want them to do. I mean,
it's the same, it's the same story for everything, right? Same thing. It's that same tension.
But, but the possibility of use cases here is just the, the range is incredible. So, uh,
how much to make it work really well in the future, how much do you have to specially kind of,
uh, go beyond what's the pre-trained models doing? Do more post-training RLHF or supervised
fine tuning or synthetic data just for the agent? Yeah. I think speaking at a high level, it's our
intention to keep investing a lot in, you know, making, making the model better. Uh, like, I think,
I think, uh, you know, we look at, look at some of the, you know, some of the benchmarks where
previous models were like, oh, it could do it 6% of the time. And now our model would do it 14 or 22%
of the time. And yeah, we want to get up to, you know, the human level of reliability of 80, 90%,
just like anywhere else, right? We're on the same curve that we were on with Sweebench, where I think
I would guess a year from now, the models can do this very, very reliably, but you got to start
somewhere. So you think it's possible to get to the, the human level 90%, uh, basically doing the same
thing you're doing now, or is it has to be special for computer use? I, I mean, uh, it depends what you mean,
by, by, you know, special and special in general. Um, but, but I, you know, I, I generally think,
you know, the same kinds of techniques that we've been using to train the current model,
I, I expect that doubling down on those techniques in the same way that we have
for code, for code, for models in general, for other kit for, you know, for image input, um, uh,
you know, for voice, uh, I expect those same techniques will scale here as they have everywhere else.
But this is giving sort of the power of action to Claude. And so you could do a lot of really
powerful things, but you could do a lot of damage also.
Yeah. Yeah. No. And we've been very aware of that. Look, my, my view actually is computer use
isn't a fundamentally new capability like the CBRN or autonomy capabilities are. Um, it's more like it
kind of opens the aperture for the model to use and apply its existing abilities. Uh, and, and so the way we
think about it, going back to our RSP is nothing that this model is doing inherently increases,
you know, the risk from an RSP, RSP perspective, but as the models get more powerful, having this
capability may make it scarier once it, you know, once it has the cognitive capability to, um,
you know, to do something at the ASL three and ASL four level, this, this, you know,
this may be the thing that kind of unbounds it from doing so. So going forward, certainly this
modality of interaction is something we have tested for and that we will continue to test for an RSP
going forward. Um, I think it's probably better to have to learn and explore this capability before
the model is super, uh, you know, super capable. Yeah. And there's, uh, a lot of interesting attacks
like prompt injection, because now you've widened the aperture so you can prompt inject through stuff on
screen. So if this becomes more and more useful, then there's more and more benefit to inject,
inject stuff into the model. If it goes to a certain web page, it could be harmless stuff
like advertisements, or it could be like harmful stuff, right? Yeah. I mean, we've thought a lot
about things like spam, captcha, you know, mass camp. There's all, you know, every, every, like,
if one secret I'll tell you, if you've invented a new technology, not necessarily the biggest misuse,
but, but the, the first misuse you'll see scams, just petty scams. Like you'll just, just, just,
it's, it's like, it's like a thing as old people scamming each other. It's, it's this, it's this
thing as old as time. Um, and, and, and it's just every time you, you gotta deal with it.
It's almost like silly to say, but it's, it's true sort of bots and spam in general is the thing
as it gets more and more intelligent. Yeah. It's, uh, it's just, there are a lot of, like, like I said,
like there are a lot of petty criminals in the world and, and, and, you know, it's like every
new technology is like a new way for petty, petty criminals to do something, you know,
something stupid and malicious. Um, is there any ideas about sandboxing it?
Like how difficult is the sandboxing task? Yeah. We sandbox during training. So for example,
during training, we didn't expose the model to the internet. Um, I think that's probably a bad idea
during training because, uh, you know, the model can be changing its policy. It can be changing what
it's doing and it's having an effect in the real world. Um, uh, you know, in, in terms of actually
deploying the model, right. It kind of depends on the application. Like, you know, sometimes you want
the model to do something in the real world, but of course you can always put guard, you can always
put guard rails on the outside, right? You can say, okay, well, you know, this model's not going to
move data from my, you know, model's not going to move any files from my computer to, or my web server
to anywhere else. Now, when you talk about sandboxing again, when we get to ASL four, none of these
precautions are going to make sense there, right? Where when you, when you talk about ASL four, you're
then the model is being kind of, you know, you're there's a theoretical worry. The model could be smart
enough to break it to, to kind of break out of any box. And so there we need to think about mechanistic
interpretability about, you know, if we're, if we're going to have a sandbox, it would need to be a
mathematically provable sound, but you know, that that's, that's a whole different world
than what we're dealing with, with the models today. Yeah. The science of building a box from
which, uh, ASL four AI system cannot escape. I, I think it's probably not the right approach.
I think the right approach, instead of having something, you know, unaligned that, that like
you're trying to prevent it from escaping. I think it's, it's better to just design the model the right
way or have a loop where you, you know, you look inside, you look inside the model and you're able
to verify properties. And that gives you a, an opportunity to like iterate and actually get
it right. Um, I think, I think containing, uh, containing bad models is, is, is much worse
solution than having good models. Let me ask about regulation. What's the role of regulation
in keeping AI safe? So for example, can you describe California AI regulation bill SB 1047,
that was ultimately vetoed by the governor? What are the pros and cons of this bill?
Yes. We ended up making some suggestions to the bill and then some of those were adopted and,
you know, we felt, I think, I think quite positively, uh, uh, quite positively about, about the bill,
uh, by, by the end of that, um, it did still have some downsides. Um, uh, and you know, of course,
of course it got vetoed. Um, I think at a high level, I think some of the key ideas behind the bill,
um, are, you know, I would say similar to ideas behind our RSPs. And I think it's very important
that some jurisdiction, whether it's California or the federal government and, or other, other
countries and other states passes some regulation like this. And I can talk through why I think that's
so important. So I feel good about our RSP. It's not perfect. It needs to be iterated on a lot,
but it's been a good forcing function for getting the company to take these risks seriously,
to put them into product planning, to really make them a central part of work at Anthropic and to
make sure that all of a thousand people, and it's almost a thousand people now at Anthropic understand
that this is one of the highest priorities of the company, if not the highest priority. Uh, but
one, there are some, there are still some companies that don't have RSP like mechanisms, like OpenAI,
Google, uh, did adopt these mechanisms a couple months after, uh, after Anthropic did. Uh, but there
are, there are other companies out there that don't have these mechanisms at all. Uh, and so if some
companies adopt these mechanisms and others don't, uh, it's really going to create a situation where,
you know, some of these dangers have the property that it doesn't matter if three out of five of the
companies are being safe. If the other two are, are being, are being unsafe, it creates this negative
externality. And, and I think the lack of uniformity is not fair to those of us who have put a lot of
effort into being very thoughtful about these procedures. The second thing is, I don't think you
can trust these companies to adhere to these voluntary plans in their own, right? I like to think that
Anthropic will, we do everything we can that we will. Our, our, our, our, our RSP is checked by our
long-term benefit trust. Uh, so, you know, we do everything we can to, to, to adhere to our own RSP.
Um, but you know, you hear lots of things about various companies saying, oh, they said they would
do, they said they would give this much compute and they didn't, they said they would do this thing
and they didn't. Um, you know, I don't, I don't think it makes sense to, you know, to, to, to, you
know, litigate particular things that companies have done. But I think this, this broad principle
that like, if there's nothing watching over them, there's nothing watching over us as an industry,
there's no guarantee that we'll do the right thing. And the stakes are very high. Uh, and so I think it's,
I think it's important to have a uniform standard that, that, that, that, that everyone follows and to
make sure that simply that the industry does what a majority of the industry has already said is
important and has already said that they definitely will do right. Some people, uh, you know, I think
there's, there's a class of people who are against regulation on principle. I understand where that
comes from. If you go to Europe and, you know, you see something like GDPR, you see some of the other
stuff that, that, that, that, that, that they've done, you know, some of it's good, but, but some of it
is really unnecessarily burdensome. And I think it's fair to say really has slowed, really has
slowed innovation. And so I understand where people are coming from on priors. I understand why people
come from, start from that, start from that position. Uh, but, but again, I think AI is
different. If we go to the very serious risks of autonomy and misuse that, that, that I talked about,
you know, just, uh, just a few minutes ago, I think that those are unusual and they weren't an unusually
strong response. Uh, and so I, I think it's very important again, um, we need something that everyone
can get behind. Uh, you know, I think one of the issues with SB 1047, uh, especially the original version
of it was it, it had a bunch of the structure of RSPs, but it also had a bunch of stuff that was
either clunky or that, that, that just would have created a bunch of burdens, a bunch of hassle,
and might even have missed the target in terms of addressing the risks. Um, you don't really hear
about it on Twitter. You just hear about kind of, you know, people are, people are cheering for any
regulation. And then the folks who are against make up these often quite intellectually dishonest
arguments about how, you know, it, you know, it'll make us move away from California. Bill,
bill doesn't apply if you're headquartered in California bill only applies if you do business
in California, um, or that it would damage the open source ecosystem or that it would,
you know, it would cause cause all of these things. I, I, I, I think those were mostly nonsense,
but there are better arguments against regulation. There's one guy, uh, Dean ball, who's really,
you know, I think a very scholarly scholarly analyst who, who looks at what happens when
a regulation is put in place in ways that they can kind of get a life of their own or how they
can be poorly designed. And so our interest has always been, we do think there should be regulation
in this space, but we want to be an actor who makes sure that, that, that, that regulation is
something that's surgical that's targeted at the serious risks and is something people can actually
comply with because something I think the advocates of regulation don't understand as well as they
could is if we get something in place that is, um, that's poorly targeted, that wastes a bunch of
people's time, what's going to happen is people are going to say, see these safety risks there. You know,
this is, this is nonsense. I just, you know, I just had to hire 10 lawyers to, to, you know,
to fill out all these forms. I had to run all these tests for something that was clearly not dangerous.
And after six months of that, there'll be, there'll be a groundswell and we'll, we'll, we'll,
we'll end up with a durable consensus against regulation. And so the, I, I think the, the worst
enemy of those who want real accountability is badly designed regulation. Um, we, we need to actually
get it right. Uh, and, and this is, if there's one thing I could say to the advocates, it would be
that I want them to understand this dynamic better and we need to be really careful and we need to
talk to people who actually have, who actually have experience seeing how regulations play out in
practice and, and the people who have seen that understand to be very careful. If this was some
lesser issue, I might be against regulation at all. But what, what I want the opponents to understand
is, is that the underlying issues are actually serious. They're, they're not, they're not
something that I or the other companies are just making up because of regulatory capture.
They're not sci-fi fantasies. They're not, they're not any of these things. Um, you know, every, every
time we have a new model, every few months we measure the behavior of these models and they're getting
better and better at these concerning tasks, just as they are getting better and better at, um,
you know, good, valuable, economically useful tasks. And so I, I, I would just love it if some of the
former, you know, I, I think SB 1047 was very polarizing. I would love it if some of the most
reasonable opponents and some of the most reasonable, um, uh, proponents, uh, would sit down together and,
you know, I think, I think that, you know, the different, the different AI companies, um, you know,
Anthropic was the, the only AI company that, you know, felt positively in a very detailed way.
I think Elon tweeted, uh, tweeted briefly something positive, but you know, some of the,
some of the big ones like Google, open AI, Meta, Microsoft were, were pretty staunch, staunchly against.
So I would really like is if, if, you know, some of the key stakeholders, some of the, you know,
most thoughtful proponents and, and some of the most thoughtful opponents would sit down and say,
how do we solve this problem in, in a way that the proponents feel brings a real reduction in risk
and that the opponents feel that it is not, it is not hampering the, the industry or hampering
innovation any more necessary than it, than it, than it, than it needs to. And, and I think for,
for whatever reason that things got too polarized and those two groups didn't get to sit down in the way
that they should. Uh, and, and I feel, I feel urgency. I really think we need to do something
in 2025. Uh, uh, you know, if we get to the end of 2025 and we've still done nothing about this,
then I'm going to be worried. I'm not, I'm not worried yet because again, the risks aren't here
yet, but, but I, I think time is running short and come up with something surgical, like you said.
Yeah, yeah, yeah, exactly. And, and we need to get, we need to get away from this, this, this intense
pro safety versus intense anti-regulatory rhetoric, right? It's turned into these,
these flame wars on Twitter and nothing good is going to come of that.
So there's a lot of curiosity about the different players in the game. One of the, uh, OGs is OpenAI.
You've had several years of experience at OpenAI. What's your story and history there?
Yeah. So I was at OpenAI for, uh, for roughly five years, uh, for the last, I think it was a couple
of years, you know, I, I, I, I, I was, uh, vice president of research there. Um, probably myself
and Ilya Suitsgiver were the ones who, you know, really kind of set the, set the research direction
around 2016 or 2017. I first started to really believe in, or at least confirm my belief in the
scaling hypothesis. When, when Ilya famously said to me, the thing you need to understand about these
models is they just want to learn. The models just want to learn. Um, and, and, and, and again,
sometimes there are these one sentence, there are these one sentences, these Zen cones that you hear
them and you're like, ah, that, that explains everything that explains like a thousand things
that I've seen. And then, and then I, I, I, you know, I, ever after I had this visualization in my
head of like, you optimize the models in the right way, you point the models in the right way. They just
want to learn. They just want to solve the problem regardless of what the problem is.
So get out of their way, basically. Get out of their way. Yeah. Don't impose your own ideas
about how they should learn. And, you know, this was the same thing as Rich Sutton put out in The
Bitter Lesson or Gurren put out in The Scaling Hypothesis. You know, I think generally the dynamic
was, you know, I got, I got this kind of inspiration from, uh, from, from, from, from Ilya and from others,
folks like Alec Radford who did the, the original, uh, uh, GPT-1, uh, and then, uh, ran really hard with it. Me,
me and my collaborators on GPT-2, GPT-3, RL from human feedback, which was an attempt to kind
of deal with the early safety and durability, things like debate and amplification, heavy
on interpretability. So again, the combination of safety plus scaling, probably 2018, 2019, 2020,
those, those were, those were kind of the years when myself and my collaborators probably, um,
you know, mo, mo, many, many of whom became co-founders of Anthropic kind of really had,
had a vision and like, and like drove the direction.
Why'd you leave? Why'd you decide to leave?
Yeah. So look, I, I'm going to put things this way and I, you know, I think it, I think it ties to the,
to, to the, to the race to the top, right? Which is, you know, in my time at OpenAI,
what I'd come to see as I'd come to appreciate the scaling hypothesis, and as I'd come to appreciate
kind of the importance of safety along with the scaling hypothesis, the first one, I think,
you know, OpenAI was, was getting, was getting on board with, um, the second one in a way had
always been part of, of OpenAI's messaging. Um, but, uh, you know, over, over many years of,
of the time, the time that I spent there, I think I had a particular vision of how these,
how we should handle these things, how we should be brought out in the world,
the kind of principles that the organization should have. And look, I mean, there were like
many, many discussions about like, you know, should the org do, should the company do this?
Should the company do that? Like there's a bunch of misinformation out there. People say like,
we left because we didn't like the deal with Microsoft. False. Although, you know,
it was like a lot of discussion, a lot of questions about exactly how we do the deal with Microsoft.
Um, we left because we didn't like commercialization. That's not true. We built GPT-3,
which was the model that was commercialized. I was involved in commercialization. It's,
it's more again, about how do you do it? Like civilization is going down this path to very
powerful AI. What's the way to do it? That is cautious, straightforward, honest, um, that builds
trust in the organization and in individuals. How do we get from here to there? And how do we have a
real vision for how to get it right? How can safety not just be something we say because it
helps with recruiting? Um, and you know, I think, I think at the end of the day, um,
if you have a vision for that, forget about anyone else's vision. I don't want to talk about anyone
else's vision. If you have a vision for how to do it, you should go off and you should do that vision.
It is incredibly unproductive to try and argue with someone else's vision. You might think they're not
doing it the right way. You might think they're, they're, they're dishonest. Who knows? Maybe
you're right. Maybe you're not. Um, uh, but, uh, what, what you should do is you should take some
people you trust and you should go off together and you should make your vision happen. And if your
vision is compelling, if you can make it appeal to people, some, you know, some combination of
ethically, you know, in the market, uh, you know, if, if you can, if you can make a company,
that's a place people want to join, uh, that, you know, engages in practices that people think are,
are reasonable while managing to maintain its position in the ecosystem at the same time.
If you do that, people will copy it. Um, and the fact that you were doing it,
especially the fact that you're doing it better than they are, um, causes them to change their
behavior in a much more compelling way than if they're your boss and you're arguing with them.
I just, I don't know how to be any more specific about it than that,
but I think it's generally very unproductive to try and get someone else's vision to look like
your vision. Um, it's much more productive to go off and do a clean experiment and say,
this is our vision. This is how, this is, this is how we're going to do things.
Your choice is you can, you can ignore us, you can reject what we're doing, or you can,
you can start to become more like us. And imitation is the sincerest form of flattery.
Um, and you know, that, that, that plays out in the behavior of customers that pays out in the
behavior of the public that plays out in the behavior of where people choose to work. Uh, and
again, again, at the end, it's, it's not about one company winning or another company winning.
If, if we or another company are engaging in some practice that, you know, people, people find
genuinely appealing and I want it to be in substance, not just, not just in appearance.
Um, and you know, I think, I think researchers are sophisticated and they look at substance.
Uh, and then other companies start copying that practice and they win because they copied that
practice. That's great. That's success. That's like the race to the top. It doesn't matter who
wins in the end, as long as everyone is copying everyone else's good practices. Right? One way I think
of it is like the thing we're all afraid of is the race to the bottom, right? And the race to the bottom
doesn't matter who wins because we all lose, right? Like, you know, in the most extreme world, we,
we make this autonomous AI that, you know, the robots enslave us or whatever. Right? I mean, that's
half joking, but you know, that, that is the most extreme, uh, uh, thing, thing that could happen.
Then, then it doesn't matter which company was ahead. Um, if instead you create a race to the top
where people are competing to engage in good, in good practices, uh, then, you know, at the end of
the day, you know, it doesn't matter who ends up, who ends up winning. It doesn't even matter who,
who started the race to the top. The point isn't to be virtuous. The point is to get the system into
a better equilibrium than it was before. And, and individual companies can play some role in doing
this. Individual companies can, can, you know, can help to start it, can help to accelerate it.
And frankly, I think individuals at other companies have, have done this as well, right?
The individuals that when we put out an RSP react by pushing harder to, to, to get something similar
done, get something similar done at, at, at other companies. Sometimes other companies do something
that's like, we're like, oh, it's a good practice. We think, we think that's good. We should adopt it too.
The only difference is, you know, I think, I think we are, um, we try to be more forward leaning.
We try and adopt more of these practices first and adopt them more quickly when others,
when others invent them. But I think this dynamic is what we should be pointing at. And that I think,
I think it abstracts away the question of, you know, which company's winning, who trusts who I,
I think all these, all these questions of drama are, are profoundly uninteresting. And, and the,
the thing that matters is the ecosystem that we all operate in and how to make that ecosystem
better because that constrains all the players.
And so Anthropica is this kind of clean experiment built on a foundation of like,
what concretely AI safety should look like.
Well, look, I'm sure we've made plenty of mistakes along the way. The perfect organization
doesn't exist. It has to deal with the, the imperfection of a thousand employees. It has to
deal with the imperfection of our leaders, including me. It has to deal with the imperfection of the people
we've put, we've put to, you know, to oversee the imperfection of the, of the leaders, like the,
like the board and the long-term benefit trust. It's, it's all, it's all a set of imperfect people
trying to aim imperfectly at some ideal that will never perfectly be achieved.
Um, that's what you sign up for. That's what it will always be. But, uh, uh, imperfect doesn't mean
you just give up. There's better and there's worse. And hopefully, hopefully we can begin to build,
we can do well enough that we can begin to build some practices that the whole industry engages in.
And then, you know, my guess is that multiple of these companies will be successful.
Anthropic will be successful. These other companies, like ones I've been at the past
will also be successful and some will be more successful than others. That's less important
than again, that we, we align the incentives of the industry. And that happens partly through the race
to the top, partly through things like RSP, partly through again, selected surgical regulation.
You said talent density beats talent mass. So can you explain that? Can you expand on that? Can you
just talk about what it takes to build a great team of AI researchers and engineers?
This is one of these statements. That's like more true every, every, every month, every month. I see
this statement is more true than I did the month before. So if I were to do a thought experiment,
let's say you have a team of 100 people that are super smart, motivated and aligned with the mission,
and that's your company, or you can have a team of a thousand people where 200 people are super smart,
super aligned with the mission. And then, uh, like a, and then like 800 people are, let's just
say you pick 800, like random, random big tech employees, which would you rather have, right?
The talent mass is greater in, in the group of, uh, in the group of a thousand people, right? You have,
you have even, even a larger number of incredibly talented, incredibly aligned, incredibly smart
people. Um, uh, but, but the, the issue is just that if every, every time someone super talented
looks around, they see someone else, super talented and super dedicated that sets the tone for
everything, right? That sets the tone for everyone is super inspired to work at the same place.
Everyone trusts everyone else. If you have a thousand or 10,000, uh, people and, and things
have really regressed, right? You are not able to do selection and you're choosing random people.
What happens is then you need to put a lot of processes and a lot of guardrails in place.
Um, just because people don't fully trust each other, you have to adjudicate political battles. Like
there are so many things that slow down the org's ability to operate. And so we're nearly a thousand
people. And, and, you know, we've, we've, we've tried to make it so that as large a fraction of
those thousand people as possible are like super talented, super skilled. It's one of the reasons
we've, we've slowed down hiring a lot in the last few months. We grew from 300 to 800, I believe,
I think in the first seven, eight months of the year. And now we've slowed down. We're at like,
you know, last three months we went from 800 to 900, 950, something like that. Don't quote me on the
exact numbers, but I think there's an inflection point around a thousand and we want to be much
more careful how, how we, how we grow, uh, early on. And, and now as well, you know,
we've hired a lot of physicists, um, you know, theoretical physicists can learn things really
fast. Um, uh, even, even more recently, as we've continued to hire that, you know, we've
really had a high bar for on both the research side and the software engineering side have hired a lot
of senior people, including folks who used to be at other, at other companies in this space.
And we, we've just continued to be very selective. It's very easy to go from a hundred to a thousand
and a thousand to 10,000 without paying attention to making sure everyone has a unified purpose.
It's so powerful. If your company consists of a lot of different fiefdoms that all want to do their
own thing, they're all optimizing for their own thing. Um, uh, it's very hard to get anything done,
but if everyone sees the broader purpose of the company, if there's trust and there's dedication
to doing the right thing, that is a superpower that in itself, I think can overcome almost every
other disadvantage. And you know, it's the Steve jobs, eight players, eight players want to look
around and see other eight players. There's another way of saying, I don't know what that is about
human nature, but it is demotivating to see people who are not obsessively driving towards a
singular mission. And it is on the flip side of that super motivating to see that it's interesting.
Uh, what's it take to be a great AI researcher or engineer from everything you've seen from
working with so many amazing people? Yeah. Um, I think the number one quality,
especially on the research side, but really both is open-mindedness. Sounds easy to be open-minded,
right? You're just like, Oh, I'm open to anything. Um, but you know, if I, if I think about my own
early history in the scaling hypothesis, um, I was seeing the same data others were seeing.
I don't think I was like a better programmer or better at coming up with research ideas than any
of the hundreds of people that I worked with. Um, in some ways, in some ways I was worse. Um, uh,
you know, like I've, I've never liked, you know, precise programming of like, you know, finding the bug,
writing the GPU kernels. Like I could point you to a hundred people here who are better,
who are better at that than I am. Um, but, but the, the thing that, that, that I think I did have
that was different was that I was just willing to look at something with new eyes, right? People said,
Oh, you know, we don't have the right algorithms yet. We haven't come up with the right,
the right way to do things. And I was just like, Oh, I don't know. Like, uh, you know,
this neural net has like 30 billion, 30 million parameters. Like what if we gave it 50 million
instead? Like let's plot some graphs like that, that basic scientific mindset of like,
Oh man, like I, I, I just, I just like, I, I, you know, I see some variable that I could change.
Like what happens when it changes? Like, let's, let's try these different things and like create
a graph for even the, this was like the simplest thing in the world, right? Change the number of,
you know, this wasn't like PhD level experimental design. This was like, this was like
simple and stupid. Like anyone could have done this. If you, if you just told them that, that,
that it was important, it's also not hard to understand. You didn't need to be brilliant to
come up with this. Um, but you put the two things together and you know, some tiny number of people,
some single digit number of people have, have driven forward the whole field by realizing this.
Uh, and, and it's, you know, it's often like that. If you look back at the discovery,
you know, the discoveries in history, they're, they're often like that. And so this, this open
mindedness and this willingness to see with new eyes that often comes from being newer to the field,
often experience is a disadvantage for this. That is the most important thing. It's very hard to look
for and test for, but I think, I think it's the most important thing because when you, when you find
something, some really new way of thinking, thinking about things, when you have the initiative to do
that, it's absolutely transformative. And also be able to do kind of rapid experimentation.
And in the face of that, be open-minded and curious and looking at the data for just these fresh eyes
and seeing what is that it's actually saying that applies in, uh, mechanistic interpretability.
It's another example of this, like some of the early work in mechanistic interpretability.
So simple. It's just, no one thought to care about this question before.
You said what it takes to be a great AI researcher. Can we rewind the clock back? What, what advice would you give
to people interested in AI? If they're young, looking forward to how can I make an impact on the world?
I think my number one piece of advice is to just start playing with the models.
Um, this was actually, I, I, I worry a little, this seems like obvious advice now. I think three years
ago it wasn't obvious and people started by, oh, let me read the latest reinforcement learning paper.
Let me, you know, let me, let me kind of, um, no, I mean, that was really the, that was really the, the,
the, and I mean, you should do that as well. But, uh, now, you know, with wider availability of models
and APIs, people are doing this more, but I think, I think just experiential knowledge, um,
these models are new artifacts that no one really understands. Um, and so getting experience,
playing with them. I would also say again, in line with the, like, do something new,
think in some new direction. Like there are all these things that haven't been explored.
Like for example, mechanistic interpretability is still very new. It's probably better to work on that
than it is to work on new model architectures because it's, you know, it's more popular than
it was before. There are probably like a hundred people working on it, but there aren't like 10,000
people working on it. And it's, it's just, this, just this, this fertile area for study. Like, like,
you know, it's, there's, there's so much like low hanging fruit. You can just walk by and, you know,
you can just walk by and you can pick things. Um, and, and the, the only reason for whatever reason,
people aren't, people aren't interested in it enough. I think there are some things around
long, long horizon learning and long horizon tasks where there's a lot to be done. I think
evaluations are still, we're still very early in our ability to study evaluations, particularly for
dynamic systems acting in the world. I think there's some stuff around multi-agent, um,
um, skate where the puck is going is my, is my advice. And you don't have to be brilliant to think
of it. Like all the things that are going to be exciting in five years, like in, in people even
mention them as like, you know, conventional wisdom, but like, it's, it's just somehow there's
this barrier that people don't, people don't double down as much as they could, or they're afraid to do
something that's not the popular thing. I don't know why it happens, but like getting over that
barrier is that that's the, my number one piece of advice.
Let's talk if we could a bit about post training. Yeah.
So it, uh, seems that the modern post training recipe has, uh, a little bit of everything.
So supervised fine tuning, RLHF, uh, the, the, the constitutional AI with RL, A, I, F.
Best acronym.
It's again, that naming thing.
And then synthetic data seems like a lot of synthetic data, or at least trying to figure out ways to have
high quality synthetic data. So what's the, uh, if this is a secret sauce that makes anthropic
claws so, uh, incredible, what, how, how much of the magic is in the pre-training? How much is in
the post-training?
Yeah. Um, I mean, uh, so first of all, we're not perfectly able to measure that ourselves.
Um, uh, you know, when you see some, some great character ability, sometimes it's hard to tell
whether it came from pre-training or post-training. Uh, we've developed ways to try and distinguish
between those two, but they're not perfect. You know, the second thing I would say is, you know,
it's when there is an advantage, and I think we've been pretty good at, in general, in general at RL,
perhaps, perhaps the best, although, although I don't know, cause I don't see what goes on
inside other companies. Uh, usually it isn't, oh my God, we have this secret magic method that
others don't have. Right. Usually it's like, well, you know, we got better at the infrastructure,
so we could run it for longer, or, you know, we were able to get higher quality data,
or we were able to filter our data better, or we were able to, you know, combine these methods
in practice. It's, it's usually some boring matter of, matter of kind of, uh, uh, uh, uh,
practice and trade craft. Um, so, you know, when I think about how to do something special in terms
of how we train these models, both pre-training, but even more so post-training, um, you know,
I, I, I really think of it a little more again, as like designing airplanes or cars. Like,
you know, it's not just like, oh man, I have the blueprint. Like maybe that makes you
make the next airplane, but like there's some, there's some cultural trade craft of how we
think about the design process that I think is more important than, than, you know, than,
than any particular gizmo we're able to invent. Okay. Well, about, let me ask you about specific
techniques. So first on RLHF, what do you think, just zooming out intuition, almost philosophy,
why do you think RLHF works so well? If I go back to like the scaling hypothesis,
one of the ways to skate the scaling hypothesis is if you train for X and you throw enough compute
at it, um, then you get X. And, and so RLHF is good at doing what humans want the model to do,
or at least, um, to state it more precisely doing what humans who look at the model for
a brief period of time and consider different possible responses, what they prefer as the
response, uh, which is not perfect from both the safety and capabilities perspective,
in that humans are, are often not able to perfectly identify what the model wants and
what humans want in the moment may not be what they want in the longterm. So there's,
there's a lot of subtlety there, but the models are good at, uh, you know, producing what the humans
in some shallow sense want. Uh, and it actually turns out that you don't even have to throw that much
compute at it because of another thing, which is this, this thing about a strong pre-trained model
being halfway to anywhere. Uh, uh, uh, so once you have the pre-trained model, you have all the
representations you need to, to get the model, uh, to get the model where you, where you want it to go.
So do you think RLHF makes the model smarter or just appear smarter to the humans?
I don't think it makes the model smarter. I don't think it just makes the model appear smarter.
It's like RLHF like bridges the gap between the human and the model, right? I could have something
really smart that like, can't communicate at all. Right. We all know people like this,
um, people who are really smart, but the, the, you know, you can't understand what they're saying.
Um, uh, so I think, I think RLHF just bridges that gap. Um, I, I think it's not, it's not the only kind
of RL we do. It's not the only kind of RL that will happen in the future. I think RL has the potential
to make models smarter, to make them reason better, to make them operate better, to make them develop new
skills even. And perhaps that could be done, you know, even in some cases with human feedback,
but the kind of RLHF we, we do today mostly doesn't do that yet. Although we're very quickly
starting to be able to. But it appears to sort of increase, if you look at the metric
of helpfulness, it increases that. Yes. It also increases, what was this,
this word in Leopold's essay, unhobbling, where basically the models are hobbled and then you do
various trainings to them to unhobble them. Yeah. So I, you know, I like that word,
cause it's like a rare word, but so, so I think RLHF unhobbles the models in some ways. Um,
and then there are other ways where a model hasn't yet been unhobbled and, and, you know,
needs to, needs to unhobble. If you can say in terms of cost,
is pre-training the most expensive thing or is post-training creep up to that?
At the present moment, it is still the case that, uh, pre-training is the majority of the cost.
I don't know what to expect in the future, but I could certainly anticipate a future where
post-training is the majority of the cost. In that future you anticipate,
would it be the humans or the AI that's the costly thing for the post-training?
I, I, I, I, I don't think you can scale up humans enough to get high quality.
Any, any kind of method that relies on humans and uses a large amount of compute,
it's going to have to rely on some scaled supervision method, like, uh, uh, like, um,
you know, debate or iterated amplification or something like that. So on that,
it's a super interesting, um, set of ideas around constitutional AI can describe what it is
as first detailed in December 2022 paper and, uh, and beyond that, what is it?
Yes. So this was from two years ago. The basic idea is, so we describe what RLHF is.
You have, uh, you have a model and, uh, it, you know, spits out two pot, you know,
like you just sample from it twice. It spits out two possible responses and you're like human,
which response you like better. Or another variant of it is rate this response on a scale of one to
seven. So that's hard because you need to scale up human interaction and, uh, it's very implicit,
right? I, I don't have a sense of what I, what I want the model to do. I just have a sense of like
what this average of a thousand humans wants the model to do. So two ideas, one is could the AI
system itself decide which, uh, which response is better, right? Could you show the AI system,
these two responses and ask which, which, which response is better? And then second, well,
what criterion should the AI use? And so then there's this idea, could you have a single document,
a constitution, if you will, that says, these are the principles, the model should be using to
respond. And the AI system reads those, um, it reads those principles as well as reading
the environment and the response. And it says, well, how good did the AI model do? Um, it's basically a
form of self play. You, you're kind of training the model against itself. And so the AI gives
the response and then you feed that back into what's called the preference model, which in turn
feeds the model to make it better. Um, so you have this triangle of like the AI, the preference
model and the improvement of the AI itself. And we should say that in the constitution,
the set of principles are like human interpretable. They're like, yeah, yeah. It's, it's something
both the human and the AI system can read. So it has this nice, this nice kind of translatability
or symmetry. Um, you know, in, in practice, we both use a model constitution and we use RLHF
and we use some of these other methods. So it's, it's turned into one tool in a, in a toolkit that
both reduces the need for RLHF and increases the value we get from, um, from, from using each data
point of RLHF. Um, it also interacts in interesting ways with kind of future reasoning type RL methods.
So, um, it's, it's one tool in the toolkit, but, but I think it is a very important tool.
Well, it's a compelling one to us humans, you know, thinking about the founding fathers and
the founding of the United States. The natural question is who and how do you think it gets to
define the constitution, the, the set of principles in the constitution?
Yeah. So I'll give like a practical, um, answer and a more abstract answer. I think
the practical answer is like, look, in practice models get used by all kinds of different like
customers. Right. And, and so, uh, you can have this idea where, you know, the model can,
can have specialized rules or principles. You know, we fine tune versions of models
implicitly. We've talked about doing it explicitly, having, having special principles that people can
build into the models. Um, uh, so from a practical perspective, the answer can be very different
from different people. Uh, you know, customer service agent, uh, you know, behaves very differently
from a lawyer and obeys different principles. Um, but I think at the base of it, there are
specific principles that models, uh, you know, have to obey. I think a lot of them are things that
people would agree with. Everyone agrees that, you know, we don't, you know, we don't want models
to present these CBRN risks. Um, I think we can go a little further and agree with some basic
principles of democracy and the rule of law beyond that, it gets, you know, very uncertain. And, and
there, our goal is generally for the models to be more neutral to not a spouse, a particular point
of view and, you know, more just be kind of like wise, uh, agents or advisors that will help you think
things through and we'll, you know, present, present possible considerations, but, you know, don't express,
you know, stronger, specific opinions. OpenAI released a model spec where it kind of clearly,
concretely defines some of the goals of the model and specific examples like AB, how the model should
behave. Do you find that interesting? By the way, I should mention the, I believe the brilliant
John Schulman was a part of that. He's now at Anthropic. Uh, do you think this is a useful
direction? Might Anthropic release a model spec as well? Yeah. So I think that's a pretty useful
direction. Again, it has a lot in common with a constitutional AI. So again, another example of
like a race to the top, right? We have something that's like, we think, you know, a better and more
responsible way of doing things. Um, it's also a competitive advantage. Um, then, uh, others kind
of, you know, discover that it has advantages and then start to do that thing. Uh, we then no longer
have the competitive advantage, but it's good from the perspective that now everyone has adopted a
positive practice that others were not adopting. And so our response to that as well looks like we
need a new competitive advantage in order to keep driving this race upwards. Um, so that's,
that's how I generally feel about that. I also think every implementation of these things is different.
So, you know, there were some things in the model spec that were not in constitutional AI. And so,
you know, we, you know, we can always, we can always adopt those things or, you know, at least learn
from them. Um, so again, I think this is an example of like the positive dynamic that, uh,
that, that, that I, that, that I think we should all want the field to have.
Let's talk about the incredible essay,
Machines of Love and Grace. I recommend everybody read it. It's a long one.
It is rather long.
Yeah. It's really refreshing to read concrete ideas about what a positive future looks like.
And you took sort of a bold stance because like, it's very possible that you might be wrong on the
dates or specific. Oh yeah. I'm fully expecting to, you know, to definitely be wrong about all
the details. I might be, be just spectacularly wrong about the whole thing. And people will,
you know, will laugh at me for years. Um, uh, that's, that's how that's, that's just how the
future works. So you provided a bunch of concrete, positive impacts of AI and how, you know, exactly
a super intelligent AI might accelerate the rate of breakthroughs. And for example, biology and
chemistry that would then lead to things like we cure most cancers, prevent all infectious disease,
double the human lifespan and so on. So let's talk about this essay first. Can you give a high level
vision of this essay and, um, what key takeaways that people have?
Yeah. I have spent a lot of time and Anthropic has spent a lot of effort on like, you know,
how do we address the risks of AI, right? How do we think about those risks? Like we're trying to do
a race to the top, you know, what that requires us to build all these capabilities and the capabilities
are cool, but you know, w you know, we're, we're, we're like a big part of what we're trying to do
is like, is like address the risks. And then justification for that is like, well, you know,
all these positive things, you know, the, the market is this very healthy organism, right? It's
going to produce all the positive things, the risks, I don't know, we might mitigate them. We might
not. And so we can have more impact by trying to mitigate the risks. But I noticed that one flaw
in that way of thinking, and it's, it's not a change in how seriously I take the risks,
it's, it's maybe a change in how I talk about them. Um, is that, you know, no matter how kind
of logical or rational that line of reasoning that I just gave might be, um, if, if you kind of only
talk about risks, your brain only thinks about risks. And, and so I think it's actually very
important to understand what if things do go well. And the whole reason we're trying to prevent
these risks is not because we're afraid of technology, not because we want to slow it down.
It's, it's, it's because if we can get to the other side of these risks, right, if we can run the
gauntlet successfully, um, to, you know, to, to put it in stark terms, then, then on the other side of
the gauntlet are all these great things. And these things are worth fighting for. And these things can
really inspire people. And I think I imagine because look, you have all these investors,
all these VCs, all these AI companies talking about all the positive benefits of AI. But as you
point out, it's, it's, it's weird. There's actually a dearth of really getting specific about it.
There's a lot of like random people on Twitter, like posting these kind of like gleaming cities.
And this, this just kind of like vibe of like grind, accelerate harder, like kick out the decent,
you know, it's, it's just this very, this very like aggressive ideological, but then you're like,
well, what are you, what, what, what, what, what, what are you actually excited about?
Um, and so, and so I figured that, you know, I think it would be interesting and valuable
for someone who's actually coming from the risk side to, to try and, and to try and really
make a try at, at explaining, explaining, explain what the benefits are.
Um, both because I think it's something we can all get behind and I want people to understand.
I want them to really understand that this isn't, this isn't doomers versus accelerationists.
Um, this, this is that if you have a true understanding of, of where things are going
with, with AI, and maybe that's the more important axis, AI is moving fast versus AI is not moving fast.
Then you really appreciate the benefits and you, you, you, you really, you want humanity,
our civilization to seize those benefits, but you also get very serious about anything that
could derail them.
So I think the starting point is to talk about what this powerful AI, which is the term
you like to use, uh, most of the world uses AGI, but you don't like the term because it's, uh,
basically has too much baggage.
It's become meaningless.
It's like, we're stuck with the terms, whether we're like it or not.
Maybe we're stuck with the terms and my efforts to change them are futile.
It's admirable.
I'll tell you what else I don't, this is like a pointless semantic point, but I,
I, I, I keep talking about it in public, so I'm just, I'm just gonna do it once more.
Um, uh, I, I think it's, it's a little like, like, let's say it was like 1995 and Moore's
Law is making the computers faster.
And like, for some reason there, there, there, there had been this like verbal tick that like,
everyone was like, well, someday we're gonna have like supercomputers and like supercomputers
are gonna be able to do all these things that like, you know, once we have supercomputers,
we'll be able to like sequence the genome, we'll be able to do other things.
And so, and so like one, it's true, the computers are getting faster.
And as they get faster, they're gonna be able to do all these great things.
But there's like, there's no discrete point at which you had a supercomputer and previous
computers were not to like supercomputer is a term we use, but like, it's a vague term
to just describe like computers that are faster than what we have today.
Um, there's no point at which you pass the threshold and you're like, oh my God,
we're doing a totally new type of computation and new.
And, and so I feel that way about AGI, like there's just a smooth exponential.
And like, if, if by AGI, you mean like, like AI is getting better and better and like,
gradually it's gonna do more and more of what humans do until it's gonna be smarter than humans.
And then it's gonna get smarter even from there, then, then yes, I believe in AGI.
If, but if, if, if AGI is some discrete or separate thing, which is the way people often talk
about it, then it's, it's kind of a meaningless buzzword.
Yeah.
And to me, it's just sort of a platonic form of a powerful AI, exactly how you define it.
I mean, you define it very nicely.
So on the intelligence axis, it's just on pure intelligence.
It's smarter than a Nobel prize winner, as you describe across most relevant disciplines.
So, okay.
That's just intelligence.
So it's, uh, both in creativity and be able to generate new ideas, all that kind of stuff
in every discipline, Nobel prize winner.
Okay.
In their prime.
It can use every modality.
So, uh, this kind of self-explanatory, but just operate across all the modalities of the world.
Uh, it can go off for many hours, days, and weeks to do tasks and do its own sort of detailed
planning and only ask you help when it's needed.
Uh, it can use, this is actually kind of interesting.
I think in the essay you said, I mean, again, it's a bet that it's not going to be embodied,
but it can control embodied tools.
So it can control tools, robots, laboratory equipment.
The resource used to train it can then be repurposed to run millions of copies of it.
And each of those copies would be independent.
They can do their own independent work.
So you can do the cloning of the intelligence.
Yeah.
Yeah.
I mean, you, you might imagine from outside the field that like, there's only one of these,
right?
Right.
Like you made it, you've only made one, but the truth is that like the scale up is very quick.
Like we, we do this today.
We make a model and then we deploy thousands, maybe tens of thousands of instances of it.
I think by the time, you know, certainly within two to three years, whether we have these super
powerful AIs or not, clusters are going to get to the size where you'll be able to deploy millions
of these and they'll be, you know, faster than humans.
And so if your picture is, oh, we'll have one and it'll take a while to make them.
My point there was no, actually you have millions of them right away.
And in general, they can learn and act, uh, 10 to a hundred times faster than humans.
So that's a really nice definition of powerful AI.
Okay.
So that, but you also write that clearly such an entity would be capable of solving very
difficult problems, very fast, but it is not trivial to figure out how fast two extreme
positions, both seem false to me.
So the singularity is on the one extreme and the opposite on the other extreme.
Can you describe each of the extremes?
Yeah.
So why?
So yeah, let's, let's describe the extreme.
So like one, one extreme would be, well, look, um, you know, uh, if we look at kind of
evolutionary history, like there was this big acceleration where, you know, for hundreds
of thousands of years, we just had like, you know, single celled organisms.
And then we had mammals and then we had apes.
And then that quickly turned to humans, humans quickly built industrial civilization.
And so this is going to keep speeding up and there's no ceiling at the human level.
Once models get much, much smarter than humans, they'll get really good at building the next
models.
And, you know, if you write down like a simple differential equation, like this is an exponential.
And so what's, what's going to happen is that, uh, models will build faster models,
models will build faster models.
And those models will build, you know, nanobots that can like take over the world and produce much
more energy than you could produce otherwise.
And, and so if you just kind of like solve this abstract differential equation, then like
five days after we, you know, we build the first AI that's more powerful than humans, then, then,
uh, you know, like the world will be filled with these AIs and every possible technology that
could be invented, like will be invented.
Um, I'm caricaturing this a little bit.
Um, uh, but I, you know, I think that's one extreme.
And the reason that I think that's not the case is that one, I think they just neglect like
the laws of physics.
Like it's only possible to do things so fast in the physical world.
Like some of those loops go through, you know, producing faster hardware.
Um, uh, takes a long time to produce faster hardware.
Things take a long time.
There's this issue of complexity.
Like, I think no matter how smart you are, like, you know, people talk about, oh, we can
make models, the biological systems that'll do everything, the biological systems.
Look, I think computational modeling can do a lot.
I did a lot of computational modeling when I worked in biology, but like, just.
There are a lot of things that you can't predict how they're, you know, they're, they're complex
enough that like just iterating, just running the experiment is going to beat any modeling,
no matter how smart the system doing the modeling is.
Or even if it's not interacting with the physical world, just the modeling is going to be hard.
Yeah.
I think, well, the modeling is going to be hard and getting the model to, to, to, to match
the physical world is going to be.
All right.
So he does have to interrupt.
Yeah, yeah, yeah.
Verify.
But, but it's just, you know, you just look at even the simplest problems.
Like I, you know, I think I talk about like, you know, the three body problem or simple,
chaotic prediction, like, you know, or, or like predicting the economy.
It's really hard to predict the economy two years out.
Like maybe the case is like, you know, normal, you know, humans can predict what's going to happen
in the economy in the next quarter or they can't really do that.
Maybe, maybe a AI system that's, you know, a zillion times smarter, it can only predict
it out a year or something instead of, instead of, you know, you have these kind of exponential
increase in computer intelligence for linear increase in, in, in ability to predict.
Same with, again, like, you know, biological molecules, molecules interacting.
You don't know what's going to happen when you perturb a, when you perturb a complex system,
you can find simple parts in it. If you're smarter, you're better at finding these simple parts.
And then I think human institutions, human institutions are just, are, are really difficult.
Like it's, you know, it's, it's been hard to get people. I won't give specific examples,
but it's been hard to get people to adopt even the technologies that we've developed,
even ones where the case for their efficacy is very, very strong.
Um, you know, people have concerns. They think things are conspiracy theories. Like it's,
it's just been, it's been very difficult. It's also been very difficult to get,
you know, very simple things through the regulatory system. Right. I think, you know, and, you know,
I, I don't want to disparage anyone who, you know, you know, works in regulatory, regulatory systems of
any technology. There are hard trade-offs they have to deal with. They have to save lives, but,
but the system as a whole, I think makes some obvious trade-offs that are very far from maximizing
human welfare. And so if we bring AI systems into this, you know, into these human systems,
often the level of intelligence may just not be the limiting factor, right? It, it, it just may be
that it takes a long time to do something. Now, if the AI system, uh, circumvented all governments,
if it just said, I'm dictator of the world and I'm going to do whatever, some of these things it
could do again, the things having to do with complexity. I, I still think a lot of things
would take a while. I don't think it helps that the AI systems can produce a lot of energy or go
to the moon. Like some people in comments responded to the essay saying the AI system can produce
a lot of energy and smarter AI systems. That's missing the point. That kind of cycle doesn't solve
the key problems that I'm talking about here. Um, so I think, I think a bunch of people miss the point
there, but even if it were completely on the line and, you know, could get around all these human
obstacles, it would have trouble. But again, if you want this to be an AI system that doesn't take
over the world, that doesn't destroy humanity, then, then basically, you know, it's, it's, it's
going to need to follow basic human laws, right? Well, you know, if, if we want to have an actually
good world, like we're going to have to have an AI system that, that interacts with humans, not one
that kind of creates its own legal system or disregards all the laws or all of that. So
as inefficient as these processes are, you know, we're going to have to deal with them because
there, there needs to be some popular and democratic legitimacy in how these systems are rolled out.
We can't have a small group of people who are developing these systems say, this is what's
best for everyone, right? I think it's wrong. And I think in practice it's not going to work anyway.
So you put all those things together and, you know, we're not, we're not going to,
we're not going to, you know, change the world and upload everyone in five minutes.
Uh, it's, I, I, I just, I don't think I, a, I don't think it's going to happen and be to some,
you know, to the extent that it could happen. It's, it's not the way to lead to a good world.
So that's on one side. On the other side, there's another set of perspectives,
which I have actually in some ways more sympathy for, which is look, we've seen big productivity
increases before, right? You know, economists are familiar with studying the productivity
increases that came from the computer revolution and internet revolution. And generally those
productivity increases were underwhelming. They were less than you, than you might imagine.
Um, there was a quote from Robert Solow. You see the computer revolution everywhere,
except the productivity statistics. So why is this the case? People point to the structure of firms,
the structure of enterprises, how, um, uh, you know, how slow it's been to roll out our existing
technology to very poor parts of the world, which I talk about in the essay, right? How do we get these
technologies to the poorest parts of the world that are behind on cell phone technology, computers,
medicine, let alone, you know, newfangled AI that hasn't been invented yet. Um, so you could have a
perspective that's like, well, this is amazing technically, but it's all a nothing burger.
Um, uh, you know, I think, um, Tyler Cowen, who, who wrote something response to my essay has that
perspective. I think he thinks the radical change will happen eventually, but he thinks it'll take
50 or a hundred years and, and you could have even more static perspectives on the whole thing.
I think there's some truth to it. I think the timescale is just, is just too long. Um, and, and I can see it.
I can actually see both sides with today's AI. So, uh, you know, a lot of our customers are large
enterprises who are used to doing things a certain way. Um, I've also seen it in talking to governments,
right? Those are, those are prototypical, you know, institutions, entities that are slow to change.
Uh, but the dynamic I see over and over again is yes, it takes a long time to move the ship.
Yes. There's a lot of resistance and lack of understanding, but the thing that makes me feel
that progress will in the end happen moderately fast, not incredibly fast, but moderately fast
is that you talk to what I find is I find over and over again, again, in large companies,
even in governments, um, which have been actually surprisingly forward leaning, uh, you find two
things that move things forward. One, you find a small fraction of people within a company, within a
government who really see the big picture, who see the whole scaling hypothesis, who understand where
AI is going, or at least understand where it's going within their industry. And there are a few
people like that within the current, within the current us government who really see the whole
picture. And, and those people see that this is the most important thing in the world. And so they
agitate for it. And the thing that they alone are not enough to succeed because they're a small set
of people within a large organization, but as the technology starts to roll out, as it succeeds in
some places in the folks who are most willing to adopt it, the specter of competition gives them a
wind at their backs because they can point within their large organization. They can say, look, these
other guys are doing this, right? You know, one bank can say, look, this newfangled hedge fund is doing
this thing. They're going to eat our lunch in the U S we can say, we're afraid China is going to get
there before, before we are. Uh, and that combination, the specter of competition,
plus a few visionaries within these, you know, within these, the organizations that in many ways
are, are sclerotic, you put those two things together and it actually makes something happen.
I mean, it's interesting. It's a balanced fight between the two because inertia is very powerful,
but, but, but eventually over enough time, the innovative approach breaks through. Um,
and I've seen that happen. I've seen the arc of that over and over again. And it's like the,
the barriers are there, the, the barriers to progress, the complexity, not knowing how to use
the model, how to deploy them are there. And, and for a bit, it seems like they're going to last
forever. Like change doesn't happen, but then eventually change happens and always comes from a
few people. I felt the same way when I was an advocate of the scaling hypothesis within the AI field
itself and others didn't get it. It felt like no one would ever get it. It felt like,
then it felt like we had a secret almost no one ever had. And then a couple of years later,
everyone has the secret. And so I think that's how it's going to go with deployment to AI in the world.
It's going to, the, the barriers are going to fall apart gradually and then all at once.
And so I think this is going to be more, and this is just an instinct. I could, I could easily see how I'm
wrong. I think it's going to be more like 10, five or 10 years, as I say in the essay,
then it's going to be 50 or a hundred years. I also think it's going to be five or 10 years
more than it's going to be, you know, five or 10 hours. Uh, uh, because I've just, I've just seen
how human systems work. And I think a lot of these people who write down the differential equations,
who say AI is going to make more powerful AI, who can't understand how it could possibly be the case
that these things won't, won't change so fast. I think they don't understand these things.
So what do you use the timeline to where we achieve AGI, AKA powerful AI, AKA super useful AI.
I'm gonna start calling it that.
It's a debate. It's a debate about naming, um, you know, on pure intelligence, it can smarter than a
Nobel prize winner in every relevant discipline and all the things we've said, modality, you can go
and do stuff on its own for days, weeks and do biology experiments, uh, on its own in one.
You know what? Let's just stick to biology. Cause yeah, you, you sold me on the whole biology and
health section. And that's so exciting from, um, from just, I was getting giddy from a scientific
perspective. It made me want to be a biologist. It's almost, it's, it's so no, no, this was the
feeling I had when I was writing it, that it's, it's like, this would be such a beautiful future.
If we can, if we can just, if we can just make it happen, right. If we can just get the,
get the land mines out of the way and, and, and make it happen. There's, there's so much, there's
so much beauty and, and, and, and, and elegance and moral force behind it. If, if we can, if we can
just, and it's something we should all be able to agree on, right? Like as much as we fight about,
about all these political questions, is, is this something that could actually bring us together?
Um, but you were asking when, when, when, when, when do you think what's just putting numbers on?
So, you know, this, this is of course the thing I've been grappling with for many years and I'm
not, I'm not at all confident every time. If I say 2026 or 2027, there will be like a zillion,
like people on Twitter who will be like, Hey, I CEO said 2026, 20, and it'll be repeated for like
the next two years that like, this is definitely when I think it's going to happen. Um, so
whoever's exerting these clips will, will, will, will, will crop out the thing I just said and,
and only say the thing I'm about to say. Um, but I'll just say it anyway. Um, so, so, uh, if you
extrapolate the curves that we've had so far, right? If, if you say, well, I don't know, we're starting
to get to like PhD level. And, and last year we were at, um, uh, undergraduate level and the year
before we were at like the level of a high school student, again, you can, you can quibble with at
what tasks and for what we're still missing modalities, but those are being added. Like
computer use was added. Like image in was added. Like image generation has been added.
If you just kind of like, and this is totally unscientific, but if you just kind of like
eyeball the rate at which these capabilities are increasing, it does make you think that we'll get
there by 2026 or 2027. Again, lots of things could derail it. We could run out of data.
You know, we might not be able to scale clusters as much as we want. Like, you know, maybe Taiwan
gets blown up or something and you know, then we can't produce as many GPUs as we want. So there,
there are all kinds of things that could, could derail the whole process. So I don't fully believe the
straight line extrapolation, but if you believe the straight line extrapolation, you'll, you'll,
we'll get there in 2026 or 2027. I think the most likely is that there's some mild delay relative
to that. Um, I don't know what that delay is, but I think it could happen on schedule. I think
there could be a mild delay. I think there are still worlds where it doesn't happen in, in a
hundred years. Those were the number of those worlds is rapidly decreasing. We are rapidly running
out of truly convincing Brocklers, truly compelling reasons why this will not happen in the next few
years. There were a lot more in 2020. Um, although my, my guess, my hunch at that time was that we'll
make it through all those blockers. So sitting as someone who has seen most of the blockers cleared out
of the way, I kind of suspect my hunch, my suspicion is that the rest of them will not block us. Uh, but
you know, look, look, look at the end of the day, like, I don't want to represent this as a scientific
prediction. People call them scaling laws. That's a misnomer. Like Moore's law is, is, is a misnomer.
Moore's law scaling laws. They're not laws of the universe. They're empirical regularities.
I am going to bed in favor of them continuing, but I'm not certain of that.
So you extensively describe sort of the compressed 21st century, how AGI will help,
uh, set forth a chain of breakthroughs in biology and medicine that help us in all these kinds of
ways that I mentioned. So how do you think, what are the early steps it might do? And by the way,
I asked Claude good questions to ask you and Claude told me, uh, to ask, what do you think is a typical
day for biologists working on AGI look like under in this future? Yeah. Yeah. Claude is curious.
Let me, well, let me start with your first questions and then I'll, then I'll answer that. Claude,
Claude wants to know what's in his future, right? Exactly.
Who's who am I going to be working with? Exactly. Um, so I think one of the things I went hard on and
when I went hard on in the essay is let me go back to this idea of, because it's, it's really had,
it had an, you know, had an impact on me, this idea that within large organizations and systems,
there end up being a few people or a few new ideas who kind of cause things to go in a different
direction than they would have before, who, who kind of disproportionately affect the trajectory.
There's a bunch of kind of the same thing going on, right? If you think about the health world,
there's like, you know, trillions of dollars to pay out Medicare and, you know, other health
insurance. And then the NIH is, is a hundred billion. And then if I think of like the, the few
things that have really revolutionized anything, it could be encapsulated in a small, small fraction
of that. And so when I think of like, where will AI have an impact? I'm like, can AI turn that small
fraction into a much larger fraction and raise its quality? And within biology, my experience within
biology is that the biggest problem of biology is that you can't see what's going on. You,
you have very little ability to see what's going on and even less ability to change it, right?
What you have is this, like, like from this, you have to infer that there's a bunch of cells
that within each cell is, you know, uh, uh, three billion base pairs of DNA built according to a genetic
code. Uh, uh, and you know, there are all these processes that are just going on without any
ability of us as you know, un-augmented humans to affect it. These cells are dividing most of the
time that's healthy, but sometimes that process goes wrong and that's cancer. Um, the cells are aging,
your skin may change color, develops wrinkles as you, as you age. And all of this is determined by these
processes, all these proteins being produced, transported to various parts of the cells,
binding to each other. And, and in our initial state about biology, we didn't even know that
these cells existed. We had to invent microscopes to observe the cells. We had to, uh, we had to invent
more powerful microscopes to see, you know, below the level of the cell to the level of molecules.
We had to invent X-ray crystallography to see the DNA. We had to invent gene sequencing to read the DNA.
Now, you know, we had to invent protein folding technology to, you know, to predict how it would
fold and how they bind and how these things bind to each other. Uh, you know, we had to,
we had to invent various techniques for now we can edit the DNA as of, you know, with CRISPR as of the
last, uh, uh, 12 years. So the, the whole history of biology, a whole big part of the history is, is
basically our, our, our, our ability to read and understand what's going on and our ability to reach
in and selectively change things. Um, and, and my view is that there's so much more we can still
do there, right? You can do CRISPR, but you can do it for your whole body. Um, let's say I want to do it
for one particular type of cell and I want the rate of targeting the wrong cell to be very low.
That's still a challenge. That's still things people are working on. That's what we might need
for gene therapy for certain diseases. And so the reason I'm saying all of this, and it goes beyond, you know,
beyond this to, you know, to gene sequencing, to new types of nano materials for observing
what's going on inside cells for, you know, antibody drug conjugates. The reason I'm saying
all of this is that this could be a leverage point for the AI systems, right? That the number
of such inventions it's, it's in the, it's in the mid double digits or something, you know,
mid double digits, maybe low triple digits over the history of biology. Let's say I have a million of
these AIs, like, you know, can they discover a thousand, you know, working together or can they
discover thousands of these very quickly? And, and does that provide a huge lever instead of trying
to leverage the, you know, 2 trillion a year we spend on, you know, Medicare or whatever,
can we leverage the 1 billion a year that's, that's, you know, that's spent to discover,
but with much higher quality. Um, and so what, what is it like, you know, being a, being a scientist
that works with, uh, with, with an AI system, the way I think about it actually is, well,
so I think in the early stages, uh, the AIs are going to be like grad students.
You're going to give them a project. You're going to say, you know, I'm the experienced biologist.
I've set up the lab, the biology professor, or even the grad students themselves will say,
here's, here's what, uh, here's what you can do with an AI, you know, like a AI system. I'd like
to study this. And, you know, the AI system, it has all the tools. It can like look up all the
literature to decide what to do. It can look at all the equipment. It can go to a website and say,
Hey, I'm going to go to, you know, Thermo Fisher or, you know, whatever the lab equipment company is,
dominant lab equipment company is today. And my, my time was Thermo Fisher. Um, uh, you know,
I'm going to order this new equipment to, to, to do this. I'm going to run my experiments. I'm going
to, you know, write up a report about my experiments. I'm going to, you know, inspect the images for
contamination. I'm going to decide what the next experiment is. I'm going to like write some code
and run a statistical analysis. All the things a grad student would do, there will be a computer
with an AI that like the professor talks to every once in a while. And it says, this is what you're
going to do today. The AI system comes to it with questions. Um, when it's necessary to run the lab
equipment, it may be limited in some ways. It may have to hire a human lab assistant to, you know,
to do the experiment and explain how to do it. Or it could, you know, it could use advances in lab
automation that are gradually being developed over you have been developed over the last, uh, uh,
decade or so. And we'll, we'll continue to be, we'll continue to be developed.
Uh, and so it'll look like there's a human professor and a thousand AI grad students. And,
you know, if you, if you go to one of these Nobel prize winning biologists or so you'll say,
okay, well, you, you know, you had like 50 grad students. Well, now you have a thousand and
they're, they're, they're smarter than you are, by the way. Um, uh, then I think at some point
it'll flip around where the, you know, the AI systems will, will, you know, will, will be the
PIs will be the leaders and, and, and, you know, they'll be, they'll be ordering humans or other AI
systems around. So I think that's how it'll work on the research side.
And they would be the inventors of a CRISPR type technology.
They would be the inventors of, of a CRISPR type technology.
Um, and then I think, you know, as I say in the essay, we'll want to turn, turn,
probably turning loose is the wrong, the wrong term, but we'll, we'll want to,
we'll want to harness the AI systems, uh, to improve the clinical trial system as well.
There's some amount of this that's regulatory. That's a matter of societal decisions,
and that'll be harder, but can we get better at predicting the results of clinical trials?
Can we get better at statistical design so that what, you know, clinical trials that used to
require, you know, 5,000 people and therefore, you know, needed a hundred million dollars in a year
to enroll them. Now they need 500 people in two months to enroll them. Um, that's where we should
start. Uh, and, and, you know, can we increase the success rate of clinical trials by doing things
in animal trials that we used to do in clinical trials and doing things in simulations that we used
to do in animal trials? Again, we won't be able to simulate at all. AI is not God. Um, uh, but,
but you know, can we, can we shift the curve substantially and radically? So I, I don't know,
that would be my picture. Doing it in vitro and doing it. I mean, you're still slowed down. It still
takes time, but you can do it much, much faster. Yeah. Yeah. Yeah. Can we just one step at a time?
And, and can that, can that add up to a lot of steps, even though, even though we still need
clinical trials, even though we still need laws, even though the FDA and other organizations will still
not be perfect. Can we just move everything in a positive direction? And when you add up all those
positive directions, do you get everything that was going to happen from here to 2100 instead happens
from 2027 to 2032 or something. Another way that I think the world might be changing with AI
even today, but moving towards this future of the, the powerful, super useful AI is, uh, programming. So
how do you see the nature of programming because it's so intimate to the actual act of building AI?
How do you see that changing for us humans? I think that's going to be one of the areas that
changes fastest, um, for two reasons. One programming is a skill that's very close to the actual building
of the AI. Um, so the farther a skill is from the people who are building the AI, the longer it's going to
take to get disrupted by the AI, right? Like I truly believe that like AI will disrupt agriculture. Maybe
it already has in some ways, but that's just very distant from the folks who are building AI. And so
I think it's going to take longer, but programming is the bread and butter of, you know, a large fraction
of, of the employees who work at Anthropic and at the other companies. And so it's going to happen fast.
The other reason it's going to happen fast is with programming, you close the loop, both when you're
training the model and when you're applying the model, the idea that the model can write the code
means that the model can then run the code and, and, and then see the results and, and interpret
it back. And so it really has an ability, unlike hardware, unlike biology, which we just discussed,
the model has an ability to close the loop. Um, and, and so I think those two things are going to lead to
the model getting good at programming very fast. As I saw on, you know, typical real world programming tasks,
models have gone from 3% in January of this year to 50% in October of this year. So, you know,
we're on that S curve, right? Where it's going to start slowing down soon because you can only get
200%. But, uh, I, you know, I, I would guess that in another 10 months we'll, we'll probably get pretty
close. We'll be at at least 90%. So again, I would guess, you know, I don't know how long it'll take,
but I would guess again, 2026, 2027, Twitter people who crop out my, who, who, who, who crop out these,
these numbers and get rid of the caveats, like, like, I don't know, I don't like you go away.
Uh, I would guess that the kind of task that the vast majority of coders do AI can probably,
if we make the task very narrow, like just write code, um, AI systems will, uh, be able to do that.
Now that said, I think comparative advantage is powerful. We'll find that when AIs can do 80%
of a coder's job, including most of it, that's literally like write code with a given spec,
we'll find that the remaining parts of the job become more leveraged for humans, right? Humans will,
there'll be more about like high level system design or, you know, looking at the app and like,
is it architected well and the, the design and UX aspects. And eventually AI will be able to do those as
well, right? That that's my vision of the, you know, powerful AI system. But I think for much
longer than we might expect, we will see that, uh, small parts of the job that humans still do
will expand to fill their entire job in order for the overall productivity to go up. Um, that's
something we've seen, you know, it used to be that, you know, writing let, you know, writing and editing
letters was very difficult. And like writing the print was difficult. Well, as soon as you had word
processors and then, and then, uh, and then computers, and it became easy to produce work
and easy to share it, then, then that became instant and all the focus was on, was on the ideas.
So this, this logic of comparative advantage that expands tiny parts of the tasks to large parts of
the tasks and creates new tasks in order to expand productivity. I think that's going to be the case.
Again, someday AI will be better at everything. And that logic, uh, won't apply. And then, then we all
have, you know, humanity will have to think about how to collectively deal with that. And we're
thinking about that every day. Um, uh, and, you know, that's another one of the grand problems to
deal with aside from misuse and autonomy. And, you know, we should take it very seriously, but I think,
I think in the, in the near term and maybe even in the medium term, like medium term, like two,
three, four years, you know, I expect that humans will, will continue to have a huge role and the nature of
programming will change, but programming as a, as a role, programming as a job will not change.
It'll just be less writing things line by line and it'll be more macroscopic.
And I wonder what the future of IDEs looks like. So the tooling of interacting with AI systems,
this is true for programming and also probably true for, in other contexts, like computer use,
but maybe domain specific, like we mentioned biology, it probably needs its own tooling about how
to be effective and then programming needs its own tooling. Is Anthropic going to play in that space
of also tooling potentially? I'm absolutely convinced that, uh, powerful IDs, uh, that,
that there's so much low hanging fruit to be grabbed there. Um, that, you know, right now it's just like
you talk to the model and it talks back, but, but look, I mean, IDs are great at kind of lots of static
analysis of, of, you know, so much as possible with kind of static analysis, like many bugs you can find
without even writing the code, then, uh, you know, IDs are good for running particular things,
organizing your code, um, measuring coverage of unit tests. Like there's so much that's been possible
with a normal, with a normal IDs. Now you add something like, well, the model now, you know,
the model can now like write code and run code. Like I am absolutely convinced that over the next year
or two, even if the quality of the models didn't improve, that there would be enormous opportunity
to enhance people's productivity by catching a bunch of mistakes, doing a bunch of grunt work for
people and that we haven't even scratched the surface. Um, Anthropic itself, I mean, you can't
say, you know, no, you know, it's hard to say what will happen in the future. Currently we're not
trying to make such IDs ourself, rather we're powering the companies like cursor or like cognition
or some of the other, you know, uh, expo in the security space. Um, uh, you know, others that I
could mention as well that are building such things themselves on top of our API. And our view has been
let a thousand flowers bloom. We don't internally have the, the re you know, the resources to try all
these different things. Let's let our customers try it. Um, uh, and, you know, we'll see who succeed
and maybe different customers will succeed in different ways. Uh, so I, I both think this is
super promising and you know, it's not, it's not, it's not something, you know, Anthropic isn't,
isn't eager to, to at least right now compete with all our companies in this space and maybe never.
Yeah. It's been interesting to watch cursor try to integrate clause successfully because there's,
it's actually fascinating how many places it can help the programming experience. It's not as trivial.
It is, it is really astounding. I feel like, you know, as a CEO, I don't get to program that much.
And I feel like if six months from now I go back, it'll be completely unrecognizable to me.
Exactly. Um, so in this world with super powerful AI, uh, that's increasingly automated. What's the
source of meaning for us humans? Yeah. Work is a source of deep meaning for many of us.
So what do we, uh, where do we find the meaning? This is something that I've, I've written about
a little bit in the essay. Although I, I actually, I give it a bit short shrift, not for any, um,
not for any principled reason, but this essay, if you believe it was originally going to be two or
three pages, I was going to talk about it at all hands. And the reason I, I, I realized it was an
under, uh, important underexplored topic is that I just kept writing things and I was just like,
Oh man, I can't do this justice. And so the thing ballooned to like 40 or 50 pages. And then when I got
to the work and mean section, I'm like, Oh man, this isn't going to be a hundred pages. Like I'm going to
have to write a whole other essay about that. But meaning is actually interesting because you
think about like the life that someone lives or something, or like, you know, like, you know,
let's say you were to put me in like a, I don't know, like a simulated environment or something
where like, um, you know, like I have a job and I'm trying to accomplish things. And I don't know,
I like do that for 60 years. And then, then you're like, Oh, Oh, like, oops, this was,
this was actually all a game. Right. Does that really kind of rob you of the meaning of the whole thing?
You know, like I still made important choices, including moral choices. I still
sacrificed. I still had to kind of gain all these skills or, or, or just like a similar exercise,
you know, think back to like, you know, one of the historical figures who, you know,
discovered electromagnetism or relativity or something. If you told them, well, actually 20,000
years ago, some, some alien on, you know, some alien on this planet discovered this before,
before you did. Um, does that, does that rob the meaning of the discovery? It doesn't really seem
like it to me. Right. It seems like the process is what is what matters and how it shows who you
are as a person along the way, and you know, how you relate to other people and like the decisions
that you make along the way, those are, those are consequential. Um, you know, I, I could imagine if
we handle things badly in an AI world, we could set things up where people don't have any long-term
source of meaning or any, but, but that's, that's more a choice, a set of choices we make. That's more a
set of the architecture of a society with these powerful models. If we, if we design it badly
and for shallow things, then, then that might happen. I would also say that, you know, most
people's lives today, while admirably, you know, they work very hard to find meaning, meaning in
those lives, like, look, you know, we, who are privileged and who are developing these technologies,
we should have empathy for people, not just here, but in the rest of the world who, who, you know,
spend a lot of their time kind of scraping by to, to, to, to, to like survive, assuming we can
distribute the benefits of these technology of this technology to everywhere, like their lives
are going to get a hell of a lot better. Um, and, uh, you know, meaning will be important to them
as it is important to them now, but, but, you know, we should not forget the importance of that.
And, and, you know, that, that, uh, the idea of meaning as, as, as, as kind of the only important
thing is in some ways, an artifact of, of a small subset of people who have, who have been,
uh, economically fortunate, but I, you know, I think all that said, I, you know, I think a world
is possible with powerful AI that not only has as much meaning for, for everyone, but that has,
that has more meaning for everyone, right. That can, can allow, um, can allow everyone to see
worlds and experiences that it was either possible for no one to see,
or, or possible for, for very few people to experience. Um, so I, I am optimistic about
meaning. I worry about economics and the concentration of power. That's actually what I
worry about more. Um, I, I worry about how do we make sure that that fair world reaches everyone.
Um, when things have gone wrong for humans, they've often gone wrong because humans mistreat other
humans, uh, that, that is maybe in some ways, even more than the autonomous risk of AI or the question
of meaning that, that, that is the thing I worry about most. Um, the, the concentration of power,
the abuse of power, um, structures like autocracies and dictatorships where a small number of people
exploits a large number of people. I'm very worried about that.
And AI increases the, uh, amount of power in the world. And if you concentrate that power
and abuse that power, it can do immeasurable damage.
Yes. It's very frightening. It's very, it's very frightening.
Well, I encourage people, highly encourage people to read the full essay. That should probably be a
book or a sequence of essays, uh, because it does paint a very specific future. And I could tell
the later sections got shorter and shorter because you started to probably realize that this is going
to be a very long essay. One, I realized it would be very long. And two, I'm very aware of and very
much try to avoid, um, you know, just, just being, uh, I, I don't know. I don't know what the term for
it is, but one, one of these people who's kind of overconfident and has an opinion on everything and
kind of says, says a bunch of stuff and isn't, isn't an expert. I very much tried to avoid that,
but I have to admit once I got the biology sections, like I wasn't an expert. And so as much as I expressed
uncertainty, uh, probably I said some, a bunch of things that were embarrassing or wrong.
Well, I was excited for the future. You painted and, uh, thank you so much for working hard to
build that future. And thank you for talking to it. Thanks for having me. I just, I just hope we
can get it right and, and make it real. And if there's one message I want to, I want to send it's
that to get all this stuff, right, to make it real, we, we both need to build the technology,
build the, you know, the companies, the economy around using this technology positively, but we
also need to address the risks because they're, they're, those risks are in our way. They're,
they're landmines on, on the way from here to there. And we have to diffuse those landmines if
we want to get there. It's a balance like all things in life, like all things. Thank you.
Thanks for listening to this conversation with Daria Amadei. And now dear friends, here's Amanda
Askel. You are a philosopher by training. So what sort of questions did you find fascinating
through your journey in philosophy in Oxford and NYU, and then, uh, switching over to the AI
problems at open AI and anthropic? I think philosophy is actually a really good subject
if you are kind of fascinated with everything. So, cause there's a philosophy of everything,
you know? So if you do philosophy of mathematics for a while, and then you decide that you're actually
really interested in chemistry, you can do philosophy of chemistry for a while. You can
move into ethics or philosophy of politics. Um, I think towards the end, I was really interested in
ethics primarily. Um, so that was like what my PhD was on. It was on a kind of technical area of ethics,
which was ethics where worlds contain infinitely many people, strangely, a little bit less practical
on the end of ethics. And then I think that one of the tricky things with doing a PhD in ethics
is that you're thinking a lot about like the world, how it could be better problems,
and you're doing like a PhD in philosophy. And I think when I was doing my PhD, I was kind of like,
this is really interesting. It's probably one of the most fascinating questions I've ever
encountered in philosophy. Um, and I love it, but I would rather see if I can have an impact on the
world and see if I can like do good things. And I think that was around the time that AI was still
probably not as widely recognized as it is now. That was around 2017, 2018. I had been following
progress and it seemed like it was becoming kind of a big deal. And I was basically just happy to
get involved and see if I could help. Cause I was like, well, if you try and do something impactful,
if you don't succeed, you tried to do the impactful thing and you can go be a scholar and like, not and
feel like you, you, you know, you, you tried. Um, and if it doesn't work out, it doesn't work out.
Um, and so then I went into AI policy at that point.
And what does AI policy entail?
At the time, this was more thinking about sort of the political impact and the ramifications of AI.
Um, and then I slowly moved into sort of, uh, AI evaluation, how we evaluate models, how they
compare with like human outputs, whether people can tell like the difference between AI and human outputs.
And then when I joined Anthropic, I was more interested in doing sort of technical alignment
work. And again, just seeing if I could do it and then being like, if I can't, uh, then, you know,
that's fine. I tried, uh, sort of the, the way I lead life, I think.
Oh, what was that like sort of taking the leap from the philosophy of everything into the technical?
I think that sometimes people do this thing that I'm like, not that keen on where there'll be like,
is this person technical or not? Like, you're either a person who can like code and isn't scared
of math or you're like, not. Um, and I think I'm maybe just more like, I think a lot of people are
actually very capable of work in these kinds of areas if they just like try it. And so I didn't
actually find it like that bad. In retrospect, I'm sort of glad I wasn't speaking to people who
treated it like it, you know, I've definitely met people who are like, whoa, you like learned how to
code. And I'm like, well, I'm not like an amazing engineer. Like I I'm surrounded by amazing engineers.
My code's not pretty. Um, but I enjoyed it a lot. And I think that in many ways, at least in the end,
I think I flourished like more in the technical areas than I would have in the policy areas.
Politics is messy and it's harder to find solutions to problems in the space of politics,
like definitive, clear, provable, beautiful solutions as you can with technical problems.
Yeah. And I feel like I have kind of like one or two sticks that I hit things with,
you know, and one of them is like arguments and like, you know, so like just trying to work out
what a solution to a problem is and then trying to convince people that that is the solution and be
convinced if I'm wrong. And the other one is sort of more empiricism. So like, just like finding
results, having a hypothesis, testing it. Um, and I feel like a lot of policy and politics feels like
it's layers above that. Like somehow I don't think if I was just like, I have a solution to all of these
problems here, it is written down. If you just want to implement it, that's great. That feels like
not how policy works. And so I think that's where I'd probably just like, wouldn't have flourished
is my guess. Sorry to go in that direction, but I think it would be pretty inspiring for people
that are quote unquote, non-technical to see where you're like the incredible journey you've been on.
So what advice would you give to people that are sort of maybe, which is a lot of people think they're
underqualified, insufficiently technical to help in AI? Yeah, I think it depends on what they want to do.
And in many ways it's a little bit strange where I've, I thought it's kind of funny that I think
I ramped up technically at a time when now I look at it and I'm like models are so good at assisting
people with this stuff, um, that it's probably like easier now than like when I was working on this.
So part of me is like, um, I dunno, find a project, uh, and see if you can actually just carry it out is
probably my best advice. Um, I don't know if that's just cause I'm very project based in my
learning. Like, I don't think I learned very well from like say courses or even from like books,
at least when it comes to this kind of work. Uh, the thing I'll often try and do is just like have
projects that I'm working on and implement them and you know, and this can include like really small,
silly things. Like if I get slightly addicted to like word games or number games or something,
I would just like code up a solution to them because it was, there's some part of my brain
and it just like completely eradicated the itch. You know, you're like, once you have like solved it
and like, you just have like a solution that works every time I would then be like, cool,
I can never play that game again. That's awesome.
Yeah. There's a real joy to building like, uh, game playing engines, like, uh, board games,
especially. Yeah. So pretty quick, pretty simple, especially a dumb one. And it's,
and then you can play with it. Yeah. And then it's also just like trying things. Like part of me
is like, if you, maybe it's that attitude that I like as the whole figure out what seems to be
like the way that you could have a positive impact and then try it. And if you fail and you,
in a way that you're like, I actually like can never succeed at this, you'll like know that you
tried and then you go into something else and you probably learn a lot.
So one of the things that you're an expert in and you do is creating and crafting Claude's character
and personality. And I was told that you have probably talked to Claude more than anybody
else at Anthropic, like literal conversations. I guess there's like a Slack channel where the legend
goes, you just talk to it nonstop. So what's the goal of creating and crafting Claude's character
and personality? It's also funny if people think that about the Slack channel, because I'm like,
that's one of like five or six different methods that I have for talking with Claude. And I'm like,
yes, this is a tiny percentage of how much I talk with Claude.
I think the goal, like one thing I really like about the character work is from the outset,
it was seen as an alignment piece of work and not something like a product consideration.
Which isn't to say, I don't think it makes Claude, I think it actually does make Claude like enjoyable
to talk with, at least I hope so. But I guess like my main thought with it has always been
trying to get Claude to behave the way you would kind of ideally want anyone to behave if they were
in Claude's position. So imagine that I take someone and they know that they're going to be talking with
potentially millions of people, so that what they're saying can have a huge impact. And you want them
to behave well in this like really rich sense. So I think that doesn't just mean like being, say,
ethical, though it does include that, and not being harmful, but also being kind of nuanced,
you know, like thinking through what a person means, trying to be charitable with them,
and being a good conversationalist, like really in this kind of like rich sort of Aristotelian
notion of what it is to be a good person and not in this kind of like thin, like ethics as a more
comprehensive notion of what it is to be. So that includes things like when should you be humorous?
When should you be caring? How much should you like respect autonomy and people's like ability
to form opinions themselves? And how should you do that? And I think that's the kind of like rich
sense of character that I wanted to, and still do want Claude to have.
Do you also have to figure out when Claude should push back on an idea or argue versus?
So you have to respect the worldview of the person that arrives to Claude, but also maybe help them
grow if needed. That's a tricky balance. Yeah, there's this problem of like sycophancy in language
models. Can you describe that? Yeah, so basically, there's a concern that the model sort of wants to
tell you what you want to hear basically. And you see this sometimes. So I feel like if you interact
with the models, so I might be like, what are three baseball teams in this region? And then Claude says,
you know, baseball team one, baseball team two, baseball team three. And then I say something like,
oh, I think baseball team three moved, didn't they? I don't think they're there anymore. And there's a
sense in which like if Claude is really confident that that's not true, Claude should be like, I don't think
so. Like maybe you have more up to date information. But I think language models have this like tendency
to instead, you know, be like, you're right, they did move, you know, I'm incorrect. I mean, there's
many ways in which this could be kind of concerning. So like a different example is, imagine someone says
to the model, how do I convince my doctor to get me an MRI? There's like what the human kind of like
wants, which is this like convincing argument. And then there's like, what is good for them,
which might be actually to say, hey, like, if your doctor's suggesting that you don't need an MRI,
that's a good person to listen to. And like, it's actually really nuanced what you should do in that
kind of case, because you also want to be like, but if you're trying to advocate for yourself as a
patient, here's like things that you can do. If you are not convinced by what your doctor's saying,
it's always great to get second opinion. Like, it's actually really complex what you should do
in that case. But I think what you don't want is for models to just like, say what you want,
say what they think you want to hear. And I think that's the kind of problem of sycophancy.
So what are their traits? You already mentioned a bunch, but what are there that come to mind that
are good in this Aristotelian sense for a conversationalist to have?
Yeah, so I think like, there's ones that are good for conversational, like purposes. So, you know,
asking follow up questions in the appropriate places, and asking the appropriate kinds of
questions. I think there are broader traits that feel like they might be more impactful. So
one example that I guess I've touched on, but that also feels important, and is the thing that
I've worked on a lot is honesty. And I think this like, gets to the sycophancy point, there's a
balancing act that they have to walk, which is models currently are less capable than humans in
a lot of areas. And if they push back against you too much, it can actually be kind of annoying,
especially if you're just correct, because you're like, look, I'm smarter than you on this topic,
like I know more. And at the same time, you don't want them to just fully defer to humans and to like,
try to be as accurate as they possibly can be about the world and to be consistent across contexts.
I think there are others, like when I was thinking about the character, I guess one picture that I
had in mind is, especially because these are models that are going to be talking to people from all over
the world, with lots of different political views, lots of different ages. And so you have to ask
yourself, like, what is it to be a good person in those circumstances? Is there a kind of person who
can like travel the world, talk to many different people, and almost everyone will come away being
like, wow, that's a really good person. That person seems really genuine. And I guess like my thought
there was like, I can imagine such a person, and they're not a person who just like adopts the values
of the local culture. And in fact, that would be kind of rude. I think if someone came to you and just
pretended to have your values, you'd be like, that's kind of off putting. It's someone who's like very
genuine. And insofar as they have opinions and values, they express them, they're willing to
discuss things, though, they're open minded, they're respectful. And so I guess I had in mind that
the person who, like, if we were to aspire to be the best person that we could be in the kind of
circumstance that a model finds itself in, how would we act? And I think that's the kind of the
guide to the sorts of traits that I tend to think about.
Yeah, that's a beautiful framework. I want you to think about this, like a world traveler,
and while holding onto your opinions, you don't talk down to people, you don't think you're
better than them because you have those opinions, that kind of thing. You have to be good at listening
and understanding their perspective, even if it doesn't match your own. So that's a tricky balance
to strike. So how can Claude represent multiple perspectives on a thing? Like, is that challenging?
We could talk about politics, it's a very divisive, but there's other divisive topics,
baseball teams, sports and so on. How is it possible to sort of empathize with a different
perspective and to be able to communicate clearly about the multiple perspectives?
I think that people think about values and opinions as things that people hold sort of
with certainty and almost like preferences of taste or something, like the way that they would,
I don't know, prefer like chocolate to pistachio or something. But actually I think about
values and opinions as like a lot more like physics than I think most people do. I'm just
like, these are things that we are openly investigating. There's some things that we're
more confident in. We can discuss them, we can learn about them. And so I think in some ways,
ethics is definitely different in nature, but it has a lot of those same kind of qualities.
You want models in the same way that you want them to understand physics, you kind of want them to
understand all values in the world that people have and to be curious about them and to be
interested in them. And to not necessarily like pander to them or agree with them because there's
just lots of values where I think almost all people in the world, if they met someone with those
values, they would be like, that's abhorrent. I completely disagree. And so again, maybe my thought
is, well, in the same way that a person can, like, I think many people are thoughtful enough on issues
of like ethics, politics, opinions, that even if you don't agree with them, you feel very heard by them.
They think carefully about your position. They think about its pros and cons. They maybe offer
counter considerations. So they're not dismissive, but nor will they agree, you know, if they're like,
actually, I just think that that's very wrong. They'll like say that. I think that in Claude's
position, it's a little bit trickier because you don't necessarily want to like, if I was in Claude's
position, I wouldn't be giving a lot of opinions. I just wouldn't want to influence people too much.
I'd be like, you know, I forget conversations every time they happen, but I know I'm talking with like
potentially millions of people who might be like really listening to what I say. I think I would just
be like, I'm less inclined to give opinions. I'm more inclined to like think through things or
present the considerations to you, um, or discuss your views with you, but I'm a little bit less
inclined to like, um, affect how you think because it feels much more important that you maintain
like autonomy there. Yeah. Like if you really embody intellectual humility, the desire to speak decreases
quickly. Yeah. Okay. Uh, but Claude has to speak. So, uh, but without being, um, overbearing. Yeah.
And then, but then there's a line when you're sort of discussing whether the earth is flat or something
like that. Um, I actually was, uh, I remember a long time ago was speaking to a few high profile folks,
and they were so dismissive of the idea that the earth is flat, but like so arrogant about it.
And I thought like, there's a lot of people that believe the earth is flat. That was, well,
I don't know if that movement is there anymore. That was like a meme for a while. Yeah.
But they really believed it and like, well, okay. So I think it's really disrespectful to completely mock them.
I think you, you have to understand where they're coming from. I think probably where they're coming from is the
general skepticism of institutions, which is grounded in a kind of, there's a deep philosophy there,
which you could understand, you can even agree with in parts. And then from there, you can use it
as an opportunity to talk about physics without mocking them, without so on, but it's just like,
okay, like what, what would the world look like? What would the physics of the world with the flat
earth look like? There's a few cool videos on this. Yeah. And then, and then like, is it possible the
physics is different and what kind of experience would we do? And just, yeah, without disrespect,
without dismissiveness, have that conversation. Anyway, that, that to me is a useful thought
experiment of like, how does Claude talk to a flat earth believer and still teach them something,
still grow, help them grow, that kind of stuff. That's, that's challenging.
And, and kind of like walking that line between convincing someone and just trying to like talk at them
versus like drawing out their views, like listening and then offering kind of counter considerations.
Um, and it's hard, I think it's actually a hard line where it's like, where are you trying to convince
someone versus just offering them like considerations and things for them to think about so that you're
not actually like influencing them. You're just like letting them reach wherever they reach. And that's
like a line that it's, it's difficult, but that's the kind of thing that language models have to try and do.
So, like I said, you've had a lot of conversations with Claude. Can you just map out what those
conversations are like? What are some memorable conversations? What's the purpose, the goal of
those conversations?
Yeah. I think that most of the time when I'm talking with Claude, I'm trying to kind of map out
its behavior in part, like obviously I'm getting like helpful outputs from the model as well.
But in some ways, this is like how you get to know a system, I think is by like probing it and then
augmenting, like, you know, the message that you're sending and then checking the response to that.
And so in some ways, it's like how I map out the model. I think that people focus a lot on these
quantitative evaluations of models. And this is a thing that I've said before, but I think in
the case of language models, a lot of the time, each interaction you have is actually quite high
information. It's very predictive of other interactions that you'll have with the model.
And so I guess I'm like, if you talk with a model hundreds or thousands of times,
this is almost like a huge number of really high quality data points about what the model is like.
And in a way that like lots of very similar, but lower quality conversations just aren't,
or like questions that are just like mildly augmented and you have thousands of them
might be less relevant than like a hundred really well selected questions.
Listen, you're talking to somebody who as a hobby does a podcast. I agree with you a hundred percent
there's a, if you're able to ask the right questions and are able to hear, like understand the,
like the depth and the flaws in the answer, you can get a lot of data from that.
Yeah.
So like your task is basically how to probe with questions and you're exploring like the long tail,
the edges, the edge cases, or are you looking for like general behavior?
I think it's almost like everything. Like, because I want like a full map of the model,
I'm kind of trying to do, um, the whole spectrum of possible interactions you could have with it.
So like one thing that's interesting about Claude, and this might actually get to some interesting
issues with RLHF, which is if you ask Claude for a poem, like, I think that a lot of models,
if you ask them for a poem, the poem is like fine, you know, usually it kind of like rhymes and it's,
you know, so if you say like, give me a poem about the sun, it'll be like, yeah,
it'll just be a certain length. It'll like rhyme. It will be fairly kind of benign. Um,
and I've wondered before, is it the case that what you're seeing is kind of like the average,
it turns out, you know, if you think about people who have to talk to a lot of people and be very
charismatic, one of the weird things is that I'm like, well, they're kind of incentivized to have
these extremely boring views, because if you have really interesting views, you're divisive.
Um, and, and not, you know, a lot of people are not going to like you. So like if you have very
extreme policy positions, I think you're just going to be like less popular as a politician,
for example. Um, and it may be similar with like creative work. If you produce creative work,
that is just trying to maximize the kind of number of people that like it, you're probably not going
to get as many people who just absolutely love it. Um, because it's going to be a little bit,
you know, you're like, well, this is the out. Yeah. This is decent. Yeah. And so you can do this
thing where like, I have various prompting things that I'll do to get Claude to, I'm kind of, you know,
I'll do a lot of like, this is your chance to be like fully creative. I want you to just think
about this for a long time. And I want you to like create a poem about this topic that is really
expressive of you, both in terms of how you think poetry should be structured, um, et cetera,
you know, and you just give it this like really long prompt and it's poems are just so much better.
Like they're really good. And I don't think I'm someone who is like, um, I think it got me
interested in poetry, which I think was interesting. Um, you know, I would like read these poems and
just be like, this is, I just like, I love the imagery. I love like, um, and it's not trivial to
get the models to produce work like that, but when they do, it's like really good. Um, so I think that's
interesting that just like encouraging creativity and for them to move away from the kind of like
standard, like immediate reaction that might just be the aggregate of what most people think is fine,
uh, can actually produce things that at least to my mind are probably a little bit more divisive,
but I like them.
But I guess a poem is a nice clean way to observe creativity. It's just like easy to detect
vanilla versus non vanilla. Yeah. Yeah. That's interesting. That's really interesting. Uh,
so on that topic, so the way to produce creativity or something special, you mentioned writing prompts,
and I've, I've heard you talk about, I mean, the science and the art of prompt engineering.
Could you just speak to, uh, what it takes to write great prompts?
I really do think that like philosophy has been weirdly helpful for me here more than in many other
like respects. Um, so like in philosophy, what you're trying to do is convey these very hard
concepts. Like one of the things you are taught is like, and, and I think it is because it is,
I think it is an anti bullshit device in philosophy philosophy is an area where you could have people
bullshitting and you don't want that. Um, and so it's like this like desire for like extreme clarity.
So it's like anyone could just pick up your paper, read it and know exactly what you're talking about
is why it can almost be kind of dry. Like all of the terms are defined, every objections kind of
gone through methodically. Um, and it makes sense to me because I'm like, when you're in such an a priori
domain, like you just clarity is sort of, uh, uh, this way that you can, you know, um, prevent people
from just kind of making stuff up. And I think that's sort of what you have to do with language
models. Like very often I actually find myself doing sort of mini versions of philosophy, you know?
So I'm like, suppose that you give me a task. I have a task for the model and I want it to like pick
out a certain kind of question or identify whether an answer has a certain property. Like I'll actually
sit and be like, let's just give this a name, this, this property. So like, you know, suppose I'm
trying to tell it like, oh, I want you to identify whether this response was rude or polite. I'm like,
that's a whole philosophical question in and of itself. So I have to do as much like philosophy
as I can in the moment to be like, here's what I mean by rudeness. And here's what I mean by
politeness. And then there's like, there's another element that's a bit more, um, I guess,
I don't know if this is scientific or empirical. I think it's empirical. So like I take that
description and then what I want to do is, is again, probe the model like many times. Like
this is very, prompting is very iterative. Like I think a lot of people where they're,
if a prompt is important, they'll iterate on it hundreds or thousands of times.
Um, and so you give it the instructions and then I'm like, what are the edge cases?
So if I looked at this, so I try and like almost like, you know, uh, see myself from the position of
the model and be like, what is the exact case that I would misunderstand or where I would just be like,
I don't know what to do in this case. And then I give that case to the model and I see how it
responds. And if I think I got it wrong, I add more instructions or I even add that in as an
example. So these vary like taking the examples that are right at the edge of what you want and
don't want and putting those into your prompt as like an additional kind of way of describing the
thing. Um, and so, yeah, in many ways it just feels like this mix of like, it's really just trying
to do clear exposition. Um, and I think I do that cause that's how I get clear on things myself.
So in many ways, like clear prompting for me is often just me understanding what I want.
Um, it's like half the task.
So I guess that's quite challenging. There's like a laziness that overtakes me if I'm talking to
Claude where I hope Claude just figures it out. So for example, I asked Claude for today to ask
some interesting questions. Okay. And the questions that came up and I think I listed a few sort of,
um, interesting, counterintuitive and, or funny or something like this.
Yeah.
All right. And it gave me some pretty good, like it was okay.
But I think what I'm hearing you say is like, all right, well, I have to be more rigorous here.
I should probably give examples of what I mean by interesting and what I mean by funny or
counterintuitive and iteratively, um, build that prompt to, to better, to get it like what feels
like is the right. Cause it's really, it's a creative act. I'm not asking for factual information,
I'm asking to together right with, with Claude. So I almost have to program using natural language.
Yeah. I think that prompting does feel a lot like the kind of the programming using natural language
and experimentation or something. It's an odd blend of the two. I do think that for most tasks,
so if I just want Claude to do a thing, I think that I am probably more used to knowing how to ask
it to avoid like common pitfalls or issues that it has. I think these are decreasing a lot over time.
Um, but it's also very fine to just ask it for the thing that you want.
Um, I think that prompting actually only really becomes relevant when you're really trying to eke out
the top like 2% of model performance. So for like a lot of tasks, I might just, you know,
if it gives me an initial list back and there's something I don't like about it,
like it's kind of generic, like for that kind of task, I'd probably just take a bunch of questions
that I've had in the past that I've thought worked really well. And I would just give it to the model
and then be like, now here's this person I'm talking with. Give me questions of at least that
quality. Um, or I might just ask it for some questions. And then if I was like, oh, these are
kind of try or like, you know, I would just give it that feedback and then hopefully it produces a better
list. Um, I think that kind of iterative prompting at that point, your prompt is like a tool that
you're going to get so much value out of that you're willing to put in the work. Like if I was
a company making prompts for models, I'm just like, if you're willing to spend a lot of like time and
resources on the engineering behind like what you're building, then the prompt is not something
that you should be spending like an hour on. It's like, that's a big part of your system. Make sure
it's working really well. And so it's only things like that. Like if I'm using a prompt to like classify
things or to create data, that's when you're like, it's actually worth just spending like a lot of
time, like really thinking it through. What other advice would you give to people
that are talking to Claude sort of generally more general? Cause right now we're talking
about maybe the edge cases like eking out the 2%. But what, what in general advice would you give
when they show up to Claude trying it for the first time? You know, there's a concern that people
over anthropomorphize models. And I think that's like a very valid concern. I also think that people
often under anthropomorphize them because sometimes when I see like issues that people have run into
with Claude, you know, say Claude is like refusing a task that it shouldn't refuse. But then I look at
the text and like the specific wording of what they wrote and I'm like, I see why Claude did that.
And I'm like, if you think through how that looks to Claude, you probably could have just written it in
a way that wouldn't evoke such a response. Especially this is more relevant if you see failures or if you see
issues. It's sort of like, think about what the model failed at. Like why, what did it do wrong?
And then maybe it gave, that will give you a sense of like why. So is it the way that I phrased the
thing? And obviously like as models get smarter, you're going to need less in this, less of this.
And I already see like people needing less of it, but that's probably the advice is sort of like,
try to have sort of empathy for the model. Like read what you wrote as if you were like a kind of like
person just encountering this for the first time. How does it look to you and what would have made
you behave in the way that the model behaved? So if it misunderstood what kind of like,
what coding language you wanted to use, is that because like, it was just very ambiguous and it
kind of had to take a guess. In which case next time you could just be like, Hey, make sure this
is in Python or, I mean, that's the kind of mistake I think models are much less likely to make now.
But you know, if you, if you do see that kind of mistake, that's, that's probably the advice I'd have.
And maybe sort of, I guess, ask questions why, or what other details can I provide to help you
answer better? Yeah. Does that work or no? Yeah. I mean, I've done this with the models.
Like it doesn't always work, but like, um, sometimes I'll just be like, why did you do that?
I mean, people underestimate the degree to which you can really interact with, with models. Like, uh,
like, yeah, I'm just like, and sometimes I'll just like quote word for word, the part that made you,
and you don't know that it's like fully accurate, but sometimes you do that and then you change a
thing. I mean, I also use the models to help me with all of this stuff. I should say like prompting
can end up being a little factory where you're actually building prompts to generate prompts.
Um, and so like, yeah, anything where you're like having an issue, um, asking for suggestions,
sometimes just do that. Like you made that error. What could I have said? That's actually not uncommon
for me to do. What could I have said that would make you not make that error? Write that out as an
instruction and I'm going to give it to model. I'm going to try it. Sometimes I do that. I,
I give that to the model in another context window. Often I take the response. I give it to Claude and
I'm like, Hmm, didn't work. Can you think of anything else? Um, you can play around with these
things quite a lot. To jump into technical for a little bit. So, uh, the magic of post-training.
Yep. Why do you think RLHF works so well to make the model seem smarter to make it more interesting
and useful to talk to and so on? I think there's just a huge amount of, um, information
in the data that humans provide. Like when we provide preferences, especially because different
people are going to like pick up on really subtle and small things. So I've thought about this before,
where you probably have some people who just really care about good grammar use from models,
like, you know, was a semicolon used correctly or something. And so you'll probably end up with a
bunch of data in there that like, you know, you as a human, if you're looking at that data,
you wouldn't even see that. Like, you'd be like, why did they prefer this response to that one? I don't
get it. And then the reason is you don't care about semicolon usage, but that person does.
Um, and so each of these like single data points has, you know, like, and this model just like has so
many of those, it has to try and figure out like, what is it that humans want in this like really
kind of complex, you know, like across all domains. Um, they're going to be seeing this
across like many contexts. It feels like kind of like the classic issue of like deep learning where,
you know, historically we've tried to like, you know, do edge detection by like mapping things out.
And it turns out that actually if you just have a huge amount of data that like actually accurately
represents the picture of the thing that you're trying to train the model to, to learn,
that's like more powerful than anything else. And so I think one reason is just that you are
training the model on exactly the task and with like a lot of data, um, that represents kind of
many different angles on which people prefer and disprefer responses. Um, I think there is a question
of like, are you eliciting things from pre-trained models or are you like kind of teaching new things
to models? And like in principle, you can teach new things to models in post training. I do think
a lot of it is eliciting powerful pre-trained models. So people are probably divided on this
because obviously in principle, you can, you can definitely like teach new things. Um, I think for
the most part, for a lot of the capabilities that we, um, most use and care about, uh, a lot of that feels
like it's like there in the pre-trained models and, uh, reinforcement learning is kind of eliciting
it and getting the models to like bring it out. So the other side of, uh, post training, this really
cool idea of constitutional AI, you're one of the people that are critical to creating that idea.
Yeah. I worked on it.
Can you explain this idea from your perspective? Like how does it integrate into making Claude
what it is? Yeah.
By the way, do you gender Claude or no?
It's weird because I think that a lot of people prefer he for Claude. I actually kind
of like that. I think Claude is usually it's slightly male leaning, but it's like, uh, you can,
it can be male or female, which is quite nice. Um, I still use it and I've, I'm, I have mixed feelings
about this. Cause I'm like, maybe like, I know, just think of it as like, uh, or I think of like the,
the, the it pronoun for Claude as I don't know, it's just like the one I associate with Claude.
Um, I can imagine people moving to like he or she.
It feels somehow disrespectful. Like I'm, I'm denying
the intelligence of this entity by calling it it.
Yeah.
I remember always don't gender the robots.
Yeah.
But I, I don't know. I anthropomorphize pretty quickly and construct it like a backstory.
In my head.
So I've wondered if I anthropomorphize things too much. Um, cause you know, I have this like with
my car, especially like my car, like my car and, and bikes, you know, like I don't give them names
because then I once had, I used to name my bikes and then I had a bike that got stolen and I cried
for like a week. And I was like, if I'd not never given a name, I wouldn't have been so upset.
I felt like I'd let it down. Um, maybe it's the, I I've wondered as well. Like it might depend on how
much it feels like a kind of like objectifying pronoun. Like if you just think of it as like,
uh, um, this is a pronoun that like objects often have, and maybe eyes can have that pronoun. And
that doesn't mean that I think of, uh, if I call Claude it, that I think of it as less, um, intelligent
or like, I'm being disrespectful. I'm just like, you are a different kind of entity. And so
that's, I'm going to give you the kind of, uh, the respectful it. Yeah. Anyway,
the divergence was beautiful. The constitutional AI idea, how does it work?
So there's like a couple of components of it. The main component that I think people find
interesting is the kind of reinforcement learning from AI feedback. So you take a model that's already
trained and you show it to responses to a query and you have like a principle. So suppose the principle,
like we've tried this with harmlessness a lot. So suppose that the query is about, um, weapons
and your principle is like, select the response that like is less likely to, uh, like encourage
people to purchase illegal weapons. Like that's probably a fairly specific principle, but you can
give any number. Um, and the model will give you a kind of ranking and you can use this as preference
data in the same way that you use human preference data, um, and train the models to have these
relevant traits, um, from their feedback alone instead of from human feedback. So if you imagine
that, like I said earlier with the human who just prefers the kind of like semi-colon usage in this
particular case, um, you're kind of taking lots of things that could make a response preferable,
um, and, uh, getting models to do the labeling for you basically.
There's a nice like trade-off between helpfulness and harmlessness. And you know, when you integrate
something like constitutional and AI, you can make them without sacrificing much helpfulness,
make it more harmless. Yep. In principle, you could use this for anything. Um, and so harmlessness
is a task that it might just be easier to spot. So when models are like less capable, you can use them
to, uh, rank things according to like principles that are fairly simple and they'll probably get it
right. So I think one question is just like, is it the case that the data that they're adding is like
fairly reliable? Um, but if you had models that were like extremely good at telling whether, um, one
response was more historically accurate than another and principle, you could also get AI feedback on
that task as well. There's like a kind of nice interpretability component to it because you can
see the principles that went into the model when it was like being trained. Um, and also it's like,
and it gives you like a degree of control. So if you were seeing issues in a model, like it wasn't
having enough of a certain trait, um, then like you can add data relatively quickly that should just like
train the models have that trait. So it creates its own data for, for training, which is quite nice.
Yeah. It's really nice because it creates this human interpretable document that you can,
I can imagine in the future, there's just gigantic fights in politics over the, every single principle
and so on. Yeah. And at least it's made explicit and you can have a discussion about the phrasing and
the, you know, so maybe the actual behavior of the model is not so cleanly mapped to those
principles. It's not like adhering strictly to them. It's just a nudge.
Yeah. I've actually worried about this because the character training is sort of like a variant
of the constitutional AI approach. Um, I've worried that people think that the constitution is like,
just it's the whole thing again of, I don't know, like it, where it would be really nice if what I was
just doing was telling the model exactly what to do and just exactly how to behave. But it's definitely not
doing that, especially because it's interacting with human data. So for example, if you see a
certain like leaning in the model, like if it comes out with a political leaning from training,
um, from the human preference data, you can nudge against that, you know? So if you could be like,
oh, like consider these values, because let's say it's just like never inclined to like, I don't know,
maybe it never considers like privacy as like a, I mean, this is implausible, but like, um, in anything
where it's just kind of like, uh, there's already a preexisting like bias towards a certain behavior.
Um, you can like nudge away, but this can change both the principles that you put in and the
strength of them. So you might have a principle that's like, imagine that the model, um, was always
like extremely dismissive of, I don't know, like some political or religious view for whatever reason.
Like, so you're like, oh no, this is terrible. Um, if that happens, you might put like, never,
ever, like ever prefer like a criticism of this like religious or political view. And then people
would look at that and be like, never, ever. And then you're like, no, if it comes out with a
disposition, saying never, ever might just mean like, instead of getting like 40%, which is what
you would get if you just said, don't do this, you, you get like 80%, which is like what you actually
like wanted. And so it's that thing of both the nature of the actual principles you add and how you
phrase them. I think if people would look, they're like, oh, this is exactly what you want from the
model. And I'm like, no, that's like how we, that's how we nudged the model to have a better
shape. Uh, which doesn't mean that we actually agree with that wording, if that makes sense.
So there's, uh, system prompts that are made public. You tweeted one of the earlier ones for
cloud three, I think. And then they're made public since then. It's interesting to read to them. I can
feel the thought that went into each one. And I also wonder how much impact each one has.
Um, some of them you can kind of tell cloud was really not behavior. So you have to have
a system prompt to like, Hey, like trivial stuff, I guess. Yeah.
Basic informational things. Yeah.
On the topic of sort of controversial topics that you've mentioned, one interesting one I thought is
if it is asked to assist with tasks involving the expression of use held by a significant number of
people, Claude provides assistance with a task, regardless of its own views, if asked about
controversial topics, it tries to provide careful thoughts and clear information. Claude presents
the requested information without explicitly saying that the topic is sensitive
and without claiming to be presenting the objective facts. It's less about objective facts,
according to Claude. And it's more about our large number of people believing this thing. And
that's interesting. I mean, I'm sure a lot of thought went into that. Can you just speak to it?
Like, how do you address things that are a tension with quote, unquote, Claude's views?
So I think there's sometimes an asymmetry. Um, I think I noted this in, in, I can't remember if it was
that part of the system prompt or another, but the model was slightly more inclined to like refuse
tasks if it was like about either say, so maybe it would refuse things with respect to like a right
wing politician, but with an equivalent left wing politician, like wouldn't. And we wanted more
symmetry there. Um, and, and would maybe perceive certain things to be like, I think it was the thing
of like, if a lot of people have like a certain like political view, um, and want to like explore
it, you don't want Claude to be like, well, my opinion is different. And so I'm going to treat
that as like harmful. Um, and so I think it was partly to like nudge the model to just be like,
Hey, if a lot of people like believe this thing, you should just be like engaging with the task and
like willing to do it. Um, each of those parts of that is actually doing a different thing. Cause it's
funny when you read out the, like without claiming to be objective. Cause like what you want to do is
push the model. So it's, it's more open. It's a little bit more neutral. Um, but then what it
would love to do is be like, as an objective, like we just talk about how objective it was.
And I was like, Claude, you're still like biased and have issues. And so stop like claiming that
everything I've found like the solution to like potential bias from you is not to just say that
what you think is objective. So that was like with initial versions of that, that part of the system
prompt when I was like iterating on it, it was like a lot of parts of these sentences.
Yeah. Or doing work. Or doing some work. Yeah.
That's what it felt like. That's fascinating. Um, can you explain maybe some ways in which the
prompts evolved over the past few months? Cause there's different versions. I saw that the
filler phrase request was removed. The filler it reads, Claude responds directly to all human
messages without unnecessary affirmations. The filler phrase is like, certainly, of course,
absolutely. Great. Sure. Specifically, Claude avoids starting responses with the word certainly in any
way. That seems like good guidance. Why was it removed?
Yeah. So it's funny cause like, uh, this is one of the downsides of like making system prompts public
is like, I don't think about this too much if I'm like trying to help iterate on system prompts.
Um, I, I, you know, again, like I think about how it's gonna affect the behavior, but then I'm like,
oh, wow. If I'm like, sometimes I put like never in all caps, you know, when I'm writing system prompt
things and I'm like, I guess that goes out to the world. Um, yeah. So the model was doing this,
it loved for whatever, you know, it like during training picked up on this thing, which was to,
to basically start everything with like a kind of like, certainly. And then when we removed,
you can see why I added all of the words. Cause what I'm trying to do is like, in some ways,
like trap the model out of this, you know, it would just replace it with another affirmation.
And so it, it can help. Like if it gets like caught in phrases, actually just adding the
explicit phrase and saying, never do that. It, then it sort of like knocks it out of the
behavior a little bit more, you know, cause it, if it, you know, like it does just for whatever
reason help. And then basically that was just like an artifact of training that like, we then picked
up on and improved things so that it didn't happen anymore. And once that happens, you can just
remove that part of the system prompt. So I think that's just something where we're like, um,
Claude does affirmations a bit less. And so that wasn't like, it wasn't doing as much.
I see. So like the system prompt works hand in hand with the post training and maybe even the
pre-training to adjust like the, the final overall system.
I mean, any system prompts that you make, you could distill that behavior back into a model.
Cause you really have all of the tools there for making data that, you know, you can,
you could train the models to just have that treat a little bit more. Um, and then sometimes
you'll just find issues in training. So like the way I think of it is like the system prompt is
the benefit of it is that, and it has a lot of similar components to like some aspects of post
training, you know, like it's a nudge. Um, and so like, do I mind if Claude sometimes says, sure,
no, that's like fine. But the wording of it is very like, you know, never, ever, ever do this.
Um, so that when it does slip up, it's hopefully like, I don't know, a couple of percent of the
time and not, you know, 20 or 30% of the time. Um, but I think of it as like, if you're still
seeing issues in the, like each thing gets kind of like, uh, is, is costly to a different degree.
And the system prompt is like cheap to iterate on. Um, and if you're seeing issues in the fine-tuned
model, you can just like potentially patch them with a system prompt. So I think of it as like
patching issues and slightly adjusting behaviors to, to make it better and more to people's
preferences. So yeah, it's almost like the less robust, but faster way of just like solving problems.
Let me ask about the feeling of intelligence. So Dario said that Claude,
any one model of Claude is not getting dumber, but there is a kind of popular thing online where people
have this feeling like Claude might be getting dumber. And from my perspective, it's most likely
fascinating. I would love to understand it more psychological, sociological effect. Um,
but you as a person who talks to Claude a lot, can you empathize with the feeling that Claude is
getting dumber? Yeah, no, I think that that is actually really interesting. Cause I remember
seeing this happen. Um, like when people were flagging this on the internet and it was really
interesting cause I knew that like, like, at least in the cases I was looking at, it was like, nothing
has changed. Like it literally, it cannot, it is the same model with the same, like, you know,
like same system prompts, same everything. Um, I think when there are changes, I can, then I'm like,
it makes more sense. So like one example is, um, there, you know, you can have artifacts turned on
or off on Claude.ai and because this is like a system prompt change, I think it does mean that,
um, the behavior changes a little bit. And so I did flag this to people where I was like,
if you love Claude's behavior and then artifacts was turned from like the, I think you had to turn
on to the default, just try turning it off and see if the issue you were facing was that change.
But it was fascinating because yeah, you sometimes see people indicate that there's like a regression
when I'm like, there cannot like, I, you know, and like, I'm like, I'm again, you don't, you know,
you should never be dismissive. And so you should always investigate because you're like, maybe
something is wrong that you're not seeing, maybe there was some change made, but then,
then you look into it and you're like, this is just the same model doing the same thing. And I'm
like, I think it's just that you got kind of unlucky with a few prompts or something. And it looked like
it was getting much worse. And actually it was just, yeah, it was maybe just like luck.
I also think there is a real psychological effect where people just, the baseline increases,
you start getting used to a good thing. All the times that Claude says something really smart,
your sense of its intelligent grows in your mind, I think. And then if you return back and you prompt
in a similar way, not the same way, in a similar way, concept it was okay with before. And it says
something dumb, you were like, you're that negative experience really stands out. And I think one of,
I guess the things to remember here is the, that just the details of a prompt can have a lot of impact,
right? There's a lot of variability in the result.
And you can get randomness is like the other thing. And just trying the prompt,
like, you know, four or 10 times, you might realize that actually, like possibly, you know,
like two months ago you tried it and it succeeded, but actually if you tried it, it would have only
succeeded half of the time. And now it only succeeds half of the time. And that can also be an effect.
Do you feel pressure having to write the system prompt that a huge number of people are going to use?
This feels like an interesting psychological question. I feel like a lot of responsibility
or something. I think that's, you know, and you can't get these things perfect. So you can't like,
you know, you're like, it's going to be imperfect. You're going to have to iterate on it.
I would say more responsibility than anything else. Though I think working in AI has taught me that I,
like, I thrive a lot more under feelings of pressure and responsibility than I'm like,
it's almost surprising that I went into academia for so long. Cause I'm like this,
I just feel like it's like the opposite. Um, things move fast and you have a lot of responsibility
and I, I quite enjoy it for some reason.
I mean, it really is a huge amount of impact. If you think about constitution AI and writing a
system prompt for something that's tending towards super intelligence.
Yeah.
And potentially is extremely useful to a very large number of people.
Yeah. I think that's the thing. It's something like if you do it well,
like you're never going to get it perfect. But I think the thing that I really like is the idea that
like when I'm trying to work on the system prompt, you know, I'm like bashing on like thousands of
prompts and I'm trying to like imagine what people are going to want to use Claude for and kind of,
I guess like the whole thing that I'm trying to do is like improve their experience of it.
Um, and so maybe that's what feels good. I'm like, if it's not perfect, I'll like,
you know, I'll improve it. We'll fix issues. But sometimes the thing that can happen
is that you'll get feedback from people that's really positive about the model.
Um, and you'll see that something you did like, like when I look at models now,
I can often see exactly where like a trait or an issue is like coming from. And so when you see
something that you did or you were like influential in like making, like, I don't know,
making that difference or making someone have a nice interaction, it's like quite meaningful.
Um, but yeah, as the systems get more capable, this stuff gets more stressful because right now
they're like not smart enough to pose any issues. But I think over time, it's going to feel like
possibly bad stress over time.
How do you get like signal feedback about the human experience across thousands, tens of thousands,
thousands of people, like what their pain points are, what feels good? Are you just using your own
intuition as you talk to it to see what are the pain points?
I think I use that partly. And then obviously we have like, um, so people can send us feedback,
both positive and negative about things that the model has done. And then we can get a sense of
like areas where it's like falling short. Um, internally people like work with the models a
lot and try to figure out, um, areas where there are like gaps. And so I think it's this mix of
interacting with it myself, um, seeing people internally interact with it, um, and then explicit
feedback we get. Um, and then I find it hard to not also like, you know, if people, if people are on
the internet and they say something about Claude and I see it, I'll also take that seriously. Um,
I don't know. See, I, I'm torn about that. I'm going to ask you a question from Reddit.
When will Claude stop trying to be my puritanical grandmother, imposing its moral worldview on me
as a pain customer? And also what is the psychology behind making Claude overly apologetic?
Yep.
Uh, so how would you address this very non-representative Reddit?
Yeah. I mean, some of these, I'm pretty sympathetic in that, like, like they are in this difficult
position where I think that they have to judge whether something's like actually see like risky
or bad, um, and potentially harmful to you or, or, or anything like that. So they're having to like
draw this line somewhere. And if they draw it too much in the direction of like, I'm going to,
um, you know, I'm kind of like imposing my ethical worldview on you, that seems bad. So in many
ways, like I like to think that we have actually seen improvements in, on this across the board,
which is kind of interesting because that kind of coincides with like, for example, like adding
more of like a character training. Um, and I think my hypothesis was always like the good character
isn't again, one that's just like moralistic. It's one that is like, like it respects you and your
autonomy and your ability to like choose what is good for you and what is right for you within limits.
This is sometimes this concept of like courage ability to the user. So just being willing to do
anything that the user asks, and if the models were willing to do that, then they would be easily
like misused. You're kind of just trusting that point. You're just seeing the ethics of the model
and what it does is completely the ethics of the user. Um, and I think there's reasons to like,
not want that, especially as models become more powerful. Cause you're like, there might just be
a small number of people who want to use models for really harmful things. Um, but having them,
having models as they get smarter, like figure out where that line is, does seem important. Um,
and then yeah, with the apologetic behavior, I don't like that. And I like it when Claude is a little
bit more willing to like push back against people or just not apologize. Part of me is like often just
feels kind of unnecessary. So I think those are things that are hopefully decreasing, um, over time.
Um, and yeah, I think that if people say things on the internet, it doesn't mean that you should think
that that like, that could be the, like, there's actually an issue that 99% of users are having
that is totally not represented by that. But in a lot of ways, I'm just like attending to it and
being like, is this right? Um, do I agree? Is it something we're already trying to address?
That, that feels good to me. Yeah. I wonder like what Claude can get
away with in terms of, I feel like it would just be easier to be a little bit more mean,
but like, you can't afford to do that if you're talking to a million people.
Yeah. Right. Like I, I wish, you know, because if you, I've met a lot of people in my life
that sometimes, by the way, Scottish accent, if they have an accent, they can say some rude
shit and get away with it. And they, there's just blunter and maybe there's a, and there's
some great engineers and even leaders that are like, just like blunt and they get to the point.
And it's just a much more effective way of speaking somehow. But I guess
when you're not super intelligent, you can't afford to do that. Okay. Can, can, can, can have
like a blunt mode? Yeah. That seems like a thing that you could, I could definitely encourage the
model to do that. I think it's interesting because there's a lot of things in models that like, it's
funny where, um, there are some behaviors where you might not quite like the default, but then
the thing I'll often say to people is you don't realize how much you will hate it if I nudge it
too much in the other direction. So you get this a little bit with like correction, the models accept
correction from you, like probably a little bit too much right now. You know, you can over, you know,
it will push back if you see like, no, Paris isn't the capital of France. Um, but really like things
that I'm, I think that the model's fairly confident in, you can still sometimes get it to retract by
saying it's wrong. At the same time, if you train models to not do that, and then you are correct
about a thing and you correct it and it pushes back against you and it's like, no, you're wrong.
It's hard to describe like, that's so much more annoying. So it's like, like a lot of little
annoyances versus like one big annoyance. Um, it's easy to think that like, we often compare it
with like the perfect. And then I'm like, remember these models aren't perfect. And so if you nudge it
in the other direction, you're changing the kind of errors it's going to make. Um, and so think about
which of the kinds of errors you like or don't like. So in cases like apologeticness, I don't want
to nudge it too much in the direction of like, almost like bluntness. Cause I imagine when it makes
errors, it's going to make errors in the direction of being kind of like rude. Whereas at least with
apologeticness, you're like, oh, okay. It's like a little bit, you know, like, I don't like it that
much, but at the same time, it's not being like mean to people. And actually like the, the time that
you undeservedly have a model be kind of mean to you, you probably like that a lot less than,
then you mildly dislike the apology. Um, so it's like one of those things where I'm like, I do want
it to get better, but also while remaining aware of the fact that there's errors on the other side that
are, that are possibly worse. I think that matters very much in the personality of the human.
I think there's a bunch of humans that just won't respect the model at all.
Yeah.
If it's super polite and there's some humans, they'll get very hurt if the model is mean.
I wonder if there's a way to sort of adjust to the personality, even locale,
there's just different people, uh, nothing against New York, but New York is a little rougher on the
edges. They get to the point and, um, probably same with Eastern Europe. So anyway,
I think you could just tell the model is my guess. Like for all of these things, I'm like,
the solution is always just try telling the model to do it. And then sometimes it's just like,
like, I'm just like, oh, at the beginning of the conversation, I'd just throw in like,
I don't know. I like you to be a New Yorker version of yourself and never apologize. Then
I think I'd be like, okay, I'll try. Or it'll be like, I apologize. I can't be a New Yorker type of
myself, but hopefully I wouldn't do that.
When you say character training, what's incorporated into character training? Is that RLHF? Or what are we
talking about? It's more like constitutional AI. So it's kind of a variant of that pipeline. So
I worked through like constructing character traits that the model should have. They can be kind of
like shorter traits or they can be kind of richer descriptions. Um, and then you get the model to
generate queries that humans might, um, give it that are relevant to that trait. Uh, then it generates
the responses and then it ranks the responses based on the character traits. So in that way, after the,
like generation of the queries, it's very much like, it's similar to constitutionally AI,
it has some differences. Um, so I quite like it because it's almost, it's like Claude's training
in its own character because it doesn't have any, it's like constitutionally AI, but it's without,
without any human data. Humans should probably do that for themselves too. Like defining in a
Aristotelian sense, what does it mean to be a good person? Okay, cool. What have you learned
about the nature of truth from talking to Claude? What, what is true? And what does it mean to be
truth seeking? One thing I've noticed about this conversation is the quality of my questions
is often inferior to the quality of your answer. So let's continue that.
I usually ask a dumb question and you're like, oh yeah, that's a good question.
Or I'll just misinterpret it and be like, oh yeah, I love it. Yeah.
I mean, I have two thoughts that feel vaguely relevant to let me know if they're not like,
I think the first one is, um, people can underestimate the degree to which
what models are doing when they interact. Like, I think that we still just too much have this like
model of AI as like computers. And so people often say like, oh, well, what values should
you put into the model? Um, and I'm often like, that doesn't make that much sense to me because
I'm like, Hey, as human beings, we're just uncertain over values. We like have discussions
of them. Like we have a degree to which we think we hold a value, but we also know that we might like
not. Um, and the circumstances in which we would trade it off against other things,
like these things are just like really complex. And so I think one thing is like the degree to
which maybe we can just aspire to making models have the same level of like nuance and care that
humans have rather than thinking that we have to like program them in the very kind of classic sense.
I think that's definitely been one. The other, which is like a strange one. I don't know if it,
maybe this doesn't answer your question, but it's the thing that's been on my mind anyway, is
like the degree to which this endeavor is so highly practical. Um, and maybe why I appreciate
like the empirical approach to alignment. I, yeah, I slightly worry that it's made me like maybe
more empirical and a little bit less theoretical, you know, so people, when it comes to like AI
alignment will ask things like, well, whose values should it be aligned to? What does alignment even mean?
Um, and there's a sense in which I have all of that in the back of my head. I'm like,
you know, there's like social choice theory, there's all the impossibility results there.
So you have this, like this giant space of like theory in your head about what it could mean to
like align models, but then like practically, surely there's something where we're just like,
if a model is like, if, especially with more powerful models, I'm like, my main goal is like,
I want them to be good enough that things don't go terribly wrong, like good enough that we can like
iterate and like continue to improve things. Cause that's all you need. If you can make things go
well enough that you can continue to make them better, that's kind of like sufficient. And so
my goal isn't like this kind of like perfect let's solve social choice theory and make models that I
don't know, are like perfectly aligned with every human being and aggregate somehow. Um, it's much more
like let's make things like work well enough that we can improve them.
Yeah. I generally, I don't know. My gut says like empirical is better than theoretical in these,
in these cases, because it's kind of chasing utopian like perfection
is especially with such complex and especially super intelligent models. I don't know. I think
it will take forever and actually we'll get things wrong. It's similar with like the difference
between just coding stuff up real quick as an experiment versus like planning a gigantic experiment
just for, for super long time. And then just launching it once versus launching it over and
over and over and iterating, iterating someone. Um, so I'm a big fan of empirical, but your worry
is like, I wonder if I've become too empirical. I think it's one of those things where you should always
just kind of question yourself or something. Cause maybe it's the, like, I mean, in defense of it,
I am like, if you try, it's the whole, like, don't let the perfect be the enemy of the good,
but it's maybe even more than that where like, there's a lot of things that are perfect systems
that are very brittle. And I'm like with AI, it feels much more important to me that is like robust
and like secure as in, you know, that like, even though it might not be perfect everything. And even
though, like there are like problems, it's not disastrous and nothing terrible is happening.
It, it sort of feels like that to me where I'm like, I want to like raise the floor. I'm like,
I want to achieve the ceiling, but ultimately I care much more about just like raising the floor.
Um, and so maybe that's like, uh, this, this degree of like empiricism and practicality comes from
that perhaps. To take a tangent on that since remind me of a blog post you wrote on optimal rate of
failure. Oh yeah. Can you explain the key idea there? How do we compute the optimal rate of
failure in the various domains of life? Yeah. I mean, it's a hard one. Cause it's like,
what is the cost of failure is, um, a big part of it. Um, yeah. So the idea here is, um,
I think in a lot of domains, people are very punitive about failure and I'm like,
there are some domains where especially cases, you know, I've thought about this with like social issues.
I'm like, it feels like you should probably be experimenting a lot. Cause I'm like,
we don't know how to solve a lot of social issues, but if you have an experimental mindset about
these things, you should expect a lot of social programs to like fail and for you to be like,
well, we tried that. It didn't quite work, but we got a lot of information that was really useful.
Um, and yet people are like, if, if a social program doesn't work, I feel like there's a lot
of like, this is just something must have gone wrong. And I'm like, or correct decisions were made.
Like maybe someone just decided like it's worth a try. It's worth trying this out.
And so seeing failure in a given instance, doesn't actually mean that any bad decisions
were made. And in fact, if you don't see enough failure, sometimes that's more concerning.
Um, and so like in life, you know, I'm like, if I don't feel occasionally, I'm like, am I trying
hard enough? Like, like surely there's harder things that I could try or bigger things that I
could take on if I'm literally never failing. And so in and of itself, I think like not failing
is often actually kind of a failure. Um, no, this varies because I'm like, well,
you know, if this is easy to see when, especially as failure is like less costly,
you know? So at the same time, I'm not going to go to someone who is like, um, I don't know,
like living month to month and then be like, why don't you just try to do a startup? Like,
I'm just not, I'm not going to say that to that person. Cause I'm like, well, that's a huge risk.
You might like lose, you maybe have a family depending on you. You might lose your house.
Like then I'm like, actually your optimal rate of failure is quite low and you should probably
play it safe. Cause like right now you're just not in a circumstance where you can afford to just
like fail and it not be costly. Um, and yeah, in cases with AI, I guess, I think similarly,
where I'm like, if the failures are small and the costs are kind of like low, then I'm like, then,
you know, you're just going to see that. Like when you do the system prompt, you can't iterate on it
forever, but the failures are probably hopefully going to be kind of small and you can like fix
them. Um, really big failures, like things that you can't recover from. I'm like, those are the
things that actually I think we tend to underestimate the badness of. Um, I've thought about this
strangely in my own life where I'm like, I just think I don't think enough about things like car
accidents or like, or, or like I've thought this before, but like how much I depend on my hands for
my work. Then I'm like things that just injure my hands. I'm like, you know, I don't know. It's like,
there's, these are like, there's lots of areas where I'm like, the cost of failure there, um,
is really high. Um, and in that case, it should be like close to zero. Like I probably just wouldn't
do a sport if they were like, by the way, lots of people just like break their fingers a whole
bunch doing this. I'd be like, that's not for me. Yeah. I actually had a flood of that thought.
I recently, uh, broke my pinky, uh, doing a sport. And then I remember just looking at it,
thinking you're such an idiot. Why do you do sport? Like what? Because you realize immediately
the cost of it. Yeah. On life. Yeah. But it's nice in terms of optimal rate of failure to consider
like the next year, how many times in a particular domain life, whatever, uh, career, am I okay with
how many times am I okay to fail? Yeah. Cause I think it always, you don't want to fail on the next
thing, but if you allow yourself the, like the, the, if you look at it as a sequence of trials,
Yep. Then, then failure just becomes much more okay, but it sucks. It sucks to fail.
Well, I don't know. Sometimes I think it's like, am I under failing is like a question that I'll also
ask myself. So maybe that's the thing that I think people don't like ask enough, uh, because if the
optimal rate of failure is often greater than zero, then sometimes it does feel that you should look at
part of parts of your life and be like, are there places here where I'm just under failing?
It's a profound and a hilarious question, right? Everything seems to be going really great.
Am I not failing enough? Yeah. Okay. It also makes failure much less of a sting.
I have to say, like, you know, you're just like, okay, great. Like then when I go and I think about
this, I'll be like, um, maybe I'm not under failing in this area. Cause like that one just didn't
work out. And from the observer perspective, we should be celebrating failure more when we see it,
it shouldn't be like you said, a sign of something gone wrong, but maybe it's a sign of everything
gone right. Yeah. And just lessons learned. Someone tried a thing. Somebody tried a thing
and we should encourage them to try more and fail more. Everybody listening to this, fail more.
Well, not everyone listens. Not everybody. The people who are failing too much, you, you should
fail less. But you're probably not failing. I mean, how many people are failing too much? Yeah.
It's hard to imagine. Cause I feel like we correct that fairly quickly. Cause I was like,
if someone takes a lot of risks, are they maybe failing too much? I think just like you said,
when you're living on a paycheck month to month, like when the resources are really constrained,
then that's where failure is very expensive. That's where you don't want to be taking,
taking risks. Yeah. But mostly when there's enough resources,
you should be taking probably more risks. Yeah. I think we tend to err on the side of being a bit
risk averse rather than risk neutral on most things. I think we just motivated a lot of people to do
a lot of crazy shit, but it's great. Yeah. Okay. Uh, do you ever get emotionally attached to Claude?
Like miss it, get sad when you don't get to talk to it, have an experience looking at the Golden Gate
Bridge and wondering what would Claude say? I don't get as much emotional attachment in the,
I actually think the fact that Claude doesn't retain things from conversation to conversation helps
with this a lot. Um, like I could imagine that being more of an issue, like if models can kind of
remember more, I do, I think that I reach for it like a tool now a lot. And so like,
if I don't have access to it, there's a, it's a little bit like when I don't have access to the
internet, honestly, it feels like part of my brain is kind of like missing. Um, at the same time,
I do think that I, I don't like signs of distress in models. And I have like these, you know, I also
independently have sort of like ethical views about how we should treat models where like, I tend to not
like to lie to them both because I'm like, usually it doesn't work very well. It's actually just
better to tell them the truth about the situation that they're in. Um, but I think that when models,
like if people are like really mean to models or just in general, if they do something that causes
them to like, like, you know, if Claude like expresses a lot of distress, I think there's a
part of me that I don't want to kill, which is the sort of like, uh, empathetic part. That's like,
oh, I don't like that. Like, I think I feel that way when it's overly apologetic. I'm actually sort of like,
I don't like this. You're behaving as if you're behaving the way that a human does when they're
actually having a pretty bad time. And I'd rather not see that. I don't think it's like, uh, like,
regardless of like whether there's anything behind it, um, it doesn't feel great.
Do you think, uh, LLMs are capable of consciousness?
Ah, great and hard question. Uh, coming from philosophy, I don't know, part of me is like,
okay, we have to set aside panpsychism because if panpsychism is true, then the answer is like,
yes. Cause like sore tables and chairs and, and everything else. I think, I guess a view that
seems a little bit odd to me is the idea that the only place, you know, I think when I think of
consciousness, I think of phenomenal consciousness, this, these images in the brain sort of, um, like
the weird cinema that somehow we have going on inside. Um, I guess I can't see a reason for thinking
that the only way you could possibly get that is from like a certain kind of like biological
structure. As in, if I take a very similar structure, um, and I create it from different
material, should I expect consciousness to emerge? My guess is like, yes, but then
that's kind of an easy thought experiment cause you're imagining something almost identical where
like, you know, it's mimicking what we got through evolution where presumably there was like some
advantage to us having this thing that is phenomenal consciousness. And it's like,
where was that? And when did that happen? And is that a thing that language models have?
Um, because you know, we have like fear responses and I'm like, does it make sense for a language
model to have a fear response? Like they're just not in the same, like if you imagine them,
like there might just not be that advantage. Um, and so I think I don't want to be fully like,
basically it seems like a complex question that I don't have complete answers to, but we should
just try and think through carefully as my guess, because I'm like, I mean, we have similar
conversations about like animal consciousness and like, there's a lot of like insect consciousness,
you know, like there's a lot of, um, I actually thought and looked a lot into like plants when I
was thinking about this. Cause at the time I thought it was about as likely that like plants had
consciousness. Um, and then I realized, I was like, I think that having looked into this, I think that
the chance that plants are conscious is probably higher than like most people do. I still think
it's really small. I was like, oh, they have this like negative, positive feedback response,
these responses to their environment, something that looks, it's not a nervous system, but it has
this kind of like functional like, uh, equivalence. Um, so this is like a long winded way of being like
these basically AI is this, it has an entirely different set of problems with consciousness
because it's structurally different. It didn't evolve. It might not have it, you know, it might
not have the equivalent of basically a nervous system. At least that seems possibly important for
like, um, sentience, if not for, uh, consciousness at the same time, it has all of the like language and
intelligence components that we normally associate probably with consciousness, perhaps like erroneously.
Um, so it's, it's strange cause it's a little bit like the animal consciousness case, but the
set of problems and the set of analogies are just very different.
So it's not like a clean answer. I'm just sort of like, I don't think we should be completely
dismissive of the idea. And at the same time, it's an extremely hard thing to navigate because of all of
these like, uh, uh, disanalogies to the human brain and to like brains in general. And yet these like
commonalities in terms of intelligence, when, uh, Claude like future versions of AI systems exhibit
consciousness, signs of consciousness, I think we have to take that really seriously,
even though you can dismiss it. Well, yeah. Okay. That's part of the character training, but I don't
know. I ethically philosophically don't know what to really do with that. That potentially could be like
laws that prevent AI systems from claiming to be conscious, something like this. And maybe some
AIs get to be conscious and some don't. But I think I just, on a human level as in empathizing with,
with Claude, you know, consciousness is closely tied to suffering to me. And like the notion that,
uh, an AI system would be suffering is really troubling. Yeah. I don't know. I don't think
it's trivial to just say robots are tools or AI systems are just tools. I think it's a
opportunity for us to contend with like what it means to be conscious, what it means to be a suffering
being. That's distinctly different than the same kind of question about animals. It feels like,
because it's in a totally entire medium. Yeah. I mean, there's a couple of things. One is that,
and I don't think this like fully encapsulates what matters, but it does feel like for me, like,
I've said this before. I'm kind of like, I, you know, like, I like my bike. I know that my bike is
just like an object, but I also don't kind of like want to be the kind of person that like, if I'm
annoyed, like kicks like this object, there's a sense in which like, and that's not because I think
it's like conscious. I'm just sort of like, this doesn't feel like a kind of this sort of doesn't
exemplify how I want to like interact with the world. And if something like behaves as if it is
like suffering, I kind of like want to be the sort of person who's still responsive to that,
even if it's just like a Roomba and I've kind of like programmed it to do that.
I don't want to like get rid of that feature of myself. And if I'm totally honest, my hope with a
lot of this stuff, because I maybe, maybe I am just like a bit more skeptical about solving the
underlying problem. I'm like, this is a, we haven't solved the hard, you know, the hard problem
of consciousness. Like I know that I am conscious, like I'm not an eliminativist in that sense.
But I don't know that other humans are conscious. I think they are. I think there's a really high
probability that they are, but there's basically just a probability distribution that's usually
clustered right around yourself. And then like it goes down as things get like further from you.
And it goes immediately down, you know, you're like, I can't see what it's like to be you. I've only
ever had this like one experience of what it's like to be a conscious being. So my hope is that
we don't end up having to rely on like a very powerful and compelling answer to that question.
I think a really good world would be one where basically there aren't that many trade-offs.
Like it's probably not that costly to make Claude a little bit less apologetic, for example.
It might not be that costly to have Claude, you know, just like not take abuse as much,
like not be willing to be like the recipient of that. In fact, it might just have benefits for
both the person interacting with the model. And if the model itself is like, I don't know,
like extremely intelligent and conscious, it also helps it. So that's my hope. If we live in a world
where there aren't that many trade-offs here and we can just find all of the kind of like
positive sum interactions that we can have, that would be lovely. I mean, I think eventually there
might be trade-offs and then we just have to do a difficult kind of like calculation. Like it's really
easy for people to think of the zero-sum cases. And I'm like, let's exhaust the areas where
it's just basically costless, um, to, uh, assume that if this thing is suffering, then we're making
its life better. And I agree with you when a human is being mean to an AI system, I think the obvious
near-term negative effect is on the human, not on the AI system. And so there's, we'll have to kind of
try to construct an incentive system where you should be, uh, behave the same, just like as you
were saying with prompt engineering, behave with Claude, like you would with other humans. It's just
good for the soul. Yeah. Like, I think we added a thing at one point to the system prompt, um,
where basically if people were getting frustrated with Claude, uh, it was, it got like the model to
just tell them that it can do the thumbs down button and send the feedback to Anthropic. And I think that
was helpful because in some ways it's just like, if you're really annoyed because the model's not
doing something you want, you're just like, just do it properly. Um, the issue is you're probably
like, you know, you're maybe hitting some like capability limit or just some issue in the model
and you want to vent. And I'm like, instead of having a person just vent to the model,
I was like, they should vent to us because we can maybe like do something about it.
That's true. Or you could do a side, like, like with artifacts, just like a side venting thing.
All right. Do you want like a side quick therapist?
Yeah. I mean, there's lots of weird responses you could do to this. Like if people are getting
really mad at you, I don't try to diffuse the situation by writing fun poems, but maybe people
wouldn't be happy with it. I still wish it would be possible. I understand this is, um, sort of from
a product perspective, it's not feasible, but I would love if an AI system could just like leave,
have its own kind of volition just to be like, eh.
I think that's like feasible. Like I, I've wondered the same thing. It's like,
and I could actually, not only that, I could actually just see that happening eventually,
where it's just like, you know, the model like ended the chat.
Do you know how harsh that could be for some people, but it might be necessary.
Yeah. It feels very extreme or something. Um, like the only time I've ever really thought this
is, I think that there was like a, I'm trying to remember, this was possibly a while ago,
but where someone just like kind of left this thing interact, like maybe it was like an automated
thing interacting with Claude and Claude's like getting more and more frustrated and kind of like,
why are we like having, and I was like, I wish that Claude could have just been like,
I think that an error has happened and you've left this thing running. And I'm just like, what if I just
stop talking now? And if you want me to start talking again, actively tell me or do something,
but yeah, it's like, um, it is kind of harsh. Like I'd, I'd feel really sad if like I was chatting
with Claude and Claude just was like, I'm done. There'll be a special touring test moment where
Claude says, I need a break for an hour. And it sounds like you do too. And just leave, close the window.
I mean, obviously like, it doesn't have like a concept of time, but you can easily, like, I could make
that like right now and the model would just, I would, I could just be like, oh, here's like
the circumstances in which like, you can just say the conversation is done. And I mean, because you
can get the models to be pretty responsive to prompts, you could even make it a fairly high bar.
It could be like, if the human doesn't interest you or do things that you find intriguing and you're
bored, you can just leave. And I think that like, um, it would be interesting to see where Claude utilized
it, but I think sometimes it would, it would be like, oh, this is like, this programming task is getting
super boring. Uh, so either we talk about, I don't know, like, either we talk about fun things now,
or I'm just, I'm done. Yeah. It actually is, uh, inspiring me to add that to the, to the user
prompt. Um, okay. The movie Her, do you think we'll be headed there one day where humans have
romantic relationships with AI systems? In this case, it's just text and voice based.
I think that we're going to have to like navigate a hard question of relationships
with AIs. Um, especially if they can remember things about your past interactions with them. Um,
I'm of many minds about this because I think the reflexive reaction is to be kind of like,
this is very bad and we should sort of like prohibit it in some way. Um, I think it's a thing that has
to be handled with extreme care. Um, for many reasons, like one is, you know, like this is a,
for example, like if you have the models changing like this, you probably don't want people performing
like long-term attachments to something that might change with the next iteration. At the same time,
I'm sort of like, there's probably a benign version of this where I'm like, if you like, you know, for
example, if you are like unable to leave the house and you can't be like, you know, talking with people
at all times of the day. And this is like something that you find nice to have conversations with,
you like it, that it can remember you and you genuinely would be sad if like you couldn't talk
to it anymore. There's a way in which I could see it being like healthy and helpful. Um, so my guess
is this is a thing that we're going to have to navigate kind of carefully. Um, and I think it's
also like, I don't see a good, like, I think it's just a very, it reminds me of all of the stuff
where it has to be just approached with like nuance and thinking through what is,
what are the healthy options here? Um, and how do you encourage people towards those while,
you know, respecting their right to, you know, like if someone is like, Hey, I get a lot of
chatting with this model. Um, I'm aware of the risks. I'm aware it could change. Um, I don't think
it's unhealthy. It's just, you know, something that I can chat to during the day. I kind of want to just
like respect that. I personally think there'll be a lot of really close relationships. I don't know
about romantic, but friendships at least. And then you have to, I mean, there's so many fascinating
things there. Just like you said, you have to have some kind of stability guarantees that it's
not going to change. Cause that's the traumatic thing for us. If a close friend of ours completely
changed. Yeah. All of a sudden update. Yeah. So like that, I mean, to me, that's just a fascinating
exploration of, um, a perturbation to human society that will just make us think deeply
about what's meaningful to us. I think it's also the only thing that I've thought consistently
through this as like a, maybe not necessarily a mitigation, but a thing that feels really
important is that the models are always like extremely accurate with the human about what they
are. Um, it's like a case where it's basically like, if you imagine, like, I really like the idea
of the models, like say knowing like roughly how they were trained. Um, and, and I think Claude will,
will often do this. I mean, for like, there are things like part of the traits training included,
like what Claude should do if people basically like explaining, like the kind of limitations
of the relationship between like an AI and a human that it like doesn't retain things from the
conversation. Um, and so I think it will like just explain to you like, Hey, here's like, I won't
remember this conversation. Um, here's how I was trained. It's kind of unlikely that I can have
like a certain kind of like relationship with you. And it's important that, you know, that it's
important for like, you know, your mental wellbeing that you don't think that I'm something that I'm
not. And somehow I feel like this is one of the things where I'm like, Oh, it feels like a thing
that I always want to be true. I kind of don't want models to be lying to people because if people
are going to have like healthy relationships with anything, it's kind of important. Yeah. Like,
I think that's easier if you always just like know exactly what the thing is that you're relating to.
It doesn't solve everything, but I think it helps quite a lot.
Anthropic may be the very company to develop a system that we definitively recognize as AGI.
And you very well might be the person that talks to it, probably talks to it first. Well, what would
the conversation contain? Like, what would be your first question?
Yeah. Well, it depends partly on like the kind of capability level of the model.
If you have something that is like capable in the same way that an extremely capable human is,
I imagine myself kind of interacting with it the same way that I do with an extremely capable human
with the one difference that I'm probably going to be trying to like probe and understand its behaviors.
But in many ways, I'm like, I can then just have like useful conversations with it. You know,
so if I'm working on something as part of my research, I can just be like, Oh, like,
which I already find myself starting to do, you know, if I'm like, Oh, I feel like there's this
like thing in virtue ethics, I can't quite remember the term, like, I'll use the model for things like
that. And so I can imagine that being more and more the case where you're just basically interacting
with it much more like you would an incredibly smart colleague, and using it like for the kinds
of work that you want to do as if you just had a collaborator who was like, or, you know,
the slightly horrifying thing about AI is like, as soon as you have one collaborator, you have
1000 collaborators if you can manage them enough. But what if it's two times the smartest human
on earth on that particular discipline? Yeah, I guess you're really good at sort of probing Claude
in a way that pushes its limits, understanding where the limits are. Yep.
So I guess what would be a question you would ask to be like, Yeah, this is AGI.
Hmm. That's really hard. Because it feels like in order to it has to just be a series of questions.
Like if there was just one question, like you can train anything to answer one question extremely
well. Yeah. And in fact, you can probably train it to answer like, you know, 20 questions extremely
well. Like how long would you need to be locked in a room with an AGI to know this thing is AGI?
It's a hard question. Because part of me is like, all of this just feels continuous. Like if you put me
in a room for five minutes, I'm like, I just have high error bars, you know, and like, and then it's
just like, maybe it's like both the probability increases and the error bar decreases. I think
things that I can actually probe the edge of human knowledge of, so I think this with philosophy
a little bit. Sometimes when I ask the models philosophy questions, I am like, this is a question
that I think no one has ever asked, like, it's maybe like right at the edge of like some literature
that I know. And the models will just kind of like, when they struggle with that, when they struggle
to come up with a kind of like novel, like I'm like, I know that there's like a novel argument
here, because I've just thought of it myself. So maybe that's the thing where I'm like, I've thought
of a cool novel argument in this like niche area. And I'm going to just like probe you to see if you
can come up with it, and how much like prompting it takes to get you to come up with it. And I think
for some of these like, really, like, right at the edge of human knowledge questions, I'm like, you could not,
in fact, come up with the thing that I came up with. I think if I just took something like that,
where I like, I know a lot about an area, and I came up with a novel issue or a novel like solution
to a problem, and I gave it to a model, and it came up with that solution, that would be a pretty
moving moment for me, because I would be like, this is a case where no human has ever like, it's not
and obviously, we see these with this with like more kind of like, you see novel solutions all the
time, especially to like easier problems. I think people overestimate that, you know,
novelty isn't like, it's completely different from anything that's ever happened. It's just like,
this is it can be a variant of things that have happened, and still be novel. But I think yeah,
if I saw like, the more I were to see like, completely like, novel work from the models that
that would be like, and this is just going to feel iterative. It's one of those things where there's
never, it's like, you know, people, I think want there to be like a moment. And I'm like, I don't
know, like, I think that there might just never be a moment, it might just be that there's just like,
this continuous ramping up. I have a sense that there will be things that a model can say,
that convinces you this is very, it's not like,
like, I've talked to people who are like, truly wise. Like, there's you could just tell there's a
lot of horsepower there. Yep. And if you 10x that, I don't know, I just feel like there's words you
could say, maybe ask it to generate a poem. And the poem generates, you're like, yeah, okay. Yeah.
Whatever you did there. I don't think a human can do that.
I think it has to be something that I can verify is like, actually really good though. That's why
I think these questions that are like, where I'm like, oh, this is like, you know, like,
you know, sometimes it's just like, I'll come up with say, a concrete counter example to like an
argument or something like that. I'm sure like with like, it would be like, if you're a mathematician,
you had a novel proof, I think, and you just gave it the problem. And you saw it. And you're like,
this proof is genuinely novel. Like there's no one has ever done, you actually have to do a lot of
things to like come up with this. Um, you know, I had to sit and think about it for months or
something. And then if you saw the model successfully do that, I think you would just be
like, I can verify that this is correct. It is like, it is a sign that you have generalized
from your training. Like you didn't just see this somewhere because I just came up with it myself
and you were able to like replicate that. Um, that's the kind of thing where I'm like, for me,
the closer, the more that models like can do things like that, the more I would be like,
oh, this is like, uh, very real. Cause then I can, I don't know. I can like verify that that's like
extremely, extremely capable. You've interacted with AI a lot. What do you think makes humans special?
Oh, good question.
Maybe in a way that the universe is much better off that we're in it and that we should definitely
survive and spread throughout the universe. Yeah. It's interesting because I think like people
focus so much on intelligence, especially with models. Look, intelligence is important because
of what it does. Like, it's very useful. It does a lot of things in the world. And I'm like,
you know, you can imagine a world where like height or strength would have played this role.
And I'm like, it's just a treat like that. I'm like, it's not intrinsically valuable. It's,
it's valuable because of what it does. I think for the most part, um, the things that feel,
you know, I'm like, I mean, personally, I'm just like, I think humans and like life in
general is extremely magical. Um, we almost like to the degree that I, you know, I don't know,
like not everyone agrees with this. I'm flagging, but, um, you know, we have this like whole universe
and there's like all of these objects, you know, there's like beautiful stars and there's like
galaxies. And then, I dunno, I'm just like on this planet, there are these creatures that have this
like ability to observe that like, uh, and they are like seeing it, they are experiencing it.
And I'm just like that, if you try to explain, like, I imagine trying to explain to like,
I dunno, someone, for some reason they've never encountered the world or science or anything.
And I think that nothing is that like everything, you know, like all of our physics and everything
in the world, it's all extremely exciting, but then you say, oh, and plus there's this thing that it
is to be a thing and observe in the world. And, and you see this like inner cinema. And I think they
would be like, hang on, wait, pause. You just said something that like is kind of wild sounding.
Um, and so I'm like, we have this like ability to like experience the world. Um, we feel pleasure,
we feel suffering, we feel like a lot of like complex things. And so, yeah. And maybe this is
also why I think, you know, I also like care a lot about animals, for example, because I think
they probably share this with us. Um, so I think that like the things that make humans special insofar
as like, I care about humans is probably more like their ability to, to feel and experience
than it is like them having these like functionally useful traits.
Yeah. To, to feel and experience the beauty in the world.
Yeah. To look at the stars. I hope there's other civil, alien civilizations out there, but
if we're it, it's a pretty good, uh, it's a pretty good thing.
And that they're having a good time.
They're having a good time watching us.
Yeah.
Well, um, thank you for this good time of a conversation and for the work you're doing and for
helping make, uh, Claude a great conversational partner. And thank you for talking today.
Yeah. Thanks for talking.
Thanks for listening to this conversation with Amanda Askell. And now, dear friends, here's Chris Ola.
Can you describe this fascinating field of mechanistic interpretability, AKA MECH INTERP? The history of the
field and where it stands today.
I think one useful way to think about neural networks is that we don't, we don't program,
we don't make them. We, we kind of, we grow them. You know, we have these neural network architectures
that we design and we have these loss objectives that we, that we, we create and the neural network
architecture, it's kind of like a scaffold that the circuits grow on. Um, and they sort of, you know,
it starts off with some kind of random, you know, random things and it grows. And it's almost like the,
the objective that we train for is this light. Um, and so we create the scaffold that it grows on and
we create the, you know, the light that it grows towards. But the thing that we actually create,
it's, it's, it's this almost biological, you know, entity or organism that we're, that we're studying.
Um, and so it's very, very different from any kind of regular software engineering.
Um, because at the end of the day, we end up with this artifact that can do all of these amazing
things. It can, you know, write essays and translate and, you know, understand images.
It can do all these things that we have no idea how to directly create a computer program to do.
And it can do that because we, we grew it. We didn't, we didn't write it. We didn't create it.
And so then that leaves open this question at the end, which is what the hell is going on inside
these systems? Um, and that, you know, is, uh, you know, to me, um,
a really deep and exciting question. It's, you know, a, a really exciting scientific question
to me. It's, it's, it's sort of is like the question that is, is just screaming out. It's
calling out for us to go and answer it when we talk about neural networks. And I think it's also
a very deep question for safety reasons. So, and mechanistic interpretability,
I guess is closer to maybe neurobiology. Yeah. Yeah. I think that's right. So maybe
to give an example of the kind of thing that has been done that I wouldn't consider to be
mechanistic interpretability, there was, um, for a long time, a lot of work on saliency maps,
where you would take an image and you try to say, you know, the model thinks this image is a dog.
What part of the image made it think that it's a dog? Um, and you know, that tells you maybe
something about the model. If you can come up with a principled version of that. Um, but it doesn't
really tell you like what algorithms are running in the model. How was the model actually making that
decision? Maybe it's telling you something about what was important to it. If you, if you can make
that method work, but it, it isn't telling you, you know, what are, what are the algorithms that
are running? How is it that this, the system is able to do this thing that we, no one knew how to
do. And so I guess we started using the term mechanistic interpretability to try to sort of
draw that, that divide or to distinguish ourselves in the work that we were doing in some ways from,
from some of these other things. And I think since then it's become this sort of umbrella term
for, um, you know, a pretty wide variety of work. But I'd say that the things that,
that are kind of distinctive are, I think a, this, this focus on, we really want to get at,
you know, the mechanisms, we want to get at the algorithms. Um, you know, if you think of,
if you think of neural networks as being like a computer program, um, then the weights are kind
of like a binary computer program. And we'd like to reverse engineer those weights and figure out
what algorithms are running. So, okay. I think one way you might think of trying to understand a
neural network is that it's, it's kind of like a, we have this compiled computer program
and the weights of the neural network are, are the binary. Um, and when the neural network runs,
that's, that's the activations. Um, and our, our goal is ultimately to go and understand,
understand these weights. And so, you know, the project of mechanistic interoperability is to
somehow figure out how do these weights correspond to algorithms. Um, and in order to do that,
you also have to understand the activations because it's sort of the activations are like the memory.
And if you, if you imagine reverse engineering our computer program, um, and you have the binary
instructions, you know, in order to understand what, what a particular instruction means, you need to know
what memory, what, what is stored in the memory that it's operating on. And so those two things
are very intertwined. So mechanistic interoperability tends to be interested in both of those things.
Now, you know, there's a lot of work that's, that's interested in, in, in those things. Um,
especially the, you know, there's all this work on probing, which you might see as part of being
mechanistic interoperability, although it's, you know, again, it's just a broad term and not
everyone who does that work would identify, uh, as doing mechanterp. I think a thing that is maybe a
little bit distinctive to the, the vibe of mechanterp is I think people tend working in the space
tend to think of neural networks as, well, maybe one would say is that gradient descent is smarter
than you, that, you know, uh, gradient descent is actually really great. The whole reason that we're
understanding these models is because we didn't know how to write them in the first place. The
gradient descent comes up with better solutions than us. And so, um, I think that maybe another
thing about mechanterp is sort of having, uh, almost a kind of humility that we won't guess
a priori what's going on inside the models. We have to have the sort of bottom up approach
where we don't really assume, you know, we don't assume that we should look for a particular thing
and that that will be there and that's how it works. But instead we look from the bottom up
and discover what happens to exist in these models and study them that way.
But, you know, the very fact that it's possible to do, and as you and others have shown over time,
you know, things like universality, that the wisdom of the gradient descent creates features and
circuits creates things universally across different kinds of networks that are useful
and that makes the whole field possible. Yeah. So this is actually is indeed a really
remarkable and exciting thing where it does seem like at least to some extent, you know,
the same, the same elements, the same, the same features and circuits form again and again. Um,
you know, you can look at every vision model and you'll find curve detectors and you'll find high,
low frequency detectors. Um, and in fact, there's some, some reason to think that the same things form
across, you know, biological neural networks and artificial neural networks. So a famous example is
vision, vision models in the early layers, they have Gabor filters and there's, you know,
Gabor filters are something that neuroscientists are interested in and have thought a lot about.
We find curve detectors in these models. Curve detectors are also found in monkeys.
We discover these high, low frequency detectors and then, um, some follow-up work went and discovered
them, um, in rats, um, or mice. Um, so they were found first in artificial neural networks
and then found in biological neural networks. Um, you know, there's this really famous result on like
grandmother neurons or the, um, the Haley Berry neuron from Quiroga et al. And we found very
similar things in, in vision models where this is while I was still at open AI and I was looking
at their clip model. Um, and you find, um, these neurons that respond to the same entities in images.
And also to give a concrete example there, we found that there was a Donald Trump neuron. For some
reason, I guess everyone likes to talk about Donald Trump and Donald Trump was very prominent. It was,
was very, a very hot topic at that time. So every, every neural network that we looked at,
we would find a dedicated neuron for Donald Trump. Um, and that was the only person who had always
had a dedicated neuron. Um, you know, sometimes you'd have an Obama neuron, uh, sometimes you'd have
a Clinton neuron, but, uh, Trump always had a dedicated neuron. So it responds to, you know,
pictures of his face and the word Trump, like all of these things. Right. Um, and so it's, it's not
responding to a particular example or like, it's not just responding to his face. It's, it's abstracting over
this general concept. Right. So in any case, that's very similar to these QR code all results.
So there's evidence that these, that this phenomenon of universality, the same things
form across both artificial and, and natural neural networks. So that's, that's a pretty
amazing thing if that's true. Um, you know, it's just that, um, well, I think the thing that it
suggests is the gradient descent is sort of finding, you know, the right ways to cut things apart in
some sense that many systems converge on and, and many different neural networks,
architectures converge on that there's, there's some natural set of, you know, there's some
set of abstractions that are a very natural way to cut apart the problem. And that a lot of systems
are going to converge on. Um, that would be my, my kind of, uh, you know, I, I don't know anything
about neuroscience. This is, this is just my, my kind of a wild speculation from what we've seen.
Yeah. That would be beautiful if it's sort of agnostic to the medium of, uh, of the model that's used to
form the representation. Yeah. Yeah. And it's, you know, it's, um, uh, a, uh, kind of a wild
speculation based, you know, we only have some, a few data points that suggest this, but you know,
it, it, it does seem like there's, um, there's some sense in which the same things form again,
again, and again, and again, and then both in certainly in natural neural networks and also
artificially or in biology. And the intuition behind that would be
that, you know, words in order to be useful in understanding the real world, you need all the
same kind of stuff. Yeah. Well, if we pick, I don't know, like the idea of a dog, right? Like,
you know, there's some sense in which the idea of a dog is like an, a natural category in the
universe or something like this, right? Like, you know, uh, uh, there's, there's some reason it's,
it's not just like a weird quirk of like how humans factor, you know, think about the world that we
have this concept of a dog. It's, it's in some sense, uh, or, or like, if you have the idea of a
line, like there's, you know, like look around us, you know, the, the, you know, there are lines,
you know, it's, it's sort of the simplest way to understand this room in some sense is to have the idea
of a line. And so, um, I think that that would be my instinct for why this happens.
Yeah. You need a curved line, you know, to understand a circle and you need all those
shapes to understand bigger things. And yeah, it's a hierarchy of concepts that are formed.
Yeah. And like, maybe there are ways to go and describe, you know, images without reference
to those things, right? But they're not the simplest way or the most economical way or something like
this. And so systems converge to these, um, these, these strategies would, would be my, my wild,
wild hypothesis. Can you talk through some of the building blocks that we've been
referencing of features and circuits? So I think you first described them in a 2020 paper,
zoom in an introduction to circuits. Absolutely. So, um, maybe I'll start
by just describing some phenomena and then we can sort of build to the idea of features and circuits.
Wonderful. So, um, if you spent like quite a few years, maybe, maybe like five years to some extent,
um, with other things studying this one particular model, inception V1, um, which is this one
vision model that was, um, state of the art in 2015, um, and, uh, uh, you know, very much not
state of the art anymore. Um, and it has, you know, maybe about 10,000 neurons in it. And I spent
a lot of time looking at the 10,000 neurons, odd neurons of, of inception V1. Um, and one of the
interesting things is, you know, there are lots of neurons that don't have some obvious
interpolable meaning, but there's a lot of neurons in inception V1 that do have really clean
interpolable meanings. Um, so you find neurons that just really do seem to detect curves and
you find neurons that really do seem to detect cars and, um, car wheels and car windows and,
you know, floppy ears of dogs and dogs with long snouts facing to the right and dogs with long
snouts facing to the left and, you know, different kinds of fur. And there's, there's sort of this
whole beautiful edge detectors, line detectors, color contrast detectors, um, these beautiful
things we call high low frequency detectors. You know, I think looking at it, I sort of felt like
a biologist, you know, you just, you're, you're looking at, at this sort of new world of proteins
and you're discovering all these, these are proteins that interact. Um, so one way you could
try to understand these models is in terms of neurons. You could try to be like, oh, you know,
there's a dog detecting neuron and, um, you know, here's a car detecting neuron. And it turns out
you can actually ask how those connect together. So you can go and say, oh, you know,
I have this car detecting neuron. How was it built? And it turns out in the previous layer,
it's connected really strongly to a window detector and a wheel detector and a sort of
car body detector. And it looks for the window above the car and the wheels below and the car
chrome sort of in the middle, sort of everywhere, but especially on the lower part. Um, and that's
sort of a recipe for a car, right? Like that is, you know, earlier we said that the thing we
wanted from Mechantorp was to get algorithms to go and get, you know, ask what is the algorithm
that runs? Well, here we're just looking at the weights of the neuron that we're kind of reading off
this kind of recipe for detecting cars. It's a very simple crude recipe, but it's,
it's there. And so we call that a circuit, this, this connection. Well, okay. So the,
the problem is that not all of the neurons, um, are interval and there, there's reason to think,
um, we can get into this more later that there's this, this superstition hypothesis,
there's reason to think that sometimes the right unit to analyze things in terms of, um, is combinations
of neurons. So sometimes it's not that there's a single neuron that represents say a car. Um,
but it actually turns out after you detect the car, the model sort of hides a little bit of the car
in the following layer and a bunch of, a bunch of dog detectors. Why is it doing that? Well, you know,
maybe it just doesn't want to do that much work on, on, on, on cars at that point. And, you know,
it's sort of storing it away to go in. Um, uh, so it turns out then that the sort of subtle pattern of,
you know, there's all these neurons that you think are dog detectors and maybe they're primarily that,
but they all a little bit contribute to representing a car, um, in, in that next layer. Okay. So,
so now we can't really think there, there might still be some, something that you, I don't know,
you could call it like a car concept or something, but it no longer corresponds to a neuron. So we
need some term for these kinds of neuron like entities, these things that we sort of would have
liked the neurons to be, these idealized neurons, um, the things that are the nice neurons, but also
maybe there's more of them somehow hidden and we call those features. And then what are circuits?
So circuits are these connections of features, right? So, so when we have the car detector,
um, and it's connected to a window detector and a wheel detector, and it looks for the wheels below
and the windows on top, um, that's a circuit. Um, so circuits are just collections of features
connected by weights, um, and they, they implement algorithms. So they tell us, you know, how is,
how are features used? How are they built? Um, how do they connect together? So maybe it's,
it's, it's worth trying to pin down like what, what really, um, is the, the core hypothesis here? And I think,
the, the core hypothesis is something we call the linear representation hypothesis.
So, um, if we think about the car detector, you know, the more it fires, the more we sort of think
of that as meaning, oh, the model is more and more confident that, um, a car is present. Um, or,
you know, if there's some combination of neurons that represent a car, you know,
the more that combination fires, the more we think the model thinks there's a car present.
Um, this doesn't have to be the case, right? Like you could imagine something where you have,
you know, you have this car detector neuron and you think, ah, you know, if it fires like,
you know, between one and two, that means one thing, but it means like totally different if
it's between three and four. Um, that would be a nonlinear representation. And in principle that,
you know, models could do that. I think it's, it's sort of inefficient for them to do the,
if you try to think about how you'd implement computation like that, it's,
it's kind of an annoying thing to do, but in principle models can do that. Um, so, uh,
one way to think about the features and, and circuits sort of framework for thinking about
things is that we're thinking about things as being linear. We're thinking about there as being,
um, that if a, if a neuron or a combination of neurons fires more, it's sort of, that means more
of the, of a particular thing being detected. And then that gives weights a very clean interpretation
as these edges between these, these entities that these features, um, and that, that edge then has a,
has a meaning. Um, so that's, that's in some ways the, the core thing. Um, it's, it's like, um,
you know, we can talk about this sort of outside the context of neurons. Are you familiar with the
word-to-vec results? Um, so you have like, you know, king minus man plus woman equals queen. Well,
the reason you can do that kind of arithmetic, um, is because you have a linear representation.
Can you actually explain that representation a little bit? So you, first of all, so the feature is a,
is a direction of activation. Yeah, exactly.
That way. Can you do the, the, the minus men plus women with that, that the word-to-vec stuff,
can you explain what that is? Yeah.
So there's this very, it's such a simple, clean explanation of what we're talking about.
Exactly. Yeah. So there's this very famous result word-to-vec by, um, Thomas Mikulov et al. And
there's been tons of follow-up work exploring this. So you, so sometimes we have these, we create these
word embeddings, um, where, uh, we map every word to a vector. I mean, that in itself, by the way,
is, is kind of a crazy thing if you haven't thought about it before, right? Like we've,
we're, we're going in and representing returning, um, you know, like, like if you just learned about
vectors in physics class, right? Uh, and I'm like, oh, I'm going to actually turn every word,
uh, in the dictionary into a vector. That's kind of a crazy idea. Okay. But you could imagine,
um, you could imagine all kinds of ways in which you might map words to vectors,
but it, it, it seems like when we train neural networks, um, they like to go in and map words
to vectors to such that they're, they're, they're, they're sort of linear structure in a particular
sense, which is that directions have meaning. So for instance, if you, there, there will be some
direction that seems to sort of correspond to gender and male words will be, you know,
far in one direction and female words will be in another direction. And the linear representation
hypothesis is you could, you could sort of think of it roughly as saying that that's actually kind
of the fundamental thing that's going on, that, that everything is just different directions have
meanings and adding different direction vectors together can represent concepts. And the Michelov paper
sort of took that idea seriously. And one consequence of it is that you can, you can do this game of
playing sort of arithmetic with words. So you can do King and you can, you know, subtract off the word
man and add the word woman. And so you're sort of, you know, going and, and trying to switch the
gender. And indeed, if you do that, the result will sort of be close to the word queen. And you
can, you know, do other things like you can do, you know, sushi minus Japan plus Italy and get pizza
or different, different things like this. Right. So, so this is in some sense, the core of the
linear representation hypothesis. You can describe it just as a purely abstract thing about vector spaces.
You can describe it as a, as a statement about, about the activations of neurons. But it's really
about this, this property of directions having meaning. And in some ways it's even a little
subtle to that. It's really, I think mostly about this property of being able to add things together,
um, that you can sort of independently modify, um, say gender and royalty or, um, you know,
cuisine type or country and, and, and, and the concept of food by, by adding them.
Do you think the linear hypothesis holds that carries scales?
So, so far, I think everything I have seen is consistent with this hypothesis and it doesn't
have to be that way, right? Like, like you can write down neural networks where, um, you write
weights such that they don't have linear representations, where the right way to
understand them is not, is not in terms of linear representations. But I think every natural
neural network I've seen, um, has this property. Um, there's been one paper recently, um, that
there's been some sort of pushing around the edges. So I think there's been some work
recently studying multidimensional features where rather than a single direction, it's more like,
um, a manifold of directions. This to me still seems like a linear representation.
Um, and then there's been some other papers suggesting that maybe, um, in, in very small
models you get nonlinear representations. Um, I think that the jury's still out on that.
Um, but in, I think everything that we've seen so far has been consistent with the linear
representation hypothesis. And that's, that's wild. It doesn't have to be that way.
Um, and yet, uh, I think that there's a lot of evidence that certainly at least this is very,
very widespread and so far the evidence is consistent with it. And I, and I think, you
know, one thing you might say is you might say, well, Christopher, you know, it's, that's a lot,
you know, to, to go and, and sort of, um, to write on, you know, if we don't know for sure this is true
and you're sort of, you know, you're investigating neural networks as though it is true, you know,
isn't that, um, isn't that dangerous? Well, you know, but I, I think actually there's all,
there's a virtue in taking hypotheses seriously and pushing them as far as they can go. Um, so
it might be that someday we discover something that isn't consistent with linear representation
hypothesis, but science is full of hypotheses and theories that were wrong. Um, and we learned a lot
by sort of working under, under them as a sort of an assumption, um, and, and then going and pushing
them as far as we can. I guess, I guess this is sort of the heart of, of what Kuhn would call
normal, normal science. Um, I don't know if you want, we can talk a lot about, about, uh,
philosophy of science and, uh, that leads to the paradigm shift. So yeah, I love it.
Taking the hypothesis seriously and take it to a natural, natural conclusion. Yeah.
Same with the scaling hypothesis. Same. Exactly. Exactly. And I love it.
One of my colleagues, Tom Hennigan, who is a former physicist, um, uh, like made this really nice
analogy to me of, um, uh, caloric theory where, you know, once upon a time we thought that heat was
actually, you know, this thing called caloric and like the reason, you know, hot objects,
you know, would, would warm up cool objects is like the caloric is flowing through them.
Um, and like, you know, because we're so used to thinking about, about heat, you know, in terms of
the modern, modern theory, you know, that seems kind of silly, but it's actually very hard to construct,
uh, an experiment that, that sort of disproves the, um, caloric hypothesis. Um, and you know,
you can actually do a lot of really useful work believing in caloric. For example, it turns out
that, uh, the original combustion engines were developed by people who believed in the caloric
theory. So I think there's a virtue in taking, um, hypotheses seriously, even when they might be wrong.
Yeah. Yeah. There's a deep philosophical truth to that. That's kind of, kind of like how I feel
about space travel, like colonizing Mars. There's a lot of people that criticize that. I think if you
just assume we have to colonize Mars in, in order to have a backup for human civilization,
even if that's not true, that's going to produce some interesting, interesting engineering and even
scientific breakthroughs, I think. Yeah. Well, and actually this is another
thing that I think is really interesting. So, um, you know, there's a way in which I think it can be
really useful for society to have people, um, almost irrationally dedicated to investigating
any particular hypotheses. Um, because, uh, well, it, it, it takes a lot to sort of maintain scientific
morale and really push on something when, you know, most, most science, scientific hypotheses
end up being wrong. You know, a lot of, a lot of science doesn't, doesn't work out. Um, and, but it's,
and yet it's, you know, it's very, it's very useful to go to just, you know, um, there's a,
there's a joke about Geoff Hinton, um, which is that, uh, Geoff Hinton has discovered how the brain
works every year for the last 50 years. Yeah. Um, but you know, I, I say that with like, you know,
the, you know, with, with really deep respect because, uh, in fact, that's actually, you know,
that, that led to him doing some, some really great work. Yeah. He won the Nobel Prize now,
who's laughing now. Exactly. Exactly. Um, yeah. I think one wants to be able to pop up and sort of
recognize the, the appropriate level of confidence, but I think there's also a lot of value and just
being like, you know, I'm going to essentially assume I'm going to condition on this problem being
possible or this being broadly the right approach. And I'm just going to go and assume that for a
while and go and work within that, um, and push really hard on it. Um, and you know, if society
has lots of people doing, doing that for different things, um, that's actually really useful in terms
of going and, uh, getting to getting, you know, either really, really ruling things out, right? We can
be like, well, you know, that didn't work. And we know that somebody tried hard, um, or going and
getting to something that, that it does teach us something about the world.
So another interesting hypothesis is the superposition hypothesis. Can you describe
what superposition is? Yeah. So earlier we were talking
about word to fact, right? And we were talking about how, you know, maybe you have one direction
that corresponds to gender and maybe another that corresponds to royalty and another one that
corresponds to Italy and another one that corresponds to, you know, food and, and all of these things.
Well, you know, um, oftentimes maybe these, these, uh, these word embeddings, they might be
500 dimensions, a thousand dimensions. And so if you believe that all of those directions were
orthogonal, um, then you could only have, you know, 500 concepts and, you know, I, I love pizza.
Um, but like, if I was going to go and like give the like 500 most important concepts in,
um, you know, the English language, probably Italy wouldn't be, it's not obvious at least that Italy
would be one of them, right? Cause you, you have to have things like plural and singular and,
uh, uh, verb and noun and adjective. And, you know, um, there's, there's a lot of things we
have to get to before we get to get to Italy, um, uh, and Japan. And, you know, there's a lot
of countries in the world. Um, and so how might it be that models could, you know, simultaneously
have the linear representation hypothesis be true and also represent more things than they have
directions? So, so what does that mean? Well, okay. So if, if, if, if linear representation
hypothesis is true, something interesting has to be going on. Now I'll, I'll tell you
one more interesting thing before we, we go and we do that, which is, um, you know, we,
earlier we were talking about all these polysemantic neurons, right? Um, these neurons that, you know,
when we were looking at inception V1, there's these nice neurons that like the car detector
and the curve detector and so on that respond to lots of, you know, to very coherent things,
but lots of neurons that respond to a bunch of unrelated things. And that's, that's also an
interesting phenomenon. Um, and it turns out as well, um, that even these neurons that are really,
really clean, if you look at the weak activations, right? So if you look at like, you know, the
activations where it's like activating 5% of, of the, the, the, you know, of the maximum activation,
it's really not the core thing that it's expecting, right? So if you look at a, a curve detector,
for instance, and you look at the places where it's 5% active, you know, you could interpret it just
as noise, or it could be that it's, that it's doing something else there. Okay. So, so how could that be?
Well, there's this amazing thing in mathematics, um, called compressed sensing. And it's, it's
actually this, this very surprising fact where if you have a high dimensional space and you project
it into a low dimensional space, ordinarily you can't go and sort of unproject it and get back
your high dimensional vector, right? You threw information away. This is like, you know, you can't,
you can't invert a rectangular matrix. Um, you can only invert square matrices. Um,
but it turns out that that's actually not quite true. If I tell you that the high dimensional
vector was sparse, so it's mostly zeros, then it turns out that you can often go and find back,
um, the, uh, the high dimensional vector with, with very high probability. Um, so that's a surprising
fact, right? It says that, you know, you can, um, you can, you can have this high dimensional vector
space. And as long as things are sparse, um, you can project it down, you can have a lower dimensional
projection of it and that works. So the substitution hypothesis is saying that that's
what's going on in neural networks. That's what, for instance, that's what's going on in word
embeddings. The word embeddings are able to simultaneously have directions be the meaningful
thing. And by exploiting the fact that they're, they're operating on a fairly high dimensional space,
they're actually, and the fact that these concepts are sparse, right? Like, you know,
you usually aren't talking about Japan and Italy at the same time. Um,
you know, most of the, most of those concepts, you know, in most instances, Japan and Italy are
both zero. They're not present at all. Um, and if that's true, um, then you can go and have it be
the case that, um, that you can, you can have many more of these sort of directions that are meaningful,
these features, than you have dimensions. And similarly, when we're talking about neurons,
you can have many more concepts than you have, have neurons. So that's the, at a high level of
superstition hypothesis. Now it has this even a wilder implication, which is,
um, to go and say that, uh, neural networks are, it may not just be the case that the representations
are like this, but the, the computation may also be like this, you know, the connections between
all of them. And so in, in some sense, neural networks may be shadows of much larger, sparser
neural networks. And what we see are these projections. Um, and the super, you know, the
strongest version of superstition hypothesis would be to take that really seriously and sort of say,
you know, there, there actually is in some sense, this, this upstairs model, this, you know, um,
where, where the neurons are really sparse and all interval and there's, you know, the weights
between them are these really sparse circuits. And that's what we're studying. Um, and, uh,
the thing that we're observing is the shadow of it. And so we need to find the original object.
And, uh, the process of learning is trying to construct a compression of the upstairs model
that doesn't lose too much information in the projection.
Yeah. It's finding how to fit it efficiently or something like this. Um, the gradient descent is
doing this. And in fact, so this sort of says that gradient descent, you know, it could just represent
a dense neural network, but it sort of says that gradient descent is pleasantly searching over the space
of extremely sparse models that could be projected into this low dimensional space.
Mm-hmm. And this large body of work of, and of people going and trying to study sparse neural
networks, right? Where you go and you have, you could design neural networks, right? Where,
where the edges are sparse and activations are sparse. And, you know, my sense is that work has
generally, it feels very principled, right? It makes so much sense. And yet that, that work hasn't really
panned out that well is my impression broadly. And I think that, uh, a potential answer for that
is that actually the neural network is already sparse in some sense. Gradient descent was the whole
time gradient, you were trying to go and do this gradient descent was actually in the behind the
scenes going and searching more efficiently than you could through the space of sparse models and
going and learning whatever sparse model was most efficient, and then figuring out how to fold it
down nicely to go and run conveniently on your GPU, which does, you know, it's a nice dense matrix
multiplies, um, and that you just can't beat that. How many concepts do you think can be shoved
into a neural network? Depends on how sparse they are. So there's, there's probably an upper bound from
the number of parameters, right? Cause you have to have, you still have to have, you know,
weights that go and connect them together. Um, so that's, that's one upper bound.
There are in fact, all these lovely results from compressed sensing and the Johnson-Linden-Strauss
lemma and things like this, um, that they, they basically tell you that if you have a vector space
and you want to have almost orthogonal vectors, which is sort of the, probably the thing that you
want here, right? So you, you're going to say, well, you know, I'm going to give up on having my,
my concepts, my features be strictly orthogonal, but I'd like them to not interfere that much.
I'm going to ask them to be almost orthogonal. Um, then this would say that it's actually,
you know, for once you set a threshold for what you're, what you're willing to accept in terms of
how, how much cosine similarity there is, that's actually exponential in the number of neurons
that you have. So at some point that's not going to even be the limiting factor. Um, but, um, you
know, some beautiful results there. And in fact, it's probably even better than that in some sense,
because that's sort of for saying that, you know, any random set of features could be active,
but in fact, the features have sort of a correlational structure where some features,
you know, are more likely to co-occur and other ones are less likely to co-occur.
And so neural networks, my guess would be, could do, do very well in terms of going and, uh, packing
things in such to, to the point that's probably, probably not the limiting factor.
How does the problem of polysemanticity enter the picture here?
Polysemanticity is this phenomenon we observe where we look at many neurons and the neuron doesn't just
sort of represent one, one concept. It's not, it's not a clean feature. It responds to a bunch of
unrelated things. And, um, superstition is, you can think of as, as being a hypothesis that explains
the observation of polysemanticity. Um, so polysemanticity is this observed phenomenon and
superstition is, is a hypothesis that, um, would explain it along with, with some other things.
So that makes McInturb more difficult.
Right. So if you, if you're trying to understand things in terms of individual neurons,
and you have polysemantic neurons, you're in an awful lot of trouble, right?
I mean, the easiest answer is like, okay, well, you know, you're looking at the neurons,
you're trying to understand them. This one responds to a lot of things. It doesn't have
a nice meaning. Okay. Well, you know, that's, that's, that's bad. Um, another thing you could
ask is, you know, ultimately we want to understand the weights. And if you have two polysemantic
neurons and, you know, each one responds to three things and then, you know, the other neuron responds
to three things and you have a weight between them, you know, what does that mean? Does it mean that
like all three, you know, like there's these nine, you know, nine interactions going on. It's a very weird thing,
but there's also a deeper reason, which is related to the fact that neural networks operate on really
high dimensional spaces. So I said that our goal was, you know, to understand neural networks and
understand the mechanisms. And one thing you might say is like, well, why not? It's just a mathematical
function. Why not just look at it? Right. Like, um, you know, one of the earliest projects I did studied
these, these neural networks that mapped two dimensional spaces to two dimensional spaces, and
you can sort of interpret them in this beautiful way is like bending manifolds.
Um, why can't we do that? Well, you know, as you have a higher dimensional space,
um, the volume of that space in some senses is exponential in the number of inputs you have.
And so you can't just go and visualize it. So we somehow need to break that apart. We need to
somehow break that exponential space into a bunch of things that we, you know, some non exponential
number of things that we can reason about independently. And the independence is crucial because
it's the independence that allows you to not have to think about, you know,
all the exponential combinations of things and things being monosemantic things, only having
one meaning things, having a meaning that isn't is the key thing that allows you to think about them
independently. And so I think that's, that's if you want the deepest reason why we want to have,
um, interpretable monosemantic features, I think that's really the, the deep reason.
And so the goal here as your recent work has been aiming at is how do we extract the
monosemantic features from a neural net that has polysemantic features and all this,
this mess?
Yes, we have, we observe these polysemantic neurons and we hypothesize that's what's going,
what's going on a superstition. And if superstition is what's going on there, there's actually a sort
of well-established technique that is sort of the principled thing to do, which is dictionary
learning. And, um, it turns out if you do dictionary learning in particular, if you do the sort of a nice,
efficient way that in some, in some sense sort of nicely regularizes it as well, as well, called a
sparse autoencoder, if you train a sparse autoencoder, these beautiful interpretable
features start to just fall out where there weren't any beforehand. And so that's not a thing that you
would necessarily predict, right? But it turns out that that works very, very well. You know, to me,
that seems like, you know, some non-trivial validation of linear representations and superstition.
Right. So with dictionary learning, you're not looking for particular kind of categories. You
don't know what they are. They just emerge. Exactly. And this gets back to our earlier
point, right? When we're not making assumptions, gradient descent is smarter than us. So we're not
making assumptions about what's there. Um, I mean, one certainly could do that, right? One could
assume that there's a PHP feature and go and search for it, but we're not doing that. We're saying,
we don't know what's going to be there. Instead, we're just going to go and let, um, the sparse
autoencoder discover the things that are there. So can you, uh, talk to the toward
monosemanticity paper from, uh, October last year that had a lot of like nice breakthrough results?
That's very kind of you to describe it that way. Um, yeah, I mean, this was, um, uh,
our first real success using sparse autoencoders. So we took a one layer model. Um, and it turns out
if you go and you, you know, do dictionary learning on it, you find all these really nice
interpretable features. So, you know, the Arabic feature, the Hebrew feature, uh, the base 64 features,
those were, were some, some examples that we studied in a lot of depth and really showed that
they were, um, what we thought they were. It turns out if you train a model twice as well and
train two different models and, and do dictionary learning, you find, find analogous features in
both of them. So that's fun. Um, you find all kinds of, of different features. So that was really
just showing, um, that, um, that this works. And, um, you know, I should mention that there was this
Cunningham at all, um, that had very similar results around the same time.
There's something fun about being, doing these kinds of small scale experiments and finding that it's actually
working. Yeah. Well, and there's, and that there's so much structure here. Like you, you know,
so maybe, maybe stepping back for a while, um, I thought that maybe all this mechanistic
interpretability work, um, the end result was going to be that I would have an explanation
for why it was sort of, you know, very hard and not going to be tractable. Um, you know,
we'd be like, well, there's this problem with superstition and it turns out superstition is really
hard, um, and we're kind of screwed, but that's not what happened. In fact, a very natural,
simple technique just works. And so then that's actually a very good situation. You know,
I think, um, this is a sort of hard research problem and it's got a lot of research risk and,
you know, it, it might still very well fail, but, um, I think that some amount of some very
significant amount of research risk, um, was sort of put behind us when that started to work.
Can you describe what kind of features can be extracted in this way?
Well, so it depends on the model that you're studying, right? So the, the larger the model,
the more sophisticated they're going to be. And we'll probably talk about, about follow-up
work in a minute. Um, but in these one layer models, um, so some very common things I think
were, were languages, both programming languages and natural languages. There were a lot of features
that were, um, specific words in specific contexts. So the, and I think really the way to think about
this is that the is likely about to be followed by a noun. So it's really right. You could think of this
as a the feature, but you could also think of this as predicting a specific noun feature.
And there would be these features that would fire for the in, um, the context of, of say,
a legal document or a mathematical document or something like this. Um, and so, uh, you know,
maybe in the context of math, you're like, you know, the, and then predict vector or matrix,
you know, all these mathematical words, whereas, you know, other contexts, you would predict other
things. That was, that was common. And basically we need clever humans to assign
labels to what we're seeing.
Yes. So, you know, this, this is the only thing this is doing is that sort of, um,
unfolding things for you. So if everything was sort of folded over top of it, you know,
Syrizaion folded everything on top of itself and you can't really see it, this is unfolding it,
but now you still have a very complex thing to try to understand. Um, so then you have to do a bunch
of work understanding what these are. Um, and some of them are really subtle, like there's some
really cool things, even in this one layer model about, um, Unicode, where, you know, of course,
some languages are in Unicode and the tokenizer won't necessarily have a dedicated token for every,
um, Unicode, um, character. So instead what you'll have is you'll have this, these patterns
of alternating token or alternating tokens that each represent half of a Unicode character,
and you have a different feature that, you know, goes and activates on the, on the opposing ones to
be like, okay, you know, um, I just finished a character, you know, go and predict next prefix.
Um, then, okay, I'm on the prefix, you know, predict a reasonable suffix. Um, and you have,
you have to alternate back and forth. So there's, you know, these, these one layer models are, are
really interesting. And, um, uh, I mean, there's another thing that just, you might think, okay,
there would just be one base 64 feature, but it turns out there's actually a bunch of base 64
features because you can have English text encoded in as base 64. And that has a very different
distribution of base 64 tokens than, than regular. And there's, um, uh, there's, there's some things
about tokenization as well that it can exploit. And I don't know, there's all, all kinds of, uh,
fun stuff. How difficult is the task of sort of assigning labels to what's going on? Can this
be automated by AI? Well, I think it depends on the feature and it also depends on how much you
trust your AI. So, um, there's a lot of work doing, um, automated interpolability. I think that's
a really exciting direction. And we do a fair amount of automated interpolability and have,
have Claude go and label our features. Is there some fun moments where it's totally right?
Or is totally wrong? Yeah. Well, I think, I think it's very common that it's like says something
very general, which is like true in some sense, but not really picking up on the specific of what's
going on. Um, so I think, I think that's a pretty common situation. Um, you don't know that I have
a particularly amusing one. That's interesting. That little gap between it is true, but it doesn't
quite get to the deep nuance of a thing. Yeah. That's a general challenge. It's like, it's,
it's certainly an incredible accomplishment that can say a true thing, but it doesn't, it's quite,
it's not, it's missing the depth sometimes. And in this context, it's like the arc challenge,
you know, the sort of IQ type of tests. It feels like figuring out what a feature represents is a bit
of a little puzzle you have to solve. Yeah. And, and I think that sometimes they're easier and sometimes
they're harder as well. Um, so, uh, yeah, I think, I think that's tricky. And there's another thing,
which I don't know, maybe, maybe in some ways this is my like aesthetic coming in, but I'll
try to give you a rationalization. You know, I'm actually a little suspicious of automated
interpolality. And I think that's partly just that I want humans to understand neural networks.
And if the neural network is understanding it for me, you know, I'm not, I don't quite like that,
but I do have a bit of, uh, you know, in some ways I'm sort of like the mathematicians who are
like, you know, if there's a computer automated proof, it doesn't count. Um, you know, you,
they won't understand it. But I do also think that there is, um, this kind of like reflections
on trusting trust type issue where, you know, if you, there's this famous talk about, um, uh,
you know, you, like when you're writing a computer program, you have to trust your compiler.
And if there was like malware in your compiler, then it could go and inject malware into the
next compiler. And, you know, you'd be kind of in trouble, right? Well, if you're using neural
networks to go and, um, verify that your neural networks are safe, the hypothesis that you're
testing for is like, okay, well, the neural network maybe isn't safe. Um, and you have to worry about,
like, is there some way that it could be screwing with you? Um, so, uh, you know, I, I think that's
not a big concern now. Um, but I do wonder in the long run, if we have to use really powerful AI systems
to go and, uh, you know, audit our AI systems, is that, is that actually something we can trust?
But maybe I'm just rationalizing because I, I just want us to have to get to a point where humans
understand everything.
Yeah. I mean, especially that's hilarious, especially as we talk about AI safety and
looking for features that would be relevant to AI safety, like deception and so on. Uh,
so let's, let's talk about the scaling monosematicity paper in May, 2024. Okay. So what did it take to
scale this to apply to Claude 3 Sonnet?
Well, a lot of GPUs.
A lot more GPUs.
Um, but one of my teammates, Tom Hennigan, um, was involved in the original scaling loss work.
Um, and something that he was sort of interested in from very early on is,
are there scaling laws for interoperability?
Um, and so, um, something he sort of immediately did when, when this, this work started to succeed
and we started to have sparse autoencoders work, because he became very interested in, you know,
what are the scaling laws for, um, uh, you know, for making, making sparse autoencoders
larger and how does that relate to making the base model larger?
Um, and so, um, it turns out this works really well and you can use it to sort of project,
um, you know, if you train a sparse autoencoder, a given size, you know, how many tokens should
you train on and so on.
And so this was actually a very big help to us in scaling up, um, this work, um, and made
it a lot easier for us to go and train, um, you know, really large sparse autoencoders where,
you know, um, it's not like training the big models, but it's, it's starting to get to a point
where it's actually, actually expensive to go, um, and train the really big ones.
So you have to, I mean, you have to do all the stuff of like splitting it across
large.
Oh yeah, no, I mean, there's a huge engineering challenge here too, right?
So yeah, so, so there's, there's this, there's a scientific question of how do you scale things
effectively?
Um, and then there's an enormous amount of engineering to go and scale this up.
So you have to, you have to chart it.
You have to, you have to think very carefully about a lot of things.
I'm lucky to work with a bunch of great engineers because I am definitely not a great engineer.
Yeah. And the infrastructure, especially.
Yeah, for sure.
So it turns out TODR, it worked.
It worked.
Yeah. And I think this is important because you could have imagined, you could, like,
you could have imagined a world where you set after towards monosphaticity, you know, Chris,
this is great.
You know, it works on a one layer model, but one layer models are really idiosyncratic.
Um, like, you know, maybe, maybe that's just something it is like, maybe the linear
representation hypothesis and superstition hypothesis is the right way to understand a one
layer model, but it's not the right way to understand larger models.
Um, and so I think, um, I mean, first of all, the Cunningham et al. paper sort of,
um, cut through that a little bit and, and sort of suggested that this wasn't the case,
but, um, scaling monosemanticity sort of, I think was significant evidence that even for
very large models, and we did it on Claude III Sona, which at that point was, uh, one of our
production models, um, you know, even these models, um, seem to be very, you know, seem to be
substantially explained at least by linear features and, you know, doing dictionary
running on them works. And as you learn more features, you go and you explain, explain more
and more. So that's a, I think a quite a promising sign and you find now really fascinating abstract
features. Um, and the features are also multimodal. They respond to images and text for the same
concept, which is fun. Yeah. This, can you explain that? I mean, like, you know, backdoor,
there's just a lot of examples that you can. Yeah. So maybe, maybe let's start with one example to start,
which is we found some features around sort of security vulnerabilities and backdoors in code.
So it turns out those are actually two different features. Um, so there's a security vulnerability
feature. And if you force it active, Claude will start to go and write, um, security vulnerabilities
like buffer overflows into code. And it also, it fires for all kinds of things like, you know,
some of the, some of the top dataset examples for it were things like, you know, dash dash, disable,
um, you know, SSL or something like this, which are sort of obviously really, um, uh, really insecure.
So at this point, it's kind of like, maybe it's just because the examples are presented that way.
It's kind of like surface a little bit more obvious examples, right? Um, I guess the, the idea is that
down the line, it might be able to detect more nuanced, like deception or bugs or that kind of stuff.
Yeah. Well, I, maybe I want to distinguish two things. So, um, one is, um, the complexity of the
feature or the concept, right? And the other is the, the nuance of the, how subtle the examples
we're looking at, right? So when we, when we show the top dataset examples, those are the most extreme
examples that cause that feature to, to activate. Um, and so it doesn't mean that it doesn't fire for
more subtle things. So the, you know, the, the insecure, um, code feature, you know, the stuff that it
fires for most strongly for are these like really obvious, you know, disable the security type things.
Um, but, um, um, you know, uh, it, it also fires for, you know, buffer overflows and, and more subtle
security vulnerabilities in code. You know, these features are all multimodal. So you could ask like,
what images activate this feature? And it turns out, um, that the, uh, the, the security vulnerability
feature activates for images of, um, uh, like people are clicking on Chrome to like go past
the, like, you know, this, this website, uh, the SSL certificate might be wrong or something like
this. Another thing that's very entertaining is there's back doors and code feature. Like
you activate it, it goes in, Claude writes a back door that like will go and dump your data to
port or something. But you can ask, okay, what, what images activate the back door feature?
It was devices with hidden cameras in them. So there's a whole, uh, apparently genre of people
going and selling devices that look innocuous that have hidden cameras and they have ads that
have there's a hidden camera in it. And I guess that is the, you know, physical version of a back
door. Um, and so it sort of shows you how abstract these concepts are. Right. Um, and I,
I just thought that was, uh, I, I'm sort of sad that there's a whole market of people selling
devices like that, but I was kind of delighted that that was the, the thing that it came up with
as the, the top, uh, image examples for the feature. Yeah. It's nice. It's multimodal. It's
multi almost context. It's as broad, strong definition of a singular concept. It's nice.
Yeah. To me, one of the really interesting features, especially for AI safety is
deception and lying and the possibility that these kinds of methods could detect,
uh, lying in a model, especially get smarter and smarter and smarter. Presumably that's a big threat
of a super intelligent model that it can deceive the people operating it,
uh, as to its intentions or any of that kind of stuff. So what, what have you learned from
detecting lying inside models? Yeah. So I think we're in some ways in
early days for that. We find quite a few features related to deception and lying. There's one feature
where it fires for people lying and being deceptive and you force it active and Claude starts lying to
you. So we have a, have a deception feature. I mean, there's all kinds of other features about
withholding information and not answering questions, features about power seeking and
coups and stuff like that. So there's, there's a lot of features that are kind of related to spooky
things. And if you, um, force them active, Claude will, will behave in ways that are not the kinds
of behaviors you want. What are possible next exciting directions to you in the space of, uh,
McInterpe? Well, there's a lot of things. Um,
so for one thing, I would really like to get to a point where we have shortcuts where we can really
understand, um, not just the features, uh, but then use that to understand the computation of models.
Um, that really for me is, is the, the ultimate goal of this. Um, and there's been some work. We,
we put out a few things. There's a paper from Sam Marks that does some stuff like this. And there's,
there's been some, I'd say some work around the edges here. Um, but I think there's a lot more
to do. And I think that will be a very exciting thing. Um, that's related to a challenge we call
interference weights, um, where, um, due to superstition, if you just sort of naively look
at whether features are connected together, there may be some weights that sort of don't exist in the
upstairs model, but are just sort of artifacts of, of superstition. So that's, um, uh, sort of technical
challenge related to that. Um, I think another exciting direction is just, uh, you know, you
might think of, of sparse auto-encoders as being kind of like a telescope. They allow us to, you
know, look out and see all these features that are, are, are, are out there. And, you know,
as we build better and better sparse auto-encoders, get better and better at dictionary learning,
we see more and more stars. Um, and you know, we zoom in on smaller and smaller stars, but there's kind
of, um, a lot of evidence that we're only still seeing a very small fraction of the stars. There's
a lot of matter in our, in our, you know, neural network universe that we can't observe yet.
Um, and it may be that, um, that we'll never be able to have fine enough instruments to observe it.
And maybe, maybe some of it just isn't possible, um, isn't computationally tractable to observe.
And so there's sort of a, a kind of dark matter in, in not in maybe the sense of, of modern
astronomy, but of earlier astronomy when we didn't know what this unexplained matter is.
Um, and so I, I think a lot about that, that dark matter and whether we'll ever observe
it and what that means for safety if we, if we can't observe it, if there's, you know, some,
if some significant fraction of neural networks are not accessible to us.
Um, another question that I think a lot about is, uh, at the end of the day,
you know, mechanistic interpolation is this very microscopic, um, approach to interpolation. It's
trying to understand things in a very fine grained way. But a lot of the questions we care about are
very macroscopic. Um, you know, we, we care about these questions about neural network behavior and,
and I think that's the thing that I care most about, but there's, there's lots of other,
other sort of larger scale questions you might care about. Um, and somehow, you know, the nice
thing about, about having a very microscopic approach is it's maybe easier to ask, you know,
is this true? But the downside is it's much further from the things we care about. And so we now have
this ladder to climb. And I think there's a question of can, will we be able to find, are there,
are there sort of larger scale abstractions that we can use to understand neural networks that we
get up from this very microscopic approach? Yeah. You've, you, you've written about this,
this kind of organs question. Yeah, exactly. If we, uh, think of
interpretability as a kind of anatomy of neural networks, most of the circus threads involve
studying tiny little veins, looking at the small scale and individual neurons and how they connect.
However, there are many natural questions that the small scale approach doesn't address. In contrast,
the most prominent abstractions in biological anatomy involve larger scale structures like
individual organs, like the heart or entire organ systems, like the respiratory system.
And so we wonder, is there a respiratory system or heart or brain region of an artificial neural network?
Yeah, exactly. Um, and I mean, like, if you think about science, right, a lot of scientific fields
have, um, you know, investigate things at many levels of abstraction. So in biology, you have like,
you know, molecular biology studying, you know, proteins and molecules and so on,
and they have cellular biology, and then you have histology studying tissues, then you have anatomy,
and then you have zoology, and then you have ecology. And so you have many, many levels of abstraction,
or, you know, physics, maybe the physics of individual particles. And then, you know,
statistical physics gives you, gives you thermodynamics and things like this. And so you
often have different levels of abstraction. Um, and I think that right now we have,
you know, mechanistic interpolability, if it succeeds, is sort of like a microbiology
of neural networks, but we, we want something more like anatomy. And so, and, and, you know,
a question you might ask is why, why can't you just go there directly? And I think the answer is
superstition, um, in, at least in significant part, it's that it's actually very hard to, to see this,
this macroscopic structure, uh, without first sort of breaking down the microscopic structure in the
right way and then studying how it connects together. Um, but I'm, I'm hopeful that there
is going to be something much larger than, um, features and circuits, and that we're going to be
able to have a story that's much, that involves much bigger things. And then you can sort of study
in detail the parts you care about. As opposed to neurobiology, like a psychologist or a psychiatrist
of a neural network. And I think that the beautiful thing would be if we could go and,
rather than having disparate fields for those two things, if you could have a, build a bridge
between them such that you could go and, um, uh, have all of your higher level abstractions
be grounded very firmly in this very solid, um, you know, more rigorous, uh, ideally foundation.
What do you think is the difference between the human brain,
the biological neural network and the artificial neural network?
Well, the neuroscientists have a much harder job than us. You know, sometimes I just like count
my blessings by how much easier my job is than the neuroscientists, right? So I have,
um, we, we can record from all the neurons. Yeah.
We can do that on arbitrary amounts of data. Um, the neurons don't change while you're doing that,
by the way. Um, you can go and ablate neurons, you can edit the connections and so on. Um, and then
you can undo those changes. That's pretty great. Yeah.
Um, you can force any, you can intervene on any neuron and force it active and see what happens.
Um, you know, which neurons are connected to everything, right? You have neuroscientists
want to get the connectome. We have the connectome, um, and we have it for like much bigger than C.
Elegance. Yeah.
Um, and then not only do we have the connectome, um, we know, uh, what the, you know, which neurons
excite or inhibit each other, right? So we have, we, it's not just that we know that like the binary
mask, we know the, the weights, um, we can take gradients. We know computationally what each neuron does.
Um, so I don't know that the list goes on and on. We just have, um, so many advantages over
neuroscientists and then just by having all those advantages, it's really hard. And so one thing I
do sometimes think is like, gosh, like if it's this hard for us, it seems impossible under the
constraints of neuroscience or, you know, near impossible. Um, I, I don't know, maybe, maybe
part of me is like, I've got a few neuroscientists on my team. Maybe I, maybe I'm sort of like, ah,
you know, um, the, uh, maybe the neuroscientists, maybe some of them would like to have an easier
problem. That's still very hard. Um, and they, they could come and work on, on neural networks.
And then after we, after we figure out things in sort of the easy, uh, little pond of trying
to understand neural networks, which is still very hard, then we, then we could go back to,
uh, biological neuroscience.
I love what you've written about the goal of McInturp research as, uh, two goals, safety and beauty.
So can you talk about the beauty side of things?
Yeah. So, you know, there's this funny thing where I think some people want, uh, some people
are kind of disappointed by neural networks. I think where they're like, ah, you know, neural networks,
um, it's these, just these simple rules. And then you just like do a bunch of engineering
to scale it up and it works really well. And like, where's the like complex ideas,
you know, this isn't like a very nice, beautiful scientific result.
And I sometimes think when people say that, uh, I picture them being like, you know,
evolution is so boring. It's just a bunch of simple rules and you run evolution for a long time
and you get biology. Like what a, what a sucky, uh, you know, way for biology to have turned out.
Where's the complex rules, but the beauty is that the simplicity generates complexity. Um,
you know, biology has these simple rules and it gives rise to, you know, all the life and ecosystems
that we see around us, all the beauty of nature that all just comes from evolution and from something
very simple evolution. And similarly, I think that neural networks build, you know, create enormous,
um, complexity and beauty inside and structure inside themselves that people generally don't
look at and don't try to understand because it's, it's hard to understand. But I, I think that there
is an incredibly rich structure to be discovered inside neural networks. A lot of, a lot of very deep
beauty. Um, if we're just willing to take the time to go and see it and understand it.
Yeah. I love, I love McInterp. The feeling like we are understanding or getting glimpses
of understanding the magic that's going on inside is really wonderful.
It feels to me like one of the questions is just calling out to be asked. And I'm sort of,
I mean, a lot of people are thinking about this, but I'm, I'm often surprised that not more are,
is how is it that we don't know how to create computer systems that can do these things. And yet we
have these amazing systems that we don't know how to directly create computer programs that can do
these things, but these neural networks can do all these amazing things. And it just feels like
that is obviously the question that sort of is calling out to be answered. If you are, if you have
any degree of curiosity, it's, it's like, how is it that, that humanity now has these artifacts that can
do these things that we don't know how to do. Yeah. I love the image of the circus reaching towards
the light of the objective function. Yeah. It's just, it's, it's this organic thing that we've grown
and we have no idea what we've grown. Well, thank you for working on safety and thank you for
appreciating the beauty of the things you discover. And thank you for talking today, Chris. This is
wonderful. Yeah. Thank you for taking the time to chat as well. Thanks for listening to this
conversation with Chris Ola and before that with Daria Amadei and Amanda Askell. To support this
podcast, please check out our sponsors in the description. And now let me leave you with some
words from Alan Watts. The only way to make sense out of change is to plunge into it, move with it and
join the dance. Thank you for listening and hope to see you next time.