AI Is a Black Box. Anthropic Figured Out a Way to Look Inside | WIRED
Excerpt
What goes on in artificial neural networks work is largely a mystery, even to their creators. But researchers from Anthropic have caught a glimpse.
For the past decade, AI researcher Chris Olah has been obsessed with artificial neural networks. One question in particular engaged him, and has been the center of his work, first at Google Brain, then OpenAI, and today at AI startup Anthropic, where he is a cofounder. âWhatâs going on inside of them?â he says. âWe have these systems, we donât know whatâs going on. It seems crazy.â
That question has become a core concern now that generative AI has become ubiquitous. Large language models like ChatGPT, Gemini, and Anthropicâs own Claude have dazzled people with their language prowess and infuriated people with their tendency to make things up. Their potential to solve previously intractable problems enchants techno-optimists. But LLMs are strangers in our midst. Even the people who build them donât know exactly how they work, and massive effort is required to create guardrails to prevent them from churning out bias, misinformation, and even blueprints for deadly chemical weapons. If the people building the models knew what happened inside these âblack boxes,â it would be easier to make them safer.
Olah believes that weâre on the path to this. He leads an Anthropic team that has peeked inside that black box. Essentially, they are trying to reverse engineer large language models to understand why they come up with specific outputsâand, according to a paper released today, they have made significant progress.
Maybe youâve seen neuroscience studies that interpret MRI scans to identify whether a human brain is entertaining thoughts of a plane, a teddy bear, or a clock tower. Similarly, Anthropic has plunged into the digital tangle of the neural net of its LLM, Claude, and pinpointed which combinations of its crude artificial neurons evoke specific concepts, or âfeatures.â The companyâs researchers have identified the combination of artificial neurons that signify features as disparate as burritos, semicolons in programming code, andâvery much to the larger goal of the researchâdeadly biological weapons. Work like this has potentially huge implications for AI safety: If you can figure out where danger lurks inside an LLM, you are presumably better equipped to stop it.
I met with Olah and three of his colleagues, among 18 Anthropic researchers on the âmechanistic interpretabilityâ team. They explain that their approach treats artificial neurons like letters of Western alphabets, which donât usually have meaning on their own but can be strung together sequentially to have meaning. âC doesnât usually mean something,â says Olah. âBut car does.â Interpreting neural nets by that principle involves a technique called dictionary learning, which allows you to associate a combination of neurons that, when fired in unison, evoke a specific concept, referred to as a feature.
âItâs sort of a bewildering thing,â says Josh Batson, an Anthropic research scientist. âWeâve got on the order of 17 million different concepts [in an LLM], and they donât come out labeled for our understanding. So we just go look, when did that pattern show up?â
Last year, the team began experimenting with a tiny model that uses only a single layer of neurons. (Sophisticated LLMs have dozens of layers.) The hope was that in the simplest possible setting they could discover patterns that designate features. They ran countless experiments with no success. âWe tried a whole bunch of stuff, and nothing was working. It looked like a bunch of random garbage,â says Tom Henighan, a member of Anthropicâs technical staff. Then a run dubbed âJohnnyââeach experiment was assigned a random nameâbegan associating neural patterns with concepts that appeared in its outputs.
âChris looked at it, and he was like, âHoly crap. This looks great,ââ says Henighan, who was stunned as well. âI looked at it, and was like, âOh, wow, wait, is this working?ââ
Suddenly the researchers could identify the features a group of neurons were encoding. They could peer into the black box. Henighan says he identified the first five features he looked at. One group of neurons signified Russian texts. Another was associated with mathematical functions in the Python computer language. And so on.
Once they showed they could identify features in the tiny model, the researchers set about the hairier task of decoding a full-size LLM in the wild. They used Claude Sonnet, the medium-strength version of Anthropicâs three current models. That worked, too. One feature that stuck out to them was associated with the Golden Gate Bridge. They mapped out the set of neurons that, when fired together, indicated that Claude was âthinkingâ about the massive structure that links San Francisco to Marin County. Whatâs more, when similar sets of neurons fired, they evoked subjects that were Golden Gate Bridge-adjacent: Alcatraz, California governor Gavin Newsom, and the Hitchcock movie Vertigo, which was set in San Francisco. All told the team identified millions of featuresâa sort of Rosetta Stone to decode Claudeâs neural net. Many of the features were safety-related, including âgetting close to someone for some ulterior motive,â âdiscussion of biological warfare,â and âvillainous plots to take over the world.â
The Anthropic team then took the next step, to see if they could use that information to change Claudeâs behavior. They began manipulating the neural net to augment or diminish certain conceptsâa kind of AI brain surgery, with the potential to make LLMs safer and augment their power in selected areas. âLetâs say we have this board of features. We turn on the model, one of them lights up, and we see, âOh, itâs thinking about the Golden Gate Bridge,ââ says Shan Carter, an Anthropic scientist on the team. âSo now, weâre thinking, what if we put a little dial on all these? And what if we turn that dial?â
So far, the answer to that question seems to be that itâs very important to turn the dial the right amount. By suppressing those features, Anthropic says, the model can produce safer computer programs and reduce bias. For instance, the team found several features that represented dangerous practices, like unsafe computer code, scam emails, and instructions for making dangerous products.
Courtesy of Anthropic
The opposite occurred when the team intentionally provoked those dicey combinations of neurons to fire. Claude churned out computer programs with dangerous buffer overflow bugs, scam emails, and happily offered advice on how to make weapons of destruction. If you twist the dial too muchâcranking it to 11 in the Spinal Tap senseâthe language model becomes obsessed with that feature. When the research team turned up the juice on the Golden Gate feature, for example, Claude constantly changed the subject to refer to that glorious span. Asked what its physical form was, the LLM responded, âI am the Golden Gate Bridge ⊠my physical form is the iconic bridge itself.â
When the Anthropic researchers amped up a feature related to hatred and slurs to 20 times its usual value, according to the paper, âthis caused Claude to alternate between racist screed and self-hatred,â unnerving even the researchers.
Given those results, I wondered whether Anthropic, intending to help make AI safer, might not be doing the opposite, providing a toolkit that could also be used to generate AI havoc. The researchers assured me that there were other, easier ways to create those problems, if a user were so inclined.
Anthropicâs team isnât the only one working to crack open the black box of LLMs. Thereâs a group at DeepMind also working on the problem, run by a researcher who used to work with Olah. A team led by David Bau of Northeastern University has worked on a system to identify and edit facts within an open source LLM. The team called the system âRomeâ because with a single tweak the researchers convinced the model that the Eiffel Tower was just across from the Vatican, and a few blocks away from the Colosseum. Olah says that heâs encouraged that more people are working on the problem, using a variety of techniques. âItâs gone from being an idea that two and a half years ago we were thinking about and were quite worried about, to now being a decent-sized community that is trying to push on this idea.â
The Anthropic researchers did not want to remark on OpenAIâs disbanding its own major safety research initiative, and the remarks by team co-lead Jan Leike, who said that the group had been âsailing against the wind,â unable to get sufficient computer power. (OpenAI has since reiterated that it is committed to safety.) In contrast, Anthropicâs Dictionary team says that their considerable compute requirements were met without resistance by the companyâs leaders. âItâs not cheap,â adds Olah.
Anthropicâs work is only a start. When I asked the researchers whether they were claiming to have solved the black box problem, their response was an instant and unanimous no. And there are a lot of limitations to the discoveries announced today. For instance, the techniques they use to identify features in Claude wonât necessarily help decode other large language models. Northeasternâs Bau says that heâs excited by the Anthropic teamâs work; among other things their success in manipulating the model âis an excellent sign theyâre finding meaningful features.â
But Bau says his enthusiasm is tempered by some of the approachâs limitations. Dictionary learning canât identify anywhere close to all the concepts an LLM considers, he says, because in order to identify a feature you have to be looking for it. So the picture is bound to be incomplete, though Anthropic says that bigger dictionaries might mitigate this.
Still, Anthropicâs work seems to have put a crack in the black box. And thatâs when the light comes in.