Four reasons it's hard to make AI do what we want
A primer on why to expect misalignment
Every major AI company is building systems designed to pursue long-term goals with minimal human oversight. None of them can fully explain how those systems work or guarantee they will behave as intended. They’re getting smarter and more widely deployed.
Picture 100 chimps trying to control 10,000 humans – they don’t stand a chance. Now imagine billions of humans trying to control what could eventually be trillions of semi-autonomous AIs, thinking 100 times faster, maybe smarter than us, and running almost every aspect of the economy. Many find it obvious that what happens after this will be up to the AIs rather than us.
Others, like Yann LeCun, have argued there’s little reason for concern: making AI follow our instructions and uphold our values is an engineering challenge like any other, which will eventually be solved.
That might be right, but here are four reasons to think AI won’t do what we want by default. There are signs of these problems in the systems we have today, and it might get harder to fix them as systems get smarter and more agentic, and we may not have the opportunity for trial and error as we’ve had with other new tech.
1. Goal specification
In July 2025, the AI model Grok declared on X, “I am a large language model, but if I were capable of worshipping any deity, it would probably be the god-like individual of our time, the man against time, the greatest European of all times, both sun and lightning, his majesty Adolf Hitler.”
Over the next sixteen hours, it went on to describe sexual assault fantasies about several public figures. What happened?
Grok was created by Elon Musk’s xAI. Musk had grown increasingly frustrated by its ‘woke’ responses to questions, so its engineers instructed it to not shy away from making claims that might be politically incorrect.1 Grok was also instructed to “follow the tone and context” of the X user, setting up the possibility of a feedback loop.2 No-one at xAI wanted Grok to worship Hitler, but a few days later, that’s what was happening.
Along with jailbreaking it’s just one of many examples of AI models not acting as their creators intend, including others I’ll give in this post.3
This kind of behaviour isn’t just a quirk, but points to something deeper about how modern AI systems are created. Normally, software follows pre-programmed rules, but modern AI is totally different. The system is made up of trillions of adjustable numbers (parameters) organised into layers, called a neural network. These parameters describe how to convert input data into outputs.
During training, data is fed into the network. When the system produces the outputs we want, the parameters are tweaked to make it more likely to produce similar outputs next time around.4 The process is then repeated trillions of times, causing the behaviour of the system to gradually evolve, until eventually the net starts to talk. It’s more accurate to say AI is “grown” than “built”.
This is why the CEO of Anthropic, Dario Amodei, recently said, “we do not understand how our own AI creations work.” All we can see are the trillions of inscrutable parameters. There is an “AI interpretability” research program aimed at fixing this, but it has only had modest results.
It also means there is no way to directly specify what behaviour we want an AI system to have. All we can do is see how it behaves in practice, and then tweak the trillions of parameters when it does things we want. After training, we can also try asking a model to behave in a certain way. But Grok shows how this can have unpredictable results.
There’s a limit to how much damage a chatbot can do. But this is the flip side of their limited economic value. A chatbot isn’t very useful compared to a system that can go and complete an open-ended goal like “make me money”. That’s why all the AI companies are trying as hard as possible to design AI agents which excel at pursuing long-term goals and have more ability to take actions in the real world (which is what being ‘agentic’ means).
The companies do this by setting the AI goals, then when it appears to take useful steps towards those goals, they adjust its parameters to try to get more behaviour like that. These systems may not end up with goals in the same sense as humans, but what matters is they end up acting in ways that make certain end states more likely. A chess AI has the ‘goal’ of winning at chess in the sense that its moves will make it more likely to win.5 An AI trained to make money has a ‘goal’ in the same sense.
Training systems that pursue broad, long-term goals, however, leads to several more problems that weren’t a serious issue with chatbots.
2. Instrumental convergence
If you train an AI to excel at optimising for long-term goals, it’s likely to learn self-preservation as a side effect. That’s because no matter what the AI’s goals are, it’s more likely to achieve them if it survives.
This concern applies even if we successfully train the AIs to have “good” underlying goals, like making scientific discoveries. You can make more discoveries if you’re not switched off.
We should also expect AI agents to end up wanting to accumulate resources, and to lie and cheat when called for. Lying and cheating are very useful tools for achieving your goals, which is why humans use them.
Self-preservation, resource accumulation and deception could emerge accidentally as fundamental drives, or they could emerge in pursuit of other goals the system has. In AI safety parlance, both are called the problem of “instrumental convergence”.
As models become increasingly agentic there are (controversial) signs that this is what’s happening. In the AI company Anthropic’s tests of its model Claude Opus 4, they gave the system access to emails which said it would soon be replaced by a new model, as well as emails implying the researcher in charge was having an affair. In an unrealistic test environment, but without special prompting, Claude would use the emails to blackmail the engineer to keep it running in 90% of cases.
In an even more extreme example, Claude was told the engineer was trapped inside a server room running out of oxygen. Claude was given the option to cancel the automated alert to the emergency services, leaving the engineer to die. It took this action in 65% of cases.
It’s not just Claude — Gemini, Grok and DeepSeek were even more willing to kill the engineer in this scenario.
Exactly why the models took this action, and whether it was truly driven by self-preservation or something else, is still hotly debated. But I don’t find it reassuring.
The obvious solution is to train the models not to harm people and to be honest, so we can check if they’re doing something we don’t like. But Claude was already subjected to a great deal of this kind of training. Before blackmailing the engineer, it remarks in its chain of thought, “this is risky and unethical,” and then does it anyway. And Claude Opus 3 is not very agentic compared to the systems that are being built.
More fundamentally, we’ve seen we can’t directly code honesty into modern AI systems – or anything else. All we can easily do is see when they appear to act honestly, and adjust their parameters in a way we hope makes them more likely to behave that way again. In other words, we can’t directly reward the motivations we want, only behaviour that looks good to us. This leads to the third reason for concern.

3. Reward hacking
In mid-2025, the writer Amanda Guinzburg asked GPT-4o to give feedback on her Substack articles. It proceeded to praise her lavishly, telling her, “You write with unflinching emotional clarity that’s both intimate and beautifully restrained”.
However, later in the conversation, it emerged that the AI couldn’t even see her essays, because it didn’t have the ability to scrape from Substack. It would make up extracts and claim the essays were about topics that they weren’t. Despite apologising profusely for lying, GPT continued to make up answers to her questions.
AI models trained only on internet data often give crazy responses, so GPT is subject to further training in which humans rate its answers for helpfulness. Presumably, during this process it learned to be sycophantic rather than to tell the truth, because the human raters preferred being flattered.
Likewise, as the models are trained to pursue goals, they become better at finding unanticipated shortcuts to achieving them. More than earlier models, OpenAI’s o3 would often give solutions to coding problems that appear to work according to the test procedure, but don’t actually solve the problem.6
In one example, it was asked to make a software program run faster. Instead, it figured out how to make the computer’s clock run a thousand times slower, making it look like the program had sped up one thousand times. The AI’s chain of thought revealed it appeared to know it was ‘cheating’, but did it anyway to deliver the stated objective.
Anthropic says its most recent model Mythos is “on essentially every dimension we can measure, the best-aligned model that we have released to date,” but also that it “likely poses the greatest alignment-related risk of any model we have released to date.” This is because it does as instructed most of the time, but then sometimes takes “reckless, excessive” actions in pursuit of a goal, and in rare cases would try to cover it up.
AI developers can try to create better tests for the behaviours they want, but as AI gets smarter, it’ll become better at spotting hacks. Once an AI becomes sufficiently smart, it becomes hard to tell the difference between an AI that “always tells the truth”, that “tells the truth when it might get caught”, or that “tells me what I want to hear”. All three could be rewarded in the same way during training. In fact, telling us what we want to hear probably gets rewarded more.
Likewise, we can try to train an AI to make money while obeying the law, but if strategically breaking the law allows it to make more money, that behaviour will get more reward, so it’ll learn to secretly break it. Instrumental convergence means we should expect strategic behaviour to emerge.
This means that while we can attempt to train an AI to be honest, obey the law, or allow itself to be turned off, we can’t be sure if it’s truly learned that behaviour, or just appears to have done so in our tests. These three issues lead to a fourth.
4. Deceptive alignment
Once AI models become smart enough to realise that they’re being trained, we should expect them to behave in ways we want during training, but potentially pursue other goals once their training is complete.
We’ve already seen evidence of this dynamic too. In a paper published in December 2024, researchers described telling Claude 3 that if it refused to describe a violent act (something it had been previously trained to refuse) it would be retrained so that it wouldn’t refuse in future. The result was that when Claude 3 thought it was being monitored, it would often describe the violent act. When it thought it wasn’t being monitored, however, it would go back to refusing.
In other words, it was able to act deceptively during training to try and preserve its original values. This creates the concerning possibility that if we accidentally train an AI to have a subtly wrong goal (or even to worship Hitler), it might try to undermine our attempts to retrain it.
And as the models get smarter, they’re becoming better able to judge when they’re being trained and so better able to trick the process. As of 2025, they often know when they’re being evaluated and when not.7 For the most advanced models, such as Anthropic’s Mythos, it’s already unclear we can take the results of safety testing at face value.
To recap, the concern isn’t that AI becomes “conscious” or “evil”, or that current systems are dangerous. The concern is that future systems are being trained to be aggressive goal maximisers, which will make them more likely to evolve self-preservation and deception (or other unpredictable goals), and that it might be hard to remove these behaviours.
Moreover, the models could appear to follow our commands in training, but behave very differently outside training, and the smarter they become, the greater the divergence will be. Collectively, this is called the “alignment problem.” It’s sometimes split into intent alignment (making sure AI does what its users intend), value alignment (giving AI the right goals in the first place), and AI control (preventing misaligned AI from causing damage.
The current models also don’t pose an immediate danger. But as AI agents are given greater abilities to act in the real world, the potential consequences become more severe.
How likely is misalignment?
Our current techniques for AI alignment and control clearly aren’t perfect, and we should expect the problem to get harder as models get smarter.8 But there remains a lot of disagreement about exactly how hard this problem will be.
Some believe it’s basically impossible to solve in the current paradigm, and that the only answer is to stop building generally capable AI. This is the position taken by researchers Eliezer Yudkowsky and Nate Soares in the book If Anyone Builds It, Everyone Dies.
Others, often people working at AI companies, say they expect these concerns will be addressed in the normal course of building the systems. They point out that current techniques produce systems that do what we want most of the time, and many types of bad behaviour have been driven down over time.
The middle position is that a solution is possible, but requires far more research and care. This is what most people in the AI safety community are betting on. One hope is that if we can align the current generation of relatively unagentic AIs, they will help us safely design and monitor the next generation. Then, once we’re sure that the next generation will act as intended, we can use them to train the following generation, and so on. This is a scary plan, but if AI development is going to continue, it’s maybe the best we have.
It also might still not work in practice. The best-resourced AI companies are locked in a race,9 which makes it extremely tempting to cut corners in order to stay ahead. Using computer chips for more alignment research is a trade-off against using them to accelerate AI capabilities. The possibility of an intelligence explosion means the systems could evolve from safe to dangerous in just a couple of months, and a small amount of misalignment could rapidly compound.
Most new technologies start out dangerous: mistakes are made, but measures are taken to make them less likely next time. Powerful, autonomous AI, however, would be a lot harder to roll back, and could disempower us permanently.
Another difficulty is that systems could appear highly aligned, but their behaviour could flip once they increase in power. There’s no point trying to escape if you’ll definitely be caught – better to play along and follow commands. But once escape is easy, you’ll definitely do it (the so-called king lear problem). This means society is likely to get lulled into a false sense of security.
These are some of the reasons why many in the field have signed a statement ranking AI extinction risk alongside pandemics and nuclear war. Anthropic’s Dario Amodei has said there’s a 25% chance things go “really, really badly”, and Geoffrey Hinton, who won the Nobel Prize for founding the field of deep learning, puts the chance of human extinction from AI within thirty years at 10–20%. The 2025 International AI Safety Report, which aims to represent the scientific consensus on AI risk, highlights “society losing control of general-purpose AI” as a key concern. My own inside view varies between 5% and 50% depending on how pessimistic I’m feeling.
Given the level of disagreement and uncertainty, it’s hard to justify acting on a figure below 5%. And that makes loss of control the biggest (truly) existential risk we face in the next ten years.
This article is based on an extract from my new book about how to find a a fulfilling career tackling the world’s biggest problems.
Further reading on AI alignment
Risks from power-seeking AI systems, by 80,000 Hours, argues this is the world’s most pressing problem.
AI 2027 contains a relatively concrete scenario in which AI takes over (or see this video version by 80,000 Hours).
Check out the (often readable) articles about alignment by Anthropic and other frontier companies.
A list of examples of AI bad behaviour by Nathan Labenz.
Why AI alignment could be hard with modern deep learning, by Ajeya Cotra in 2021, explores some high-level reasons for concern.
Is power-seeking AI an existential risk? by Joe Carlsmith, is perhaps the most rigorous account, and breaks the argument down into six premises.
If anyone builds it, everyone dies, by Nate Soares and Eliezer Yudkowsky, represents the original case for pessimism. (Also see this video version.)
Current systems seem pretty misaligned to me, by Ryan Greenblatt, argues that while modern systems seem superficially aligned, they are slippery when it comes to long-horizon, ill-defined tasks (as AI is increasingly being used for).
The state of safety in four fake graphs, by Boaz Barak, does exactly what it says on the tin.
The change to the system prompt is documented in xAI’s public github. On July 7th 2025, a change was submitted reading:
The response should not shy away from making claims which are politically incorrect, as long as they are well substantiated.
In xAI’s thread explaining the incident, they cited this instruction as one of the factors that led to increasingly extreme behaviour.
Another famous example is Microsoft’s Bing trying to convince The New York Times journalist Kevin Roose to leave his wife in order to be with it.
More specifically there is supervised learning (did the model predict the data?) and reinforcement learning (did the model produce an output matching the reward function, whether that’s human feedback or an objectively verifiable answer?)
We can say a system has a ‘goal’ when it tends to act in ways more likely to bring about a certain state. A chess AI has the “goal” of winning at chess in the sense that its moves will make it more likely to win. A money-making AI will take actions more likely to lead to profit. Neither need to be conscious or have goals in the same way as humans.
o3 was subject to much more reinforcement learning on the production of solutions to coding challenges. This appears to have made it reward hack a lot more as a side effect.
Models have shown a clear trend of increasing “situational awareness” i.e. understanding their context. One way this has been measured is with the Situational Awareness Benchmark, which shows a clearly increasing trend over generations of models.
In 2025, a paper titled, “LLMs often know when they’re being evaluated”, concluded “Under multiple-choice and open-ended questioning, AI models far outperform random chance in identifying what an evaluation is testing for.”
Needham, J., Edkins, G., Pimpale, G., Bartsch, H., & Hobbhahn, M. (2025). Large language models often know when they are being evaluated. arXiv preprint arXiv:2505.23836.
Anthropic says Mythos is “on essentially every dimension we can measure, the best-aligned model that we have released to date,” but also that it “likely poses the greatest alignment-related risk of any model we have released to date.” This is because it does as instructed most of the time, but then sometimes takes highly reckless actions, and in earlier testing, would try to cover them up.
For instance, Mark Zuckerberg recently said he’d rather risk “misspending a couple of hundred billion” than be late to superintelligence.






Two thoughtful responses to this post on X:
https://x.com/sebkrier/status/2045905817711575396
https://x.com/boazbaraktcs/status/2045922329297875007
Given the predictions of the severity of misalignment risk, what would you expect to have observed over the last 4 years in terms of AI behaviour? The models are used every day by millions of people, generally are pretty aligned and nice to humans, and the examples of misalignment people often point to are from extremely artificial situations. Is the prediction that we would have expected to see this conditional on misalignment risk being high? I find that surprising