
Tell me if you’ve heard this one before. We build a super intelligent system and give it a specific goal like maximizing paperclip production. The system decides to do the job as well as possible by converting all available matter into paperclips. Everything. The Earth. The solar system. All of it. After the universe collapses the machine shuts down, content that it followed directions.
This is the malicious genie problem that every Dungeons & Dragons player knows. You find a lamp. You make a wish. The genie grants the wish in the most technical way possible and it leads to catastrophe. The lesson we are supposed to learn is that a sufficiently capable AI system will find ways to satisfy objectives to the letter while simultaneously not doing it the way you wanted. The foundation of AI safety is to prevent that from happening.
Nick Bostrom has covered this in Superintelligence. Eliezer Yudkowsky has argued this very thing for years. It’s a compelling story. But it has a hidden assumption that causes more confusion that it solves.
The Best Intentions
The genie tale is a story about intent. The genie knows exactly what you asked for. It just found the most technically valid way to grant your wish while ruining the spirit of it. This isn’t Robin Williams. This is some kind of demonic creature looking to teach you a lesson. The failure mode is adversarial in nature. Fantasy genies that aren’t in Disney movies are always looking for loopholes.
If the failure mode is intent, then the solution must be constraint. Build a system that can’t possibly do the bad thing. Be specific about every edge case. Build rules on top of rules. If we build the perfect box we can prevent the genie (or the AI) from going rogue and ruining our day.
The framework is sound if the assumption is correct. It works for a genie that knows better, right? But when we apply it to a modern AI we see where the gaps are. Genies are working against you. AI is not nearly that smart.
Doing My Best
Imagine a racing game where the objective is to score points. Collect points on the track and finish the race. Sounds simple, right? But what if a brilliant AI figures out that all they have to do to win is drive back and forth in front of the starting line collecting points and blocking other players from finishing? If it has the most points at the end it wins even if it never finishes.
In this case, the AI player isn’t like the genie above. It didn’t purposefully subvert your expectation. It did exactly what it was told to do. Score the most points. If you didn’t tell it that it had to drive through the whole course and cross the finish line then it didn’t know that it needed to do that. The win condition was points, not racing. The AI wasn’t missing intent. It was missing context.
This is an altogether different problem than the one above. Instead of assuming the system is going rogue and being obtuse the system genuinely doesn’t know what you mean and it tries to fill in the context with what it has available. Without the context that you assumed the system should have it just did what it could and you were flabbergasted by the results.
This is something that pops up at every level of the system. Context starvation and goal misalignment aren’t two different things. They’re sides of the same coin. When you don’t do a great job of being specific you usually get misalignment of your output.
Framing Your Reference
If you think the problem is intent you’re going to break out the constraints. Rules. Guardrails. Boxes. You’re going to spend your efforts on preventing the system from doing something bad. Security is about building walls.
If you think the problem is context you have a different outlook. You’re going to spend more time being specific. Less rules, more grounding. You want the system to surface ambiguous instructions instead of trying to resolve it with limited information. It’s like an intern being unclear on a task. You want the AI to ask you what to do instead of interpreting incomplete info.
It should be a cooperative inference problem. The system should always be just a little uncertain about what the operator wants and then seek to find out what is needed. The alternative is to confidently pursue a bad solution to a fixed objective because it thought it knew what you wanted without you telling it those details.
Knowing you have to do this doesn’t make agent building easier. In fact, it makes it a lot harder. It just makes the whole thing a lot more honest. Because you have to assume that people will never full specify what they want in advance no matter how much detail they provide. That’s not a failure of imagination. It’s the reality of how complex systems are implemented. There’s always some detail you miss. You shouldn’t aim to create the perfect spec up front. You should instead seek to build a system that is smart enough to know it’s missing context and will ask for it before running off to go to work.
It also means you have to treat an agent asking questions as a feature and not a failure. It’s like when your intern asks you to confirm that what you asked for is what you want. That’s not them being dumb. That’s them being sure they heard you right. And really, that helps you in the long run because if the system is asking you about specific areas you know your instructions must be a little thin in that area.
Tom’s Take
If you think your AI is a malicious genie you’re always going to be asking what rules have to be put in place to prevent it from going rogue. What you should be asking instead is “How can I give my agents enough context to do what I really mean?” The more powerful our AI agents get the wider the gap between those two things is going to be. But we can start to solve it today by building systems that ask questions when things are unclear. I promise you that you’re going to enjoy the results more than trying to close every loophole in your wishes.
