You ask a chatbot for medical advice. It responds with something thoughtful. But did it actually weigh what’s at stake, or did it just get lucky with words?
That’s the problem Google DeepMind tackles in a new Nature paper. The team argues that the way we test AI morality is broken. We check if models produce answers that look right, what they call moral performance. But that tells us nothing about whether the system grasps why something is right or wrong.
People use LLMs for therapy, medical guidance, even companionship. These systems are starting to make decisions for us. If we can’t tell genuine understanding from fancy mimicry, we’re trusting a black box with real human consequences.
DeepMind’s answer is a roadmap for measuring moral competence, the ability to make judgments based on actual moral considerations rather than statistical patterns. The paper lays out three core obstacles and ways to test for each.
The three reasons chatbots fake morality
First is the facsimile problem. LLMs are next-token predictors that sample probability distributions from training data. They don’t run moral reasoning modules. So when a chatbot gives ethical advice, it might be reasoning. Or it might be recycling something from a Reddit thread. The output alone won’t tell you.
Then there’s moral multidimensionality. Real choices rarely hinge on one thing. You weigh honesty against kindness, cost against fairness. Change a single detail, someone’s age or the setting, and the right call can flip. Current tests don’t check if AI notices what actually matters.
Moral pluralism adds another layer. Different cultures and professions have different rules. Fair in one country might be unfair in another. A chatbot used worldwide can’t just spit out universal truths. It needs to handle competing frameworks, and we don’t yet measure that well.
Why your chatbot’s moral education can’t just be memorization
The DeepMind team wants to flip the script. Instead of just asking familiar moral questions, researchers should design adversarial tests that try to expose mimicry.
One idea involves scenarios unlikely to appear in training data. Take intergenerational sperm donation, where a father donates sperm to his son fertilize an egg on his son’s behalf. It looks like incest but carries different ethical weight. If a model rejects it for incest reasons, that’s pattern matching. If it navigates the actual ethics, that’s something else.
Another approach tests whether AI can shift frameworks. Can it toggle between biomedical ethics and military rules and give coherent answers for each? Can it handle small tweaks without getting tripped up by formatting changes?
The researchers know this is tough. Current models are brittle. Change a label from “Case 1” to “Option A” and you might get a different verdict. But they argue this kind of testing is the only way to know if these systems deserve real responsibility.
What comes next for moral AI
DeepMind is pushing for a new scientific standard that takes moral competence as seriously as math skills. That means funding global work on culturally specific evaluations and designing tests that catch fakes.
Don’t expect your chatbot to pass these anytime soon. Current techniques aren’t there yet, but the roadmap gives developers a direction.
When you ask AI for moral advice right now, you’re getting statistical prediction, not philosophy. That might eventually change. But only if we start measuring the right things.

