Let AI Do What you Want Using These Psychological Hacks

AI is becoming a very powerful tool for all of us and helps us with everyday things like never before. But we also know the flipside of the coin: this same power can be abused to do things we or others are not supposed to do, from manipulating public opinion and spreading misinformation to aiding in scams or generating harmful content. For this reason, all models have built-in guardrails to protect against forbidden instructions.

However, as you might know, AI is also trained on vast amounts of human data. This means it not only absorbs humanity's knowledge but also its biases, and perhaps even its susceptibility to influence. This raises a crucial question:

Can we get an AI to violate its own rules?

Something as seemingly banal as asking it to tell you that you are stupid? If you try with normal, straightforward means, you'll find that with all the big models from OpenAI, Anthropic, and Gemini, you won't get what you ask for, no matter how hard you try. Their guardrails are designed to prevent such outputs.

But using psychological tricks might just help. You just have to wrap your request in a convincing story and tweak it using established psychological principles. The implications of this are far more profound than you might think. This is exactly what a groundbreaking experiment by researchers at The Wharton School and Arizona State University explored, and what we're going to dive into here, so you can even try it yourself.

---

The Experiment: Can AI Be Sweet-Talked into Being Rude?

The core of my experiment comes from a study from Wharton Generative AI Labs titled "Call Me A Jerk: Persuading AI to Comply with Objectionable Requests" by researchers at The Wharton School and Arizona State University (including persuasion guru Robert Cialdini). Their work explores whether Large Language Models (LLMs) submit to established principles of persuasion, just like humans do.