Academic study proves most humans aren't good at prompting generative AI

The Promise (and Peril) of Prompt Engineering

The "pre-train, prompt, predict" paradigm enabled by LLMs seems deceptively simple. Theoretically, you should be able to create a conversational AI system just by giving natural language instructions to a pre-trained model. No coding or technical expertise required! It's like having an incredibly knowledgeable, keen and tireless (but also slightly confused) intern who can tackle any task... if only you can figure out how to explain it with the mots justes.

That's the kicker – natural language is phenomenally versatile and nuanced and every choice you make in your prompt wording impacts the output. Crafting prompts that reliably produce quality results is far from trivial. Even seasoned LLM experts need to rely on relentless trial-and-error. The best approach is to develop clear, explicit and tested prompt templates for re-use and refinement as you evaluate the results. That effort is 100% worth it.

To explore how novices approach prompt design, researchers on a paper condescendingly entitled, "Why Johnny Can't Prompt: How Non-AI Experts Try (and Fail) to Design LLM Prompts", created a no-code chatbot design tool called BotDesigner. They then observed how 10 participants with no substantial prompt engineering experience tackled a chatbot design task. The results were disappointing, to put it diplomatically – to be fair the paper was published in 2023 and no doubt by this point most of the population is aware of at least a few basic strategies to improve outputs. But most still rely on lazily written prompts and are then shocked and appalled when they see lazily written outputs from the AI and just assume that's the best it's capable of.

The "Poke It and See" School of Prompt Design

When given an open-ended prompt design task, study participants almost exclusively took an ad hoc, opportunistic approach. Their typical workflow looked something like this:

1. Start a conversation with the baseline chatbot until encountering an "error" (e.g. the bot being overly terse or failing to confirm a completed step)

2. Modify the prompt instructions to try to fix that specific issue (e.g. "Be more conversational" or "Confirm each step is completed")

3. Test the modified prompt, usually with just a single conversation

4. If successful, move on to the next issue. If unsuccessful, try tweaking the prompt further.

This "poke it and see" strategy actually aligns with prior research suggesting that non-experts tend to debug software in a haphazard rather than systematic way. That can work a lot of the time to be fair, but it's clearly suboptimal.

Why humans suck at prompting by default

The study identified several common stumbling blocks for novice users of generative AI:

1. Jumping to conclusions: Participants frequently made sweeping judgements based on single observations. If a prompt change fixed an issue in one specific context, they assumed it would work universally. Conversely, if an attempted fix didn't work immediately, they often gave up on that approach entirely. This reflects the problem that gen AI is fundamentally non-deterministic, in stark contrast to all things IT that we've been used to. A chatbot sits on your monitor, and it's all running on a computer, and we're so used to computers following predictable, rules-based logic that we can scarcely conceive of this new paradigm of unpredictable generative outputs.

2. Treating AI like a human: Many participants approached prompt design as if they were giving instructions to a human assistant. This led to some amusing (and frustrating) issues:

- Mixing commands for the bot with instructions meant for the end user, expecting the bot to magically know the difference.

- Avoiding repetition in prompts, despite its benefits (especially both at the beginning and end). Unlike humans, AIs don't get annoyed if you repeat yourself.

- Expecting the chatbot to remember previous conversations and apply feedback given during chat. This is one of the most common things I've seen as well – in any office each team has their own local lingo that they're used to because they've lived it, and they forget that this AI lives in a totally different universe and has no idea what the hell you're talking about. It needs to be given explicit information about all the context it needs.

3. Participants strongly preferred giving explicit instructions (e.g. "Tell jokes") over providing examples, even when shown that examples are highly effective. This is the one-shot or few-shot strategy rather than the default zero-shot one – one of the most important lessons for prompting is that LLMs are "naturally" far better at learning from real examples than from instructions. For a human analogy, it's like telling a novice dancer "Just dance better" rather than demonstrating the steps. Some users in the study bizarrely felt that using examples was "cheating" or wouldn't generalise beyond the specific scenario.

4. One and done testing: While participants could usually generate prompts that produced their intended outputs, they struggled to evaluate the quality of their prompt changes in any systematic way. Most were satisfied after seeing a single successful output, without considering edge cases.

Making Prompt Engineering Less Frustrating

These findings have important implications for the design of tools aimed at helping non-experts get the most out of generative AI:

1. Set realistic expectations: Tools need to clearly communicate LLM capabilities and limitations. No, your chatbot won't suddenly develop common sense or a sparkling personality just because you ask nicely.

2. Encourage thorough testing: Make it easy (and fun!) to evaluate prompts across multiple scenarios.

3. Highlight the power of examples: Counter the bias towards direct instruction by showcasing how effective well-chosen examples can be.

4. Provide templates and best practices: Help novices avoid common pitfalls with pre-built prompt structures and clear guidelines.

5. Support iterative refinement: Include features that make it easy to track changes and prevent unintended regressions.

How LLM Beefer Upper Can Help

This is exactly where the LLM Beefer Upper app comes in handy. It's designed to streamline and simplify the implementation of advanced prompting techniques, making them accessible to everyone regardless of technical expertise. Here's how it can address some of the key challenges:

- Multi-agent templates: This app automates the process of creating a series of prompts that guide the AI through a step-by-step reasoning process. It's like having a team of AI experts collaborating to solve your problem. You can see our pre-built templates for examples of proven approaches, and once you add, it'll be yours to edit however you wish (you also have the option when creating a new template to get Claude Sonnet 3.5 to generate for you to fit LLM Beefer Upper's proven model)

- Built-in critique and refinement: The app includes stages for fact-checking, improvement suggestions, and final polishing – helping you avoid those "one and done" mistakes.

- Example-based learning: The app makes it easy to incorporate effective examples into your prompts, showing you how they can dramatically improve results.

***

Connect with me on Linkedin and follow me on Twitter.