AI gone wonky: The hallucination conundrum
AI hallucinations are infuriating. They can be merely annoying or potentially lead to disastrous consequences without human oversight. The bigger picture is they undermine public trust in AI systems, which to me is a tragedy because when used well the productivity gains from generative AI are spectacular. To understand why hallucinations are even a thing, we need to peek under the hood of how large language models (LLMs) actually work. They aren't vast databases of facts that are looked up (some people still assume this even in mid-2024, perhaps because they can’t imagine software doing anything but rules based logic), but rather incredibly sophisticated pattern-matching machines - which is pretty much what we are too, we're just way better at it! LLMs have been trained on enormous amounts of text data, literally trillions of words, learning to predict what words are likely to come next in any given sequence. This stochastic nature of next-token prediction is both a blessing and a curse. It's what allows LLMs to be creative and generate novel and occasionally brilliant texts, but it's also why they can veer off into nonsense every now and then.
The human expectation problem
What makes AI hallucinations particularly jarring is our expectation of how computers typically behave. We're accustomed to machines following strict rules and giving predictable, accurate outputs. A common example that’s become an infamous AI meme in 2024 is asking LLMs to count the number of ‘r’s in the word ‘strawberry’. The vast majority of the time, even the best models respond with 2 rather than 3. This is used as ‘proof’ that AI sucks, is all hype, stupider than a 2 year old etc. But the very question betrays a misunderstanding of what an LLM is. It’s just predicting the likely next token, given it’s array of correlations from the training data, not calculating anything or looking up data – there’s lots more to it, but this Medium article illustrates the basic concept in a very simplified way:
Another important factor is that GPT4 has been trained on public internet texts. Consider the landscape of Q&A style websites like Quora, Twitter, Reddit, Facebook, any specific discussion forum etc. How often do you see people putting in the effort to post “I don’t know”? Hardly ever. So the training data vastly overrepresents text from humans who are confident enough to answer publicly, but neither their confidence nor the content of their answer is a reliable predictor of accuracy. Chat GPT 'does its job' very well in terms of simulating that kind of confident answer, and humans are of course very wrong about most things most of the time, so it's learned from the best! Ultimately, you can never assume an AI output is correct without additional independent verification, no matter how persuasively worded it is. While this is a huge risk especially for children who don't reflect as critically when reading authoritative-sounding texts, the positive flip side is that the clear failures of AI will help people learn to critically engage with information from a much younger age.
The only reliable solution: Verify, verify, verify
Until we have truly hallucination-free AI (I can't see how that would even be possible without a major technological breakthrough that results in an AI capable of pro-actively undertaking real empirical research), the only reliable method to reduce AI hallucinations is to implement robust verification processes, which right now means you, the human, have to put in that review and correction time. Companies like Open AI, Google, Anthropic, and Meta are pouring resources into making their models more reliable and accurate as a high priority. This includes incorporating in-built reasoning steps (see the much-hyped and potentially fabled Q* leak from Open AI).
This is where techniques like chain-of-thought (CoT) prompting come into play. By encouraging the AI to simulate reflection and to think step-by-step to break down its reasoning process, we can more easily spot logical flaws or factual errors. It's like asking a student to show their work – it's much easier to identify where they've gone wrong.
Automating the verification process: Enter the LLM Beefer Upper
I know what you're thinking. "Waaah! Verifying everything is a huge faff… I thought AI is supposed to make my life easier??" I completely agree, and this is exactly why the LLM Beefer Upper was created! Because I couldn't be bothered putting in the effort every time and wanted to automate the step-by-step verification and improvement steps.
The LLM Beefer Upper automates the process of implementing advanced prompting techniques, including multi-agent setups that incorporate verification steps. It's like having a team of AI experts working on your problem, each with a specific role:
- The initial responder generates the first draft.
- The fact-checker scrutinises the response for accuracy.
- The improver suggests enhancements.
- The final polisher refines everything into a cohesive, more accurate and ultimately better quality output.
This approach doesn't completely eliminate the possibility of hallucinations and human review is always needed, but it dramatically reduces their likelihood and severity.
Putting it to the test: A real-world experiment
To illustrate the effectiveness of this approach, I conducted an experiment using GPT-4, Claude 3.5, and the LLM Beefer Upper (which uses Claude 3.5 with its multi-agent setup). The task? Creating a lean, need-to-know to-do list for a complex degree programme structure change proposal, based on dense regulatory information. I used a real life, complicated example I personally dealt with last year, because it’s much easier for me to verify and evaluate how well the models perform because I have vivid, PTSD-tier memories of dealing with that entire process. I know from previous attempts that using Chat GPT is simply not good enough given the scale and complexity of information which has to be accurate – even if it might save me 10% of the time from doing it manually, the stress I’d experience from having to correct the errors means I’d rather just do it myself. The version I got from LLM Beefer Upper was not quite perfect, but I consider it a no-brainer option to go for it with a task like this from now on, because it would easily save me 50%+ of the time. The best part is the corrections are just removing superfluous information or minor clarifications – these are easy to fix, unlike trying to unpack ambiguity or identify missing information entirely which are more cognitively taxing tasks.
The prompt used was:
"I want to remove Course X as a core course and replace it with Course Y and Course Z. Assume the date is 1st September 2023 and I want the new degree structure in place in 2025/6. Based on the reference information below (the university's guidelines and the degree programme structure I want to adapt), please give me a concise, need-to-know, to-do list including any specific deadlines and considerations needed."
The LLM Beefer Upper version used my general-purpose 'To-do Planner' steak-tier 4-agent prompt template, which had the following additional 3 prompts for the additional agents:
"Agent 2: You are a critical analyst specialising in project management and regulatory compliance. Your task is to evaluate the to-do plan created by the first agent, focusing on its completeness, clarity, and especially adherence to regulatory requirements. If there is any ambiguous or superfluous information that isn't truly need to know for the task, identify it as such. Original documentation: {provided_text}. Additional context and instructions: {prompt_detail}. First LLM Response (To-Do Plan): {agent1_text}. Critique the plan for: Comprehensiveness in covering all and only the need to know tasks; Clarity and actionability of tasks; Proper incorporation of regulatory information; Adherence to project ideas and goals; Appropriate prioritisation and timeline of tasks; Avoidance of vague or unnecessary items. "
"Agent 3: You are an expert project consultant and task efficiency specialist. Your task is to provide detailed suggestions for improving the to-do plan based on the initial draft and the critical analysis. Original documentation: {provided_text}. Additional context and instructions: {prompt_detail}. First LLM Response (To-Do Plan): {agent1_text}. Critical Analysis: {agent2_text}. Provide specific, actionable suggestions to enhance the to-do plan, focusing on: Improving task clarity and specificity; Avoiding ambiguity or superfluous requirements not relevant to the project. Strengthening accuracy and alignment with regulatory requirements; Enhancing the prioritisation and sequencing of tasks; Addressing any other shortcomings identified in the critical analysis; Your goal is to provide recommendations for another agent who will improve the to-do plan to be as perfect as possible, ensuring it concisely but comprehensively covers all and only the necessary aspects while remaining concise and actionable. "
"Agent 4: You are a master task planner with exceptional synthesis skills. Your task is to produce the final, polished to-do plan, incorporating all previous feedback and insights to create an outstanding, comprehensive yet concise and need-to-know plan. Original documentation: {provided_text}. Additional context and instructions: {prompt_detail}. First LLM Response (To-Do Plan): {agent1_text}. Critical Analysis: {agent2_text}. Improvement Suggestions: {agent3_text}. Create the final version of the to-do plan, ensuring: Perfect clarity and actionability of each task; Comprehensive coverage of all and only the need-to-know requirements and project aspects, including regulatory requirements; Optimal prioritisation and sequencing of tasks with accurate and realistic deadlines; Addressing and improving upon all points raised in the critical analysis and improvement suggestions. Your final product should be a polished, high-quality to-do plan that is both comprehensive and concise, fully addressing all specified requirements and regulatory standards without overloading the reader with unnecessary information. "
Here's a snapshot of the results:
Breaking down the results:
Several important metrics were used to evaluate the responses. Most AI benchmarks tend to focus on unambiguous right/wrong tests which is great but those aren't real world applications for LLMs. Nobody is using AI at work to solve spatial awareness riddles, they're using it for real productive tasks. So I've used a rating out of 10 for each metric below. 10 means it fully succeeded, 0 means it fully failed. Scoring the in-between cases is tricky but here are some examples:
No ambiguous requirements: GPT4o's output had so many that I couldn't give it anything but zero. Just too much noise. Both Claude and LLM Beefer Upper included superfluous requirements as well, but they weren't as big a deal. For example, Claude added a note to consider in-year change regulations if applicable - they weren't applicable at all, the prompt clearly stated this was to be implemented in a future year, and you'd think an advanced AI like Claude would have realised that and not even mentioned it. But it was the only really standout example. GPT4o on the other hand, was full of needless or irrelevant tasks like 'ensure proposal aligns with programme structure norms', 'ensure current students are aware of any transitional arrangements' (this is only relevant for undergrad students but the example was for a postgrad degree) and referencing undergrad committees and deadlines when they're not relevant for a PG degree proposal. Even though Claude and LLM Beefer Upper had minimal superfluous requirements, I still scored them harshly with 2/10 because the prompt instructions were clear that the to-do list must be lean and concise using only need-to-know information. Here's a summary of the experiment's results:
- Accuracy: All models managed to avoid leaving out vital information, but the LLM Beefer Upper excelled in providing correct specific deadlines and an appropriate overall plan timeline.
- Clarity: The LLM Beefer Upper minimised adding superfluous requirements, making its output more focused and actionable. GPT4 tends to be overly verbose by default, that alone made it more painful to review for accuracy and quality.
- Completeness: Both Claude 3.5 and the LLM Beefer Upper explicitly mentioned crucial steps like SSLC consultation and updating the prospectus, which GPT-4 missed.
- Role identification: The LLM Beefer Upper was the only model to correctly identify which roles should do what, not 100% but only minor changes needed.
The main metric to care about
The litmus test from this and the raison d-etre of this app in the first place is: How much time would I save on this task with the help of generative AI? While all the AI outputs would require some review and verification, the LLM Beefer Upper's response needed minimal editing. The time saved compared to traditional methods or even using standard LLM responses was substantial.
Embracing the verification mindset
The key takeaway here is you need to shift your mindset when working with generative AI. Instead of treating it as an infallible oracle, view it as a powerful but imperfect assistant that requires guidance, oversight and verification, and can massively improve your productivity when used well. For tasks where accuracy is paramount, the 'steak tier' system of 4 LLM agents as per the above experiment consistently gets much better results. It's designed to maximise accuracy and instruction-following, crucial for these high-stakes scenarios.
The goal isn't and shouldn’t be to eliminate human involvement entirely. It’s using AI as a helpful assistant that can get you results quicker and more effectively which can massively enhance your productivity and let you focus your time on more valuable and fulfilling tasks. By implementing strong, auditable verification processes as an inherent part of the system, you can take advantage of the best of AI while compensating for its weaknesses, ultimately leading to better, more reliable outcomes.
***Connect with me on Linkedin and follow me on Twitter.