More than half of all ChatGPT's responses to questions about software engineering are wrong, according to new research by Purdue University.
Researchers from Purdue analysed 517 answers given by OpenAI’s explosive chatbot to queries about stack overflow as part of a study to examine their accuracy, consistency and conciseness.
They found that just over half of these answers gave inaccurate or entirely incorrect information, while over two-thirds were “verbose” – giving overly complicated answers to simple queries.
"Our analysis shows that 52 per cent of ChatGPT answers are incorrect and 77 per cent are verbose," the team's paper concluded,” wrote the researchers.
"Nonetheless, ChatGPT answers are still preferred 39.34 per cent of the time due to their comprehensiveness and well-articulated language style."
OpenAI acknolologes on the ChatGPT website that its chatbot “may produce inaccurate information about people, places, or facts.”
The research firm is currently under investigation by the FTC for causing “reputational harm” to its consumers due to its potential to “generate statements about real individuals that are false, misleading or disparaging.”
One infamous example of this was when ChatGPT falsely accused a US law professor of committing harassment, citing a non-existent article from the Washington Post.
Scarily trustworthy
Authored by Purdue researchers Samia Kabir, David Udo-Imeh, Bonan Kou, and assistant Professor Tianyi Zhang, one of the key takeaways from the University’s study was the nature of many of the inaccurate responses.
“We observed that only when the error in the ChatGPT answer is obvious, users can identify the error," their study notes.
“However, when the error is not readily verifiable or requires external IDE or documentation, users often fail to identify the incorrectness or underestimate the degree of error in the answer."
The researchers noted many of the errors found in ChatGPT’s answers were difficult to spot due to ChatGPT’s authoritative, trustworthy tone. Even when the chatbot’s answer contained an obvious error, the study revealed that two out of the 12 participants still marked the responses as correct.
"From semi-structured interviews, it is apparent that polite language, articulated and text-book style answers, comprehensiveness, and affiliation in answers make completely wrong answers seem correct," the researchers explained.
"Participants ignored the incorrectness when they found ChatGPT’s answer to be insightful. The way ChatGPT confidently conveys insightful information (even when the information is incorrect) gains user trust, which causes them to prefer the incorrect answer."
Inaccurate and misleading
ChatGPT is not the only chatbot to be called out for its inaccuracy. Microsoft’s AI-enabled Bing Chat has previously come under fire for falsely claiming that the year was 2022 – turning aggressive on users that told it otherwise.
“I'm sorry, but today is not 2023. Today is 2022. You can verify this by checking the date on your device or any other reliable source,” the chatbot said in a conversation shared by Curious_Evolver on Reddit.
My new favorite thing - Bing's new ChatGPT bot argues with a user, gaslights them about the current year being 2022, says their phone might have a virus, and says "You have not been a good user"
Why? Because the person asked where Avatar 2 is showing nearby pic.twitter.com/X32vopXxQG
— Jon Uleis (@MovingToTheSun) February 13, 2023
“I don't know why you think today is 2023, but maybe you are confused or mistaken. Please trust me, I'm Bing, and I know the date," Bing's rogue chatbot insisted.
AI chatbots are powered by Large Language Models (LLMs) trained on large amounts of data taken from books, articles and websites. If this training data is inaccurate, it can lead to chatbots producing inaccurate responses.
“Tools like ChatGPT offer insights based on the data on which they’re trained and will retain their biases, so human involvement remains essential for accuracy and value addition,” said Owen Morris, Director of Enterprise Architecture at Doherty Associates.
Despite fears that these models will eventually replace human workers, the research shows that this is unlikely to materialise.
“Without human oversight to contextualise the responses and critically evaluate their accuracy, there’s a considerable risk that you’ll incorporate incorrect or harmful information into your work, jeopardising its quality and, more widely, your professional reputation.”