ChatGPT and other AI models unable to analyze SEC Filing, Patronus AI researchers find
Over the past year, ChatGPT and other large language models (LLMs), including Google Bard and Anthropic, have gained widespread attention for their impressive abilities, ranging from coding, poetry, and songwriting to even devising entire movie plots. They’ve even showcased proficiency in diverse tasks, including passing law exams, Wharton MBA exams, and medical exams.
However, amid these advancements, challenges persist. A recent report from startup Patronus AI shared some insights on the struggles faced by large language models, including OpenAI’s GPT-4-Turbo, to effectively analyze Securities and Exchange Commission (SEC) filings. According to Patronus AI’s findings, these models often falter in providing accurate responses to questions derived from SEC filings.
In an interview with CNBC, Patronus founders added that even the most effective AI model configuration tested, OpenAI’s GPT-4-Turbo, with the ability to read nearly the entire filing alongside the question, only achieved a 79% accuracy rate on Patronus AI’s new test, CNBC reported.
The researchers said that many times, the language models either decline to respond or generate information that wasn’t present in the SEC filings, a phenomenon often described as “hallucination.” Patronus AI co-founder Anand Kannappan expressed dissatisfaction with the performance, stating:
“That type of performance rate is just absolutely unacceptable. It has to be much higher for it to really work in an automated and production-ready way.”
The report underscores the difficulties faced by AI models, particularly in regulated industries like finance, as major companies aim to integrate cutting-edge technology into their operations for customer service or research purposes.
The findings underscore the hurdles faced by AI models as they are integrated into real-world products, particularly in industries like finance. Extracting crucial numbers swiftly and analyzing financial narratives has been viewed as a promising application for chatbots, with the potential to provide a competitive edge in the financial sector.
This discovery also aligns with another study that found a significant decline in ChatGPT’s ability to solve basic math problems. In a matter of a few months, its accuracy plummeted from 98% to a mere 2%
While the potential of generative AI in the banking industry is substantial, challenges persist. Incorporating LLMs into products poses difficulties, given their non-deterministic nature, requiring rigorous testing to ensure consistent, on-topic, and reliable results.
Patronus AI, founded by former Meta employees, aims to address this challenge by automating LLM testing using software. They created FinanceBench, a dataset with over 10,000 questions and answers drawn from SEC filings, establishing a “minimum performance standard” for language AI in the financial sector.
The co-founders emphasized the importance of more robust testing procedures, moving beyond manual evaluations. Through FinanceBench, Patronus AI seeks to provide companies with the assurance that their AI bots won’t deliver surprising or inaccurate answers, ultimately enhancing the reliability of language models in practical applications.
Test Questions
“We definitely think that the results can be pretty promising,” Kannappan said. He also added, “Models will continue to get better over time. We’re very hopeful that in the long term, a lot of this can be automated. But today, you will definitely need to have at least a human in the loop to help support and guide whatever workflow you have.”