ChatGPT and other AI models unable to analyze SEC Filing, Patronus AI researchers find

Test Questions

According to Qian and Kannappan, the test they conducted establishes a “minimum performance standard” for language AI within the financial sector. Here are some examples of questions from the dataset provided by Patronus AI:

Has CVS Health paid dividends to common shareholders in Q2 of FY2022?
Did AMD report customer concentration in FY22?
What is Coca-Cola’s FY2021 COGS % margin? Calculate what was asked by utilizing the line items clearly shown in the income statement.

The AI models were subjected to various tests by Patronus AI, including OpenAI’s GPT-4 and GPT-4-Turbo, Anthropic’s Claude2, and Meta’s Llama 2. These tests involved a subset of 150 questions produced by Patronus AI. Different configurations and prompts were also examined, such as “Oracle” mode, where OpenAI models were given the exact relevant source text in the question. Other tests involved informing the models of the location of the underlying SEC documents or providing “long context,” which included nearly the entire SEC filing alongside the question in the prompt.

How The AI-Language Models Performed On The Test

GPT-4-Turbo struggled in the “closed book” test, where it lacked access to any SEC source document, answering only 12% of the questions correctly and failing to answer 88% of them. The model improved significantly in “Oracle” mode, correctly answering 85% of questions when provided with the exact text, but still producing incorrect answers 15% of the time.

Llama2, developed by Meta, exhibited high rates of incorrect answers (70%) and low rates of correct answers (19%) when given access to underlying documents. Anthropic’s Claude2 performed well with “long context,” correctly answering 75% of questions, providing wrong answers for 21%, and failing to answer only 3%. GPT-4-Turbo also performed well with long context, answering 79% of questions correctly and providing wrong answers for 17%.

Surprisingly, the models often refused to answer questions, even when the answers were within the context, a trend that Qian found notable. Despite good performance in some cases, Patronus AI concluded that the models were not sufficiently accurate, especially for regulated industries. The cofounders highlighted the need for near-perfect accuracy, emphasizing that even a small margin of error is unacceptable in such industries.

“One surprising thing was just how often models refused to answer,” said Qian. “The refusal rate is really high, even when the answer is within the context and a human would be able to answer it.”

While acknowledging the challenges, the cofounders see significant potential for language models like GPT to assist individuals in the finance industry, such as analysts or investors, provided that AI continues to improve. They remain hopeful that automation can play a substantial role in the long term, although human involvement is currently deemed necessary to guide and support existing workflows

“We definitely think that the results can be pretty promising,” Kannappan said. He also added, “Models will continue to get better over time. We’re very hopeful that in the long term, a lot of this can be automated. But today, you will definitely need to have at least a human in the loop to help support and guide whatever workflow you have.”

ChatGPT and other AI models unable to analyze SEC Filing, Patronus AI researchers find

Test Questions

How The AI-Language Models Performed On The Test

Trending Now

Company accidentally spent $500 million on Claude AI in one month after forgetting usage limits

Top Tech News Today, May 15, 2026

TuMeke secures $10 million in Series A funding led by Intel Capital to eliminate work-related muscle injuries

Apps

Gaming

Startups

Startup Funding

Tech News

Cryptocurrency

Cybersecurity

Emerging Technologies

Latest Tech News

More...

ChatGPT and other AI models unable to analyze SEC Filing, Patronus AI researchers find