While artificial intelligence continues to deliver groundbreaking tools that simplify various aspects of human life, the issue of hallucination remains a persistent and growing concern.
According to IBM, hallucination in AI is “a phenomenon where, in a large language model (LLM)—often a generative AI chatbot or computer vision tool—the system perceives patterns or objects that are non-existent or imperceptible to human observers, creating outputs that are nonsensical or altogether inaccurate.”
OpenAI’s technical report on its latest models—o3 and o4-mini—reveals that these systems are more prone to hallucinations than earlier versions such as o1, o1-mini, and o3-mini, or even the “non-reasoning” model GPT-4o.
To evaluate hallucination tendencies, OpenAI used PersonQA, a benchmark designed to assess how accurately models respond to factual, person-related queries.
“PersonQA is a dataset of questions and publicly available facts that measures the model’s accuracy on attempted answers,” the report notes.
The findings are significant: the o3 model hallucinated on 33% of PersonQA queries—roughly double the rates recorded by o1 (16%) and o3-mini (14.8%). The o4-mini model performed even worse, hallucinating 48% of the time.
Despite the results, OpenAI did not offer a definitive explanation for the increase in hallucinations. Instead, it stated that “more research” is needed to understand the anomaly. If larger and more capable reasoning models continue to exhibit increased hallucination rates, the challenge of mitigating such errors may only intensify.
“Addressing hallucinations across all our models is an ongoing area of research, and we’re continually working to improve their accuracy and reliability,” OpenAI spokesperson Niko Felix told TechCrunch.
According to IBM, hallucination in AI is “a phenomenon where, in a large language model (LLM)—often a generative AI chatbot or computer vision tool—the system perceives patterns or objects that are non-existent or imperceptible to human observers, creating outputs that are nonsensical or altogether inaccurate.”
OpenAI’s technical report on its latest models—o3 and o4-mini—reveals that these systems are more prone to hallucinations than earlier versions such as o1, o1-mini, and o3-mini, or even the “non-reasoning” model GPT-4o.
To evaluate hallucination tendencies, OpenAI used PersonQA, a benchmark designed to assess how accurately models respond to factual, person-related queries.
“PersonQA is a dataset of questions and publicly available facts that measures the model’s accuracy on attempted answers,” the report notes.
The findings are significant: the o3 model hallucinated on 33% of PersonQA queries—roughly double the rates recorded by o1 (16%) and o3-mini (14.8%). The o4-mini model performed even worse, hallucinating 48% of the time.
Despite the results, OpenAI did not offer a definitive explanation for the increase in hallucinations. Instead, it stated that “more research” is needed to understand the anomaly. If larger and more capable reasoning models continue to exhibit increased hallucination rates, the challenge of mitigating such errors may only intensify.
“Addressing hallucinations across all our models is an ongoing area of research, and we’re continually working to improve their accuracy and reliability,” OpenAI spokesperson Niko Felix told TechCrunch.
You may also like
Equity MF inflows double in FY25, AUM jumps 23 pc on SIP surge
Vladimir Putin's secret son finds astounding way to rebel from his fierce guard
April 20 - Project Cheetah expands, Trump could deploy troops, IPL double bill
Thousands of trans activists launch 'emergency protest' after Supreme Court gender rulings
Ann Coulter openly asks Trump to defy court order, says Supreme Court was not elected president