>>177889,
>>177890,
>>177891,
>>177892,
>>177893,
>>177894,
>>177895,
>>177896,
>>177897,
>>177898,
>>177899,
>>177900,
>>177901,
>>177902,
>>177903,
>>177904,
>>177905,
>>177906,
>>177907,
>>177908,
>>177909,
>>177910,
>>177911,
>>177912,
>>177913,
>>177914,
>>177915,
>>177916,
>>177917,
>>177918,
>>177919,
>>177920,
>>177921,
>>177922,
>>177923,
>>177924,
>>177925,
>>177926,
>>177927,
>>177928,
>>177929,
>>177930,
>>177931,
>>177932,
>>177933,
>>177934,
>>177935,
>>177936,
>>177937Robert W Malone, MD @RWMaloneMD - Really important study. Buyer (or author) beware. AI “hallucinations “ are a major problem.
Quote:
Abdul Șhakoor @abxxai
BREAKING: Someone just tested 35 AI models across 172 billion tokens of real document questions.
The hallucination numbers should end the "just give it the documents" argument forever.
Here is what the data actually showed.
The best model in the entire study, under perfect conditions, fabricated answers 1.19% of the time. That sounds small until you realize that is the ceiling. The absolute best case. Under optimal settings that almost no real deployment uses.
Typical top models sit at 5 to 7% fabrication on document Q&A. Not on questions from memory. Not on abstract reasoning. On questions where the answer is sitting right there in the document in front of it.
The median across all 35 models tested was around 25%.
One in four answers fabricated, even with the source material provided.
Then they tested what happens when you extend the context window. Every company selling 128K and 200K context as the hallucination solution needs to read this part carefully.
At 200K context length, every single model in the study exceeded 10% hallucination. The rate nearly tripled compared to optimal shorter contexts.
The longer the window people want, the worse the fabrication gets. The exact feature being sold as the fix is making the problem significantly worse.
There is one more finding that does not get talked about enough.
Grounding skill and anti-fabrication skill are completely separate capabilities in these models.
A model that is excellent at finding relevant information in a document is not necessarily good at avoiding making things up. They are measuring two different things that do not reliably correlate. You cannot assume a model that retrieves well also fabricates less.
Message too long. Click here to view full text.