Member-only story
AI Model Hallucination Evaluation List: Gemini 2.0 Leads, GPT-4 Closely Follows, Domestic Model ZhiPu Surges AheadAuto (YAML)
Vectara Company recently released an AI large language model hallucination evaluation list, systematically assessing the hallucination occurrence of current mainstream AI models in text summarization tasks. The list uses Vectara’s self-developed HHEM-2.1 evaluation model, testing the probability of hallucination by having AI models summarize 831 short articles.
In the latest list, Google’s Gemini 2.0 Flash topped the chart with a hallucination rate of 0.7%, followed by Gemini 2.0 Pro and OpenAI’s o3-mini-high-reasoning model, both with a hallucination rate of 0.8%. Notably, the GPT-4 series models also performed exceptionally well, with hallucination rates between 1.5% and 1.7%. The domestic model ZhiPu glm-9b performed well, with a hallucination rate of 1.3%, while Qwen had a higher rate, reaching between 2.8% and 3.0%. The evaluation of deepseek’s latest models v3 and r1 is ongoing and may be included later.
A strict methodology was adopted for the evaluation: all models used a temperature parameter of 0 to ensure output stability; simultaneously, metrics such as answer rate and average summary length were set to prevent models from achieving high scores through simple copying or overly short answers. This list will be updated regularly, providing an important reference for users to select and assess AI models.
The significance of this evaluation lies in the establishment of a quantifiable and reproducible AI model hallucination assessment standard for the first time. Although limited to text summarization tasks, this evaluation method has important value for understanding and improving the real performance of AI models.