DeepSeek-R1 Hallucinates 4x More Than V3, Raising Red Flags for Crypto AI Agent Tokens

DeepSeek-R1, the flagship reasoning model from Chinese lab DeepSeek, hallucinates at 14.3% according to Vectara’s HHEM 2.1 benchmark. That is nearly four times higher than its non-reasoning predecessor DeepSeek-V3, which scored 3.9%.

The gap raises hard questions for the crypto sector. A fast-growing class of AI agent tokens now leans on reasoning-style LLMs for autonomous trading, signals, and on-chain execution.

Vectara Data Shows R1 ‘Overhelps’ With False Facts

Vectara ran both DeepSeek models through HHEM 2.1, its dedicated hallucination evaluation framework. The team also cross-checked the results using Google’s FACTS methodology. R1 produced more false or unsupported statements than V3 in every test configuration.

The cause was not reasoning depth alone. Vectara’s analysts found that R1 tends to “overhelp.” The model adds information that does not appear in the source text.

That added detail can be factually correct on its own and still count as a hallucination. The behavior smuggles fabricated context into otherwise sound answers.

Vectara stated the finding directly in a public post on X.

“DeepSeek-R1 shows a 14.3% hallucination rate, nearly 4x higher than DeepSeek-V3,” Vectrara noted in a post.

The pattern is not unique to DeepSeek. Industry trackers note the same trade-off across reasoning-trained models from other labs. Reinforcement learning that sharpens chain-of-thought also rewards bolder and more confident generation.

Why Crypto AI Tokens Sit on This Trade-Off

The crypto market now hosts hundreds of AI agent tokens, led by Virtuals Protocol (VIRTUAL), ai16z (AI16Z), and aixbt (AIXBT).

The category has posted roughly 39.4% growth over a recent 30-day window. Virtuals alone has surpassed $576 million in market capitalization.

Virtuals Protocol (VIRTUAL) Price Performance. Source: Coingecko

Most of these agents wrap a large language model in tooling. That tooling lets the agent post on social media, route trades, mint tokens, or generate market commentary.

When the underlying model fabricates a price level, a partnership, or a contract address, the consequences can land on-chain.

One BeInCrypto analysis of AIXBT showed the agent had shilled 416 tokens with a 19% average return. The same surface mechanic, however, exposes followers to bad calls when the model fails.

The risk surface scales with autonomy. Read-only agents that summarize sentiment differ in stakes from agents that hold treasury keys.

Reasoning models are especially attractive for agents that plan across multiple steps. That is also the use case where Vectara’s 14.3% figure bites hardest.

A single hallucinated fact early in a chain of thought can propagate through every downstream action.

LeCun Argues the Problem Is Architectural

Yann LeCun, Meta’s chief AI scientist, has long argued that autoregressive LLMs cannot fully escape hallucination. In his view, the architecture itself lacks any grounded model of the world.

Hallucinations in LLM are due to the Auto-Regressive prediction.

I think what I call “Objective Driven AI” will solve the problem: systems that plan their answer by optimizing a number of objective functions *at inference time* https://t.co/JcR5hItwzJ

— Yann LeCun (@ylecun) June 9, 2023

Reinforcement learning on chain-of-thought can paper over the issue inside narrow domains like math and coding. The root cause, however, stays in place.

Other frontier labs disagree. They point to steady progress on benchmark hallucination rates through retrieval augmentation, post-training fine-tunes, and verifier models. Reports from developers, however, often line up with the leaderboard data.

AI researcher xlr8harder, writing on X about a debugging session with R1, summed up the daily experience.

“Deepseek R1 has an interesting unintegrated understanding of its thought traces. … so it defaults to gaslighting me with hallucinations,” they stated.

For crypto agent developers, the practical question is risk management, not architectural philosophy. Designs that route every model claim through a verification step may fare better.

The same goes for agents that lean on smaller, more conservative models for financial actions.

The next leaderboard cycles and the eventual successors to R1 will show whether the reasoning-versus-accuracy trade-off is being narrowed.

For now, the gap between 14.3% and 3.9% is an operational detail worth watching. It could separate AI agent tokens shipping working products from those shipping promises.

The post DeepSeek-R1 Hallucinates 4x More Than V3, Raising Red Flags for Crypto AI Agent Tokens appeared first on BeInCrypto.