You won't believe how much Gemini's data-analysis falls short of Google's hype!

The promise of Google’s leading generative AI models, Gemini 1.5 Pro and 1.5 Flash, lies in their ability to process and analyze vast amounts of data. At least, that’s what Google claims. The models purportedly excel at summarizing extensive documents and searching through film footage to achieve tasks previously deemed impossible.

However, recent research casts doubt on these claims. The studies examined how well Google’s Gemini models, along with others, can comprehend massive datasets, equivalent to the length of “War and Peace.” The findings reveal that Gemini 1.5 Pro and 1.5 Flash struggle to provide accurate answers to questions regarding such extensive data, with a success rate as low as 40-50%.

Marzena Karpinska, a postdoc at UMass Amherst and a study co-author, points out, “While models like Gemini 1.5 Pro may have the capacity to process long contexts, they often lack actual understanding of the content.”

Gemini’s limitations in context window

The context window of a model refers to the input data it considers before generating output. Gemini’s latest versions can digest up to 2 million tokens as context, which is the largest among commercially available models. Despite Google’s claims showcasing Gemini’s magical reasoning capabilities across various scenarios, the studies reveal a different reality.

One study involved evaluating true/false statements about fictional books, where Gemini 1.5 Pro and 1.5 Flash struggled to provide accurate answers without prior foreknowledge. The models’ overall performance in question-answering accuracy failed to surpass random chance.

Challenging reasoning over videos

The second study focused on testing Gemini 1.5 Flash’s ability to reason over videos by answering questions about image content. The results indicated that Flash faced difficulty in transcribing handwritten digits accurately in a slideshow-like format, questioning its reasoning abilities even on simple tasks.

Overpromising and under-delivering

Google’s advertising of Gemini’s context window as a distinguishing factor in generative AI technology has come under increasing scrutiny due to the gap between what is promised and its actual performance. The research findings suggest that Google’s models, including others in the market, struggle to deliver on their anticipated capabilities.

Looking ahead

To combat the hype surrounding generative AI, researchers advocate for better benchmarks and third-party critique to ensure realistic evaluations of these models’ capabilities. As the field evolves, a critical reassessment of existing benchmarking practices is necessary to provide accurate insights into the true potential of these cutting-edge technologies.