Super Mario is Revolutionizing AI Testing!

Enter the exciting realm of AI and gaming, where Super Mario Bros. is the ultimate challenge that takes artificial intelligence to its limits. One might think that mastering Pokémon was a daunting task for AI, but researchers from Hao AI Lab at the University of California San Diego argue that Super Mario Bros. poses an even greater challenge.

Anthropic’s Claude 3.7 stole the show in the live Super Mario Bros. games, outperforming other contenders like Claude 3.5, Google’s Gemini 1.5 Pro, and OpenAI’s GPT-4O.

However, this gaming experiment didn’t use the original 1985 release of Super Mario Bros. Instead, it ran in an emulator integrated with GamingAgent, a framework created by Hao AI Lab.

GamingAgent provided the AI with basic instructions, prompting actions like “If an obstacle or enemy is near, move/jump left to dodge,” along with in-game screenshots. The AI then translated these instructions into Python code to control Mario.

Despite the simplified setup, Hao Lab emphasized that each model had to adapt and strategize effectively within the game environment. Surprisingly, reasoning models like OpenAI’s o1, which analyze problems systematically to find solutions, struggled compared to non-reasoning models, even though they typically excel in most benchmarks.

The delay in decision-making is a significant drawback for reasoning models when it comes to fast-paced games like Super Mario Bros. In such games, split-second timing can determine success or failure.

Experts have debated the utility of gauging AI progress through gaming benchmarks. While games provide extensive data for training AI, some argue that the abstract and simplistic nature of games doesn’t necessarily align with real-world challenges.

As Andrej Karpathy from OpenAI stated, there appears to be an “evaluation crisis” surrounding AI metrics and capabilities, leaving uncertainty about the true depth of AI advancements.

In the end, the mesmerizing sight of AI navigating the world of Mario reminds us of the never-ending quest to push the boundaries of artificial intelligence.