beebom:~# posts/googles-experimental-gemini-model-tops-the-leaderboard-but-stumbles-in-my-tests/

Google’s Experimental Gemini Model Tops the Leaderboard, But Stumbles in My Tests

Nov. 16, 2024

Google recently released its experimental ‘Gemini-exp-1114’ model in AI Studio for developers to test the model. Many speculate that it’s the next-gen Gemini 2.0 model which Google will release in the coming months. Meanwhile, the search giant tested the model on Chatbot Arena where users can vote on which model offers the best response.

After receiving more than 6,000 votes, Google’s Gemini-exp-1114 model has topped theLMArena leaderboard, outrankingChatGPT-4o and Claude 3.5 Sonnet. However, the ranking drops to the fourth position with Style Control, which distinguishes a model’s response and presentation/formatting that can influence the user.

Nevertheless, I was curious to test the Gemini-exp-1114 model so I ran some of my reasoning prompts that I have used tocompare Gemini 1.5 Pro and GPT-4in the past. In my testing, I found that Gemini-exp-1114 failed to correctly answer thestrawberry question.

It still says there are two r’s in the word ‘strawberry’. On the other hand,OpenAI’s o1-minimodel correctly says there are three r’s after thinking for six seconds.

One thing to note though, the Gemini-exp-1114 model takes some time to respond which gives an impression that it might be running CoT reasoning in the background, but I can’t say for sure. Some recent reports suggest thatLLM scaling has hit a wallso Google and Anthropic are working on inference scaling, just like OpenAI to improve model performance.

Next, I asked the Gemini-exp-1114 model tocount ‘q’ in the word ‘vague’and this time, it correctly answered zero times. OpenAI’s o1-mini model also gave the right answer. However, in the next question which has stumped so many frontier models, Gemini-exp-1114 also disappoints.

Reasoning Tests on the Upcoming Gemini Model

Reasoning Tests on the Upcoming Gemini Model

In another reasoning question, Gemini-exp-1114 again gets it wrong and says the answer is four brothers and one sister. ChatGPT o1-preview gets it right and says two sisters and three brothers.

I am surprised that Gemini-exp-1114 ranked first in Hard Prompts on Chatbot Arena. In terms of overall intelligence, OpenAI’s o1 models are the best out there, along with the improved Claude 3.5 Sonnet for coding tasks. So are you disappointed by Google’s upcoming model or do you still think Google can beat OpenAI in the AI race? Let us know in the comments below.

Passionate about Windows, ChromeOS, Android, security and privacy issues. Have a penchant to solve everyday computing problems.