Llama 3 vs GPT-4: Meta Challenges OpenAI on AI Turf

Apr. 20, 2024



Meta recentlyintroduced its Llama 3 modelin two sizes with 8B and 70B parameters and open-sourced the models for the AI community. While being a smaller 70B model, Llama 3 has shown impressive capability, as evident from theLMSYS leaderboard. So we have compared Llama 3 with the flagship GPT-4 model to evaluate their performance in various tests. On that note, let’s go through our comparison between Llama 3 and GPT-4.

1. Magic Elevator Test

1. Magic Elevator Test

Let’s first run themagic elevator testto evaluate the logical reasoning capability of Llama 3 in comparison to GPT-4. And guess what?Llama 3 surprisingly passes the testwhereas the GPT-4 model fails to provide the correct answer. This is pretty surprising since Llama 3 is only trained on 70 billion parameters whereas GPT-4 is trained on a massive 1.7 trillion parameters.

Note: GPT-4 loses on ChatGPT Plus

Next, we ran the classicreasoning questionto test the intelligence of both models. In this test, both Llama 3 70B and GPT-4 gave the correct answer without delving into mathematics. Good job Meta!

Winner: Llama 3 70B, and GPT-4 via ChatGPT Plus

After that, I asked another question to compare the reasoning capability of Llama 3 and GPT-4. In this test, the Llama 3 70B model comes close to giving the right answer butmisses outon mentioning the box. Whereas, the GPT-4 model rightly answers that “the apples are still on the ground inside the box”. I am going to give it to GPT-4 in this round.

Winner: GPT-4 via ChatGPT Plus

While the question seems quite simple, many AI models fail to get the right answer. However, in this test, both Llama 3 70B and GPT-4 gave thecorrect answer. That said, Llama 3 sometimes generates wrong output so keep that in mind.

Winner: Llama 3 70B, and GPT-4 via ChatGPT Plus

Next, I asked a simple logical question andboth models gave a correct response. It’s interesting to see a much smaller Llama 3 70B model rivaling the top-tier GPT-4 model.

Winner: Llama 3 70B, and GPT-4 via ChatGPT Plus

Next, we ran a complexmath problemon both Llama 3 and GPT-4 to find which model wins this test. Here, GPT-4 passes the test with flying colors, butLlama 3 failsto come up with the right answer. It’s not surprising though. The GPT-4 model has scored great on the MATH benchmark. Keep in mind that I explicitly asked ChatGPT to not use Code Interpreter for mathematical calculations.

Winner: GPT-4 via ChatGPT Plus

Following user instructions is very important for an AI model and Meta’sLlama 3 70B model excelsat it. It generated all 10 sentences ending with the word “mango”. GPT-4 could only generate eight such sentences.

Winner: Llama 3 70B

Although Llama 3 currently doesn’t have a long context window, we still did the NIAH test to check its retrieval capability. The Llama 3 70B model supports acontext length of up to 8K tokens. So I placed a needle (a random statement) inside a 35K-character long text (8K tokens) and asked the model to find the information. Surprisingly, the Llama 3 70B found the text in no time. GPT-4 also had no problem finding the needle.

Of course, this is asmall context, but when Meta releases a Llama 3 model with a much larger context window, I will test it again. But for now, Llama 3 shows great retrieval capability.

Winner: Llama 3 70B, and GPT-4 via ChatGPT Plus

In almost all of the tests, the Llama 3 70B model has shown impressive capabilities, be it advanced reasoning, following user instructions, or retrieval capability. Only in mathematical calculations, it lags behind the GPT-4 model. Meta says that Llama 3 has been trained on a larger coding dataset so itscoding performanceshould also be great.

Bear in mind that we are comparing amuch smaller modelwith the GPT-4 model. Also, Llama 3 is a dense model whereas GPT-4 is built on the MoE architecture consisting of 8x 222B models. It goes on to show that Meta has done a remarkable job with the Llama 3 family of models. When the 500B+ Llama 3 model drops in the future, it will perform even better and may beat the best AI models out there.

It’s safe to say that Llama 3 has upped the game, and by open-sourcing the model, Meta hasclosed the gap significantlybetween proprietary and open-source models. We did all these tests on an Instruct model. Fine-tuned models on Llama 3 70B would deliver exceptional performance. Apart from OpenAI, Anthropic, and Google, Meta has now officially joined the AI race.

Passionate about Windows, ChromeOS, Android, security and privacy issues. Have a penchant to solve everyday computing problems.