Comparing Llama and GPT: Open-Source Versus Closed-Source AI Development

Photo of Patryk Szczygło

Patryk Szczygło

Sep 15, 2023 • 12 min read
Website designer working digital tablet and computer laptop with smart phone and graphics design diagram on wooden desk as concept-Sep-15-2023-11-46-29-9553-AM

As it stands, GPT-4 is the king of general-purpose large language models. But for building specialized LLM-based products, Llama 2 might prove superior.

In the introductory paper for Llama 2, Meta itself admits that these models “still lag behind other models like GPT-4.”

The tricky thing is, it’s hard to say exactly why GPT-4 dominates. No one outside of OpenAI knows the details of how it’s built because it’s a closed-source model.

This is where Meta’s Llama family differs the most from OpenaAI’s GPT. Meta releases their models as open source, or at least kind of open source, and GPTs are closed. This difference in openness significantly impacts how you work with and build products upon each.

Key similarities and differences between Llama 2 and GPT-4

Similarities of Llama and GPT models

  • Both are Large Language Models (LLM) based on the Transformer architecture.
  • They work in tokens, which are numbers that represent words or chunks of text. The data they were trained on is also tokenized. Llama 2 was trained on 2 trillion tokens, and speculations about GPT-4 are around 13 trillion. This training enables them to guess next tokens.
  • We don’t know exactly what data these models were trained on, the information isn’t public.
  • Their performance depends on their number of parameters (=weights). The smallest Llama 2 has 7 billion, considered the smallest model size that can do useful things. The largest has 70 billion. The precise number of GPT-4 parameters is unknown, but speculations are between 1-2 trillion.
  • Another common aspect of their performance is the context window. It determines how much of your input the model can take in at one time. It’s 4096 tokens for base Llama 2 and 8000 for base GPT-4. However, Llama 2 can be extended to 32000, and GPT-4 also has a 32000 version (which isn’t publicly available yet).

Difference #1 - Llama is open source, GPT is proprietary

OpenAI’s research used to be available to all, but the increasing power of their GPT models convinced them to shut the world out. When asked why, the company’s co-founder and Chief Scientist Ilya Sutskever said:

“We were wrong. Flat out, we were wrong. If you believe, as we do, that at some point, AI — AGI — is going to be extremely, unbelievably potent, then it just does not make sense to open-source. It is a bad idea... I fully expect that in a few years it’s going to be completely obvious to everyone that open-sourcing AI is just not wise.

Mark Zuckerberg is of the opposite opinion:

“Open source drives innovation because it enables many more developers to build with new technology [...] It also improves safety and security because when software is open, more people can scrutinize it to identify and fix potential issues.”

But there’s a small issue here. While closer to being open than GPT-4, Llama 2 isn’t open source to the full extent. One researcher who analyzed how open Meta’s models are stated:

“Meta using the term ‘open source’ for this is positively misleading: There is no source to be seen, the training data is entirely undocumented, and beyond the glossy charts the technical documentation is really rather poor.”

Luckily, unless your plan was to recreate Llama 2 in its entirety, then it’s not really a problem for you. You still get benefits that OpenAI doesn’t provide:

  • The ability to download the model, interact with it directly, and host it wherever you want,
  • Access to weights and no extra payment for the option to fine-tune the model.

Weights determine the output of an LLM. Having access to them is helpful both from a research perspective, and when you’re building a product and want to fine-tune them to provide a different output than the base model.

Difference #2 - Llama is customizable, GPT is convenient

Llama 2 is the first reliable model that is free to use for commercial purposes (with some limitations, for example if your app hits over 700 million users).

To start working with it, you need to fill out a form. After being approved, you can choose and download a model from Hugging Face. With a strong enough computer, you should be able to run the smallest version of Llama 2 locally.

As for the bigger ones, you’ll need access to machines built with AI in mind, the most convenient way being cloud services like Amazon SageMaker.

To customize Llama 2, you can fine-tune it for free – well, kind of for free, because fine-tuning can be difficult, costly, and require a lot of compute. Particularly if you want to do full parameter fine-tuning on large-scale models.

If that’s not the case, there are ways to fine-tune Llama models on a single GPU, or platforms like Gradient that automate this for you.

Fine-tuning can produce fascinating results. This is how you can get a model that outperforms GPT-4 at a specific niche task, for example SQL generation or functional representation.

Code Llama is a good example. It’s a specialized Llama 2 model additionally trained on 500 billion tokens of code data. With some additional fine-tuning, it was able to beat GPT-4 in the HumanEval programming benchmark.

When it comes to working with OpenAIs models, you need to get an OpenAI API key and prepare to pay for the tokens you’ve used every month. Using their models is more restrictive:

  • You can’t download them or host them yourself, but on the plus side it means you don’t need to worry about where and how it’s hosted.
  • You can’t fine-tune GPT-4 yet, only GPT-3.5 and a couple of other models for now.
  • Pricing is set per 1000 tokens (~750 words), for GPT-4 with an 8K context window it’s currently $0.03 / 1K tokens of input, and $0.06 / 1K tokens of output.

How does this translate to the costs of building a product on Llama 2 versus GPT-4?

As one experiment shows, if you need a model for summarizing text:

  • You’ll pay 18x times more to use GPT-4 than the biggest Llama 2 to achieve similar performance.
  • Llama 2 70B will cost 10% more than GPT-3.5, but the performance difference is worth the extra 10%.

Depending on the use case, it might turn out that you don’t even need the 70 billion-parameter Llama 2, and that 13 or 7 billion will suffice. If so, your expenses will drop even more:

  • With more parameters, the model can process, learn from, and generate more data – but also requires more computational and memory resources, i.e. it’s more expensive to run.
  • It’s also more expensive to fine-tune a model with more parameters, or retrain it with recent data.

Benchmarks and comparisons of Llama and GPT models

Remember that benchmarks are tricky. A great result on a benchmark doesn’t necessarily mean the model will perform better for your use case. Plus, with the different versions of models available out there, comparing them can be tricky. Take these benchmarks with a grain of salt.

HumanEval

A carefully curated set of 164 programming challenges created by OpenAI to evaluate code generation models.

  • GPT-4: 67.0% (or as much as 91% in a new study that added a new reinforcement learning method to it)
  • Llama 2: 29.9%, however a fine-tuned Code Llama achieved 73.8%

MMLU

This challenge consists of 57 general-knowledge tasks, with elementary mathematics, US history, computer science, law, and more. It tests world knowledge and problem solving.

  • GPT 4: 86.4%
  • GPT 3.5: 70%
  • Llama 2 70B: 68.9%

LegalBench

Here, the challenge is all about legal reasoning tasks, based on a dataset prepared with law practitioners. Below are averaged scores from 5 different tasks.

  • GPT-4: 77.32%
  • GPT-3.5: 64.9%
  • Llama 2 13B: 50.6%

HellaSwag

HellaSwag evaluates the common sense of models with questions that are trivial for humans.

  • GPT-4: 95.3%
  • Llama 2 70B: 85.3%

AgentBench

Unique benchmark that evaluates LLM as autonomous agents across different environments like an operating system, database, knowledge graph, or digital card game. The numbers represent an overall score as a weighted average from all environments.

  • GPT-4: 4.41
  • GPT-3.5 Turbo: 2.55
  • Llama 2 13B: 0.55

Winogrande

Here, the models have to tackle 44000 common-sense problems.

  • GPT-4: 87.5%
  • Llama 2 70B: 80.2%

Just scratching the surface

The Llama and GPT families of models represent the two sides of the AI development coin – open source and closed.

Both are top of their class, but they’re far from the only two alternatives you have to choose from. In this article, I mainly wanted to use these models to explain the differences between open and closed AI development.

Hopefully it has helped you decide which approach is better for you. If you’re still not sure, we can provide additional guidance.

Photo of Patryk Szczygło

More posts by this author

Patryk Szczygło

Patryk is an engineer leading R&D department to develop more knowledge in cutting edge...
Thinking about implementing AI?  Discover the best way to introduce AI in your company with AI Primer Workshop  Sign up for AI Primer

We're Netguru!

At Netguru we specialize in designing, building, shipping and scaling beautiful, usable products with blazing-fast efficiency
Let's talk business!

Trusted by: