Compare LLM's performance at scale with PromptFoo

Explore effective alternatives to OpenAI with Promptfoo, an open-source tool for comparing large language models (LLMs). Learn how to evaluate models like Together AI's LLAMA-3 against OpenAI's GPT-4o, discovering insights on speed, cost, and performance. Make informed decisions for your applicat...

Compare LLM's performance at scale with PromptFoo

I have previously discussed considering alternatives to OpenAI for productive, real-world applications. There are dozens of such alternatives, and many open-source options are particularly intriguing.

By comparing these solutions, you can potentially save money while also achieving better quality and speed.

For example, in a previous blog post I talked about the noticeable speed differences between models like OpenAI's gptg-4o - its fastest model - and LLAMA-3 on Together and Fireworks AI. They can deliver up to 3x the speed at a fraction of the cost.

How to compare LLMs

There are various ways to compare models. You can use tools like openwebui if you’re doing it occasionally or if you have just a few models to test. There are also many commercial model comparison tools you could leverage.

In a business setting, where you have dozens of prompts and use cases across many models, you may want to automate this evaluation. Having a tool that supports such tasks can be very useful. One tool I frequently use is Promptfoo.

In this blog post, we will explore how to use Promptfoo to compare the performance and quality of responses from alternative models to OpenAI—in this case, a llama-based model from Together AI. Of course, you can adapt this approach to any other model or scenario you like.

Let’s get started.

---

1. Introduction to Promptfoo

Promptfoo is an open-source tool designed to help evaluate and compare large language models (LLMs). It enables developers to systematically test prompts across multiple LLM providers, evaluate outputs using various assertion types, and calculate metrics like accuracy, safety, and performance. Promptfoo is particularly helpful for those building business applications who need a straightforward, flexible, and extensible API for LLM evaluation.

2. Overview of Together AI

Together AI is a platform that provides access to various open-source language models. Although we don’t have extensive product details in our database, it’s known for offering alternatives to OpenAI’s GPT models. Together AI aims to deliver more cost-effective and customizable options for developers working with LLMs.

---

3. Setting up Promptfoo

To get started with Promptfoo, follow these steps:

  1. Ensure you have Node.js 18 or newer installed.
  2. Install Promptfoo globally using npm:
npm install -g promptfoo
  1. Verify the installation:
promptfoo --version
  1. Initialize Promptfoo in your project directory:
promptfoo init

This will create a promptfooconfig.yaml file in your current directory.

---

4. Configuring Promptfoo for Comparison Testing

To compare OpenAI GPT with Together AI, configure your promptfooconfig.yaml as follows:

providers:
  - openai:gpt-4o-mini
  - - adaline:togetherai:chat:meta-llama/Meta-Llama-3-8B-Instruct-Turbo

prompts:
  - "Translate the following English text to German: '{{input}}'"

tests:
  - description: English to German translation
    vars:
      input: Hello, how are you?
    assert:
      - type: contains
        value: Hallo
      - type: contains
        value: wie geht es dir

Note: Replace together:togethercomputer/llama-2-70b-chat with the actual Together AI model identifier.

---

5. Creating an English-to-German Translation Test Case

We’ve already included a basic test case. Let’s add more tests for comprehensive coverage:

tests:
  - description: Simple greeting translation
    vars:
      input: Hello, how are you?
    assert:
      - type: contains
        value: Hallo
      - type: contains
        value: wie geht es dir

  - description: Complex sentence translation
    vars:
      input: The quick brown fox jumps over the lazy dog.
    assert:
      - type: contains
        value: Der schnelle braune Fuchs
      - type: contains
        value: springt über den faulen Hund

  - description: Idiomatic expression translation
    vars:
      input: It's raining cats and dogs.
    assert:
      - type: contains
        value: Es regnet in Strömen

---

6. Running the Tests and Collecting Metrics

To run the tests and collect metrics:

promptfoo eval -c promptfooconfig.yaml --share -o output.json

This command evaluates the prompts using both OpenAI GPT and the Together AI model, then saves the results to output.json.

---

7. Analyzing Results

After running the tests, you can review output.json to analyze the results.

For an even easier method of reviewing and comparing results, you can use the built-in dashboard:

vscode ➜ /workspaces/promptfoo $ npx promptfoo view
Migrated results from file system to database
Server running at http://localhost:15500 and monitoring for new evals.
Open URL in browser? (y/N):

Open http://localhost:15500 in your browser.

In this dashboard, you can view all the test prompts along with their results (metrics + telemetry data) that you just executed.

This gives you the insight needed to decide which model might be the best fit for your scenario(s).

Consider factors like:

---

For many more options and configurations, please refer to the official Promptfo documentation.

https://www.promptfoo.dev/docs/intro/

8. Conclusion

Using Promptfoo to compare the performance of OpenAI GPT and Together AI for English-to-German translation tasks provides valuable insights into each model’s strengths and weaknesses. By systematically testing various inputs and evaluating the outputs, you can make informed decisions about which LLM provider best meets your requirements.

Key Metrics to Consider:

  1. Accuracy of translations
  2. Handling of idiomatic expressions
  3. Response time
  4. Consistency across different input types

By examining these factors, you can determine which model performs better for your specific use case and make data-driven decisions about which LLM provider to integrate into your applications.

Data Privacy | Imprint