Compare LLM's performance at scale with PromptFoo
Explore effective alternatives to OpenAI with Promptfoo, an open-source tool for comparing large language models (LLMs). Learn how to evaluate models like Together AI's LLAMA-3 against OpenAI's GPT-4o, discovering insights on speed, cost, and performance. Make informed decisions for your applicat...

I have previously discussed considering alternatives to OpenAI for productive, real-world applications. There are dozens of such alternatives, and many open-source options are particularly intriguing.
By comparing these solutions, you can potentially save money while also achieving better quality and speed.
For example, in a previous blog post I talked about the noticeable speed differences between models like OpenAI's gptg-4o - its fastest model - and LLAMA-3 on Together and Fireworks AI. They can deliver up to 3x the speed at a fraction of the cost.
How to compare LLMs
There are various ways to compare models. You can use tools like openwebui if you’re doing it occasionally or if you have just a few models to test. There are also many commercial model comparison tools you could leverage.

In a business setting, where you have dozens of prompts and use cases across many models, you may want to automate this evaluation. Having a tool that supports such tasks can be very useful. One tool I frequently use is Promptfoo.
In this blog post, we will explore how to use Promptfoo to compare the performance and quality of responses from alternative models to OpenAI—in this case, a llama-based model from Together AI. Of course, you can adapt this approach to any other model or scenario you like.
Let’s get started.
---
1. Introduction to Promptfoo
Promptfoo is an open-source tool designed to help evaluate and compare large language models (LLMs). It enables developers to systematically test prompts across multiple LLM providers, evaluate outputs using various assertion types, and calculate metrics like accuracy, safety, and performance. Promptfoo is particularly helpful for those building business applications who need a straightforward, flexible, and extensible API for LLM evaluation.
2. Overview of Together AI
Together AI is a platform that provides access to various open-source language models. Although we don’t have extensive product details in our database, it’s known for offering alternatives to OpenAI’s GPT models. Together AI aims to deliver more cost-effective and customizable options for developers working with LLMs.
---
3. Setting up Promptfoo
To get started with Promptfoo, follow these steps:
- Ensure you have Node.js 18 or newer installed.
- Install Promptfoo globally using npm:
npm install -g promptfoo
- Verify the installation:
promptfoo --version
- Initialize Promptfoo in your project directory:
promptfoo init
This will create a promptfooconfig.yaml
file in your current directory.
---
4. Configuring Promptfoo for Comparison Testing
To compare OpenAI GPT with Together AI, configure your promptfooconfig.yaml
as follows:
providers:
- openai:gpt-4o-mini
- - adaline:togetherai:chat:meta-llama/Meta-Llama-3-8B-Instruct-Turbo
prompts:
- "Translate the following English text to German: '{{input}}'"
tests:
- description: English to German translation
vars:
input: Hello, how are you?
assert:
- type: contains
value: Hallo
- type: contains
value: wie geht es dir
Note: Replace together:togethercomputer/llama-2-70b-chat
with the actual Together AI model identifier.
---
5. Creating an English-to-German Translation Test Case
We’ve already included a basic test case. Let’s add more tests for comprehensive coverage:
tests:
- description: Simple greeting translation
vars:
input: Hello, how are you?
assert:
- type: contains
value: Hallo
- type: contains
value: wie geht es dir
- description: Complex sentence translation
vars:
input: The quick brown fox jumps over the lazy dog.
assert:
- type: contains
value: Der schnelle braune Fuchs
- type: contains
value: springt über den faulen Hund
- description: Idiomatic expression translation
vars:
input: It's raining cats and dogs.
assert:
- type: contains
value: Es regnet in Strömen
---
6. Running the Tests and Collecting Metrics
To run the tests and collect metrics:
promptfoo eval -c promptfooconfig.yaml --share -o output.json
This command evaluates the prompts using both OpenAI GPT and the Together AI model, then saves the results to output.json
.
---
7. Analyzing Results

After running the tests, you can review output.json
to analyze the results.
For an even easier method of reviewing and comparing results, you can use the built-in dashboard:
vscode ➜ /workspaces/promptfoo $ npx promptfoo view
Migrated results from file system to database
Server running at http://localhost:15500 and monitoring for new evals.
Open URL in browser? (y/N):
Open http://localhost:15500 in your browser.

In this dashboard, you can view all the test prompts along with their results (metrics + telemetry data) that you just executed.
This gives you the insight needed to decide which model might be the best fit for your scenario(s).
Consider factors like:
- Speed (tokens per second)
- Quality based on defined metrics (we used very simple metrics here, but there are plenty of additional built-in metrics and deterministic metrics available)
- Cost factors, especially with large volumes of data


---
For many more options and configurations, please refer to the official Promptfo documentation.
https://www.promptfoo.dev/docs/intro/
8. Conclusion
Using Promptfoo to compare the performance of OpenAI GPT and Together AI for English-to-German translation tasks provides valuable insights into each model’s strengths and weaknesses. By systematically testing various inputs and evaluating the outputs, you can make informed decisions about which LLM provider best meets your requirements.
Key Metrics to Consider:
- Accuracy of translations
- Handling of idiomatic expressions
- Response time
- Consistency across different input types
By examining these factors, you can determine which model performs better for your specific use case and make data-driven decisions about which LLM provider to integrate into your applications.
Comments ()